# Yelp review sentiment analysis

Based on https://github.com/nicknochnack/BERTSentiment

## 1. Install and Import Dependencies

Need to install requests (for HTML), beautifulsoup (for scraping websites), pandas, numpy, (HuggingFace) transformers and pytorch

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re

## 2. Instantiate Model

Create a tokenizer and a model, both based on the same BERT LLM model (trained on the same data corpus).

See https://huggingface.co/docs/transformers/model_doc/auto for more inormation on the Auto tokenizers and models.

In [None]:
preModel = 'nlptown/bert-base-multilingual-uncased-sentiment'

tokenizer = AutoTokenizer.from_pretrained(preModel)
model = AutoModelForSequenceClassification.from_pretrained(preModel)

## 3. Encode and Calculate Sentiment

Encode a review using the tokenizer, storing the tokens in a pytorch tensor

In [None]:
tokens = tokenizer.encode('The food was very good but the place was disappointing', return_tensors='pt')

Now apply the BERT model to the resulting tokens.

In [None]:
result = model(tokens)

The result for each of the ratings is resturned in `logits`. The logit with the highest value is the preferred rating.

In [None]:
result.logits

Now return the rating, converting it from an index 0..4 to a rating in the range 1..5. Note that this rating was derived from the review comment, independently of the numeric rating chosen by the person writing the review.

In [None]:
int(torch.argmax(result.logits))+1

## 4. Collect Reviews



In [None]:
url = 'https://www.yelp.com/biz/the-reg-waterford-waterford'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*comment.*')
results = soup.find_all('p', {'class':regex})
reviews = [result.text for result in results]
reviews

## 5. Load Reviews into DataFrame and Score

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.DataFrame(np.array(reviews), columns=['review'])

In [None]:
reviewId = 3
df['review'].iloc[reviewId]

In [None]:
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

sentiment_score(df['review'].iloc[reviewId])

Apply the sentiment model to score the reviews by predicting their sentiment rating.

In [None]:
df['sentiment'] = df['review'].apply(lambda x: sentiment_score(x[:512]))
df

If you look at the corresponding [review](https://www.yelp.com/biz/the-reg-waterford-waterford). you will see that the reviewer gave the place 3 stars, which is more positive than the rating derived by the sentiment model.