## Sentiment Analysis with Tansformers

In this simple project, I will make use of BERT particularly a [bert-base-multilingual-uncased](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) model from [Hugging Face](https://huggingface.co/) that is finetuned for sentiment analysis on product reviews in six languages.

# Import Dependencies

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd

# Instantiate Model (May take some time)

In [2]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading: 100%|██████████| 39.0/39.0 [00:00<00:00, 14.9kB/s]
Downloading: 100%|██████████| 953/953 [00:00<00:00, 529kB/s]
Downloading: 100%|██████████| 851k/851k [00:01<00:00, 728kB/s]  
Downloading: 100%|██████████| 112/112 [00:00<00:00, 65.3kB/s]
Downloading: 100%|██████████| 638M/638M [04:20<00:00, 2.57MB/s] 


# Test

In [3]:
tokens = tokenizer.encode('The deep learning course is very insightful and fun.', return_tensors='pt')
result = model(tokens)
result.logits

tensor([[-3.0342, -2.7840, -0.1796,  2.4267,  2.7624]],
       grad_fn=<AddmmBackward0>)

In [4]:
# Let's see what the sentiment score is
int(torch.argmax(result.logits))+1

5

# Collect Reviews

I will Collect yelp reviews and use the BERT model to analyse the sentiment of the reiews.

In [3]:
def extract_reviews_scores(links):
    df = None
    for link in links:
        try: # If the link is valid
            # Extracts the reviews
            r = requests.get(link)
            soup = BeautifulSoup(r.text, 'html.parser')
            regex = re.compile('.*comment.*')
            results = soup.find_all('p', {'class':regex})
            reviews = [result.text for result in results]

            # Extracts the score given by the reviewer
            result_len = len(results)
            regex_score = re.compile('i-stars__09f24__M1AR7')
            result_score = soup.find_all('div', {'class':regex_score})
            
            # On each webpage, the first score is the general review score of the restaurant. We skip it.
            # The next result_len scores are the scores we are interested in. The other scores beneath are for other restaurants which we are not interested in.
            # You can visit the website for more details.
            scores = result_score[1:result_len+1] 
            review_scores = [int(score['aria-label'][0]) for score in scores]

            if df is None:
                df = pd.DataFrame({'Review': np.array(reviews), 'Real_Score': np.array(review_scores)})
            else:
                df1 = pd.DataFrame({'Review': np.array(reviews), 'Real_Score': np.array(review_scores)})
                df = pd.concat([df, df1], ignore_index=True)
        except:  # Catches any error.
            # Do nothing
            pass
    
    return df

In [4]:
# Links I will be scraping
# Note: you can provide yours from yelp as well. This is because the data cleaning code is designed specifically for the yelp website.

links = ['https://www.yelp.com/biz/kingdom-of-dumplings-san-francisco']
sample = 'https://www.yelp.com/biz/kingdom-of-dumplings-san-francisco?start='
start=10
for i in range(2, 224):
    link = sample + str(start)
    links.append(link)
    start += 10

len(links)

223

In [51]:
# Function for calculating sentiments provided in dataframe form.

def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

In [52]:
# Extract reviews and scores
df = extract_reviews_scores(links)

In [54]:
df['sentiment'] = df['Review'].apply(lambda x: sentiment_score(x[:800])) # model is limited to 800 tokens

In [55]:
# visualize the result
df

Unnamed: 0,Review,Real_Score,sentiment
0,This place is so good! We live in the sunset a...,5.0,5
1,"I love this Chinese place, absolute must visit...",5.0,5
2,Came here after a long bike ride across SF wit...,5.0,4
3,I was craving Chinese food on a Thursday for d...,5.0,5
4,"Walking by this restaurant, I was a bit shocke...",3.0,3
...,...,...,...
1118,We were planning to go to shanghai dumpling in...,4.0,5
1119,"For dumplings in SF, this is my favorite place...",4.0,4
1120,Dirty utensils - with dried food sticking on t...,2.0,2
1121,I came here for lunch at my son's recommendati...,5.0,5


In [57]:
# calculate accuracy
"""
Accuracy =  Number of correct predictions/Total Number of predictions
"""
correct_predictions = len(df[df.Real_Score == df.sentiment])
total_predictions = len(df)

Acccuracy = (correct_predictions/total_predictions) * 100

print("The accuracy is " + str(Acccuracy) + '%.')

The accuracy is 58.14781834372217%.


In [59]:
# We can see the accuracy score is not so great
# We could try measuring the accuracy of the algorithm with root mean square error instead.
rmse = np.square(df.Real_Score - df.sentiment).mean()
print("The RMSE is", rmse)

# The performance is much better when looking at the accuracy in terms of RMSE.

The RMSE is 0.8343722172751559
