# Advanced NLP for Sentiment Analysis Using BERT

## Overview

In this project, we will use **Hugging Face Transformers** and pre-trained **BERT** Neural Networks for sentiment analysis. We will run the model using a single prompt but also leverage **BeautifulSoup** to scrape reviews from Yelp to be able to calculate sentiment on a larger scale.

There are three main steps that we are going to follow:
1. Download and install BERT from HF Transformers
2. Run sentiment analysis Using BERT and Python
3. Scrape reviews from Yelp and calculate the score

## 1. Install and Import Dependencies

One of the key dependencies that we are going to need is **PyTorch**. You get PyTorch by going to https://pytorch.org/.

In [1]:
!pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio===0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Looking in links: https://download.pytorch.org/whl/cu113/torch_stable.html


Now, we are going to install 5 other dependencies; **transformers**, **requests**, **beautifulsoup4**, **pandas**, and **numpy**.

We are going to leverage **transformes** for our actual NLP model. So, this is going to allows us to easily import and download and install our NLP model and specifically, the NLP model that we are going to use, the multilingual BERT model that allows us to perfomre sentiment analysis (https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment).
This model gives you a sentiment score between 1 and 5; this means rather than just getting a confidence interval, or a number between 0 and 1, you are actually getting a score.

**requests** library is going to allow us to make a request to the Yelp site that we are goint to be scraping.

**beautifulsoup4** is going to allow us to actually work through that soup that we get back from the page, and extract the data we need.

**Pandas** is going to allow us to structure our data in a format that makes is easy to work with. And **Numpy** is going to give us some additional data transformation processes.

In [2]:
!pip install transformers requests beautifulsoup4 pandas numpy==1.17.5



In [19]:
# Importing Dependencies

from transformers import AutoTokenizer, AutoModelForSequenceClassification  
import torch  
import requests 
from bs4 import BeautifulSoup 
import re 
import numpy as np
import pandas as pd

# AutoTokenizer allows us to pass through a string and convert that into a sequence of numbers that we can then pass to our nlp model
# AutoModelForSequenceClassification is going to give us the architecture from transformers to be able to load in our nlp model 
# we are going to use the arg_max function from torch to be able to extract our highest sequence result.
# requests used to grab data or grab webpage from Yelp.
# BeautifulSoup allows us to traverse the result from yelp, allows to extract data we actually need, the reviews.
# allows us to creat a regex function to be able to extract the specific comments that we want.

## 2. Instantiate Model

Now, we are going to instantiate and set up our model. First, we creat our tokenizer and then we are loading in our model. We use a pre-trained nlp model for sentiment analysis from HF (bert-base-multilingual-uncased-sentiment). There are a number of NLP models available from HF including models for translation, q&a, classification, and generation. Here we are using the model for sentiment analysis. And we set up our model using the same pre-trained model from HF.

In [4]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/638M [00:00<?, ?B/s]

## 3. Encode and Calculate Sentiment

Now, we are going to test our model. We are going to pass a string or a prompt to our tokenizer, tokenize it and pass it through our model and get our classification.

In [8]:
tokens = tokenizer.encode("I didn't like it, not recommending", return_tensors='pt')
# tokens

We do not need to decode the tokens, but this is how it works if we want to decode it:

In [10]:
# tokenizer.decode(tokens[0])  # you should pass one list from the list of lists

Now, what we need to do is to pass out the tokens to our model:

In [11]:
result = model(tokens)

In [12]:
result

SequenceClassifierOutput(loss=None, logits=tensor([[ 3.0376,  3.0210,  1.0249, -2.3583, -3.8397]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

You can see that what we are going to get out of here is a Sequence Classifier Output calss. What we need from the result to understant the sentiment is the **logits**. The values in this tensor represent the probability of that particular class being the sentiment. **More clearly, the output from the model is a one-hot encoded list of scores. The position with the highest score represents the sentiment rating**. So, in the current case, the position of the highest score is the first position, so the **rating is 1**, meaning it it is a negative review.

In [13]:
result.logits

tensor([[ 3.0376,  3.0210,  1.0249, -2.3583, -3.8397]],
       grad_fn=<AddmmBackward0>)

In [14]:
# to get the rating
int(torch.argmax(result.logits)) + 1  # since the position numbering strats from 0

1

## 4. Collect Reviews

Now, we are going to collect some reviews from Yelp. I am going to look at reviews for Rumi's Kitchen (one of my favorite restaurants). We are going to extract the reviews from this page https://www.yelp.com/biz/rumis-kitchen-atlanta-2 and pass them through our sentiment pipeline.

To build our scraper, first we are going to use the `requests` library to grab our webpage. What we get from that is a response code, then we can type `r.text` to get the text out of that webpage, this represents everything that comprises that webpage. Then we can use `BeautifulSoup` to parse the text in this webpage. 

And after that using `re.compile` we are going to extract the specific components that we want from this webpage, the reviews, which are the texts that start with **comment** class (if you inspect the webpage you can see it). Then, we are going to pass out that regex through our soup, `soup.find_all`, to find all the tags within that soup that match our specific formatting. In this case, we are looking fro paragraphs, and we are looking for anything that has a class which matches our regex, which in this case is going to be **comment**.

So far, we can see that the results of our code are wrapped inside of html tags, but we just want the texts. So, in the last step, we use a list comprehension to extract all the reviews from the tags.

In [16]:
r = requests.get('https://www.yelp.com/biz/rumis-kitchen-atlanta-2')
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*comment.*')
results = soup.find_all('p', {'class': regex})
reviews = [result.text for result in results]

In [18]:
reviews[0]

"For starters, we got the hummus which was super yummy. It comes along with fresh pita bread \xa0that is literally fresh out the oven. The Sabazi plate is to cleanse your plate to taste every fresh ingredients. I'm definitely a fan of the homemade sodas. The flavors they have is \xa0peach, passion fruit and mango. The peach is my favorite thus far so definitely try it. For entrees, my boyfriend got the Lamb Koobideh Kabob and it comes with rice. I got the Chicken Kabob and it was so delicious but I definitely enjoyed the lamb more. We were so stuff that I didn't even get the chance to try a dessert. But, I'll definitely be back to Rumi Kitchen."

## 5. Loade Reviews into Dataframe and score

In the next step, we are going to load the reviews into a dataframe, and we are going to run through each one of these reviews and score them. Dataframe makes it easier to go through the reviews and process them.

In [20]:
df = pd.DataFrame(np.array(reviews), columns=['reviews'])

In [27]:
df.head()

Unnamed: 0,reviews
0,"For starters, we got the hummus which was supe..."
1,"For starters, we got the hummus which was supe..."
2,The food was good but there were a few things ...
3,Thai place was pretty good although my compani...
4,This place is fantastic!!! I've passed Rumi's ...


In [26]:
df['reviews'].iloc[0]

"For starters, we got the hummus which was super yummy. It comes along with fresh pita bread \xa0that is literally fresh out the oven. The Sabazi plate is to cleanse your plate to taste every fresh ingredients. I'm definitely a fan of the homemade sodas. The flavors they have is \xa0peach, passion fruit and mango. The peach is my favorite thus far so definitely try it. For entrees, my boyfriend got the Lamb Koobideh Kabob and it comes with rice. I got the Chicken Kabob and it was so delicious but I definitely enjoyed the lamb more. We were so stuff that I didn't even get the chance to try a dessert. But, I'll definitely be back to Rumi Kitchen."

Now, we are going to loops through each one of these reviews and get the score for them. Before that, we are going to define a quick function to do the encoding and sentiment scoring. Encapsulation the sentiment pipeline in a function makes it easier to process multiple strings.

In [28]:
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits)) + 1    

In [29]:
# Example
sentiment_score(df['reviews'].iloc[0])

4

We want to go through all the reviews and stor them inside our dataframe. To do so, we will use a `apply` `lambda` function to be able to go through, run through each one of reviews in our dataframe and store that inside of a column.

In [31]:
df['sentiment'] = df['reviews'].apply(lambda x: sentiment_score(x[:512])) 

In [32]:
df

Unnamed: 0,reviews,sentiment
0,"For starters, we got the hummus which was supe...",4
1,"For starters, we got the hummus which was supe...",4
2,The food was good but there were a few things ...,2
3,Thai place was pretty good although my compani...,3
4,This place is fantastic!!! I've passed Rumi's ...,5
5,This place is really quite great. Food: falafe...,3
6,"Excellent food, great wine, great service. Don...",5
7,Rumi's is a solid place in ATL for Persian foo...,4
8,Beautifully designed restaurant with Persian a...,5
9,Rumi's is one of the best Persian spots in Atl...,4


Note that, our NLP pipeline is actually limited as to how much text or how many tokens you can pass through to it at one particular time, and in this case, it is limited to 512 tokens. So, what we are doing here is that we grab the first 512 tokens from each reviews. This may influence the result of your sentiment pipeline, you could actually append these together or do it in multiple steps and get taken the avrage, but in this case this is a quick workaround. 

We can do this process for another restaurant or any other businesses to get the sentiment score. We just have to go to step 4 and follow all the steps from there.