# Using BERT model for Sentiment Analysis for TripAdvisor Reviews 

Using the BERT model form HuggingFace is an extremely convenient way of doing sentiment analysis for your data. It can be used for data that are unlabeled and BERT as a pre-trained model will generate the sentiments for the unlabelled dataset. 

### 1. Load relevant libraries


In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch 
import re

import pandas as pd 
import numpy as np 

### 2. Instantiate Model

In [2]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained ('nlptown/bert-base-multilingual-uncased-sentiment')

### 3. Encode and Calculate Sentiment

In [3]:
tokens = tokenizer.encode ('not as bad', return_tensors = 'pt')

In [4]:
result = model(tokens)

In [5]:
result

SequenceClassifierOutput(loss=None, logits=tensor([[-1.0146,  0.7619,  2.3528,  0.3902, -1.8820]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [6]:
result.logits

tensor([[-1.0146,  0.7619,  2.3528,  0.3902, -1.8820]],
       grad_fn=<AddmmBackward0>)

In [7]:
int(torch.argmax(result.logits))+1

3

### 4. Load TripAdvisor Reviews 

These reviews are scraped from TripAdvisor website for ROW NYC hotel in New York City. You may refer to the code in the repository - Scraping TripAdvisor Reviews using Selenium on the web scraping process.

In [8]:
df = pd.read_csv('Reviews TA.csv')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13465 entries, 0 to 13464
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Date    13465 non-null  object
 1   Title   13464 non-null  object
 2   Review  13465 non-null  object
 3   Rating  13465 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 420.9+ KB


In [10]:
reviews= df.drop(['Date', 'Title', 'Rating'], axis = 1)

In [11]:
reviews.head()

Unnamed: 0,Review
0,Very good location. Reasonable price. The room...
1,We stayed in this hotel just before Christmas ...
2,Pleasant staff and security in place. Stayed h...
3,"Pricing was okay. Very noisy, small room. The ..."
4,"Rooms are filthy, elevators are dangerous and ..."


Create a function to perform tokenisation and encoding for the review input. As the output from the BERT Model range form 1-5, those rated 1 and 2 will be considered as negative , those 3 will be neutral and those rated 4 -5 will be positive. 

In [19]:
def sentiment_score(review):
    tokens = tokenizer.encode (review, return_tensors = 'pt', max_length = 250, truncation = True)
    result = model(tokens)
    result = int(torch.argmax(result.logits))+1
    
    if result == 1 or result == 2:
        return 'Negative'
    elif result == 3:
        return 'Neutral'
    else:
        return 'Positive'

In [20]:
#Applying the function to the entire dataset of 13465 reviews 
reviews['Sentiment'] = reviews['Review'].astype(str).apply(sentiment_score)

In [21]:
# Output of the BERT Model 
reviews

Unnamed: 0,Review,Sentiment
0,Very good location. Reasonable price. The room...,Positive
1,We stayed in this hotel just before Christmas ...,Positive
2,Pleasant staff and security in place. Stayed h...,Positive
3,"Pricing was okay. Very noisy, small room. The ...",Negative
4,"Rooms are filthy, elevators are dangerous and ...",Negative
...,...,...
13460,This was my first time to New York and we know...,Positive
13461,This is a great place in Time Square area to s...,Positive
13462,This was my first time visiting this hotel and...,Positive
13463,"Not my first time in NYC, but it's my first ti...",Positive


To get sentiment of an individual document

In [22]:
sentiment_score(' Hi you are not that lousy')

'Neutral'