# MSc in Data Science and Big Data 
## Master Thesis
## Innovation & Entrepreneurship Business School
### Guillermo Altesor 


In this Notebook we will do Sentiment Analysis of our TripAdvisor reviews using Transformers, specifically roBERTa.

In [3]:
# Install the transformers library#
!pip install transformers

Collecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
     ---------------------------------------- 4.9/4.9 MB 4.6 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.8.0-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
     ------------------------------------- 163.5/163.5 kB 10.2 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp310-cp310-win_amd64.whl (3.3 MB)
     ---------------------------------------- 3.3/3.3 MB 9.1 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp310-cp310-win_amd64.whl (151 kB)
     -------------------------------------- 151.7/151.7 kB 4.6 MB/s eta 0:00:00
Installing collected packages: tokenizers, pyyaml, filelock, huggingface-hub, transformers
Successfully installed filelock-3.8.0 huggingface-hub-0.10.0 pyyaml-6.0 tokenizers-0.12.1 transformers-4.22.2




In [5]:
# Install the torch library
!pip install torch

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'C:\\Users\\guill\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python310\\site-packages\\caffe2\\python\\serialized_test\\data\\operator_test\\piecewise_linear_transform_test.test_multi_predictions_params_from_arg.zip'



Collecting torch
  Downloading torch-1.12.1-cp310-cp310-win_amd64.whl (162.2 MB)
     -------------------------------------- 162.2/162.2 MB 5.3 MB/s eta 0:00:00
Installing collected packages: torch


In [6]:
# Import required packages
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer

# Create class for data preparation
class SimpleDataset:
    def __init__(self, tokenized_texts):
        self.tokenized_texts = tokenized_texts
    
    def __len__(self):
        return len(self.tokenized_texts["input_ids"])
    
    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.tokenized_texts.items()}

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# Load tokenizer and model, create trainer
model_name = "siebert/sentiment-roberta-large-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(model=model)

Downloading: 100%|██████████| 256/256 [00:00<00:00, 83.5kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading: 100%|██████████| 687/687 [00:00<00:00, 222kB/s]
Downloading: 100%|██████████| 798k/798k [00:00<00:00, 1.09MB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 955kB/s] 
Downloading: 100%|██████████| 150/150 [00:00<00:00, 45.0kB/s]
Downloading: 100%|██████████| 1.42G/1.42G [06:01<00:00, 3.93MB/s]


In [8]:
#!pip install drive

**To explore different options, we will use Google Drive, saving our CSV there and loading it from Drive to Colab**

In [9]:
file_name = "Cured_ATT.csv" 
text_column = "Review"

df_pred = pd.read_csv(file_name)
pred_texts = df_pred[text_column].dropna().astype('str').tolist()

In [10]:
# Tokenize texts and create prediction data set
tokenized_texts = tokenizer(pred_texts,truncation=True,padding=True)
pred_dataset = SimpleDataset(tokenized_texts)

In [11]:
# Run predictions
predictions = trainer.predict(pred_dataset)

***** Running Prediction *****
  Num examples = 9860
  Batch size = 8
  0%|          | 4/1233 [01:58<10:55:23, 32.00s/it]

KeyboardInterrupt: 

In [None]:
# Transform predictions to labels
preds = predictions.predictions.argmax(-1)
labels = pd.Series(preds).map(model.config.id2label)
scores = (np.exp(predictions[0])/np.exp(predictions[0]).sum(-1,keepdims=True)).max(1)

In [None]:
# Create DataFrame with texts, predictions, labels, and scores
df = pd.DataFrame(list(zip(pred_texts,preds,labels,scores)), columns=['text','pred','label','score'])
df.head(50)

Unnamed: 0,text,pred,label,score
0,"Worth the trip, cable car needs minimum 90 min...",1,POSITIVE,0.998918
1,Must see of Tenerife - A must see site on Tene...,1,POSITIVE,0.998881
2,A must visit place in tenerife. - Absolutely a...,1,POSITIVE,0.998937
3,Hike to the summit. - A drive up to El Tiede f...,1,POSITIVE,0.998876
4,Spectacular - It's number one for a reason. O...,1,POSITIVE,0.998924


In [None]:
df.head(50)

Unnamed: 0,text,pred,label,score
0,"Worth the trip, cable car needs minimum 90 min...",1,POSITIVE,0.998918
1,Must see of Tenerife - A must see site on Tene...,1,POSITIVE,0.998881
2,A must visit place in tenerife. - Absolutely a...,1,POSITIVE,0.998937
3,Hike to the summit. - A drive up to El Tiede f...,1,POSITIVE,0.998876
4,Spectacular - It's number one for a reason. O...,1,POSITIVE,0.998924
5,Stunning views - I was unable to complete the ...,1,POSITIVE,0.992296
6,Top of Spain - Clearly one of the best places ...,1,POSITIVE,0.998912
7,Beautiful scenery - We hired a car to drive up...,1,POSITIVE,0.998903
8,Outstanding - Outstanding day and evening watc...,1,POSITIVE,0.99894
9,A volcano - As part of this he Teide National ...,1,POSITIVE,0.998802


@article{hartmann2022,
  title={More than a feeling: Accuracy and Application of Sentiment Analysis},
  author={Hartmann, Jochen and Heitmann, Mark and Siebert, Christian and Schamp, Christina},
  journal={International Journal of Research in Marketing},
  year={2022}
}