<h1>Sentiment Analysis Application</h1>
To implement the prediction, I'll use two approaches and compare the scores. The better model will be deployed.
<h2> 1) TF-IDF and Naive Bayes </h2>

In [1]:
import pandas as pd
import numpy as np
import re
import spacy
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, f1_score

In [2]:
# Spacy is commonly used for NLP tasks
nlp = spacy.load('en_core_web_sm')

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [6]:
# Convert the positive label to 1 and the negative label to 0
df['sentiment'] = df['sentiment'].apply(lambda x: 1 if x=='positive' else 0)

In [7]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [8]:
# Text preprocessing a single example before implementing the function
doc = re.sub('<[^<]+?>', '', df['review'].iloc[1])
doc = re.sub('[^a-zA-Z0-9]', ' ', doc)
doc = nlp(doc)
token_list = [token.lemma_ for token in doc if not token.is_stop]
' '.join(token_list)

'wonderful little production   film technique unassuming   old time BBC fashion give comforting   discomforte   sense realism entire piece   actor extremely choose   Michael Sheen   get polari   voice pat   truly seamless editing guide reference Williams   diary entry   worth watching terrificly write perform piece   masterful production great master s comedy life   realism come home little thing   fantasy guard   use traditional   dream   technique remain solid disappear   play knowledge sense   particularly scene concern Orton Halliwell set   particularly flat Halliwell s mural decorate surface   terribly'

In [9]:
def preprocessing(s):
    # Remove html tags
    s = re.sub('<[^<]+?>', '', s)
    # Only allow alphanumeric characters
    s = re.sub('[^a-zA-Z0-9]', ' ', s)
    # Convert characters to lowercase
    s = s.lower()
    # Generate list of lemmatized tokens with removed stop words
    s = nlp(s)
    s = [token.lemma_ for token in s if not token.is_stop]
    # Combine the items in the list
    s = ' '.join(s)
    return s

In [None]:
# This might take a long while
df['preprocessed_text'] = df['review'].apply(preprocessing)

In [None]:
# Save the new dataframe to a csv file
df.to_csv('preprocessed_imdb_dataset.csv', index=False)

In [10]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/preprocessed_imdb_dataset.csv')

In [11]:
df.head()

Unnamed: 0,review,sentiment,preprocessed_text
0,One of the other reviewers has mentioned that ...,1,reviewer mention watch 1 oz episode ll hook ...
1,A wonderful little production. <br /><br />The...,1,wonderful little production film technique u...
2,I thought this was a wonderful way to spend ti...,1,think wonderful way spend time hot summer week...
3,Basically there's a family where a little boy ...,0,basically s family little boy jake think s...
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,petter mattei s love time money visually s...


In [12]:
X = df['preprocessed_text']
y = df['sentiment']

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=101)

In [14]:
# Initialize pipeline to vectorize and model the data
clf = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

In [15]:
clf.fit(X_train, y_train)

In [16]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.86      0.85      2500
           1       0.86      0.84      0.85      2500

    accuracy                           0.85      5000
   macro avg       0.85      0.85      0.85      5000
weighted avg       0.85      0.85      0.85      5000



An accuracy and f1 score of 85%. Not bad!

<h2>2. Transformers</h2>

In [80]:
from transformers import pipeline, AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import torch
from datasets import load_metric, Dataset

In [18]:
# Ran this on Google Colab's T4 runtime session
print(torch.cuda.get_device_name(0))

Tesla T4


In [19]:
torch.cuda.is_available()

True

In [27]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

In [29]:
# Use sentiment analysis pretrained model
classification = pipeline('sentiment-analysis', device=device)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [30]:
# Test it out on a string
classification('I thoroughly enjoyed this movie!')

[{'label': 'POSITIVE', 'score': 0.9998749494552612}]

In [31]:
# Test it out on a list
classification(['I hated this movie', 'This movie was trash', 'I loved the development of the movie. However it was not perfect. Still highly recommended'])

[{'label': 'NEGATIVE', 'score': 0.99973064661026},
 {'label': 'NEGATIVE', 'score': 0.999752938747406},
 {'label': 'POSITIVE', 'score': 0.9948720335960388}]

In [32]:
X = df['review']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=101)

In [33]:
# Initialize the DistilBert Tokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [84]:
# Encode the data and ensure each example has the same length
def preprocess_function(examples):
    return tokenizer(examples, truncation=True, padding='max_length')

In [85]:
tokenized_train_reviews = X_train.map(preprocess_function)
tokenized_test_reviews = X_test.map(preprocess_function)

In [86]:
train = pd.DataFrame(y_train).join(tokenized_train_reviews)
test = pd.DataFrame(y_test).join(tokenized_test_reviews)

In [87]:
train.head()

Unnamed: 0,sentiment,review
45011,1,"[input_ids, attention_mask]"
24946,0,"[input_ids, attention_mask]"
24522,1,"[input_ids, attention_mask]"
17521,0,"[input_ids, attention_mask]"
2982,0,"[input_ids, attention_mask]"


In [88]:
# To avoid Indexing errors on training, convert the dataframes to Dataset format 
train_input_ids = np.array([x['input_ids'] for x in train['review']])
train_attention_masks = np.array([x['attention_mask'] for x in train['review']])
train_labels = np.array(train['sentiment'])

train_dataset = Dataset.from_dict({
    'input_ids': train_input_ids,
    'attention_mask': train_attention_masks,
    'labels': train_labels
})

In [93]:
test_input_ids = np.array([x['input_ids'] for x in test['review']])
test_attention_masks = np.array([x['attention_mask'] for x in test['review']])
test_labels = np.array(test['sentiment'])

test_dataset = Dataset.from_dict({
    'input_ids': test_input_ids,
    'attention_mask': test_attention_masks,
    'labels': test_labels
})

In [42]:
# Initialize the transformer model to be used for training
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [94]:
# Speeds up training
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [95]:
def compute_metrics(eval_pred):
    load_accuracy =load_metric('accuracy')
    load_f1 = load_metric('f1')

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy = load_accuracy.compute(predictions=predictions, references=labels)['accuracy']
    f1 = load_f1.compute(predictions=predictions, references=labels)['f1']
    return {'accuracy': accuracy, 'f1': f1}

In [44]:
# Login to hugging face and copy the access token
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [49]:
torch.cuda.current_device()

0

In [96]:
# Specify the hyperparameters to used during training and push final model to HF
repo_name = 'finetuned-sentiment-model-45000-training-examples'

training_args = TrainingArguments(
    output_dir=repo_name,
    learning_rate=1e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    save_strategy='epoch',
    eval_strategy='epoch',
    push_to_hub=True
)

training_args.device

device(type='cuda', index=0)

In [97]:
# Combine all the relevant parameters to build the final training model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [98]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1963,0.189923,0.9284,0.928285
2,0.1623,0.22142,0.9314,0.931768


  load_accuracy =load_metric('accuracy')


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The repository for accuracy contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/accuracy.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

The repository for f1 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/f1.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


TrainOutput(global_step=5626, training_loss=0.20152223894191815, metrics={'train_runtime': 5911.0591, 'train_samples_per_second': 15.226, 'train_steps_per_second': 0.952, 'total_flos': 1.192206587904e+16, 'train_loss': 0.20152223894191815, 'epoch': 2.0})

In [99]:
trainer.evaluate()

{'eval_loss': 0.22142043709754944,
 'eval_accuracy': 0.9314,
 'eval_f1': 0.9317684503680127,
 'eval_runtime': 84.6731,
 'eval_samples_per_second': 59.051,
 'eval_steps_per_second': 3.697,
 'epoch': 2.0}

A better model! 🥳 We got an accuracy and f1 score of 93%

Furthermore the model is available on hugging face. We can use it on deployment of our application