<a href="https://colab.research.google.com/github/AK-Malik/GenAI_NLPForWordEmbedding/blob/main/Module3_Demo2_Analysing_Sentiment_With_OHE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysing Sentiment

Let's first import everything and load the dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob, Word
import nltk
import torch
from torch import nn
import seaborn as sns
nltk.download('punkt')

%matplotlib inline
sns.set(rc={'figure.figsize':(20,20)})
import warnings
warnings.filterwarnings('ignore')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget https://raw.githubusercontent.com/axel-sirota/implement-nlp-word-embedding/main/module3/data/yelp.csv
fi

Writing get_data.sh


In [3]:
!bash get_data.sh


--2025-05-02 06:04:04--  https://raw.githubusercontent.com/axel-sirota/implement-nlp-word-embedding/main/module3/data/yelp.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8091185 (7.7M) [text/plain]
Saving to: ‘yelp.csv’


2025-05-02 06:04:04 (106 MB/s) - ‘yelp.csv’ saved [8091185/8091185]



In [5]:
path = './yelp.csv'
yelp = pd.read_csv(path)
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# Define X and y.
X = yelp_best_worst.text
y = yelp_best_worst.stars.map({1:0, 5:1})


## Doing the train_test split and defining model

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [7]:
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [8]:
X_train_tensor = torch.Tensor(X_train_dtm.toarray()).to(device)
X_test_tensor = torch.Tensor(X_test_dtm.toarray()).to(device)
y_train = torch.Tensor(y_train.values).type(torch.LongTensor).to(device)
y_test = torch.Tensor(y_test.values).type(torch.LongTensor).to(device)

In [9]:
model = nn.Sequential(
  nn.Linear(X_train_tensor.shape[1], 2),
  nn.LogSoftmax(dim = 1)
).to(device)

In [10]:
def forward(X):
  return model(X).to(device)

def loss(y_pred, y):
  return nn.functional.nll_loss(y_pred, y)

def metric(y_pred, y):  # -> accuracy
  return (1 / len(y)) * ((y_pred.argmax(dim = 1) == y).sum())


## Let's verify the metric makes sense

In [11]:
y_train_pred = model(X_train_tensor).to(device)
y_train_pred.argmax(dim=1)

tensor([1, 1, 1,  ..., 1, 1, 1], device='cuda:0')

In [12]:
(y_train_pred.argmax(dim = 1) == y_train).sum()

tensor(2303, device='cuda:0')

In [13]:
metric(y_train_pred, y_train)

tensor(0.7047, device='cuda:0')

In [14]:
del y_train_pred

## The training routine

In [15]:
optimizer = torch.optim.AdamW(model.parameters())

In [16]:
epochs = 1000
for i in range(epochs):
  y_pred = forward(X_train_tensor)
  xe = loss(y_pred, y_train)
  accuracy = metric(y_pred, y_train)
  xe.backward()
  if i % 100 == 0:
    print("Loss: ", xe, " Accuracy ", accuracy.data.item())
  optimizer.step()
  optimizer.zero_grad()

Loss:  tensor(0.6754, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.7047123908996582
Loss:  tensor(0.1304, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9917380809783936
Loss:  tensor(0.0721, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9951040744781494
Loss:  tensor(0.0486, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9966340661048889
Loss:  tensor(0.0361, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9972460269927979
Loss:  tensor(0.0283, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9972460269927979
Loss:  tensor(0.0231, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9981640577316284
Loss:  tensor(0.0193, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9984700083732605
Loss:  tensor(0.0165, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9987760186195374
Loss:  tensor(0.0144, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9987760186195374


In [17]:
y_test_pred = forward(X_test_tensor)
print(f'Model accuracy is {metric(y_test_pred, y_test)}')

Model accuracy is 0.898533046245575


# Some manual validation

In [18]:
review = np.array(["This place was fantastic"])
vectorized_review = torch.Tensor(vect.transform(review).toarray()).to(device)

In [19]:
prediction = forward(vectorized_review)
prediction.argmax(dim = 1)

tensor([1], device='cuda:0')

Therefore, the model predicted correctly that the review was positive!