<a href="https://colab.research.google.com/github/Sagnik-Nandi/PDFQueryBot---Chatbot-over-PDFs-using-RAG/blob/main/assnmt%202%20-%20Sentiment%20Classifier/sentiment_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading Data and Installing Dependencies

In [2]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [19]:
# !ls drive/MyDrive/'Colab Notebooks'/'WiDS 2024'
# !pip uninstall torchtext torch -y
# !pip install torch==2.2.0 torchtext==0.17.0

import torch
import torchtext
from torch.utils.data import Dataset, DataLoader, random_split
from torchtext import datasets
from torchtext.vocab import vocab
from gensim.utils import tokenize
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('stopwords')
# nltk.download('wordnet')

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
import re
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [22]:
df=pd.read_csv("drive/MyDrive/Colab Notebooks/WiDS 2024/reviews.csv")
df['sentiment'] = df['sentiment'].map({'positive':1, 'negative':0})

for i in range(5):
  rev=df.iloc[i]['review']
  # iloc gives the i'th row, if you dont use iloc it will try to find a column/feature named i and raise an error
  print(rev)
  # print(list(tokenize(rev)))
  print("No of paras:", len(rev.split('<br /><br />')))
  print("No of sentences:", len(rev.split('.')))
  print("No of words:", len(rev.split()))
  print("Label:", df.iloc[i]['sentiment'])

print(max(df['review'].apply(lambda x: len(x.split()))))
print(min(df['review'].apply(lambda x: len(x.split()))))

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

## Tokenization and Preprocessing

In [11]:
# List of stopwords (overused words that could lead to overfitting)
stops=set(stopwords.words('english'))
capstops=[word.capitalize() for word in stops]
stops.update(capstops)
stops=list(stops)

# Stemmer and Lemmatizer for normalizing the words to root words
stemmer=PorterStemmer()
lemmatizer=WordNetLemmatizer()

def custom_tokenize(text):
  text=re.sub('<.*>', '', text) # Filter out html tags like <br/>
  tokens=list(tokenize(text))
  tokens=[token for token in tokens if token not in stops]
  # can do lower case as normalization
  # tokens=[stemmer.stem(token) for token in tokens]
  # tokens=[lemmatizer.lemmatize(token) for token in tokens]
  return tokens

for i in range(5) :
  rev=df.iloc[i]['review']
  print(custom_tokenize(rev))

up an can Just couldn No ll Couldn't Myself there any don't It Don Which mustn't Such Don't yours hadn ['One', 'reviewers', 'mentioned', 'watching', 'Oz', 'episode', 'hooked', 'right', 'exactly', 'happened', 'would', 'say', 'main', 'appeal', 'show', 'due', 'fact', 'goes', 'shows', 'dare', 'Forget', 'pretty', 'pictures', 'painted', 'mainstream', 'audiences', 'forget', 'charm', 'forget', 'romance', 'OZ', 'mess', 'around', 'first', 'episode', 'ever', 'saw', 'struck', 'nasty', 'surreal', 'say', 'ready', 'watched', 'developed', 'taste', 'Oz', 'got', 'accustomed', 'high', 'levels', 'graphic', 'violence', 'violence', 'injustice', 'crooked', 'guards', 'sold', 'nickel', 'inmates', 'kill', 'order', 'get', 'away', 'well', 'mannered', 'middle', 'class', 'inmates', 'turned', 'prison', 'bitches', 'due', 'lack', 'street', 'skills', 'prison', 'experience', 'Watching', 'Oz', 'may', 'become', 'comfortable', 'uncomfortable', 'viewing', 'thats', 'get', 'touch', 'darker', 'side']
['wonderful', 'little', 'p

## Train-Test Split

In [13]:
train1 = df.sample(frac=0.9, random_state=25)
train = train1.sample(frac=0.8889, random_state=25)
valid = train1.drop(train.index)
test = df.drop(train1.index)

(40000, 2)
(5000, 2)
(5000, 2)


## Vectorization and Mapping to a Vocabulary

In [30]:
import collections

counter = collections.Counter()
for _, row in train.iterrows():
  counter.update(custom_tokenize(row['review']))
  # if _%1000==0:
  #   print(_, end=' ')


In [31]:
# Min_frequency to filter rare words
min_freq = 5
specials=["<unk>", "<pad>"]
train_vocab = vocab(counter, min_freq=min_freq, specials=specials)
train_vocab.set_default_index(train_vocab["<pad>"])
for i in range(5) :
  rev=df.iloc[i]['review']
  tokens=custom_tokenize(rev)
  for w in tokens:
    print(train_vocab[w], w, end=' ')
  print()

584 One 2261 reviewers 2906 mentioned 253 watching 8521 Oz 1079 episode 2100 hooked 308 right 6285 exactly 1173 happened 265 would 333 say 25 main 2675 appeal 258 show 1033 due 186 fact 1449 goes 660 shows 3212 dare 13519 Forget 506 pretty 1114 pictures 1113 painted 2010 mainstream 857 audiences 612 forget 6301 charm 612 forget 4876 romance 10492 OZ 3112 mess 321 around 84 first 1079 episode 383 ever 1051 saw 3989 struck 965 nasty 6778 surreal 333 say 5664 ready 385 watched 341 developed 1376 taste 8521 Oz 375 got 1562 accustomed 62 high 2651 levels 1637 graphic 930 violence 930 violence 12241 injustice 11228 crooked 17533 guards 4003 sold 26192 nickel 8475 inmates 2905 kill 1262 order 48 get 1016 away 338 well 7918 mannered 466 middle 3059 class 8475 inmates 685 turned 2902 prison 29802 bitches 1033 due 836 lack 764 street 7906 skills 2902 prison 2168 experience 670 Watching 8521 Oz 1222 may 1086 become 1801 comfortable 7792 uncomfortable 485 viewing 2475 thats 48 get 2963 touch 9612 