# Sentiment Classification on IMDB Dataset

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

In [2]:
data = pd.read_csv("IMDB Dataset.csv")
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Basic Data Analysis

In [3]:
print("Number of rows : ",data.shape[0])
print("Number of columns : ", data.shape[1])

Number of rows :  50000
Number of columns :  2


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [5]:
#sentiment_count
data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [6]:
data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [7]:
text = data["review"][1]
text

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

## Inferences from Analysis
**`Note`** 
- We have a balanced Dataset, with just 2 categories : Positive and Negative.
- No missing values
- The dataset contains redundant words and html syntaxes.
- Punctuations/stopwords are present in equal distribution in the dataset.

# [Text preparation](https://eugenia-anello.medium.com/nlp-tutorial-series-d0baaf7616e0)

## Data Cleaning and Pre-processing

#### [1. Need To remove HTML tags from strings](https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string)

In [8]:
import re
from bs4 import BeautifulSoup

In [9]:
cleantext = BeautifulSoup(text, "lxml").text
cleantext

'A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.'

#### 2. [Remove punctuation, special-character from string](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)
- we will remove everything except lower/upper case letters using Regular Expressions.

In [10]:
cleantext = re.sub('\[[^]]*\]', ' ', cleantext) #remove '\'
cleantext = re.sub('[^a-zA-Z]', ' ', cleantext) #remove any character except a-z or A-Z
cleantext = re.sub(r'[^\w\s]',' ', cleantext)   #remove punctuation
cleantext

'A wonderful little production  The filming technique is very unassuming  very old time BBC fashion and gives a comforting  and sometimes discomforting  sense of realism to the entire piece  The actors are extremely well chosen  Michael Sheen not only  has got all the polari  but he has all the voices down pat too  You can truly see the seamless editing guided by the references to Williams  diary entries  not only is it well worth the watching but it is a terrificly written and performed piece  A masterful production about one of the great master s of comedy and his life  The realism really comes home with the little things  the fantasy of the guard which  rather than use the traditional  dream  techniques remains solid then disappears  It plays on our knowledge and our senses  particularly with the scenes concerning Orton and Halliwell and the sets  particularly of their flat with Halliwell s murals decorating every surface  are terribly well done '

#### [3. Remove stopwords from data](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/)

stopwords : Words that are not very useful for Machine learning hence need to ignore/remove

In [11]:
#nltk.download('stopwords')                                     #download if require

import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

# Tokenization
- First converting all reviews into lower_case
- Stopwords removal - since stopwords removal works on every word in your text we need to split the text.
- Split the text into tokens, the string is converted into a list, where each element corresponds to a word.

In [12]:
cleantext= cleantext.lower()                # converting text to lower letters
stopword = set(stopwords.words('english'))  # use set so that takes unique value only
tokens= cleantext.split()
token_list = []
for token in tokens:
    if token not in stopword:               #if splitted text not in stopwords than update the list
        token_list.append(token)

In [13]:
#token_list

## Normalize Words:-
- Converting words to its base form.
- **For example**, playing, plays and play can seem different to the computer, but they are the same thing.

### [4. Lemmatization vs Stemming](https://www.tutorialspoint.com/natural_language_toolkit/natural_language_toolkit_stemming_lemmatization.htm#)

- Stemming : convert word into its stem word, but cann't retain its meaning or context
- Lemmatization : It converts the word into its root word and also retain the meaning/context of the words, means it will also provide us a valid form of word.

In [14]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [15]:
#nltk.download('wordnet')
#nltk.download('omw-1.4')

### Lemmatization

In [16]:
lemma_word = []
for token in token_list:
    lemma_word.append(lemmatizer.lemmatize(token))

join_text = ' '.join(lemma_word)
join_text

'wonderful little production filming technique unassuming old time bbc fashion give comforting sometimes discomforting sense realism entire piece actor extremely well chosen michael sheen got polari voice pat truly see seamless editing guided reference williams diary entry well worth watching terrificly written performed piece masterful production one great master comedy life realism really come home little thing fantasy guard rather use traditional dream technique remains solid disappears play knowledge sens particularly scene concerning orton halliwell set particularly flat halliwell mural decorating every surface terribly well done'

# Feature Engineering

## Vector / mathematical Representation of words
- Cleaned text isn’t enough to be passed directly to the classification model. The features need to be numeric, not strings

#### To vectorize the text we have following Methods

1. CountVectorizer (Bag of Words Model)
2. TfidfVectorizer (Bag of Words Model)
3. Keras Tokenizer (Embedding)
4. Word embedding : 
> a. word2vec  
> b. glove

## Bag of words  (CountVectorizer )
- It’s an algorithm that transforms the text into fixed-length vectors. This is possible by counting the number of times the word is present in a document.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer            
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
corpus = []
corpus.append(join_text)

In [19]:
vec = CountVectorizer()
BOW = vec.fit_transform(corpus)
BOW.toarray()

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1,
        1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1]], dtype=int64)

- So we can see the data has become numeric with 1,2 and 3s based on the number of times they appear in the text.

# TFIDF

- TF stands for Text Frequency which means how many times a word (term) appears in a text (document).
- IDF means Inverse Document Frequency and is calculated as log(# of documents in corpus/# of documents containing the term).

- Finally **TF-IDF score is calculated as TF * IDF**

**`Note`**
- IDF acts as a balancing factor and diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely

In [20]:
vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(corpus)
tf_idf.toarray()

array([[0.09712859, 0.09712859, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.19425717, 0.09712859, 0.09712859,
        0.09712859, 0.19425717, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.09712859, 0.09712859, 0.19425717,
        0.09712859, 0.09712859, 0.19425717, 0.09712859, 0.09712859,
        0.19425717, 0.09712859, 0.19425717, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.19425717, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.09712859, 0.09712859, 0.09712859,
        0.09712859, 0.09712859, 0.29138576, 0.09

### Function for doing all the datacleaning process.

In [21]:
#to check progress of loop working in python
from tqdm import tqdm

In [22]:
def data_cleaner(data):
    clean_data = []
    for sentence in tqdm(data):
        cleantext = BeautifulSoup(sentence, "lxml").text #html tags
        cleantext = re.sub('\[[^]]*\]', ' ', cleantext)
        cleantext = re.sub('[^a-zA-Z]', ' ', cleantext)
        cleantext = re.sub(r'[^\w\s]','',cleantext) #
        cleantext = [token for token in cleantext.lower().split() if token not in stopword] #stopword
        clean_text = ' '.join([lemmatizer.lemmatize(token) for token in cleantext])
        clean_data.append(clean_text.strip())                  #remove extra space
    return clean_data

In [23]:
clean_data = data_cleaner(data['review'])

100%|████████████████████████████████████████████████████████████████████████████████████████| 50000/50000 [01:13<00:00, 684.58it/s]


In [24]:
clean_data[100]

'short film inspired soon full length feature spatula madness hilarious piece contends similar cartoon yielding multiple writer short film star edward spatula fired job join fight evil spoon premise allows funny content near beginning barely present remainder feature film minute running time absorbed odd ball comedy small musical number unfortunately much else lie plot set really time show surely follows plot better many high budget hollywood film film worth watching least time take expect deep story'

In [25]:
#Converting sentiment to numeric form 1 for postive label and 0 for negative label
data['sentiment'].replace({'negative':0,'positive':1},inplace=True)

In [26]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


# Machine Learning Models

### Feature Selection

- We will now apply all the techniques that we discussed on the whole dataset but there is no test dataset so we will keep 80% of the data aside to test the performance of the model.

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
X_train, X_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], test_size=0.2, random_state=42 ,stratify=data["sentiment"])# In[ ]:

In [29]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(40000,) (40000,)
(10000,) (10000,)


## Cleaning only the training dataset

In [30]:
clean_traindata = data_cleaner(X_train)

100%|████████████████████████████████████████████████████████████████████████████████████████| 40000/40000 [00:59<00:00, 677.34it/s]


### Validating with an example

In [31]:
X_train.iloc[25]

"Martha Plimpton has done some prestigious movies, working with River Phoenix and Harrison Ford, but she was never able to expand her limited, tomboyish appeal into the same class as, say, Molly Ringwald. This film, which was barely released, is just an extension of her late '80s/early '90s attempts to find a screen-persona which was identifiable to moviegoers, and it represents another failure. Plimpton plays a troubled young woman who finds out on her 21st birthday that she was adopted and--worse than that--was actually abandoned as an infant on her parents' doorstep! She sets out to find her biological mother and father, but the viewer has no clue why she'd even want to (would simple curiosity give her this much determination?). Unattractive material given sitcom handling; it starts off on the wrong foot and never recovers. Plimpton gives a sour, surly performance, but Hector Elizondo and Mary Kay Place are fine as her adoptive parents. *1/2 from ****"

In [32]:
clean_traindata[25]

'martha plimpton done prestigious movie working river phoenix harrison ford never able expand limited tomboyish appeal class say molly ringwald film barely released extension late early attempt find screen persona identifiable moviegoer represents another failure plimpton play troubled young woman find st birthday adopted worse actually abandoned infant parent doorstep set find biological mother father viewer clue even want would simple curiosity give much determination unattractive material given sitcom handling start wrong foot never recovers plimpton give sour surly performance hector elizondo mary kay place fine adoptive parent'

## Cleaning test dataset separately

In [33]:
clean_testdata = data_cleaner(X_test)

100%|████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:14<00:00, 674.86it/s]


### Validating with an example

In [34]:
X_test.iloc[25]

"This is not a GREAT movie as tho the cast (especially the kids) admirably help to carry along this very sad yet contrived plot it is filled with cliché upon cliché. Poor family in 50's mid America, dying mother, alcoholic father, 10 children (1 of whom has epilepsy) and an awful decision to be made. Its very easy to watch and some of the kids performances are moving without being sickly or naff. And little Frank and Warrnen steal the show for me with the last scene leaving me bawling no matter how many times I see it. A great rainy afternoon movie i recommend to all. Only those with the hardest of hearts could fail to be moved by it. Not on a par to Sophies Choice but a good TV movie equivalent!!!"

In [35]:
clean_testdata[25]

'great movie tho cast especially kid admirably help carry along sad yet contrived plot filled clich upon clich poor family mid america dying mother alcoholic father child epilepsy awful decision made easy watch kid performance moving without sickly naff little frank warrnen steal show last scene leaving bawling matter many time see great rainy afternoon movie recommend hardest heart could fail moved par sophies choice good tv movie equivalent'

In [36]:
data.shape

(50000, 2)

## Model 1 : Logistic Regression
**Important tuning parameters for Logistic Regression:**

- C       : inverse of regularization strength (lambda)
- penalty : type of regularization - 'L1', 'L2', 'elasticnet'
- solver  : algorithm used for optimization - 'liblinear','lbfgs', 'newton-cg', 'sag', 'saga' 

> [Advance optimization solver **`lbfgs`**](https://www.youtube.com/watch?v=j4TJZBIDR28)

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV

#### Vectorization : TF-IDF

In [38]:
tfidf_LRvec = TfidfVectorizer(ngram_range=(1, 2))

tfidf_LRvec_train = tfidf_LRvec.fit_transform(clean_traindata)
tfidf_LRvec_test = tfidf_LRvec.transform(clean_testdata)

In [39]:
LR_model = LogisticRegression(solver='lbfgs')
LR_model.fit(tfidf_LRvec_train, y_train)
predict_LR = LR_model.predict(tfidf_LRvec_test)

In [40]:
print("Classification Report: \n", classification_report(y_test, predict_LR,target_names=['Negative','Positive']))
print("Confusion Matrix: \n", confusion_matrix(y_test, predict_LR))
print("Accuracy: \n", accuracy_score(y_test, predict_LR))

Classification Report: 
               precision    recall  f1-score   support

    Negative       0.91      0.88      0.89      5000
    Positive       0.89      0.91      0.90      5000

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Confusion Matrix: 
 [[4416  584]
 [ 461 4539]]
Accuracy: 
 0.8955


## Model 2 : LinearSVC

- It applies a **Linear Kernal Function** to Perform Classification.
- Estimator used is **liblinear**
- It Minimizes the **squared hinge loss**
- It has Additional Parameter of **penalty normalization: L1, L2**; L2 is default selected.

### Vectorization : TF-IDF

In [41]:
tfidf_vec = TfidfVectorizer(ngram_range=(1, 3))

tfidf_SVCvec_train = tfidf_vec.fit_transform(clean_traindata)
tfidf_SVCvec_test = tfidf_vec.transform(clean_testdata)

In [42]:
from sklearn.svm import LinearSVC

linear_svc = LinearSVC(C=0.5, random_state=42)   #C=Regularization Parameter
linear_svc.fit(tfidf_SVCvec_train, y_train)

predict_SVC = linear_svc.predict(tfidf_SVCvec_test)

In [43]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("Classification Report: \n", classification_report(y_test, predict_SVC,target_names=['Negative','Positive']))
print("Confusion Matrix: \n", confusion_matrix(y_test, predict_SVC))
print("Accuracy: \n", accuracy_score(y_test, predict_SVC))

Classification Report: 
               precision    recall  f1-score   support

    Negative       0.91      0.89      0.90      5000
    Positive       0.89      0.92      0.90      5000

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Confusion Matrix: 
 [[4447  553]
 [ 414 4586]]
Accuracy: 
 0.9033


## Model 3 : Multinomial Naive Bayes Classifier

- Bayesian model uses prior probabilities to predict posterior probabilites which is **helpful for classification with discrete features like text classification.**

In [44]:
tfidf_vec_NB = TfidfVectorizer(ngram_range=(1, 1))
tfidf_vec_train_NB = tfidf_vec_NB.fit_transform(clean_traindata)
tfidf_vec_test_NB = tfidf_vec_NB.transform(clean_testdata)

In [45]:
from sklearn.naive_bayes import MultinomialNB

multi_clf = MultinomialNB()
multi_clf.fit(tfidf_vec_train_NB, y_train)

predict_NB = multi_clf.predict(tfidf_vec_test_NB)

In [46]:
print("Classification Report: \n", classification_report(y_test, predict_NB,target_names=['Negative','Positive']))
print("Confusion Matrix: \n", confusion_matrix(y_test, predict_NB))
print("Accuracy: \n", accuracy_score(y_test, predict_NB))

Classification Report: 
               precision    recall  f1-score   support

    Negative       0.86      0.88      0.87      5000
    Positive       0.88      0.86      0.87      5000

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Confusion Matrix: 
 [[4403  597]
 [ 719 4281]]
Accuracy: 
 0.8684


## Conclusion

- **LinearSVC** using **TF-IDF vectorization** gives the maximum accuracy

## Testing
[Test on user data review link](https://www.imdb.com/title/tt6443346/reviews?ref_=tt_urv)

- 0: Negative; 1: positive

In [62]:
text_0 = ["I hoped 'Black Adam' would be a decent entry in DC's recent output as its linked with 'Shazam' which might be their best recent movie. But nope. Its another CGI filled dud On the plus side there is an effort put into the story and Dwayne Johnson and his rediculusly strong shoulders buldging from his costume are 100 percent real amongst all the boring CGI. But all the team building with rubbish characters no one cares about has already been DC's downfall with the justice league, Suicide squad and even that Harley Quinn movie. And it's takes up too much time here too. There's not enough time dedicated to making us care about any of these minor characters and their random powers. Just action, action, action that doesn't matter when you don't care about anyone in the movie. There's some decent gags in the movie I suppose. But overall its just alot of green screen and feels like your watching a computer game instead of a movie. It's more 'Aquaman' than 'Shazam.' So if you liked that one you'll probably like this"]
text_1 = ["Black Adam is not without its flaws, but i still enjoy the hell out of it, the problem that i have with Black Adam is that i never really buy all the Kahndaq people 100%, in the end i only still buy half of it, but i still fine with it, it's not like their characters are terrible, it just most of the time they are doing just a poor performance except the actress who played the mom and probably Ishmael for the most part, at least i still got some of the laughs from one of them, some of the slower scene was sometimes misplaced including some of the flashback, and some nit picky here and there, the rest of Black Adam is an exciting comic book film that fills with a very impressive battle that also fills with dazzling visual and cool style, Black Adam himself played incredibly well by Dwayne Johnson, he was just unstoppable in the most badass way, Justice Society has just become one of my favorite superhero team specifically in movies, they are fun and they are great, Hawkman, Doctor Fate, Cyclone, and Atom Smasher all of them are a hit for me, i love every single one of them, though i do wish they have something more with Doctor Fate because i think Pierce Brosnan was just unbelievably good playing the characters, again at least we do still have that one moment, i would love to see Justice Society again in movie maybe even their own movie, i would also love to have a specifically Doctor Fate standalone or a prequel film, and last but not least, the main villain, i think the main villain is doing as fine as most of the hero in the movie, so overall Black Adam is a quite a fun ride, 2 hours did feel quite goes by, an exciting back and forth battle that fills with dazzling visual effect and style, great score, great sound, great performances for most of the cast, being the first original DCEU movie that release in theaters only since WW84, Black Adam is not perfect but it's still great, it's still fun, it's still fascinating, it's still exciting, and i enjoy the hell out of it."]

In [65]:
#Testing on multinomial_NB classifier :-
text_vec = tfidf_vec_NB.transform(text_0)
print(multi_clf.predict(text_vec))

[0]


In [64]:
#Testing on linear_SVC classifier :-
text_vec = tfidf_vec.transform(text_0)
print(linear_svc.predict(text_vec))

[0]
