### Bag of words: Solutions


- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [1]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This data consists of two columns.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [2]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv("movies_sentiment_data.csv")


#2. print the shape of the data
print(df.shape)

#3. print top 5 datapoints
df.head()

(19000, 2)


Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [3]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category'] = df['sentiment'].apply(lambda x: 1 if x =='positive' else 0)

In [4]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.

df['Category'].value_counts()

Category
1    9500
0    9500
Name: count, dtype: int64

In [5]:
#Do the 'train-test' splitting with test size of 20%

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.review, df.Category, test_size=0.2)

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [6]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer()),                                                    #initializing the vectorizer
    ('random_forest', (RandomForestClassifier(n_estimators=50, criterion='entropy')))      #using the RandomForest classifier
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.84      0.83      1910
           1       0.83      0.82      0.82      1890

    accuracy                           0.83      3800
   macro avg       0.83      0.83      0.83      3800
weighted avg       0.83      0.83      0.83      3800



In [12]:
reviews = ["I first saw Jake Gyllenhaal in Jarhead (2005) a little while back and, since then, I've been watching every one of his movies that arrives on my radar screen. Like Clive Owen, he has an intensity (and he even resembles Owen somewhat) that just oozes from the screen. I feel sure that, if he lands some meaty roles, he'll crack an Oscar one day...<br /><br />That's not to denigrate this film at all.<br /><br />It's a fine story, with very believable people (well, it's based upon the author's early shenanigans with rocketry), a great cast  Chris Cooper is always good, and Laura Dern is always on my watch list  with the appropriate mix of humor, pathos, excitement...and the great sound track with so many rock n roll oldies to get the feet tapping.<br /><br />But, this film had a very special significance for me: in 1957, I was the same age as Homer Hickham; like him, I looked up at the night stars to watch Sputnik as it scudded across the blackness; like Homer also, I experimented with rocketry in my backyard and used even the exact same chemicals for fuel; and like Homer, I also had most of my attempts end in explosive disaster! What fun it was...<br /><br />I didn't achieve his great (metaphorical and physical) heights though. But, that's what you find out when you see this movie.<br /><br />Sure, it's a basic family movie, but that's a dying breed these days, it seems. Take the time to see it, with the kids: you'll all have a lot of good laughs.",
"I enjoyed the movie and the story immensely! I have seen the original(1939 I believe) and enjoyed them both. To really appreciate the story one must be familiar with English culture and customs. The prof.(Peter O'Toole) was dedicated to his school and ""the boys"" in that school. It was an English ""public"" school, which we in the U.S. refer to as a private school (E.G. Andover). He is a very ascetic person and, on the surface, gives the appearance of being stiff, stuffy, uncaring, and weak to the point of being effeminate. He is strict in his educational standards because he DOES care for ""his lads"", i.e., he doesn't want them to get a cheap or weak education. He meets(through introduction) a ""dance hall girl""(Petula Clark) and is totally smitten. In England at the time, the reference to ""dance hall"" carried the connotation of extreme sexual promiscuity and was definitely ""lower class"". We find that the Prof. is in fact a very tough and courageous person as well as loyal to people and institutions that he loves and/or respects. Clark becomes more than a lover and wife...she ""leavens"" his personality and allows him to grow as a man and a person, much to the benefit of his beloved school and his own happiness. The first movie was set BEFORE WW II, this one goes through WW II, also, it is 1969( we've had the ""British Invasion""...Beetles, etc. Clark had hits and was very popular then...still is to me), the music is great, color and photography excellent. I think O'Toole played the character perfectly! There ARE dedicated people like ""Chips""...all around us but many do not receive the recognition. Very enjoyable movie and story!",
"I had a hard time sitting through this. Every single twist and turn is predictable. You're sitting there just waiting for it ... waiting for it ... and yes, there it is! Just as you predicted at the very beginning of the movie. Or 10 minutes out. Et cetera. <br /><br />Smart writing? No. <br /><br />Torture porn? No, there's no nudity. Other reviews calling this torture porn are most likely written by people on heavy drugs. Unfortunately there's no torture and no nudity (yes, no nudity).<br /><br />There's no suspense at all in this ""thriller"". The only good part about this movie is the ending, but I'm not going to spoil that.<br /><br />I'm giving it a 2/10. A 1/10 would be a horrible B-movie. This movie had better acting."
]

In [13]:
# Get predictions for new data
predictions = clf.predict(reviews)
predictions

array([1, 1, 0], dtype=int64)

- As you can see above, for both the classes (positive and negative sentiment) we got more than 80% precision, recall and f1- score. This seems to be an acceptable performance.

**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [10]:

#1. create a pipeline object
clf = Pipeline([
                
     ('vectorizer', CountVectorizer()),   
      ('KNN', (KNeighborsClassifier(n_neighbors=10, metric = 'euclidean')))   #using the KNN classifier with 10 neighbors 
])


#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.63      0.66      0.64      1849
           1       0.66      0.64      0.65      1951

    accuracy                           0.65      3800
   macro avg       0.65      0.65      0.65      3800
weighted avg       0.65      0.65      0.65      3800



- Hmmm..here the performance of various metrics (precision, recall etc.) seem to be lower (~60 %). Let's try one more classifier and then discuss why performance is varying so much

**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html



In [11]:

#1. create a pipeline object
clf = Pipeline([
                
     ('vectorizer', CountVectorizer()),   
      ('Multi NB', MultinomialNB())   #using the Multinomial Naive Bayes classifier 
])


#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85      1849
           1       0.88      0.82      0.85      1951

    accuracy                           0.85      3800
   macro avg       0.85      0.85      0.85      3800
weighted avg       0.85      0.85      0.85      3800



- That's great! MultinomialNB model for both the classes (positive and negative sentiment) we got more than 80% precision, recall and f1- score and performed equally good with Random Forest. This seems to be an acceptable performance.

### Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?

- As Machine learning algorithms does not work on Text data directly, we need to convert them into numeric vector and feed that into models while training.
- In this process, we convert text into a very **high dimensional numeric vector** using the technique of Bag of words.
- Model like K-Nearest Neighbours(KNN) doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance of model.
- The easy calculation of probabilities for the words in corpus(Bag of words) and storing them in contigency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.
- As Random Forest uses Bootstrapping(Row and column Sampling) with many decision tree and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifing the categories.
- Machine Learning is like trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which give good results and satisfy the requirements like latency, interpretability etc.

Refer these resources to get good idea:
- https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/
- https://analyticsindiamag.com/naive-bayes-why-is-it-favoured-for-text-related-tasks/