**NLP: Exercises**
 - In this Exercise, you are going to classify whether a given movie review is positive or negative.
 - you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
 - Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [1]:
import pandas as pd
df = pd.read_csv("movies_sentiment_data.csv")
df.head()

Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [2]:
df.shape

(19000, 2)

In [6]:
df['review'][0]

"I first saw Jake Gyllenhaal in Jarhead (2005) a little while back and, since then, I've been watching every one of his movies that arrives on my radar screen. Like Clive Owen, he has an intensity (and he even resembles Owen somewhat) that just oozes from the screen. I feel sure that, if he lands some meaty roles, he'll crack an Oscar one day...<br /><br />That's not to denigrate this film at all.<br /><br />It's a fine story, with very believable people (well, it's based upon the author's early shenanigans with rocketry), a great cast \x96 Chris Cooper is always good, and Laura Dern is always on my watch list \x96 with the appropriate mix of humor, pathos, excitement...and the great sound track with so many rock n roll oldies to get the feet tapping.<br /><br />But, this film had a very special significance for me: in 1957, I was the same age as Homer Hickham; like him, I looked up at the night stars to watch Sputnik as it scudded across the blackness; like Homer also, I experimented 

In [7]:
df['sentiment'].value_counts()

positive    9500
negative    9500
Name: sentiment, dtype: int64

In [8]:
df['category'] = df['sentiment'].apply(lambda x:1 if x == 'positive' else 0)
df.head()

Unnamed: 0,review,sentiment,category
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive,1
1,I enjoyed the movie and the story immensely! I...,positive,1
2,I had a hard time sitting through this. Every ...,negative,0
3,It's hard to imagine that anyone could find th...,negative,0
4,This is one military drama I like a lot! Tom B...,positive,1


In [9]:
df['category'].value_counts()

0    9500
1    9500
Name: category, dtype: int64

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['category'], test_size=0.2)

In [11]:
X_train.shape

(15200,)

In [14]:
X_test.shape

(3800,)

**Exercise-1**

Using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.
Note:

 - use CountVectorizer for pre-processing the text.

 - use Random Forest as the classifier with estimators as 50 and criterion as entropy.

 - print the classification report.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

from sklearn.pipeline import Pipeline
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('random_forest', (RandomForestClassifier(n_estimators=50, criterion='entropy')))
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.83      0.82      1844
           1       0.83      0.81      0.82      1956

    accuracy                           0.82      3800
   macro avg       0.82      0.82      0.82      3800
weighted avg       0.82      0.82      0.82      3800



**Exercise-2**

using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..
Note:

 - use CountVectorizer for pre-processing the text.
 - use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean'.
 - print the classification report.

In [18]:
from sklearn.neighbors import KNeighborsClassifier

clf1 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('KNN', (KNeighborsClassifier(n_neighbors=10, metric = 'euclidean')))
])

clf1.fit(X_train, y_train)
y_pred = clf1.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.64      0.65      0.64      1844
           1       0.66      0.65      0.66      1956

    accuracy                           0.65      3800
   macro avg       0.65      0.65      0.65      3800
weighted avg       0.65      0.65      0.65      3800



**Exercise-3**

using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..
Note:

 - use CountVectorizer for pre-processing the text.
 - use Multinomial Naive Bayes as the classifier.
 - print the classification report.

In [19]:
from sklearn.naive_bayes import MultinomialNB

clf2 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

clf2.fit(X_train, y_train)
y_pred = clf2.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85      1844
           1       0.88      0.81      0.84      1956

    accuracy                           0.85      3800
   macro avg       0.85      0.85      0.85      3800
weighted avg       0.85      0.85      0.85      3800



**Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?**
 - As Machine learning algorithms does not work on Text data directly, we need to convert them into numeric vector and feed that into models while training.
 - In this process, we convert text into a very high dimensional numeric vector using the technique of Bag of words.
 - Model like K-Nearest Neighbours(KNN) doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance of model.
 - The easy calculation of probabilities for the words in corpus(Bag of words) and storing them in contigency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.
 - As Random Forest uses Bootstrapping(Row and column Sampling) with many decision tree and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifing the categories.
 - Machine Learning is like trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which give good results and satisfy the requirements like latency, interpretability etc.