### Bag of words


- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.
- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [50]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier

In [51]:
# import the CSV file
file  = pd.read_csv('../NLP/nlp-tutorials-main/9_bag_of_words/movies_sentiment_data.csv')
file.head()

Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [52]:
file.shape

(19000, 2)

In [53]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
file['Category'] = file.sentiment.apply(lambda x: 0 if x == 'negative' else 1)

In [54]:
file.head()

Unnamed: 0,review,sentiment,Category
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive,1
1,I enjoyed the movie and the story immensely! I...,positive,1
2,I had a hard time sitting through this. Every ...,negative,0
3,It's hard to imagine that anyone could find th...,negative,0
4,This is one military drama I like a lot! Tom B...,positive,1


In [55]:
file.sentiment.value_counts()

positive    9500
negative    9500
Name: sentiment, dtype: int64

In [56]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.

file.Category.value_counts()

1    9500
0    9500
Name: Category, dtype: int64

Target labels are balanced

In [57]:
#Do the 'train-test' splitting with test size of 20%

X_train, X_test, y_train, y_test = train_test_split(file.review, file.Category, test_size=0.2)


1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.


In [58]:
#1. create a pipeline object
clf = Pipeline([('count vectorizer', CountVectorizer()),
                ('random forest', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])


In [59]:
#2. fit with X_train and y_train
clf.fit(X_train, y_train)

In [60]:
# classification report
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.83      0.84      0.84      1922
           1       0.84      0.82      0.83      1878

    accuracy                           0.83      3800
   macro avg       0.83      0.83      0.83      3800
weighted avg       0.83      0.83      0.83      3800



In [61]:
clf.score(X_test, y_test)


0.8326315789473684

**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.
- print the classification report.


In [62]:
model = Pipeline([('count vectorizer', CountVectorizer()),
                    ('knn', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))])

In [63]:
model.fit(X_train, y_train)

In [64]:
y_pre =model.predict(X_test)
print(classification_report(y_test, y_pre)) 

              precision    recall  f1-score   support

           0       0.66      0.64      0.65      1922
           1       0.64      0.67      0.65      1878

    accuracy                           0.65      3800
   macro avg       0.65      0.65      0.65      3800
weighted avg       0.65      0.65      0.65      3800



In [65]:
model.score(X_test, y_test)


0.6505263157894737

**Exercise-3**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer for pre-processing the text.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.


In [70]:

clf = Pipeline([
                
     ('vectorizer', CountVectorizer()),   
      ('Multi NB', MultinomialNB())   #using the Multinomial Naive Bayes classifier 
])


clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1922
           1       0.87      0.81      0.84      1878

    accuracy                           0.85      3800
   macro avg       0.85      0.85      0.85      3800
weighted avg       0.85      0.85      0.85      3800



RandomForestClassifier metrics report have a good score value of f1-score, precision and recall  over 80%. Hence, an acceptable perfomance.
wWhile KNeigbiursClassifer mertircs report have a score value of f-1score, precision and recall less than 60%. which makes it pretty less acceptable.