In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In this Exercise, you are going to classify whether a given movie review is positive or negative.                                                        
you are going to use Bag of words for pre-processing the text and apply different classification algorithms.                                         
Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

About Data: IMDB Dataset
Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

This data consists of two columns. - review - sentiment                       
Reviews are the statements given by users after watching the movie.         
sentiment feature tells whether the given review is positive or negative.


In [2]:
df = pd.read_csv(r"E:\Programming\NLP\bag_of_words\movies_sentiment_data.csv")

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [4]:
df.shape

(19000, 2)

In [5]:
df['Category'] = df['sentiment'].apply(lambda x: 1 if x =='positive' else 0)

In [6]:
df.head()

Unnamed: 0,review,sentiment,Category
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive,1
1,I enjoyed the movie and the story immensely! I...,positive,1
2,I had a hard time sitting through this. Every ...,negative,0
3,It's hard to imagine that anyone could find th...,negative,0
4,This is one military drama I like a lot! Tom B...,positive,1


In [7]:
df['Category'].value_counts()

Category
1    9500
0    9500
Name: count, dtype: int64

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.review, df.Category, test_size=0.2)

In [9]:
clf1 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('random_forest', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])

In [10]:
clf1.fit(X_train, y_train)

In [11]:
y_pred = clf1.predict(X_test)

In [12]:
classification_report(y_test, y_pred)

'              precision    recall  f1-score   support\n\n           0       0.83      0.83      0.83      1922\n           1       0.83      0.83      0.83      1878\n\n    accuracy                           0.83      3800\n   macro avg       0.83      0.83      0.83      3800\nweighted avg       0.83      0.83      0.83      3800\n'

As you can see above, for both the classes (positive and negative sentiment) we got more than 80% precision, recall and f1- score. This seems to be an acceptable performance

2 METHOD

using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

use CountVectorizer for pre-processing the text.                              
use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean'.   
print the classification report                               

In [13]:
clf2 = Pipeline([
    ('vectorized', CountVectorizer()),
    ('KNN', (KNeighborsClassifier(n_neighbors=10, metric = 'euclidean')))
])

In [14]:
clf2.fit(X_train, y_train)

In [15]:
y_pred = clf2.predict(X_test)


In [16]:
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.64      0.64      0.64      1884
           1       0.64      0.64      0.64      1916

    accuracy                           0.64      3800
   macro avg       0.64      0.64      0.64      3800
weighted avg       0.64      0.64      0.64      3800



In [18]:
review = ['This production has absolutely no storyline. The acting is embarrassing. The promising Dutch television Sophie Hilbrand star should not add this movie to her CV. Her acting is far from flawless and personally I think she has crossed boundary of professional decency; relating to the way she exposes herself in this movie. This movie contains too much unnecessary nudity, vulgar sexual scenes and rude language. It also shows a wrong image of the Netherlands (as most movies do). Do not bother to watch this movie: a waste of time, a waste of money and an embarrassing record for Hilbrand, who has proved to be better with her close on on the screen.']
clf2.predict(review)

array([1], dtype=int64)

Hmmm..here the performance of various metrics (precision, recall etc.) seem to be lower (~60 %). Let's try one more classifier and then discuss why performance is varying so much

3 METHOD

using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative

use CountVectorizer for pre-processing the text.                              
use Multinomial Naive Bayes as the classifier.                               
print the classification report.                                

In [13]:
clf3 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('Multi NB', MultinomialNB())
])

In [14]:
clf3.fit(X_train, y_train)


In [15]:
y_pred = clf3.predict(X_test)

In [16]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.87      0.84      1922
           1       0.86      0.81      0.83      1878

    accuracy                           0.84      3800
   macro avg       0.84      0.84      0.84      3800
weighted avg       0.84      0.84      0.84      3800



That's great! MultinomialNB model for both the classes (positive and negative sentiment) we got more than 80% precision, recall and f1- score and performed equally good with Random Forest. This seems to be an acceptable performance.


In [17]:
new = input("Enter The Review")

Enter The Review Greetings again from the darkness. Director Alejandro Amenabar creates life against all odds in this based on a true story version of one man's struggle to control his destiny. The great Javier Bardem is fascinating to watch in his role as Ramon. His eyes and head movements leave little doubt what is going on in his mind. The dream and fantasy sequences are not overused so prove very effective in explaining why he wants what he wants. Rather than force us to answer the euthanasia question, the real question posed is , What is Love? At every turn we see people in love, looking for love or dying to be loved. The script is tight and keeps the film moving despite being filmed mostly in one room. The supporting cast is wonderful and we truly feel their pain and how each family member deals with Ramon's decision. This is a gem and deserves to be seen.


In [20]:
review_list = [new]  # Creating a list with the input string
prediction = clf3.predict(review_list)

In [21]:
if prediction == 1:
    print("Postive")
else:
    print("Negative")

Postive


In [22]:
import pickle

In [25]:
pickle.dump(clf3, open('Sentiment_review_BOW', 'wb'))

In [26]:
model = pickle.load(open('Sentiment_review_BOW', 'rb'))

In [32]:
model.predict(["Forest of the Damned starts out as five young friends, brother & sister Emilio (Richard Cambridge) & Ally (Sophie Holland) along with Judd (Daniel Maclagan), Molly (Nicole Petty) & Andrew (David Hood), set off on a week long holiday 'in the middle of nowhere', their words not mine. Anyway, before they know it they're deep in a forest & Emilio clumsily runs over a woman (Frances Da Costa), along with a badly injured person to add to their problems the van they're travelling in won't start & they can't get any signals on their mobile phones. They need to find help quickly so Molly & Judd wander off in the hope of finding a house, as time goes by & darkness begins to fall it becomes clear that they are not alone & that there is something nasty lurking in the woods...<br /><br />This English production was written & directed by Johannes Roberts & having looked over several other comments & reviews both here on the IMDb & across the internet Forest of the Damned seems to divide opinion with some liking it & other's not, personally it didn't do much for at all. The script is credited on screen to Roberts but here on the IMDb it lists Joseph London with 'additional screenplay material' whatever that means, the film is your basic backwoods slasher type thing like The Texas Chainsaw Massacre (1974) with your basic stranded faceless teenage victims being bumped off but uses the interesting concept of fallen angels who roam the forest & kill people for reason that are never explained to any great deal of satisfaction. Then there's Stephen, played by the ever fantastic Tom Savini, who is never given any sort of justification for what he does. Is he there to get victims for the angels? If so why did he kill Andrew by bashing his head in? The story is very loose, it never felt like a proper film. The character's are poor, the dialogue not much better & the lack of any significant story makes it hard to get into it or care about anything that's going on. Having said that it moves along at a reasonable pace & there are a couple of decent scenes here.<br /><br />Director Johannes doesn't do anything special, it's not a particularly stylish or flash film to look at. There's a few decent horror scenes & the Tom Savini character is great whenever he's on screen (although why didn't he hear Judd breaking the door down with an axe while escaping with Molly?) & it's a shame when he gets killed off. There are a couple of decent gore scenes here, someone has their head bashed in, there's a decapitation, someone gets shotgun blasted, someone throat is bitten out, someones lips are bitten off & someone is ripped in half. There is also a fair amount of full frontal female nudity, not that it helps much.<br /><br />Technically Forest of the Damned is OK, it's reasonably well made but nothing overly special or eye-catching. This was shot in England & Wales & it's quite odd to see an English setting for a very American themed backwards horror. The acting is generally pretty poor save for Savini who deserves to be in better than this. Horror author Shaun Hutson has an embarrassing cameo at the end & proves he should stick to writing rather than acting.<br /><br />Forest of the Damned was a pretty poor horror film, it seems to have fans out there so maybe I'm missing something but it's not a film I have much fondness for. Apart from one or two decent moments there's not much here to recommend."])

array([0], dtype=int64)

In [35]:
model.predict(["it is very bad want to more improvment"])

array([0], dtype=int64)

In [36]:
if model == 1:
    print("Postive")
else:
    print("Negative")

Negative
