#<center>**IMBD Movie Reviews Classification</center>**


###**Dataset Description:**

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.

- **Review**
- **Sentiment**
  
Link for the dataset [https://www.dropbox.com/s/dctsk9k67x2jgnb/imdb_labelled.txt](https://)

In [1]:
!wget https://www.dropbox.com/s/dctsk9k67x2jgnb/imdb_labelled.txt

--2023-03-07 11:39:19--  https://www.dropbox.com/s/dctsk9k67x2jgnb/imdb_labelled.txt
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/dctsk9k67x2jgnb/imdb_labelled.txt [following]
--2023-03-07 11:39:19--  https://www.dropbox.com/s/raw/dctsk9k67x2jgnb/imdb_labelled.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc1938dc913cf59627f454edeb54.dl.dropboxusercontent.com/cd/0/inline/B3zcbD6GrAB0yzJxwOwwFKikkF1gcPEb9HAR-WjRuibOJj87Hvy4RkHrJA8tPqGJ9qOMU7ILBEK9IdGn9P-qxi0w8_u18vNITT37dgQ0bPiDbcK03rfeSbnczsM8C-kOtz-ikzhojdcZvNxMA1tZHdmt8O0zfBaqNp4syUIqIST1hA/file# [following]
--2023-03-07 11:39:19--  https://uc1938dc913cf59627f454edeb54.dl.dropboxusercontent.com/cd/0/inline/B3zcbD6GrAB0yzJxwOwwFKikkF1gcPEb9HAR-WjRuibOJj87Hvy4RkHrJA8tPqGJ9

## Importing the Libraries

In [3]:
import numpy as np
import pandas as pd
import spacy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
from sklearn.model_selection import train_test_split

nlp = spacy.load('en_core_web_sm')

In [6]:
data = pd.read_csv('/content/imdb_labelled.txt', sep = '\t', header=None, names=['Reviews', 'Sentiment'])
data.head()

Unnamed: 0,Reviews,Sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


###**Data pre-processing** 

In [8]:
from spacy.lang.en.stop_words import STOP_WORDS
import string
punct = string.punctuation
stop_words = list(STOP_WORDS)
print("STOP WORDS : ",stop_words)
print("Punctuations : ", punct)

STOP WORDS :  ['others', 'which', 'whereby', 'on', 'anyway', 'thus', 'below', 'always', 'does', 'you', 'of', 'me', 'just', 'too', 'thereby', 'not', 'hereupon', 'were', 'several', 'same', 'elsewhere', 'sixty', 'becomes', 'except', 'something', 'once', 'beforehand', 'during', 'yours', 'nevertheless', 'towards', 'third', 'please', 'can', 'every', 'he', 'see', 'whereupon', 'made', 'unless', 'i', 'only', 'them', 'hereafter', 'such', 'everything', 'between', '’m', 'move', 'the', 'either', 'becoming', 'anyhow', 'amount', 'one', 'around', 'first', 'an', 'beside', 'doing', 'some', 'used', 'toward', 'us', 'might', 'ever', 'fifty', 'own', 'mostly', 'somewhere', 'therein', 'up', 'already', 'whoever', 'noone', 'became', 'two', 'where', 'hereby', 'we', 'this', 'become', 'when', 'whereas', 'have', 'who', 'per', 'less', 'yet', '‘s', 'various', 'him', 'had', '’re', 'with', 'beyond', 'latter', 'do', 'though', 'twenty', 'again', 'about', '’s', 'serious', 'name', 'next', 'through', 'nothing', 'whatever', 

In [10]:
def data_cleaning(sentence):
  doc = nlp(sentence)
  cleaned_token = []
  for token in doc:
    if token.lemma_ != "-PRON-":
      word = token.lemma_.lower().strip()
    else:
      word = token.lower_
    if (word not in punct) and (word not in stop_words):
      cleaned_token.append(word)
  return cleaned_token
tfidf = TfidfVectorizer(tokenizer=data_cleaning) 

In [11]:
X = data['Reviews']
y = data['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state = 42)


###**Creating the pipeline and fitting the model**

In [13]:
from sklearn.svm import LinearSVC
svm = LinearSVC()

clf = Pipeline([('tfidf',tfidf),('svm',svm)])

In [15]:
clf.fit(X_train, y_train)



###**Model Evaluation**

In [17]:
y_pred = clf.predict(X_test)

In [19]:
acc = accuracy_score(y_test,y_pred)
print(acc)

0.8133333333333334


In [20]:
confusion_matrix(y_test, y_pred)

array([[62, 14],
       [14, 60]])

In [25]:
report = classification_report(y_test, y_pred)

In [36]:
print(report)

              precision    recall  f1-score   support

           0       0.82      0.82      0.82        76
           1       0.81      0.81      0.81        74

    accuracy                           0.81       150
   macro avg       0.81      0.81      0.81       150
weighted avg       0.81      0.81      0.81       150



## **Movie Prediction for TOP GUN MAVERICK**

In [45]:
clf.predict(["""If there's any movie that deserves to be seen in the theaters with big screens and booming speakers. It's :Top Gun Maverick.

One of my best experiences in years!
"""])

array([1])

## **Movie Prediction for Bharamastra 😲**

In [47]:
clf.predict(["""I went to see first day first show! Was excited for movie but it was really a waste of time. It is almost like a cheap marvel copy and nothing more. Its just a love story with vfx. Not satisfied at all. Must work hard for next adition to this triology and not copy marvel characters and even shots. The vfx is not bad but its almost same like any superhero movie. Ranbir and alia are okay and not in full form. Music is also copied from other songs of pritam himself. So, public can figure out if they are getting what they were promised or not. Here, definitely I personally ended up in disappointment!"""])

array([0])