# Scenario: **IMDB Movie Reviews Classification (or Text Classification)**

### **Dataset Description:**

This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.

- **Review**
- **Sentiment**


### **Tasks to be performed:**

- Download the dataset from the dropbox
- Import required libraries and load the dataset
- Perform Data Pre-processing and clean the data set 
- Split the data set into training and testing set using the train test split function from sklearn 
- Create a SVC Classifier and fit the model 
- Evaluate the model


In [3]:
#Downloading the dataset from drop box
!wget https://www.dropbox.com/s/dctsk9k67x2jgnb/imdb_labelled.txt

'wget' is not recognized as an internal or external command,
operable program or batch file.


### Importing the required libraries

In [1]:
import numpy as np
import pandas as pd
import spacy
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
data_imdb = pd.read_csv(r"C:\Users\Shivani Dussa\Downloads\imdb_labelled.txt",names = ['Review','Sentiment'],sep = '\t',header = None)

In [4]:
print(data_imdb.shape)
data_imdb.head()

(748, 2)


Unnamed: 0,Review,Sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


### Data Pre-Processing

In [5]:
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS)
punct = string.punctuation

In [6]:
print(len(STOP_WORDS))
print(stopwords)

326
['whereafter', 'quite', 'thence', '‘d', 'show', 'take', 'and', 'really', 'twenty', 'off', 'front', 'throughout', 'you', 'rather', 'whither', 'always', 'least', 'beforehand', 'whereupon', 'whoever', 'did', 'be', 'used', 'while', 'sometimes', 'latterly', 'seemed', 'with', 'his', 'ours', 're', 'had', 'your', 'who', 'fifteen', 'until', 'not', "n't", 'whereby', 'often', 'well', '‘m', '‘ll', 'when', 'thereafter', 'hence', "'s", 'but', 'as', 'that', 'of', 'anywhere', 'n’t', 'whose', 'why', 'have', 'these', 'can', 'into', 'those', 'thereupon', '’ll', 'within', 'my', 'itself', 'during', 'nobody', 'make', 'myself', 'in', 'becomes', 'is', 'though', 'mostly', 'nowhere', 'elsewhere', 'forty', "'d", 'same', 'between', "'ve", 'seems', 'their', 'where', 'nothing', 'every', 'was', 'yourself', 'whether', 'full', 'anyway', 'onto', 'since', 'nevertheless', 'put', 'serious', 'still', 'using', 'one', 'become', 'perhaps', 'whom', 'already', 'more', 'back', 'after', 'enough', 'go', 'either', 'behind', 'do

In [7]:
print(punct)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [13]:
def text_data_cleaning(sentence):
    tokenCollection = nlp(sentence)
    
    cleaned_tokens = []
    for tokenObj in tokenCollection:
        if tokenObj.lemma_ == "-PRON-":
            word = tokenObj.lower_
        else:
            word = tokenObj.lemma_.lower().strip()
        if (word not in stopwords) and (word not in punct):
            cleaned_tokens.append(word)
    return cleaned_tokens

tfidf = TfidfVectorizer(tokenizer = text_data_cleaning)

In [9]:
tfidf

In [10]:
data_imdb.loc[0,'Review']

'A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  '

In [15]:
tokenCollection = nlp(data_imdb.loc[0,'Review'])

In [17]:
for tokenObj in tokenCollection:
    print(tokenObj.text,tokenObj.lemma_,tokenObj.is_alpha, tokenObj.is_punct)

A a True False
very very True False
, , False True
very very True False
, , False True
very very True False
slow slow True False
- - False True
moving move True False
, , False True
aimless aimless True False
movie movie True False
about about True False
a a True False
distressed distressed True False
, , False True
drifting drift True False
young young True False
man man True False
. . False True
    False False


### **Split the data set into training and testing set using the train_test_split function from sklearn**

In [28]:
X = data_imdb['Review']
y = data_imdb['Sentiment']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 42)
X_train.shape,X_test.shape

((598,), (150,))

In [29]:
y_train.shape,y_test.shape

((598,), (150,))

In [31]:
X_train.shape,X_test.shape

((598,), (150,))

### **Creating the pipeline and fitting the model**
- 2 steps in pipeline
  - tfidf
  - svm

In [35]:
from sklearn.svm import LinearSVC
svm = LinearSVC()

clf = Pipeline([('tfidf',tfidf),('svm',svm)])

In [36]:
clf

In [37]:
clf.fit(X_train,y_train)

In [38]:
y_pred = clf.predict(X_test)

In [39]:
y_pred

array([1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0], dtype=int64)

### Evaluate the model

In [40]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.82      0.82      0.82        76
           1       0.81      0.81      0.81        74

    accuracy                           0.81       150
   macro avg       0.81      0.81      0.81       150
weighted avg       0.81      0.81      0.81       150



In [42]:
confusion_matrix(y_test,y_pred)

array([[62, 14],
       [14, 60]], dtype=int64)

In [44]:
clf.predict(['wow','this sucks'])

array([0, 0], dtype=int64)

In [45]:
clf.predict(['Worth of watching it. Please like it'])

array([1], dtype=int64)

In [46]:
clf.predict(['Loved it. amazing'])

array([1], dtype=int64)

In [50]:
clf.predict(['A very, very, very slow-moving'])

array([0], dtype=int64)

In [53]:
rv = '''
I was reluctantly dragged into the theater, thinking that they didn't need to make a Top Gun 2 and that the first one was where that story needed to end.

I could have a couple paragraphs to summarize my feelings after walking out of the theater, but I'm going to leave it with just one sentence.

I was wrong.
'''
clf.predict([rv])

array([0], dtype=int64)