# Text Classifier: Movie Reviews

#### Problem statement:
* Sentiment analysis is one of the most widely studied and challenging problems to be solved. The agenda in sentiment analysis is classifying the polarity of a given text at the document, sentence or feature level. Here I am trying to find weather the expressed opinion in a movie review is **positive** or **negative**.

#### Data:
* Large Movie Review Database: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

In [1]:
import pandas as pd
import numpy as np
import glob
import re

#### Read the text data

In [2]:
num_data = pd.DataFrame()
num_data.loc["pos", "train"] = len(glob.glob("./aclImdb/train/pos/*.txt"))
num_data.loc["neg", "train"] = len(glob.glob("./aclImdb/train/neg/*.txt"))
num_data.loc["pos", "test"] = len(glob.glob("./aclImdb/test/pos/*.txt"))
num_data.loc["neg", "test"] = len(glob.glob("./aclImdb/test/neg/*.txt"))
num_data

Unnamed: 0,train,test
pos,12500.0,12500.0
neg,12500.0,12500.0


There are a total of 50,000 records in the train and test data sets. Half are positive and half are negative.

#### PyPrind (Python Progress Indicator)
* ProgPercent(iterations, track_time=True, stream=2, title='', monitor=False, update_interval=None)
* Initializes a progress bar object that allows visualization of an iterational computation in the standard output screen.
* Iterations = Number of iterations for the iterative computation.

In [3]:
import pyprind

#length = 50000
pper = pyprind.ProgPercent(50000)

In [4]:
# Create labels for positive and negative
labels = {"pos": 1, "neg": 0}

Read all the Positive and Negative reviews

In [5]:
dataframe = pd.DataFrame()

for i in ("train", "test"):
    for j in ("pos", "neg"):
        path = "./aclImdb/%s/%s/*.txt" % (i, j)
        for file in glob.glob(path)[0:12500]:
            with open(file, "r", encoding="utf8") as infile:
                text = infile.read()
            dataframe = dataframe.append([[text, labels[j]]], ignore_index=True)
            pper.update()

dataframe.columns = ["review", "sentiment"]
len(dataframe)

[100 %] Time elapsed: 00:04:07 | ETA: 00:00:00
Total time elapsed: 00:04:07


50000

In [6]:
dataframe.head(25005)

Unnamed: 0,review,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1
...,...,...
25000,I went and saw this movie last night after bei...,1
25001,Actor turned director Bill Paxton follows up h...,1
25002,As a recreational golfer with some knowledge o...,1
25003,"I saw this film in a sneak preview, and it is ...",1


Convert the dataframe into csv file

In [7]:
dataframe[0:25000].to_csv("./movie_reviews_train_data.csv", index=False)
dataframe[25000:50000].to_csv("./movie_reviews_test_data.csv", index=False)
dataframe.to_csv("./movie_reviews_data.csv", index=False)

Read the train csv file

In [8]:
train_data = pd.read_csv('movie_reviews_train_data.csv')

#### Shuffle the data
The extrated data is sorted in order, we can not trian the ordered datset. Here, we will shuffle the above sorted dataset using permutation function from the np.random submodule.

In [9]:
# Shuffle data
#train_data = train_data.reindex(np.random.permutation(train_data.index))
# Another way
#train_data = train_data.sample(frac=1, random_state=42).reset_index(drop=True)

In [10]:
train_data.head()

Unnamed: 0,review,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,1
1,Homelessness (or Houselessness as George Carli...,1
2,Brilliant over-acting by Lesley Ann Warren. Be...,1
3,This is easily the most underrated film inn th...,1
4,This is not the typical Mel Brooks film. It wa...,1


### Data preprocessing and removing the unwanted data

In [11]:
#Removing line break> tags 
train_data['review'] = train_data['review'].str.replace('<br />','')
#Removing Numbers
train_data['review'] = train_data['review'].str.replace('\d+', '')
#Removing -- character
train_data['review'] = train_data['review'].str.replace("--", '')
#Removing Punctuations
train_data['review'] = train_data['review'].str.replace('[^\w\s]','')

In [14]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

stemmer = PorterStemmer()

In [17]:
def review_to_words(raw_review):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    
    print(raw_review)
    
    # 1. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", raw_review) 
    print("\n")
    print(letters_only)
    
    # 2. Convert to lower case, split into individual words
    words = letters_only.lower().split()
    print("\n")
    print(words)
    
    # 3. In Python, searching a set is much faster than searching a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))  
    print("\n")
    
    # 4. Remove stop words
    meaningful_words = [w for w in words if not w in stops] 
    print("\n")
    print(meaningful_words)
    
    # 5. Stemming
    words = [stemmer.stem(word) for word in meaningful_words]
    print("\n")
    print(words)
    # and return the result.
    
    return( " ".join(words))

In [18]:
review_to_words(train_data.review[0])

Bromwell High is a cartoon comedy It ran at the same time as some other programs about school life such as Teachers My  years in the teaching profession lead me to believe that Bromwell Highs satire is much closer to reality than is Teachers The scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools I knew and their students When I saw the episode in which a student repeatedly tried to burn down the school I immediately recalled  at  High A classic line INSPECTOR Im here to sack one of your teachers STUDENT Welcome to Bromwell High I expect that many adults of my age think that Bromwell High is far fetched What a pity that it isnt


Bromwell High is a cartoon comedy It ran at the same time as some other programs about school life such as Teachers My  years in the teaching profession lead me to believe that Bromwell Highs satire is much closer to reality than is Teac

'bromwel high cartoon comedi ran time program school life teacher year teach profess lead believ bromwel high satir much closer realiti teacher scrambl surviv financi insight student see right pathet teacher pomp petti whole situat remind school knew student saw episod student repeatedli tri burn school immedi recal high classic line inspector im sack one teacher student welcom bromwel high expect mani adult age think bromwel high far fetch piti isnt'

#### Why to use only CountVectorizer
* The TfidfTransformer transforms a count matrix to a normalized tf or tf-idf representation. So although both the CountVectorizer and TfidfTransformer (with use_idf=False) produce term frequencies, TfidfTransformer is normalizing the count.

Ref - https://www.quora.com/What-is-the-difference-between-TfidfVectorizer-and-CountVectorizer-1

Ref - https://stackoverflow.com/questions/35867484/pass-tokens-to-countvectorizer

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

#cv = CountVectorizer()
#cv = CountVectorizer(stop_words='english')
cv = CountVectorizer(stop_words = 'english', lowercase = True, decode_error = 'ignore',min_df=0.01, max_df=0.99)
tdm = cv.fit_transform(train_data.review)
#type(tdm)

In [22]:
# Count the number of words in the dictionary
len(cv.get_feature_names())

1478

### The dictionary length varies based on the parameters we give CountVectorizer:
* CountVectorizer() --> 1,38,141 words
* CountVectorizer(stop_words='english') --> 1,37,831 words
* CountVectorizer(stop_words='english', min_df=0.01, max_df=0.99) --> 1,478 words

In [23]:
# Print out the dictionary
print(cv.get_feature_names())



Create Matrix

In [24]:
Matrix = tdm.todense()
Matrix

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 2, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [25]:
Matrix.shape

(25000, 1478)

In [26]:
Matrix = pd.DataFrame(Matrix)
Matrix.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1468,1469,1470,1471,1472,1473,1474,1475,1476,1477
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
Matrix['sentiment'] = train_data.sentiment

In [28]:
Matrix.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1469,1470,1471,1472,1473,1474,1475,1476,1477,sentiment
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [29]:
Matrix.shape

(25000, 1479)

#### Train Test Split

In [30]:
from sklearn.model_selection import train_test_split

y= Matrix["sentiment"]
X= Matrix.drop(['sentiment'],axis=1)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25,random_state=456)  

In [31]:
print(X_train.shape)
print(X_val.shape)
print(y_train.shape)
print(y_val.shape)

(18750, 1478)
(6250, 1478)
(18750,)
(6250,)


### Model-1 LogisticRegression

In [32]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver="lbfgs", max_iter=500)
logreg.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [33]:
y_train_pred=logreg.predict(X_train)
y_val_pred=logreg.predict(X_val)

In [34]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_train,y_train_pred))
print('\n')
print(confusion_matrix(y_val,y_val_pred))

[[8338 1048]
 [ 915 8449]]


[[2635  479]
 [ 477 2659]]


In [35]:
from sklearn.metrics import classification_report

print("Classification Report on Train Data")
print(classification_report(y_train,y_train_pred,digits=2))
print("\n")

print("Classification Report on Test Data")
print(classification_report(y_val,y_val_pred,digits=2))

Classification Report on Train Data
              precision    recall  f1-score   support

           0       0.90      0.89      0.89      9386
           1       0.89      0.90      0.90      9364

    accuracy                           0.90     18750
   macro avg       0.90      0.90      0.90     18750
weighted avg       0.90      0.90      0.90     18750



Classification Report on Test Data
              precision    recall  f1-score   support

           0       0.85      0.85      0.85      3114
           1       0.85      0.85      0.85      3136

    accuracy                           0.85      6250
   macro avg       0.85      0.85      0.85      6250
weighted avg       0.85      0.85      0.85      6250



### Model-2 MultinominalNB

Multinomial Naive Bayes **(MultinomialNB)** The multinomial naive Bayes model is typically used for discrete counts. E.g., if we have a **text classification problem**, we can take the idea of bernoulli trials one step further and instead of **"word occurs in the document"** we have "count how often word occurs in the document", you can think of it as "number of times outcome number x_i is observed over the n trials"

The Complement Naive Bayes **(Complement NB)** classifier was designed to correct the “severe assumptions” made by the standard Multinomial Naive Bayes classifier. It is particularly suited for **imbalanced data sets**.

**__Why Naive Bayes?__**
1. Works well when there are more Data Points, few features
2. Faster to predict
3. Works well for multiclass problems
4. Works well for categorical features

In [36]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

classifier.fit(X_train,y_train)

# Predictions on train data
y_pred1=classifier.predict(X_train)
print(confusion_matrix(y_train,y_pred1))

# Predictions on test data
y_pred2=classifier.predict(X_val)
print(confusion_matrix(y_val,y_pred2))

[[7811 1575]
 [1464 7900]]
[[2546  568]
 [ 550 2586]]


In [37]:
from sklearn.metrics import classification_report

print("Classification Report on Train Data")
print(classification_report(y_train,y_pred1,digits=2))
print("\n")

print("Classification Report on Test Data")
print(classification_report(y_val,y_pred2,digits=2))

Classification Report on Train Data
              precision    recall  f1-score   support

           0       0.84      0.83      0.84      9386
           1       0.83      0.84      0.84      9364

    accuracy                           0.84     18750
   macro avg       0.84      0.84      0.84     18750
weighted avg       0.84      0.84      0.84     18750



Classification Report on Test Data
              precision    recall  f1-score   support

           0       0.82      0.82      0.82      3114
           1       0.82      0.82      0.82      3136

    accuracy                           0.82      6250
   macro avg       0.82      0.82      0.82      6250
weighted avg       0.82      0.82      0.82      6250



### Model-3 LogisticRegressionCV

In [38]:
from sklearn.linear_model import LogisticRegressionCV

logreg_cv = LogisticRegressionCV(cv=5, solver="lbfgs", max_iter=500)

logreg_cv.fit(X_train,y_train)

# Predictions on train data
y_pred3=logreg_cv.predict(X_train)
print(confusion_matrix(y_train,y_pred3))

print('\n')

# Predictions on test data
y_pred4=logreg_cv.predict(X_val)
print(confusion_matrix(y_val,y_pred4))

[[8264 1122]
 [ 926 8438]]


[[2630  484]
 [ 422 2714]]


In [39]:
from sklearn.metrics import classification_report

print("Classification Report on Train Data")
print(classification_report(y_train,y_pred3,digits=2))
print("\n")

print("Classification Report on Test Data")
print(classification_report(y_val,y_pred4,digits=2))

Classification Report on Train Data
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      9386
           1       0.88      0.90      0.89      9364

    accuracy                           0.89     18750
   macro avg       0.89      0.89      0.89     18750
weighted avg       0.89      0.89      0.89     18750



Classification Report on Test Data
              precision    recall  f1-score   support

           0       0.86      0.84      0.85      3114
           1       0.85      0.87      0.86      3136

    accuracy                           0.86      6250
   macro avg       0.86      0.86      0.86      6250
weighted avg       0.86      0.86      0.86      6250



### Model-4 LogisticRegression with GridSearchCV

In [40]:
from sklearn.model_selection import GridSearchCV

parameters = {"C": np.arange(0.1, 1, 0.1)}
grid = GridSearchCV(logreg, parameters, cv=5)
grid.fit(X_train,y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=500, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'C': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [41]:
topC = grid.best_params_["C"]
print(topC)

0.1


In [42]:
y_train_pred_1=grid.predict(X_train)
y_val_pred_2=grid.predict(X_val)

In [43]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_train,y_train_pred_1))
print('\n')
print(confusion_matrix(y_val,y_val_pred_2))

[[8310 1076]
 [ 917 8447]]


[[2633  481]
 [ 446 2690]]


In [44]:
from sklearn.metrics import classification_report

print("Classification Report on Train Data")
print(classification_report(y_train,y_train_pred_1,digits=2))
print("\n")

print("Classification Report on Test Data")
print(classification_report(y_val,y_val_pred_2,digits=2))

Classification Report on Train Data
              precision    recall  f1-score   support

           0       0.90      0.89      0.89      9386
           1       0.89      0.90      0.89      9364

    accuracy                           0.89     18750
   macro avg       0.89      0.89      0.89     18750
weighted avg       0.89      0.89      0.89     18750



Classification Report on Test Data
              precision    recall  f1-score   support

           0       0.86      0.85      0.85      3114
           1       0.85      0.86      0.85      3136

    accuracy                           0.85      6250
   macro avg       0.85      0.85      0.85      6250
weighted avg       0.85      0.85      0.85      6250



### Model-5 RandomForestClassifier

In [45]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)

# predict the labels on train dataset
pred_train = rfc.predict(X_train)

# predict the labels on validation dataset
pred_val = rfc.predict(X_val)



In [46]:
print(confusion_matrix(y_train,pred_train))
print('\n')
print(confusion_matrix(y_val,pred_val))

[[9355   31]
 [ 121 9243]]


[[2585  529]
 [ 976 2160]]


In [47]:
print("Classification Report on Train Data")
print(classification_report(y_train,pred_train,digits=2))

print("\n")

print("Classification Report on Test Data")
print(classification_report(y_val,pred_val,digits=2))

Classification Report on Train Data
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      9386
           1       1.00      0.99      0.99      9364

    accuracy                           0.99     18750
   macro avg       0.99      0.99      0.99     18750
weighted avg       0.99      0.99      0.99     18750



Classification Report on Test Data
              precision    recall  f1-score   support

           0       0.73      0.83      0.77      3114
           1       0.80      0.69      0.74      3136

    accuracy                           0.76      6250
   macro avg       0.76      0.76      0.76      6250
weighted avg       0.76      0.76      0.76      6250

