# Polarity prediction on movie reviews, comparison of parameters and learning methods

## Importing libraries
(scikit-learn for learning, nltk for text processing and pandas for data reprensentations)

## Learning methods
To classify a vector of numbers, we used :

- Logistic regression
- MultinomialNB
- kNN
- Random Forest

## Text representations
We tried different representations of data to see how it influences the learning :
- Bag-of-words
- n-grams
- Term frequency (bag-of-words normalized)
- Term frequency times inverse document frequency 

### Additional filter on the training set

We will try to remove English stop words: this includes 1-grams and 2-grams (we did not take into account 3-grams because of the length of the vector).

We ignore terms that appear in more than 70% of the documents, which is intuitively meaningful: with 60% of word frequency in english, we still find words such as "uh" in our texts. We also only keep terms that appear in at least 2 documents not to generalize on a very specific example.

We did keep in mind that removing "stop-words" can lead in information loss. The tfidf representation already decreases the influence of stop-words, but we wanted to get rid of the more common ones to improve performance.

## Importing libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from glob import glob
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import metrics
from scikitplot.metrics import plot_confusion_matrix

  _nan_object_mask = _nan_object_array != _nan_object_array


## Importing the datasets
(We used the small dataset)

In [2]:
# Get all files path
posFiles = glob('review_polarity/txt_sentoken/pos/*')
negFiles = glob('review_polarity/txt_sentoken/neg/*')
# Read text files
posReviews = np.array([open(f).read() for f in posFiles])
negReviews = np.array([open(f).read() for f in negFiles])
# Use pandas to label, mix the data and print a sample
polarity_files_df = pd.DataFrame({'pos':posReviews,'neg':negReviews})
polarity_files_df = pd.melt(polarity_files_df, value_vars=['pos','neg'],value_name="text",var_name="label")
polarity_files_df["label_num"] = polarity_files_df.label.map({"neg":0, "pos":1})
polarity_files_df.sample(5)

Unnamed: 0,label,text,label_num
1729,neg,"this talky , terribly-plotted thriller stars a...",0
484,pos,the sweet hereafter could serve as a textbook ...,1
1074,neg,at one point in this movie there is a staging ...,0
279,pos,"marie ( charlotte rampling , "" aberdeen "" ) an...",1
1032,neg,an affluent horse breeder's past comes up to h...,0


Larger dataset and smaller one are two differents dataset. We thus only compare results acquired on the smaller dataset. However we still give results computed with larger dataset to have an idea of the behavior of a method with an other dataset.
(Results on the big dataset have been found by using the same code than the one used for small dataset but only using the train set as below. It allowed us to avoid adapting the code for a larger dataset.)

In [3]:
# Get all files path
#posFiles = glob('aclImdb_v1/aclImdb/test/pos/*')
#negFiles = glob('aclImdb_v1/aclImdb/test/neg/*')
# Read text files
#posReviews = np.array([open(f).read() for f in posFiles])
#negReviews = np.array([open(f).read() for f in negFiles])
# Use pandas to label, mix the data and print a sample
#polarity_files_df = pd.DataFrame({'pos':posReviews,'neg':negReviews})
#polarity_files_df = pd.melt(polarity_files_df, value_vars=['pos','neg'],value_name="text",var_name="label")
#polarity_files_df["label_num"] = polarity_files_df.label.map({"neg":0, "pos":1})
#polarity_files_df.sample(5)

## Split and shuffle the data
Used when comparing confusion matrices. We also have a look to the length of the vectors after preprocessing.

In [4]:
# Split and shuffle the data (15% for train and 85% for tests)

X_train, X_test, y_train, y_test = train_test_split(polarity_files_df.text, polarity_files_df.label_num, test_size=0.15, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

vec1 = CountVectorizer(stop_words='english', max_df =0.7, min_df=2)
vec1.fit_transform(X_train)
print(len(vec1.get_feature_names()))


vec2 = CountVectorizer(stop_words='english', max_df =0.7)
vec2.fit_transform(X_train)
print(len(vec2.get_feature_names()))

#removing max_df=0.7 changes nothing but keeping it seems meaningful for larger datasets
vec3 = CountVectorizer(stop_words='english')
vec3.fit_transform(X_train)
print(len(vec3.get_feature_names()))


vec4 = CountVectorizer()
vec4.fit_transform(X_train)
print(len(vec4.get_feature_names()))


(1700,)
(300,)
(1700,)
(300,)
22142
36990
36993
37298


## Logistic Regression

In [5]:
from sklearn.linear_model import LogisticRegression

### Influence of parameters

In [6]:
# Test a range of hyperparameters
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english', max_df =0.7, min_df=2)),
                     ('tfidf', TfidfTransformer()),
                     ('lr', LogisticRegression())
                    ])

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf' : (True,False),
              'lr__C' : (0.5,1,2,20)
}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, return_train_score=True)

gs_clf.fit(X_train, y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.7, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        ...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'lr__C': (0.5, 1, 2, 20), 'tfidf__use_idf': (True, False), 'vect__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [7]:
# Getting the best scores
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

print(gs_clf.best_score_)

df = pd.DataFrame({'rank':gs_clf.cv_results_['rank_test_score'], 
                  'ngram_range':gs_clf.cv_results_['param_vect__ngram_range'],
                  'tfidf':gs_clf.cv_results_['param_tfidf__use_idf'],
                  'lr__C':gs_clf.cv_results_['param_lr__C'],
                  'mean_test_score':gs_clf.cv_results_['mean_test_score'], 
                  'mean_train_score':gs_clf.cv_results_['mean_train_score']}).set_index('rank')

#Classifier less regularized (large C) can better specialized but generalize as well as the others
#as shown below
df.sort_values('rank',ascending=True).head(10)

lr__C: 20
tfidf__use_idf: True
vect__ngram_range: (1, 1)
0.836470588235


Unnamed: 0_level_0,lr__C,mean_test_score,mean_train_score,ngram_range,tfidf
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,20.0,0.836471,1.0,"(1, 1)",True
2,20.0,0.835294,1.0,"(1, 2)",True
3,20.0,0.834706,1.0,"(1, 2)",False
4,20.0,0.833529,1.0,"(1, 1)",False
5,2.0,0.832941,0.994117,"(1, 1)",True
6,1.0,0.825294,0.987059,"(1, 1)",True
6,2.0,0.825294,0.996176,"(1, 2)",True
6,2.0,0.825294,0.979706,"(1, 1)",False
6,2.0,0.825294,0.985589,"(1, 2)",False
10,0.5,0.82,0.973825,"(1, 1)",True


### Confusion matrix on bag-of-words

For the larger dataset, we had 0.89 accuracy by using the following code.

In [8]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range =(1,2), max_df =0.7, min_df=2)),
                     ('nb', LogisticRegression()),
                    ])

text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

print(metrics.classification_report(y_test, y_pred, target_names=["pos","neg"]))
print(metrics.accuracy_score(y_test, y_pred))

# Plot normalized confusion matrix
plot_confusion_matrix(y_test, y_pred, normalize=True, title='Normalized confusion matrix - Logistic Regression')

              precision    recall  f1-score   support

         pos       0.82      0.85      0.83       151
         neg       0.84      0.81      0.83       149

   micro avg       0.83      0.83      0.83       300
   macro avg       0.83      0.83      0.83       300
weighted avg       0.83      0.83      0.83       300

0.83


<matplotlib.axes._subplots.AxesSubplot at 0x7f6c2771b470>

AttributeError: module 'matplotlib.colors' has no attribute 'to_rgba'

## MultinomialNB

In [9]:
from sklearn.naive_bayes import MultinomialNB

### Influence of parameters

In [10]:
# Test a range of hyperparameters
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english', max_df =0.7, min_df=2)),
                     ('tfidf', TfidfTransformer()),
                     ('nb', MultinomialNB())
                    ])

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf' : (True,False),
              'nb__alpha': (0,1,2,10)
}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, return_train_score=True)

gs_clf.fit(X_train, y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.7, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        ...linear_tf=False, use_idf=True)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'nb__alpha': (0, 1, 2, 10), 'tfidf__use_idf': (True, False), 'vect__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [None]:
# Getting the 10 best scores
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

print(gs_clf.best_score_)

df = pd.DataFrame({'rank':gs_clf.cv_results_['rank_test_score'], 
                  'ngram_range':gs_clf.cv_results_['param_vect__ngram_range'],
                  'tfidf':gs_clf.cv_results_['param_tfidf__use_idf'],
                  'nb__alpha':gs_clf.cv_results_['param_nb__alpha'],
                  'mean_test_score':gs_clf.cv_results_['mean_test_score'], 
                  'mean_train_score':gs_clf.cv_results_['mean_train_score']}).set_index('rank')

df.sort_values('rank',ascending=True).head(10)

nb__alpha: 1
tfidf__use_idf: False
vect__ngram_range: (1, 2)
0.815294117647


Unnamed: 0_level_0,mean_test_score,mean_train_score,nb__alpha,ngram_range,tfidf
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.815294,0.970295,1,"(1, 2)",False
2,0.814706,0.957648,1,"(1, 1)",False
2,0.814706,0.959708,2,"(1, 2)",False
4,0.809412,0.98147,2,"(1, 2)",True
4,0.809412,0.948827,2,"(1, 1)",False
6,0.806471,0.984705,1,"(1, 2)",True
7,0.804706,0.972354,10,"(1, 2)",True
8,0.802353,0.974119,1,"(1, 1)",True
9,0.801765,0.968825,2,"(1, 1)",True
9,0.801765,0.959709,10,"(1, 1)",True


### Confusion matrix on bag-of-words

For the larger dataset, we had 0.88 accuracy by using the following code.

In [None]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range =(1,2), max_df =0.7, min_df=2)),
                     ('nb', MultinomialNB()),
                    ])

text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

print(metrics.classification_report(y_test, y_pred, target_names=["pos","neg"]))
print(metrics.accuracy_score(y_test, y_pred))

# Plot normalized confusion matrix
plot_confusion_matrix(y_test, y_pred, normalize=True, title='Normalized confusion matrix - Multinomial Naive Bayes')

              precision    recall  f1-score   support

         pos       0.81      0.77      0.79       151
         neg       0.78      0.81      0.80       149

   micro avg       0.79      0.79      0.79       300
   macro avg       0.79      0.79      0.79       300
weighted avg       0.79      0.79      0.79       300

0.793333333333


<matplotlib.axes._subplots.AxesSubplot at 0x7f6c245f97f0>

AttributeError: module 'matplotlib.colors' has no attribute 'to_rgba'

## K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

### Influence of parameters

In [None]:
# Test a range of hyperparameters
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english', max_df =0.7, min_df=2)),
                     ('tfidf', TfidfTransformer()),
                     ('knn', KNeighborsClassifier()),
                    ])

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf' : (True,False),
              'knn__n_neighbors': (10, 25, 50),
              'knn__p' : (1,2),
              'knn__weights': ('uniform', 'distance')
}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, return_train_score=True)

gs_clf.fit(X_train, y_train)



In [None]:
# Getting the 10 best scores
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

print(gs_clf.best_score_)

df = pd.DataFrame({'rank':gs_clf.cv_results_['rank_test_score'], 
                  'ngram_range':gs_clf.cv_results_['param_vect__ngram_range'],
                  'tfidf':gs_clf.cv_results_['param_tfidf__use_idf'],
                  'knn__n_neighbors': gs_clf.cv_results_['param_knn__n_neighbors'],
                  'knn__p': gs_clf.cv_results_['param_knn__p'], 
                  'knn__weights': gs_clf.cv_results_['param_knn__weights'], 
                  'mean_test_score':gs_clf.cv_results_['mean_test_score'], 
                  'mean_train_score':gs_clf.cv_results_['mean_train_score']}).set_index('rank')

df.sort_values('rank',ascending=True).head(10)

### Confusion matrix on bag-of-words

For the larger dataset, we had 0.54 accuracy by using the following code.

In [None]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range =(1,2), max_df =0.7, min_df=2)),
                     ('nb', KNeighborsClassifier()),
                    ])

text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

print(metrics.classification_report(y_test, y_pred, target_names=["pos","neg"]))
print(metrics.accuracy_score(y_test, y_pred))

# Plot normalized confusion matrix
plot_confusion_matrix(y_test, y_pred, normalize=True, title='Normalized confusion matrix - K-Nearest Neighbors')

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

### Influence of parameters

In [None]:
# Test a range of hyperparameters
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english', max_df =0.7, min_df=2)),
                     ('tfidf', TfidfTransformer()),
                     ('rdc', RandomForestClassifier())
                    ])

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf' : (True,False),
              'rdc__n_estimators': (10, 25, 50)
}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, return_train_score=True)

gs_clf.fit(X_train, y_train)

In [None]:
# Getting the 10 best scores
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

print(gs_clf.best_score_)

df = pd.DataFrame({'rank':gs_clf.cv_results_['rank_test_score'], 
                  'ngram_range':gs_clf.cv_results_['param_vect__ngram_range'],
                  'tfidf':gs_clf.cv_results_['param_tfidf__use_idf'],
                  'rdc__n_estimators': gs_clf.cv_results_['param_rdc__n_estimators'],
                  'mean_test_score':gs_clf.cv_results_['mean_test_score'], 
                  'mean_train_score':gs_clf.cv_results_['mean_train_score']
                  }).set_index('rank')

df.sort_values('rank',ascending=True).head(10)

### Confusion matrix on bag-of-words

For the larger dataset, we had 0.77 accuracy by using the following code.

In [None]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english', ngram_range =(1,2), max_df =0.7, min_df=2)),
                     ('rdc', RandomForestClassifier()),
                    ])

text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

print(metrics.classification_report(y_test, y_pred, target_names=["pos","neg"]))
print(metrics.accuracy_score(y_test, y_pred))

# Plot normalized confusion matrix
plot_confusion_matrix(y_test, y_pred, normalize=True, title='Normalized confusion matrix - Random Forest')