**Sentiment Analysis of IMDB Movie Reviews**


This Notebook is heavily based on the Notebook by [Lakshmipathi N](https://www.kaggle.com/lakshmi25npathi) found on [Kaggle](https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews).

**Import necessary libraries**

In [48]:
#Load the libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
# https://online.stat.psu.edu/stat504/lesson/1/1.7
from utils import preprocesser_text, binarize_sentiment, train_test_split, evaluate, _remove_between_square_brackets
import collections
from collections import Counter
import numpy as np


import os
import warnings

**Import the training dataset**

In [52]:
#importing the training data
imdb_data=pd.read_csv('data/IMDB Dataset.csv')
print(imdb_data.shape)
imdb_data.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


**Exploratery data analysis**

In [4]:
#Summary of the dataset
imdb_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


**Sentiment count**

In [5]:
#sentiment count
imdb_data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

We can see that the dataset is balanced.

**Spliting the training dataset**

In [6]:
imdb_data_norm = preprocesser_text(imdb_data)

Pandas Apply: 100%|██████████| 50000/50000 [00:06<00:00, 8046.39it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:00<00:00, 491636.00it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:00<00:00, 50639.50it/s]
Pandas Apply: 100%|██████████| 50000/50000 [02:20<00:00, 355.39it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:33<00:00, 1486.32it/s]


**Text normalization**
When doing a very basic train/test-split, we should be sure that we have close to a 50/50 Balance of Classes in the Train-Test set. This is the case here.

In [7]:
#normalized train reviews
norm_train, norm_test = train_test_split(imdb_data_norm)
print(norm_train.sentiment.value_counts())
print(norm_test.sentiment.value_counts())
norm_train.review[0]

negative    20007
positive    19993
Name: sentiment, dtype: int64
positive    5007
negative    4993
Name: sentiment, dtype: int64


'one review ha mention watch 1 oz episod youll hook right thi exactli happen meth first thing struck oz wa brutal unflinch scene violenc set right word go trust thi show faint heart timid thi show pull punch regard drug sex violenc hardcor classic use wordit call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around first episod ever saw struck nasti wa surreal couldnt say wa readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison exp

**Bags of words model**

It is used to convert text documents to numerical vectors. It also creates n_grams, which basically means that if n = one, one word == 1 vector. If n = 2, a vector is made up of two words etc. One row then is equal to how often a specific n_gram appears in the, in this case, Review.
min_df: float x: Means that a word has to appear in at least x% of documents
max_df: float x: Means that a word has to appear in maximum x% of documents

In [88]:
#Count vectorizer for bag of words
cv=CountVectorizer(ngram_range=(1,1), min_df=0, max_df=1.)
#transformed train reviews
cv_train_reviews=cv.fit_transform(norm_train.review)
#transformed test reviews
cv_test_reviews=cv.transform(norm_test.review)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)
print('Vocab: ', cv.get_feature_names()[:5])
print('Vocab Length: ', len(cv.get_feature_names()))
cv_train_reviews[0]

BOW_cv_train: (40000, 156136)
BOW_cv_test: (10000, 156136)
Vocab:  ['00', '000', '0000000000001', '00000001', '000001']
Vocab Length:  156136


<1x156136 sparse matrix of type '<class 'numpy.int64'>'
	with 143 stored elements in Compressed Sparse Row format>

In [113]:
norm_test.review

40000    first want say lean liber polit scale found mo...
40001    wa excit see sitcom would hope repres indian c...
40002    look cover read stuff entir differ type movi c...
40003    like mani count appear denni hopper make thi c...
40004    thi movi wa tv day didnt enjoy first georg jun...
                               ...                        
49995    thought thi movi right good job wasnt creativ ...
49996    bad plot bad dialogu bad act idiot direct anno...
49997    cathol taught parochi elementari school nun ta...
49998    im go disagre previou comment side maltin thi ...
49999    one expect star trek movi high art fan expect ...
Name: review, Length: 10000, dtype: object

In [164]:
# We would expect that the dimension of the vectorized words would match the number of unique words.

results = Counter()
norm_train.review.str.split().apply(results.update)

try:
    assert len(results.keys()) ==  len(cv.get_feature_names()), "The Dimension of the vectorized words does not match the number of unique words!"
except AssertionError as E:
    print(E)

The Dimension of the vectorized words does not match the number of unique words!


In [132]:
# We can check which one are missing:
print("Unique Words in CountVectorizer but not in the reviews:")
print(set(cv.get_feature_names()).difference(results.keys()))
print("Unique Words in the Dataset but not in the CountVectorizer:")
print(set(results.keys()).difference(cv.get_feature_names()))

Unique Words in CountVectorizer but not in the reviews:
{'adventure', 'that', 'screams', 'in', 'settings', 'the', 'passed', 'mystery', 'and', 'oanyhow', 'actors'}
Unique Words in the Dataset but not in the CountVectorizer:
{'h', '^d', 'r', 'c', 'f^k', '^^', '6', '0', '_', '7', '2', 'e', '^^^', 'in\\al', 'spoilers^th', 'spoilers^^thi', 'time^^', '3', 'k', 'z', 'n', 'o^less', 'w\\', 'w', 'tortu^^^^^', '^', 'f', '5', 'f^', 'c^m', 'x', 'horror\\adventur', '1', '^^thi', '`', 'passed^mild', 'lives\\i', '\\the', 'screams\\jack', 'actors\\actress', 'hallmark\\lifetim', 'u', 'l', '^^contain', 'm^er', ']', '^_', '\\', 'horror\\fantasi', 'j', 'g', '^^i', '\\and\\', 'b', 'v', '10\\10i', '10^30', '^_^', '^oanyhow', '4', 'that^^', 'least^^', 'q', 'c^p', 'mexican\\english', '^____^the', 'hell^^', '5\\\\7', 'settings\\costum', '9', 'f^^ing', 'time^_^', 'p', '8', 'horror\\mystery\\action\\adventure\\combat'}


The difference is explainable by the regrex-pattern used by Sklearn CountVectorizer in the Parameter `token_patternstr`. It seems that our PreProcessing-Function did not reliably remove all special characters. Our Regex-Pattern removing special character should also be able to remove  `\` and `^` but currently is not.

**Term Frequency-Inverse Document Frequency model (TFIDF)**

It is used to convert text documents to  matrix of  tfidf features. Basically, per review it will calculate how often a word appears in this review and multiples it with the inverse document frequency, which is basically a "punishing" value if a word appears in many documents. Basically: # number of documents / # documents containing the word. 

> The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).

tf(t, d) == # Wort t / max (# Wort t) über alle Dokumente

idf(t) == # Dokumente mit Wort t / # Dokumente


tf-idf(t, d) = tf(t, d) * idf(t)

min_df: float x: Means that a word has to appear in at least x% of documents
max_df: float x: Means that a word has to appear in maximum x% of documents

In [9]:
#Tfidf vectorizer
tv=TfidfVectorizer(ngram_range=(1,3), min_df=0., max_df=1.)
#transformed train reviews
tv_train_reviews=tv.fit_transform(norm_train.review)
#transformed test reviews
tv_test_reviews=tv.transform(norm_test.review)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)
print('Tfidf_test:',len(tv.vocabulary_))

Tfidf_train: (40000, 6983231)
Tfidf_test: (10000, 6983231)
Tfidf_test: 6983231


In [10]:
#Tfidf vectorizer
tv=TfidfVectorizer(ngram_range=(1,3), min_df=0.001, max_df=0.999)
#transformed train reviews
tv_train_reviews=tv.fit_transform(norm_train.review)
#transformed test reviews
tv_test_reviews=tv.transform(norm_test.review)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)
print('Tfidf_test:',len(tv.vocabulary_))

Tfidf_train: (40000, 17053)
Tfidf_test: (10000, 17053)
Tfidf_test: 17053


As seen above, if we filter out very common or very rare (with max_df and min_df), the vocab-size is reduced. In our example, from 6'983'231 to 17'053. As we will see later, this will improve our overall score by reducing overfit.

**Split and binarize the sentiment tdata**

In [11]:
#Spliting the sentiment data
train_sentiments = binarize_sentiment(norm_train.sentiment)
test_sentiments = binarize_sentiment(norm_test.sentiment)
print(train_sentiments.unique())
print(test_sentiments.unique())

[1 0]
[0 1]


**Modelling the dataset**

To compare different parameters for the model and text preprocesser, we can create a function, which does all the preprocessing and modelling. This allows us to quickly compare and evaluate results.

In [45]:
def vectorize_train_validate_evaluate(word_vectorizer, n_gram, analyzer, min_df, max_df, model, parameters, train_x, train_y, test_x, test_y, cv=5):
    """Pipeline to compare different types of preprocessing of words, models and parameters."""
    word_vectorizer = word_vectorizer(ngram_range=n_gram, analyzer=analyzer, min_df=min_df, max_df=max_df)
    train_x = word_vectorizer.fit_transform(train_x)
    test_x = word_vectorizer.transform(test_x)
    print('Length of Vocabulary after vectorizing the corpus:',len(word_vectorizer.vocabulary_))
    # Gridsearch to find the best values:
    grid_search = GridSearchCV(estimator = model,
                            param_grid = parameters, 
                            scoring = 'accuracy',
                            cv = cv,
                            n_jobs = -1, 
                            verbose = 2)
    grid_search.fit(train_x,train_y)
    print(f'Best Score: {grid_search.best_score_}. Best Params: {grid_search.best_params_}')
    # It would better be to select 

    #Fitting the model for tfidf features
    grid_search.best_estimator_.fit(train_x,train_y)
    print(grid_search.best_estimator_)

    #Predicting test
    y_pred=grid_search.best_estimator_.predict(test_x)
    print(y_pred)

    #Predicting train
    y_pred_train=grid_search.best_estimator_.predict(train_x)
    print(y_pred_train)
    ##Predicting the model for tfidf features

    #Accuracy score test
    print("Accuracy test:",evaluate(test_y,y_pred)[0])
    #Accuracy score train
    print("Accuracy train:",evaluate(train_y,y_pred_train)[0])
    # Confiusionmatrix
    print("Accuracy test:",evaluate(test_y,y_pred)[1])
    #Accuracy score train
    print("Accuracy train:",evaluate(train_y,y_pred_train)[1])

In [33]:
#training the model
lr=LogisticRegression()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0, max_df=1, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0, max_df=1, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 6209089
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.500175. Best Params: {'C': 0.01, 'max_iter': 500, 'penalty': 'l2'}
LogisticRegression(C=0.01, max_iter=500)
[0 0 0 ... 0 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.6061
Accuracy train: 0.996275
Accuracy test:               precision    recall  f1-score   support

    Negative       0.56      0.98      0.71      4993
    Positive       0.93      0.23      0.37      5007

    accuracy                           0.61     10000
   macro avg       0.75      0.61      0.54     10000
weighted avg       0.75      0.61      0.54     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.99      1.00      1.00     20007
    Positive       1.00      0.99      1.00     19993

    accuracy                           1.00     40000
   macro avg       1.00      1.00      1.00     40000
weighted avg       1.00      1.00      1.00    

In [34]:
#training the model
lr=LogisticRegression()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 17053
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.888525. Best Params: {'C': 1, 'max_iter': 500, 'penalty': 'l2'}
LogisticRegression(C=1, max_iter=500)
[0 0 0 ... 1 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8986
Accuracy train: 0.929425
Accuracy test:               precision    recall  f1-score   support

    Negative       0.90      0.90      0.90      4993
    Positive       0.90      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.94      0.92      0.93     20007
    Positive       0.92      0.94      0.93     19993

    accuracy                           0.93     40000
   macro avg       0.93      0.93      0.93     40000
weighted avg       0.93      0.93      0.93     40000



**Chaning the ngram-range**

As seen below, the training speeds up quickly when chaning the range of ngrams to 3. However, the accuracy on the test-set also drops from 89% to 67%, which is below our baseline.

In [35]:
#training the model
lr=LogisticRegression()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(3,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(3,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 789
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.6503. Best Params: {'C': 1, 'max_iter': 500, 'penalty': 'l2'}
LogisticRegression(C=1, max_iter=500)
[1 0 0 ... 1 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.6569
Accuracy train: 0.670275
Accuracy test:               precision    recall  f1-score   support

    Negative       0.73      0.50      0.59      4993
    Positive       0.62      0.81      0.70      5007

    accuracy                           0.66     10000
   macro avg       0.67      0.66      0.65     10000
weighted avg       0.67      0.66      0.65     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.75      0.52      0.61     20007
    Positive       0.63      0.82      0.71     19993

    accuracy                           0.67     40000
   macro avg       0.69      0.67      0.66     40000
weighted avg       0.69      0.67      0.66     40000

Leng

**Using chars as ngram**

We can see that the vocabulary-size drops from 17'000 to 7'900, even though we have the same n_gram range. The Quality on the test-set also drops by 2.7%. 

**Linear Support Vector Classifier**

The Hyperparameter-tuning and Training is very quick. However, if we use all the words, it overfits heaviy, as seen below:

In [36]:
#training the model
svc=LinearSVC()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values. However, because of the long fitting time, we don't do a gridsearch if we use all the words (min_df = 1, max_df = 1.0)
parameters = [{'C': [1, 0.1, 0.01], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0, max_df=1, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0, max_df=1, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 6209089
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.500175. Best Params: {'C': 1, 'max_iter': 500, 'penalty': 'l2'}
LinearSVC(C=1, max_iter=500)
[0 0 0 ... 0 1 1]
[1 1 1 ... 1 0 0]
Accuracy test: 0.7485
Accuracy train: 0.996275
Accuracy test:               precision    recall  f1-score   support

    Negative       0.73      0.78      0.76      4993
    Positive       0.77      0.72      0.74      5007

    accuracy                           0.75     10000
   macro avg       0.75      0.75      0.75     10000
weighted avg       0.75      0.75      0.75     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.99      1.00      1.00     20007
    Positive       1.00      0.99      1.00     19993

    accuracy                           1.00     40000
   macro avg       1.00      1.00      1.00     40000
weighted avg       1.00      1.00      1.00     40000

Length 

In [37]:
#training the model
svc=LinearSVC()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [1, 0.1, 0.01], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 17053
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.88965. Best Params: {'C': 0.1, 'max_iter': 500, 'penalty': 'l2'}
LinearSVC(C=0.1, max_iter=500)
[0 0 1 ... 1 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.899
Accuracy train: 0.931325
Accuracy test:               precision    recall  f1-score   support

    Negative       0.90      0.89      0.90      4993
    Positive       0.90      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.94      0.92      0.93     20007
    Positive       0.92      0.94      0.93     19993

    accuracy                           0.93     40000
   macro avg       0.93      0.93      0.93     40000
weighted avg       0.93      0.93      0.93     40000

Length 

The quality, after filtering words out with min_df and max_df, is very good; it seems to avoid overfitting on a few, very specific tokens.

In [38]:
#training the model
svc=LinearSVC()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0.1, max_df=0.9, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0.1, max_df=0.9, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 153
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.7649. Best Params: {'C': 0.1, 'max_iter': 500, 'penalty': 'l2'}
LinearSVC(C=0.1, max_iter=500)
[0 0 0 ... 1 0 0]
[1 1 1 ... 1 1 0]
Accuracy test: 0.7608
Accuracy train: 0.7685
Accuracy test:               precision    recall  f1-score   support

    Negative       0.76      0.75      0.76      4993
    Positive       0.76      0.77      0.76      5007

    accuracy                           0.76     10000
   macro avg       0.76      0.76      0.76     10000
weighted avg       0.76      0.76      0.76     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.77      0.76      0.77     20007
    Positive       0.76      0.78      0.77     19993

    accuracy                           0.77     40000
   macro avg       0.77      0.77      0.77     40000
weighted avg       0.77      0.77      0.77     40000

Length of V

We can see that if we drop words, which appear in less than 10% of documents and those, which in more than 90% documents, we have a vocabulary size of 153. This leads to an awful test-accuracy of 76.6%. 

In [39]:
#training the model
svc=LinearSVC()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,4), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,4), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 17108
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.889625. Best Params: {'C': 0.1, 'max_iter': 500, 'penalty': 'l2'}
LinearSVC(C=0.1, max_iter=500)
[0 0 1 ... 1 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8991
Accuracy train: 0.93145
Accuracy test:               precision    recall  f1-score   support

    Negative       0.90      0.89      0.90      4993
    Positive       0.90      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.94      0.92      0.93     20007
    Positive       0.92      0.94      0.93     19993

    accuracy                           0.93     40000
   macro avg       0.93      0.93      0.93     40000
weighted avg       0.93      0.93      0.93     40000

Length

Increasing the ngram-range for the word-vectorizer increases the test-accuracy slightly to 89.91% and the length of the vocabulary to 17108.

In [40]:
#training the model
svc=LinearSVC()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='char', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='char', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 7890
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.864325. Best Params: {'C': 1, 'max_iter': 500, 'penalty': 'l2'}
LinearSVC(C=1, max_iter=500)
[0 0 1 ... 1 1 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8732
Accuracy train: 0.9112
Accuracy test:               precision    recall  f1-score   support

    Negative       0.88      0.87      0.87      4993
    Positive       0.87      0.88      0.87      5007

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.92      0.91      0.91     20007
    Positive       0.91      0.92      0.91     19993

    accuracy                           0.91     40000
   macro avg       0.91      0.91      0.91     40000
weighted avg       0.91      0.91      0.91     40000

Length of Vo



Best Score: 0.8468. Best Params: {'C': 0.01, 'max_iter': 500, 'penalty': 'l2'}




LinearSVC(C=0.01, max_iter=500)
[0 0 1 ... 1 1 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8652
Accuracy train: 0.921875
Accuracy test:               precision    recall  f1-score   support

    Negative       0.87      0.86      0.86      4993
    Positive       0.86      0.87      0.87      5007

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.93      0.91      0.92     20007
    Positive       0.92      0.93      0.92     19993

    accuracy                           0.92     40000
   macro avg       0.92      0.92      0.92     40000
weighted avg       0.92      0.92      0.92     40000



Using a char as analyzer, we get a vocabulary of 7890. However, this does not decrease the training-time though, because it does not converge in 500 iterations.

**Multinomial Naive Bayes for bag of words and tfidf features**

In [47]:
#training the model
mnb=MultinomialNB()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'alpha': [0.01,0.1,1]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0, max_df=1, 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0, max_df=1., 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 6209089
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Score: 0.5001749999999999. Best Params: {'alpha': 0.01}
MultinomialNB(alpha=0.01)
[0 0 0 ... 0 1 1]
[1 1 1 ... 1 0 0]
Accuracy test: 0.7512
Accuracy train: 0.996275
Accuracy test:               precision    recall  f1-score   support

    Negative       0.75      0.75      0.75      4993
    Positive       0.75      0.75      0.75      5007

    accuracy                           0.75     10000
   macro avg       0.75      0.75      0.75     10000
weighted avg       0.75      0.75      0.75     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.99      1.00      1.00     20007
    Positive       1.00      0.99      1.00     19993

    accuracy                           1.00     40000
   macro avg       1.00      1.00      1.00     40000
weighted avg       1.00      1.00      1.00     40000

Length of Vocabulary after

In [42]:
#training the model
mnb=MultinomialNB()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'alpha': [0.01,0.1,1]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 17053
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.87065. Best Params: {'alpha': 0.1}
MultinomialNB(alpha=0.1)
[0 0 0 ... 0 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8727
Accuracy train: 0.89095
Accuracy test:               precision    recall  f1-score   support

    Negative       0.88      0.86      0.87      4993
    Positive       0.86      0.89      0.87      5007

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.91      0.87      0.89     20007
    Positive       0.88      0.91      0.89     19993

    accuracy                           0.89     40000
   macro avg       0.89      0.89      0.89     40000
weighted avg       0.89      0.89      0.89     40000

Length of Vocabulary after vectorizing the 

In [43]:
#training the model
mnb=MultinomialNB()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'alpha': [0.01,0.1,1]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='char', min_df=0.001, max_df=0.999, 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='char', min_df=0.001, max_df=0.999, 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 7890
Fitting 2 folds for each of 3 candidates, totalling 6 fits
Best Score: 0.814025. Best Params: {'alpha': 1}
MultinomialNB(alpha=1)
[0 0 0 ... 1 1 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8142
Accuracy train: 0.82485
Accuracy test:               precision    recall  f1-score   support

    Negative       0.82      0.80      0.81      4993
    Positive       0.81      0.83      0.82      5007

    accuracy                           0.81     10000
   macro avg       0.81      0.81      0.81     10000
weighted avg       0.81      0.81      0.81     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.83      0.81      0.82     20007
    Positive       0.82      0.84      0.83     19993

    accuracy                           0.82     40000
   macro avg       0.83      0.82      0.82     40000
weighted avg       0.83      0.82      0.82     40000

Length of Vocabulary after vectorizing the corp