**Sentiment Analysis of IMDB Movie Reviews**


This Notebook is heavily based on the Notebook by [Lakshmipathi N](https://www.kaggle.com/lakshmi25npathi) found on [Kaggle](https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews).

**Import necessary libraries**

In [55]:
#Load the libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
# https://online.stat.psu.edu/stat504/lesson/1/1.7
from utils import preprocesser_text, binarize_sentiment, train_test_split, evaluate

import os
import warnings

**Import the training dataset**

In [11]:
#importing the training data
imdb_data=pd.read_csv('data/IMDB Dataset.csv')
print(imdb_data.shape)
imdb_data.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


**Exploratery data analysis**

In [12]:
#Summary of the dataset
imdb_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


**Sentiment count**

In [13]:
#sentiment count
imdb_data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

We can see that the dataset is balanced.

**Spliting the training dataset**

In [14]:
imdb_data_norm = preprocesser_text(imdb_data)

Pandas Apply: 100%|██████████| 50000/50000 [00:09<00:00, 5118.26it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:00<00:00, 274732.95it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:01<00:00, 29940.49it/s]
Pandas Apply: 100%|██████████| 50000/50000 [03:19<00:00, 250.78it/s]
Pandas Apply: 100%|██████████| 50000/50000 [00:49<00:00, 1017.21it/s]


**Text normalization**
When doing a very basic train/test-split, we should be sure that we have close to a 50/50 Balance of Classes in the Train-Test set. This is the case here.

In [15]:
#normalized train reviews
norm_train, norm_test = train_test_split(imdb_data_norm)
print(norm_train.sentiment.value_counts())
print(norm_test.sentiment.value_counts())
norm_train.review[0]

negative    20007
positive    19993
Name: sentiment, dtype: int64
positive    5007
negative    4993
Name: sentiment, dtype: int64


'one review ha mention watch 1 oz episod youll hook right thi exactli happen meth first thing struck oz wa brutal unflinch scene violenc set right word go trust thi show faint heart timid thi show pull punch regard drug sex violenc hardcor classic use wordit call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around first episod ever saw struck nasti wa surreal couldnt say wa readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison exp

**Bags of words model**

It is used to convert text documents to numerical vectors. It also creates n_grams, which basically means that if n = one, one word == 1 vector. If n = 2, a vector is made up of two words etc. One row then is equal to how often a specific n_gram appears in the, in this case, Review.
min_df: float x: Means that a word has to appear in at least x% of documents
max_df: float x: Means that a word has to appear in maximum x% of documents

In [16]:
#Count vectorizer for bag of words
cv=CountVectorizer(ngram_range=(1,3), min_df=0.001, max_df=0.999)
#transformed train reviews
cv_train_reviews=cv.fit_transform(norm_train.review)
#transformed test reviews
cv_test_reviews=cv.transform(norm_test.review)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)
print('Vocab: ', cv.get_feature_names()[:5])
print('Vocab Length: ', len(cv.get_feature_names()))
cv_train_reviews[0]

BOW_cv_train: (40000, 17053)
BOW_cv_test: (10000, 17053)
Vocab:  ['010', '10', '10 10', '10 becaus', '10 line']
Vocab Length:  17053


<1x17053 sparse matrix of type '<class 'numpy.int64'>'
	with 143 stored elements in Compressed Sparse Row format>

**Term Frequency-Inverse Document Frequency model (TFIDF)**

It is used to convert text documents to  matrix of  tfidf features. Basically, per review it will calculate how often a word appears in this review and multiples it with the inverse document frequency, which is basically a "punishing" value if a word appears in many documents. Basically: # number of documents / # documents containing the word. 

> The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).

tf(t, d) == # Wort t / max (# Wort t) über alle Dokumente

idf(t) == # Dokumente mit Wort t / # Dokumente


tf-idf(t, d) = tf(t, d) * idf(t)

min_df: float x: Means that a word has to appear in at least x% of documents
max_df: float x: Means that a word has to appear in maximum x% of documents

In [25]:
#Tfidf vectorizer
tv=TfidfVectorizer(ngram_range=(1,3), min_df=0., max_df=1.)
#transformed train reviews
tv_train_reviews=tv.fit_transform(norm_train.review)
#transformed test reviews
tv_test_reviews=tv.transform(norm_test.review)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)
print('Tfidf_test:',len(tv.vocabulary_))

Tfidf_train: (40000, 6983231)
Tfidf_test: (10000, 6983231)
Tfidf_test: 6983231


In [17]:
#Tfidf vectorizer
tv=TfidfVectorizer(ngram_range=(1,3), min_df=0.001, max_df=0.999)
#transformed train reviews
tv_train_reviews=tv.fit_transform(norm_train.review)
#transformed test reviews
tv_test_reviews=tv.transform(norm_test.review)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)
print('Tfidf_test:',len(tv.vocabulary_))

Tfidf_train: (40000, 17053)
Tfidf_test: (10000, 17053)
Tfidf_test: 17053


**Split and binarize the sentiment tdata**

In [18]:
#Spliting the sentiment data
train_sentiments = binarize_sentiment(norm_train.sentiment)
test_sentiments = binarize_sentiment(norm_test.sentiment)
print(train_sentiments.unique())
print(test_sentiments.unique())

[1 0]
[0 1]


**Modelling the dataset**

To compare different parameters for the model and text preprocesser, we can create a function, which does all the preprocessing and modelling. This allows us to quickly compare and evaluate results.

In [51]:
def vectorize_train_validate_evaluate(word_vectorizer, n_gram, analyzer, min_df, max_df, model, parameters, train_x, train_y, test_x, test_y):
    """Pipeline to compare different types of preprocessing of words, models and parameters."""
    word_vectorizer = word_vectorizer(ngram_range=n_gram, analyzer=analyzer, min_df=min_df, max_df=max_df)
    train_x = word_vectorizer.fit_transform(train_x)
    test_x = word_vectorizer.transform(test_x)
    print('Length of Vocabulary after vectorizing the corpus:',len(word_vectorizer.vocabulary_))
    # Gridsearch to find the best values:
    grid_search = GridSearchCV(estimator = model,
                            param_grid = parameters, 
                            scoring = 'accuracy',
                            cv = 5,
                            n_jobs = -1, 
                            verbose = 2)
    grid_search.fit(train_x,train_y)
    print(f'Best Score: {grid_search.best_score_}. Best Params: {grid_search.best_params_}')

    #Fitting the model for tfidf features
    grid_search.best_estimator_.fit(train_x,train_y)
    print(grid_search.best_estimator_)

    #Predicting test
    y_pred=grid_search.best_estimator_.predict(test_x)
    print(y_pred)

    #Predicting train
    y_pred_train=grid_search.best_estimator_.predict(train_x)
    print(y_pred_train)
    ##Predicting the model for tfidf features

    #Accuracy score test
    print("Accuracy test:",evaluate(test_y,y_pred)[0])
    #Accuracy score train
    print("Accuracy train:",evaluate(train_y,y_pred_train)[0])
    # Confiusionmatrix
    print("Accuracy test:",evaluate(test_y,y_pred)[1])
    #Accuracy score train
    print("Accuracy train:",evaluate(train_y,y_pred_train)[1])

In [62]:
#training the model
lr=LogisticRegression()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 17053
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Score: 0.8937250000000001. Best Params: {'C': 1, 'max_iter': 500, 'penalty': 'l2'}
LogisticRegression(C=1, max_iter=500)
[0 0 0 ... 1 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8986
Accuracy train: 0.929425
Accuracy test:               precision    recall  f1-score   support

    Negative       0.90      0.90      0.90      4993
    Positive       0.90      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.94      0.92      0.93     20007
    Positive       0.92      0.94      0.93     19993

    accuracy                           0.93     40000
   macro avg       0.93      0.93      0.93     40000
weighted avg       0.93      0.93      0.93 

**Chaning the ngram-range**

As seen below, the training speeds up quickly when chaning the range of ngrams to 3. However, the accuracy on the test-set also drops from 89% to 67%, which is below our baseline.

In [53]:
#training the model
lr=LogisticRegression()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(3,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(3,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=lr, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 789
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Score: 0.653425. Best Params: {'C': 1, 'max_iter': 500, 'penalty': 'l2'}
LogisticRegression(C=1, max_iter=500)
[1 0 0 ... 1 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.6569
Accuracy train: 0.670275
Accuracy test:               precision    recall  f1-score   support

    Negative       0.73      0.50      0.59      4993
    Positive       0.62      0.81      0.70      5007

    accuracy                           0.66     10000
   macro avg       0.67      0.66      0.65     10000
weighted avg       0.67      0.66      0.65     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.75      0.52      0.61     20007
    Positive       0.63      0.82      0.71     19993

    accuracy                           0.67     40000
   macro avg       0.69      0.67      0.66     40000
weighted avg       0.69      0.67      0.66     40000

L

**Using chars as ngram**

We can see that the vocabulary-size drops from 17'000 to 7'900, even though we have the same n_gram range. The Quality on the test-set also drops by 2.7%. 

**Linear Support Vector Classifier**

The Hyperparameter-tuning and Training is very quick. Also, the Quality on the Test-Set on both bow and tfidf is similar with around 89.9%. Thus, we will try differnet compbinations of parameters.

In [58]:
#training the model
svc=LinearSVC()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 17053
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Score: 0.894725. Best Params: {'C': 0.1, 'max_iter': 500, 'penalty': 'l2'}
LinearSVC(C=0.1, max_iter=500)
[0 0 1 ... 1 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.899
Accuracy train: 0.931325
Accuracy test:               precision    recall  f1-score   support

    Negative       0.90      0.89      0.90      4993
    Positive       0.90      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.94      0.92      0.93     20007
    Positive       0.92      0.94      0.93     19993

    accuracy                           0.93     40000
   macro avg       0.93      0.93      0.93     40000
weighted avg       0.93      0.93      0.93     40000

Lengt

In [65]:
#training the model
svc=LinearSVC()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0.1, max_df=0.9, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0.1, max_df=0.9, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 153
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Score: 0.7661250000000001. Best Params: {'C': 0.01, 'max_iter': 500, 'penalty': 'l2'}
LinearSVC(C=0.01, max_iter=500)
[0 0 0 ... 1 0 0]
[1 1 1 ... 1 1 0]
Accuracy test: 0.7632
Accuracy train: 0.767975
Accuracy test:               precision    recall  f1-score   support

    Negative       0.77      0.75      0.76      4993
    Positive       0.76      0.77      0.77      5007

    accuracy                           0.76     10000
   macro avg       0.76      0.76      0.76     10000
weighted avg       0.76      0.76      0.76     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.77      0.76      0.77     20007
    Positive       0.76      0.77      0.77     19993

    accuracy                           0.77     40000
   macro avg       0.77      0.77      0.77     40000
weighted avg       0.77      0.77      0.77     4

We can see that if we drop words, which appear in less than 10% of documents and those, which in more than 90% documents, we have a vocabulary size of 153. This leads to an awful test-accuracy of 76.6%. 

In [63]:
#training the model
svc=LinearSVC()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,4), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,4), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 17108
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Score: 0.89495. Best Params: {'C': 0.1, 'max_iter': 500, 'penalty': 'l2'}
LinearSVC(C=0.1, max_iter=500)
[0 0 1 ... 1 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8991
Accuracy train: 0.93145
Accuracy test:               precision    recall  f1-score   support

    Negative       0.90      0.89      0.90      4993
    Positive       0.90      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.94      0.92      0.93     20007
    Positive       0.92      0.94      0.93     19993

    accuracy                           0.93     40000
   macro avg       0.93      0.93      0.93     40000
weighted avg       0.93      0.93      0.93     40000

Length

Increasing the ngram-range for the word-vectorizer increases the test-accuracy slightly to 89.91% and the length of the vocabulary to 17108.

In [59]:
#training the model
svc=LinearSVC()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'C': [0.01,0.1,1], 'penalty': ['l2'], 'max_iter':[500]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='char', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='char', min_df=0.001, max_df=0.999, 
                                    model=svc, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 7890
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Score: 0.8679. Best Params: {'C': 1, 'max_iter': 500, 'penalty': 'l2'}
LinearSVC(C=1, max_iter=500)
[0 0 1 ... 1 1 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8732
Accuracy train: 0.9112
Accuracy test:               precision    recall  f1-score   support

    Negative       0.88      0.87      0.87      4993
    Positive       0.87      0.88      0.87      5007

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.92      0.91      0.91     20007
    Positive       0.91      0.92      0.91     19993

    accuracy                           0.91     40000
   macro avg       0.91      0.91      0.91     40000
weighted avg       0.91      0.91      0.91     40000

Length of Voc



Best Score: 0.8568. Best Params: {'C': 0.01, 'max_iter': 500, 'penalty': 'l2'}




LinearSVC(C=0.01, max_iter=500)
[0 0 1 ... 1 1 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8647
Accuracy train: 0.922575
Accuracy test:               precision    recall  f1-score   support

    Negative       0.86      0.87      0.87      4993
    Positive       0.87      0.86      0.86      5007

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.92      0.92      0.92     20007
    Positive       0.92      0.92      0.92     19993

    accuracy                           0.92     40000
   macro avg       0.92      0.92      0.92     40000
weighted avg       0.92      0.92      0.92     40000



Using a char as analyzer, we get a vocabulary of 7890. However, this does not decrease the training-time though, because it does not converge in 500 iterations.

**Multinomial Naive Bayes for bag of words and tfidf features**

In [61]:
#training the model
mnb=MultinomialNB()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'alpha': [0.01,0.1,1]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='word', min_df=0.001, max_df=0.999, 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 17053
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Score: 0.872475. Best Params: {'alpha': 0.01}
MultinomialNB(alpha=0.01)
[0 0 0 ... 0 0 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8739
Accuracy train: 0.891575
Accuracy test:               precision    recall  f1-score   support

    Negative       0.88      0.86      0.87      4993
    Positive       0.87      0.89      0.88      5007

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.91      0.87      0.89     20007
    Positive       0.88      0.91      0.89     19993

    accuracy                           0.89     40000
   macro avg       0.89      0.89      0.89     40000
weighted avg       0.89      0.89      0.89     40000

Length of Vocabulary after vectorizing

In [60]:
#training the model
mnb=MultinomialNB()
tv=TfidfVectorizer
cv=CountVectorizer
# Gridsearch to find the best values:
parameters = [{'alpha': [0.01,0.1,1]}]
# Count vectorizer
vectorize_train_validate_evaluate(tv, n_gram=(1,3), analyzer='char', min_df=0.001, max_df=0.999, 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

# Tfidf
vectorize_train_validate_evaluate(cv, n_gram=(1,3), analyzer='char', min_df=0.001, max_df=0.999, 
                                    model=mnb, parameters=parameters, train_x=norm_train.review, train_y=train_sentiments, 
                                    test_x=norm_test.review, test_y=test_sentiments)

Length of Vocabulary after vectorizing the corpus: 7890
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Best Score: 0.815475. Best Params: {'alpha': 1}
MultinomialNB(alpha=1)
[0 0 0 ... 1 1 0]
[1 1 1 ... 1 0 0]
Accuracy test: 0.8142
Accuracy train: 0.82485
Accuracy test:               precision    recall  f1-score   support

    Negative       0.82      0.80      0.81      4993
    Positive       0.81      0.83      0.82      5007

    accuracy                           0.81     10000
   macro avg       0.81      0.81      0.81     10000
weighted avg       0.81      0.81      0.81     10000

Accuracy train:               precision    recall  f1-score   support

    Negative       0.83      0.81      0.82     20007
    Positive       0.82      0.84      0.83     19993

    accuracy                           0.82     40000
   macro avg       0.83      0.82      0.82     40000
weighted avg       0.83      0.82      0.82     40000

Length of Vocabulary after vectorizing the cor