<a href="https://colab.research.google.com/github/MohammedHamood/IMDBReviews/blob/main/IMDBReviews_AB_LR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IMDB Reviews - AdaBoost with Logistic Regression

## Data Pre-Processing

In [None]:
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
import preprocessingNLP as PNLP
import numpy as np
import time
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA, NMF, TruncatedSVD
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import Normalizer

# Import Dataset
print("Downloading Dataset ...")
!wget -nv "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
!tar -xf aclImdb_v1.tar.gz
IMDB_train = load_files('aclImdb/train/', categories=("pos", "neg"), encoding='utf-8')
IMDB_test = load_files('aclImdb/test/', categories=('pos', 'neg'), encoding='utf-8')
print("Dataset Downloaded")

# Preprocessing
print("PREPROCESSING ...")
IMDB_train.data = PNLP.customNLP(IMDB_train.data)
IMDB_test.data = PNLP.customNLP(IMDB_test.data)
IMDB_train.data, IMDB_train.target = PNLP.removeEmptyInstances(IMDB_train.data, IMDB_train.target)
print("PREPROCESSING DONE!")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Downloading Dataset ...
2020-03-10 17:15:14 URL:http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz [84125825/84125825] -> "aclImdb_v1.tar.gz" [1]
Dataset Downloaded
PREPROCESSING ...
PREPROCESSING DONE!


## Setting Hyper-Parameters

In the following section, RandomSearchCV is used to evaluate the validation set accuracy of AdaBoost based on random combinations of hyper-parameters used in the validation pipeline. The relevant parameters needed are ngram_range for CountVectorizer and use_idf for TfidfTransformer. AdaBoostClassifier is used with its default attributes except for base_estimator, where LogisticRegression is applied with its optimal parameters.

In [None]:
# Define parameters
parameters = {
     'vect__ngram_range': ((1, 1), (1, 2), (1, 3)),
     'tfidf__use_idf': (True, False),
}

# Create a pipeline
pip = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('Norm', Normalizer(copy=False)),
                ('clf', AdaBoostClassifier(base_estimator=LogisticRegression(
                    max_iter=300, penalty='none', solver='newton-cg', tol=0.0001, n_jobs=-1)))])

# Initialize RandomizedSearchCV
n_iter_search = 6
cv_folds = 3
Ada_LR_rand_search = RandomizedSearchCV(pip, param_distributions=parameters, 
                               n_iter=n_iter_search, cv=cv_folds)

# Utility function to report best scores
def report(results, n_top=10):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})"
                  .format(results['mean_test_score'][candidate],
                          results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("Mean Fit Time: %.3f seconds" %results['mean_fit_time'][candidate])
            print("")

# Execute RandomizedSearchCV and print best results
Ada_LR_rand_search.fit(IMDB_train.data, IMDB_train.target)
report(Ada_LR_rand_search.cv_results_)

Model with rank: 1
Mean validation score: 0.890 (std: 0.003)
Parameters: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': True}
Mean Fit Time: 10.873 seconds

Model with rank: 2
Mean validation score: 0.887 (std: 0.003)
Parameters: {'vect__ngram_range': (1, 3), 'tfidf__use_idf': True}
Mean Fit Time: 24.118 seconds

Model with rank: 3
Mean validation score: 0.882 (std: 0.002)
Parameters: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': False}
Mean Fit Time: 12.105 seconds

Model with rank: 4
Mean validation score: 0.882 (std: 0.002)
Parameters: {'vect__ngram_range': (1, 3), 'tfidf__use_idf': False}
Mean Fit Time: 26.471 seconds

Model with rank: 5
Mean validation score: 0.870 (std: 0.002)
Parameters: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': True}
Mean Fit Time: 2.174 seconds

Model with rank: 6
Mean validation score: 0.855 (std: 0.002)
Parameters: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': False}
Mean Fit Time: 2.555 seconds



These results show a brief idea of the required parameters for CountVectorizer and TfidfTransformer.
RandomSearchCV constantly gives different rankings as it always chooses its parameters randomly. However, setting *use_idf=TRUE* for TfidfTransformer clearly seems to increase the mean validation set accuracy. This confirms that downscaling weights for words that occur in many documents improves the probability of obtaining the best accuracy. Additionally, it is a good idea to evaluate a larger size of word n-grams for the model. *ngram_range=(1, 2)* will be used for CountVectorizer.

Now, GridSearchCV is used to evaluate the hyper-parameters that could potentially optimize the validation accuracy of AdaBoostClassifier. Only the relevant parameters of the model are evaluated by combining different values for *n_estimators* and *learning_rate*.

In [None]:
# Set relevant parameters
parameters = {
    'n_estimators': (30, 50, 100, 200),
    'learning_rate': (0.01, 0.1, 1)
}

# Create a pipeline
pip = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                ('tfidf', TfidfTransformer(use_idf=True)),
                ('Norm', Normalizer(copy=False)),
                ('clf', GridSearchCV(AdaBoostClassifier(
                    base_estimator=LogisticRegression(
                        max_iter=300, penalty='none', solver='newton-cg', tol=0.0001, n_jobs=-1)),
                        parameters, cv=cv_folds, n_jobs=-1))])

# Execute pipeline
pip.fit(IMDB_train.data, IMDB_train.target)

# Collect and print results
test_accuracies = pip['clf'].cv_results_['mean_test_score']
test_time = pip['clf'].cv_results_['mean_fit_time']
test_params = pip['clf'].cv_results_['params']

for i in range(len(test_time)):
  print("Parameter: {0}".format(test_params[i]))
  print("Training Time: %.3f seconds" %test_time[i])
  print("Valdation Accuracy: {0:.3f}".format(test_accuracies[i]))
  print("")

print("Validation Accuracy: " + str(pip['clf'].best_score_))
print("Optimal Parameters: " + str(pip['clf'].best_params_))

Validation Accuracy: 0.8907998316572933
Optimal Parameters: {'learning_rate': 0.01, 'n_estimators': 30}

Parameter: {'learning_rate': 0.01, 'n_estimators': 30}
Training Time: 4.697 seconds
Valdation Accuracy: 0.891

Parameter: {'learning_rate': 0.01, 'n_estimators': 50}
Training Time: 4.582 seconds
Valdation Accuracy: 0.891

Parameter: {'learning_rate': 0.01, 'n_estimators': 100}
Training Time: 4.550 seconds
Valdation Accuracy: 0.891

Parameter: {'learning_rate': 0.01, 'n_estimators': 200}
Training Time: 4.549 seconds
Valdation Accuracy: 0.891

Parameter: {'learning_rate': 0.1, 'n_estimators': 30}
Training Time: 4.587 seconds
Valdation Accuracy: 0.891

Parameter: {'learning_rate': 0.1, 'n_estimators': 50}
Training Time: 4.551 seconds
Valdation Accuracy: 0.891

Parameter: {'learning_rate': 0.1, 'n_estimators': 100}
Training Time: 4.579 seconds
Valdation Accuracy: 0.891

Parameter: {'learning_rate': 0.1, 'n_estimators': 200}
Training Time: 4.620 seconds
Valdation Accuracy: 0.891

Paramet

Clearly, varying *learning_rate* and *n_estimators* does not have a significant impact on the validation set accuracy. Therefore, their default values will be used for calculating the test set accuracy.

# Final Result

In [None]:
# Create a pipeline
pip = Pipeline([('vect', CountVectorizer(ngram_range=(1, 2))),
                ('tfidf', TfidfTransformer(use_idf=True)),
                ('Norm', Normalizer(copy=False)),
                ('clf', AdaBoostClassifier(base_estimator=LogisticRegression(
                        max_iter=300, penalty='none', solver='newton-cg', tol=0.0001, n_jobs=-1)))])

# Evaluate validation set accuracy
start_time = time.time()
scores = cross_val_score(pip, IMDB_train.data, IMDB_train.target, cv=10)
valid_accuracy = np.mean(scores)
print("10-Cross Validation Runtime: %s seconds" % (time.time() - start_time))
print("Validation Set Accuracy: {0}".format(valid_accuracy))

# Fit the model
start_time = time.time()
pip.fit(IMDB_train.data, IMDB_train.target)
print("Training Runtime: %s seconds" % (time.time() - start_time))

# Get prediction on test set
start_time = time.time()
IMDB_pred = pip.predict(IMDB_test.data)
print("Prediction Runtime: %s seconds" % (time.time() - start_time))

# Compute test set accuracy
test_accuracy = np.mean(IMDB_pred==IMDB_test.target)
print("Test Set Accuracy: {0}".format(test_accuracy))

10-Cross Validation Runtime: 221.6829879283905 seconds
Validation Set Accuracy: 0.89612
Training Runtime: 22.44027042388916 seconds
Prediction Runtime: 7.357857704162598 seconds
Test Set Accuracy: 0.88464
