# Sentiment Analysis on IMDB movie reviews

# Main steps in the code: 
1. Use the labeledTrainData.tsv from data folder in a dataframe `train`.
2. Build a function to clean the reviews in the input file: review_cleaner(train['review'],lemmatize,stem).
3. Build a function to train the sentiment prediction models: train_predict_sentiment(cleaned_reviews, y=train["sentiment"],ngram=1,max_features=1000
4. Trained models on unigrams of the reviews without lemmatizing and stemming.
5. Trained models on unigrams and bigrams setting of the reviews with lemmatizing and stemming. Then comparing the performance.

In [1]:
# Remove warnings
import warnings

warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
%matplotlib inline

# make compatible with Python 2 and Python 3
from __future__ import print_function, division, absolute_import


## Data set

The labeled training data set consists of 25,000 IMDB movie reviews. There is also an unlabeled test set with 25,000 IMDB movie reviews. The sentiment of the reviews are binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 have a sentiment score of 1 (no reviews with score 5 or 6 are included in the analysis). No individual movie has more than 30 reviews.

## File description

* **labeledTrainData** - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review. 

* **testData** - The unlabeled test set. 25,000 rows containing an id, and text for each review. 

## Data columns
* **id** - Unique ID of each review
* **sentiment** - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* **review** - Text of the review


## 1. Data set statistics


In [3]:
import numpy as np
import pandas as pd

train = pd.read_csv("data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
# train.shape should be (25000,3)

In [4]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [6]:
# import packages

import bs4 as bs
import nltk

# nltk.download('all')
from nltk.tokenize import sent_tokenize  # tokenizes sentences
import re

from nltk.stem import PorterStemmer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

nltk.download("stopwords")

eng_stopwords = stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anhnguyen/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


<div id='sec3'></div>
##  2.Preparing the data set for classification



Create a function called `review_cleaner` that reads in a review and:

- Removes HTML tags (using beautifulsoup)
- **Extract emoticons (emotion symbols, aka smileys :D )**
- Removes non-letters (using regular expression)
- Converts all words to lowercase letters and tokenizes them (using .split() method on the review strings, so that every word in the review is an element in a list)
- Removes all the English stopwords from the list of movie review words
- Join the words back into one string seperated by space, append the emoticons to the end

(Transform the list of stopwords to a set before removing the stopwords. Use the set to look up stopwords.)

In [7]:
# 1.
from nltk.corpus import stopwords
from nltk.util import ngrams


ps = PorterStemmer()
wnl = WordNetLemmatizer()


def review_cleaner(reviews, lemmatize=True, stem=False):
    """
    Clean and preprocess a review.

    1. Remove HTML tags
    2. Use regex to remove all special characters (only keep letters)
    3. Make strings to lower case and tokenize / word split reviews
    4. Remove English stopwords
    5. Rejoin to one string
    """
    ps = PorterStemmer()
    wnl = WordNetLemmatizer()
    # 1. Remove HTML tags

    cleaned_reviews = []
    for i, review in enumerate(train["review"]):
        # print progress
        if (i + 1) % 500 == 0:
            print("Done with %d reviews" % (i + 1))
        review = bs.BeautifulSoup(review).text

        # 2. Use regex to find emoticons
        emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", review)

        # 3. Remove punctuation
        review = re.sub("[^a-zA-Z]", " ", review)

        # 4. Tokenize into words (all lower case)
        review = review.lower().split()

        # 5. Remove stopwords
        eng_stopwords = set(stopwords.words("english"))

        clean_review = []
        for word in review:
            if word not in eng_stopwords:
                if lemmatize is True:
                    word = wnl.lemmatize(word)
                elif stem is True:
                    if word == "oed":
                        continue
                    word = ps.stem(word)
                clean_review.append(word)

        # 6. Join the review to one sentence

        review_processed = " ".join(clean_review + emoticons)
        cleaned_reviews.append(review_processed)

    return cleaned_reviews

##  3. Function to train and validate a sentiment analysis model using Random Forest Classifier

In [8]:
from sklearn.ensemble import RandomForestClassifier

# # CountVectorizer can actucally handle a lot of the preprocessing for us
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics  # for confusion matrix, accuracy score etc
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix


np.random.seed(0)


def train_predict_sentiment(
    cleaned_reviews, y=train["sentiment"], ngram=1, max_features=1000
):
    print("Creating the bag of words model!\n")
    # CountVectorizer" is scikit-learn's bag of words tool, here we show more keywords
    vectorizer = CountVectorizer(
        ngram_range=(1, ngram),
        analyzer="word",
        tokenizer=None,
        preprocessor=None,
        stop_words=None,
        max_features=max_features,
    )

    X_train, X_test, y_train, y_test = train_test_split(
        cleaned_reviews, y, random_state=0, test_size=0.2
    )

    train_bag = vectorizer.fit_transform(X_train).toarray()
    test_bag = vectorizer.transform(X_test).toarray()
    #     print('TOP 20 FEATURES ARE: ',(vectorizer.get_feature_names()[:20]))

    print("Training the random forest classifier!\n")
    # Initialize a Random Forest classifier with 75 trees
    forest = RandomForestClassifier(n_estimators=50)

    # Fit the forest to the training set, using the bag of words as
    # features and the sentiment labels as the target variable
    forest = forest.fit(train_bag, y_train)

    train_predictions = forest.predict(train_bag)
    test_predictions = forest.predict(test_bag)

    train_acc = metrics.accuracy_score(y_train, train_predictions)
    valid_acc = metrics.accuracy_score(y_test, test_predictions)
    print(
        " The training accuracy is: ",
        train_acc,
        "\n",
        "The validation accuracy is: ",
        valid_acc,
    )
    print()
    print("CONFUSION MATRIX:")
    print("         Predicted")
    print("          neg pos")
    print(" Actual")
    c = confusion_matrix(y_test, test_predictions)
    print("     neg  ", c[0])
    print("     pos  ", c[1])

    # Extract feature importnace
    print("\nTOP TEN IMPORTANT FEATURES:")
    importances = forest.feature_importances_
    indices = np.argsort(importances)[::-1]
    top_10 = indices[:10]
    print([vectorizer.get_feature_names()[ind] for ind in top_10])

## 4. Train and test  Model on the IMDB data

In [9]:
# Here I use the original reviews without lemmatizing and stemming
original_clean_reviews = review_cleaner(train["review"], lemmatize=False, stem=False)
train_predict_sentiment(
    cleaned_reviews=original_clean_reviews,
    y=train["sentiment"],
    ngram=1,
    max_features=1000,
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

## 5.1 UNIGRAM setting

In [11]:
# For original reviews with unigram and 1000 max_features:
original_clean_reviews = review_cleaner(train["review"], lemmatize=False, stem=False)
train_predict_sentiment(
    cleaned_reviews=original_clean_reviews,
    y=train["sentiment"],
    ngram=1,
    max_features=1000,
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

In [12]:
# For lemmatized reviews with unigram and 1000 max_features:
wnl_clean_reviews = review_cleaner(train["review"], lemmatize=True, stem=False)
train_predict_sentiment(
    cleaned_reviews=wnl_clean_reviews, y=train["sentiment"], ngram=1, max_features=1000
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

In [13]:
# For stemmed reviews with unigram and 1000 max_features:
ps_clean_reviews = review_cleaner(train["review"], lemmatize=False, stem=True)
train_predict_sentiment(
    cleaned_reviews=ps_clean_reviews, y=train["sentiment"], ngram=1, max_features=1000
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

### For original review with unigram and 1000 max_features, I will report:
original_clean_reviews=review_cleaner(train['review'],lemmatize=False,stem=False)

train_predict_sentiment(cleaned_reviews=original_clean_reviews, y=train["sentiment"],ngram=1,max_features=1000)

The training accuracy is: 1.0 The validation accuracy is: 0.829

### For lemmatized review with unigram and 1000 max_features, I will report:
wnl_clean_reviews=review_cleaner(train['review'],lemmatize=True,stem=False)

train_predict_sentiment(cleaned_reviews=wnl_clean_reviews, y=train["sentiment"],ngram=1,max_features=1000)

The training accuracy is: 0.99995 The validation accuracy is: 0.8186


### For stemmed review with unigram and 1000 max_features, I will report:
ps_clean_reviews=review_cleaner(train['review'],lemmatize=False,stem=True)

train_predict_sentiment(cleaned_reviews=ps_clean_reviews, y=train["sentiment"],ngram=1,max_features=1000)

The training accuracy is: 1.0 The validation accuracy is: 0.825

## 5.2 BIGRAM setting

In [14]:
# For original reviews with bigram and 1000 max_features:
original_clean_reviews = review_cleaner(train["review"], lemmatize=False, stem=False)
train_predict_sentiment(
    cleaned_reviews=original_clean_reviews,
    y=train["sentiment"],
    ngram=2,
    max_features=1000,
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

In [15]:
# For lemmatized reviews with bigram and 1000 max_features:
wnl_clean_reviews = review_cleaner(train["review"], lemmatize=True, stem=False)
train_predict_sentiment(
    cleaned_reviews=wnl_clean_reviews, y=train["sentiment"], ngram=2, max_features=1000
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

In [16]:
# For stemmed reviews with bigram and 1000 max_features:
ps_clean_reviews = review_cleaner(train["review"], lemmatize=False, stem=True)
train_predict_sentiment(
    cleaned_reviews=ps_clean_reviews, y=train["sentiment"], ngram=2, max_features=1000
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

### For original review with bigram and 1000 max_features, I will report:
original_clean_reviews=review_cleaner(train['review'],lemmatize=False,stem=False)

train_predict_sentiment(cleaned_reviews=original_clean_reviews, y=train["sentiment"],ngram=2,max_features=1000)

The training accuracy is: 0.99995 The validation accuracy is: 0.8182

### For lemmatized review with bigram and 1000 max_features, I will report:
wnl_clean_reviews=review_cleaner(train['review'],lemmatize=True,stem=False)

train_predict_sentiment(cleaned_reviews=wnl_clean_reviews, y=train["sentiment"],ngram=2,max_features=1000)

The training accuracy is: 0.99995 The validation accuracy is: 0.8208


### For stemmed review with bigram and 1000 max_features, I will report:
ps_clean_reviews=review_cleaner(train['review'],lemmatize=False,stem=True)

train_predict_sentiment(cleaned_reviews=ps_clean_reviews, y=train["sentiment"],ngram=2,max_features=1000)

The training accuracy is: 0.99995 The validation accuracy is: 0.8236

# 5.3 UNIGRAM setting for lemmatized reviews with different maximum features

In [17]:
# For lemmatized reviews with unigram, and 10 max_features:
wnl_clean_reviews = review_cleaner(train["review"], lemmatize=True, stem=False)
train_predict_sentiment(
    cleaned_reviews=wnl_clean_reviews, y=train["sentiment"], ngram=1, max_features=10
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

In [18]:
# For lemmatized reviews with unigram, and 100 max_features
wnl_clean_reviews = review_cleaner(train["review"], lemmatize=True, stem=False)
train_predict_sentiment(
    cleaned_reviews=wnl_clean_reviews, y=train["sentiment"], ngram=1, max_features=100
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

In [19]:
# For lemmatized reviews with unigram, and 1000 max_features
wnl_clean_reviews = review_cleaner(train["review"], lemmatize=True, stem=False)
train_predict_sentiment(
    cleaned_reviews=wnl_clean_reviews, y=train["sentiment"], ngram=1, max_features=1000
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

In [20]:
# For lemmatized reviews with unigram, and 5000 max_features
wnl_clean_reviews = review_cleaner(train["review"], lemmatize=True, stem=False)

train_predict_sentiment(
    cleaned_reviews=wnl_clean_reviews, y=train["sentiment"], ngram=1, max_features=5000
)

Done with 500 reviews
Done with 1000 reviews
Done with 1500 reviews
Done with 2000 reviews
Done with 2500 reviews
Done with 3000 reviews
Done with 3500 reviews
Done with 4000 reviews
Done with 4500 reviews
Done with 5000 reviews
Done with 5500 reviews
Done with 6000 reviews
Done with 6500 reviews
Done with 7000 reviews
Done with 7500 reviews
Done with 8000 reviews
Done with 8500 reviews
Done with 9000 reviews
Done with 9500 reviews
Done with 10000 reviews
Done with 10500 reviews
Done with 11000 reviews
Done with 11500 reviews
Done with 12000 reviews
Done with 12500 reviews
Done with 13000 reviews
Done with 13500 reviews
Done with 14000 reviews
Done with 14500 reviews
Done with 15000 reviews
Done with 15500 reviews
Done with 16000 reviews
Done with 16500 reviews
Done with 17000 reviews
Done with 17500 reviews
Done with 18000 reviews
Done with 18500 reviews
Done with 19000 reviews
Done with 19500 reviews
Done with 20000 reviews
Done with 20500 reviews
Done with 21000 reviews
Done with 21

### For lemmatized reviews with unigram, and 10 max_features, I will report:
wnl_clean_reviews=review_cleaner(train['review'],lemmatize=True,stem=False)

train_predict_sentiment(cleaned_reviews=wnl_clean_reviews, y=train["sentiment"],ngram=1,max_features=10)

The training accuracy is: 0.87155 The validation accuracy is: 0.5588

### For lemmatized reviews with unigram, and 100 max_features, I will report:
wnl_clean_reviews=review_cleaner(train['review'],lemmatize=True,stem=False)

train_predict_sentiment(cleaned_reviews=wnl_clean_reviews, y=train["sentiment"],ngram=1,max_features=100)

The training accuracy is: 0.9998 The validation accuracy is: 0.7186

### For lemmatized reviews with unigram, and 1000 max_features, I will report:
wnl_clean_reviews=review_cleaner(train['review'],lemmatize=True,stem=False)

train_predict_sentiment(cleaned_reviews=wnl_clean_reviews, y=train["sentiment"],ngram=1,max_features=1000)

The training accuracy is: 1.0 The validation accuracy is: 0.8214

### For lemmatized reviews with unigram, and 5000 max_features, I will report:
wnl_clean_reviews=review_cleaner(train['review'],lemmatize=True,stem=False)

train_predict_sentiment(cleaned_reviews=wnl_clean_reviews, y=train["sentiment"],ngram=1,max_features=5000)

The training accuracy is: 1.0 The validation accuracy is: 0.8374

# SUMMARY

As we can clearly observe from the results:
- For unigram, while all there reviews give quite similar training accuracy (1.0, 0.99995 and 1.0), the performance in validation accuracy when using the original reviews is the highest compared to those of lemmatized review and stemmed review, which are 0.829, 0.8186 and 0.825 respectively.
- For bigram, there is no difference in the training accuracy when using 3 different reviews. However, considering the validation accuracy, lemmatizing and stemming do help improve the performance of the text classification, which is increased from 0.8182 to 0.8208 and 0.8236 respectively.
- For unigram with lemmatized reviews, we can noticed that the larger the amount of max features, the better performance in using Random Forest Classifier. Particularly, when changing from 10 to 100 max features, the training accuracy increased 14.7% (from 0.87155 to 0.9998) and the validation accuracy increased significantly, 28.6% (from 0.5588 to 0.7186). And when changing from 100 to 1000 max features, the validation accuracy continued to increase significantly from 0.7186 to 0.8214 (14.3%). 

Hence, in unigram, the lemmatizing and stemming seem to not really affect the results significantly while in bigram, lemmatizing and stemming do help to improve slightly the classification performance (but not really significant). As for the number of max features, the larger the amount of max features, the better performance of the classification models, especially in the validation accuracy. However, it seems like if we continue increasing the max features beyond 1000, the effect will not be as significant anymore.