## Final Project Day 2 Solution: Tree-based Models and Regression Models for a Classification Task

We continue to work with the final project dataset to see how Tree-based models (Decision Tree, Random Forest) and Regression Models, along with efficient optimization techniques (GridSearch, RandomizedSearch), perform to predict the __isPositive__ field of the dataset.

1. Reading the dataset
2. Exploratory data analysis and missing value imputation
3. Stop word removal and stemming
4. Splitting the training dataset into training and validation
5. Computing Bag of Words features
6. Fitting a classifier (with hyperparameter tuning) and checking the model performance
    * Find more details on the __LogisticRegression__ here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    * Find more details on the __DecisionTreeClassifier__ here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
    * Find more details on the __RandomForestClassifier__ here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
    * Find more details on the __GridSearchCV__ here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
    * Find more details on the __RandomizedSearchCV__ here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
7. Ideas for improvement

*Note: Incorporate all that you have learned over Day 1 and Day 2. Feel free to use your processed data from Day 1 to save on redundant work (1-5).*

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Rating of the review


### 1. Reading the datasets

We will use the __pandas__ library to read our datasets.

In [1]:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
# SPDX-License-Identifier: MIT-0

import warnings
warnings.filterwarnings('ignore')

import pandas as pd

df_train = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv')

Let's look at the first five rows of the dataset.

In [2]:
df_train.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


In [3]:
import pandas as pd

df_test = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv')

In [4]:
df_test.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,Kaspersky offers the best security for your co...,State of the art protection,True,1465516800,0.0,1.0
1,This Value was extremely discounted which I ap...,Quickbooks,True,1393632000,0.0,1.0
2,Some dufus probably got stock options by the t...,Sad,False,1228176000,2.639057,0.0
3,I have reviewed the software and it is beyond ...,Excellent product,True,1402531200,0.0,1.0
4,"Plain old simple you need Anti-Virus,I have tr...",A must have,True,1367539200,0.0,1.0


In [5]:
print('The shape of the training dataset is:', df_train.shape)
print('The shape of the test dataset is:', df_test.shape)

The shape of the training dataset is: (70000, 6)
The shape of the test dataset is: (8000, 6)


### 2. Exploratory data analysis and missing value imputation

Let's look at the target distribution for our datasets.

In [6]:
df_train["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

In [7]:
df_test["isPositive"].value_counts()

1.0    4980
0.0    3020
Name: isPositive, dtype: int64

Checking the number of missing values:    

In [8]:
print(df_train.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


In [9]:
print(df_test.isna().sum())

reviewText    2
summary       1
verified      0
time          0
log_votes     0
isPositive    0
dtype: int64


Let's fill-in a placeholder for the __reviewText__ missing values:

In [10]:
df_train["reviewText"].fillna("Missing", inplace=True)
df_test["reviewText"].fillna("Missing", inplace=True)

### 3. Stop word removal and stemming

We will apply the text processing methods discussed in the class. 

In [11]:
# Install the library and functions
import nltk

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
    
    return final_text_list

In [13]:
print("Pre-processing training reviewText")
df_train["reviewText"] = process_text(df_train["reviewText"].tolist())

print("Pre-processing test reviewText")
df_test["reviewText"] = process_text(df_test["reviewText"].tolist())

Pre-processing training reviewText
Pre-processing test reviewText


### 4. Splitting the training dataset into training and validation

Sklearn library has a useful function to split datasets. We will use the __train_test_split()__ function. In the example below, we get 90% of the data for training and 10% is left for validation.

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df_train["reviewText"].tolist(), 
                                                  df_train["isPositive"].tolist(), 
                                                  test_size=0.10, 
                                                  shuffle=True)

### 5. Computing Bag of Words features

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

X_test = df_test["reviewText"].tolist()
y_test = df_test["isPositive"].tolist()

# Initialize the binary count vectorizer
tfidf_vectorizer = CountVectorizer(binary=True,
                                   max_features=50  # Limit the vocabulary size
                                  )
# Fit and transform
X_train_text_vectors = tfidf_vectorizer.fit_transform(X_train)
# Only transform
X_val_text_vectors = tfidf_vectorizer.transform(X_val)
# Only transform
X_test_text_vectors = tfidf_vectorizer.transform(X_test)

### 6. Fitting a classifier (with hyperparameter tuning) and checking the model performance

#### 6.1 LogisticRegression 

Using the __LogisticRegression__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Using the __RandomizedSearchCV__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
        

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, f1_score

lrClassifier = LogisticRegression(class_weight = 'balanced')
parameters = {'penalty': ['l2', 'l1'],
              'C': [0.01, 0.02, 0.05]}

# NOTE: RandomizedSearchCV uses by default the score function of the estimator to evaluate
# (r2_score for regression; accuracy_score for classification). If desired,
# other scoring functions can be specified via the 'scoring' parameter. 
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# NOTE: You can experiment with different cv numbers, default = 5
# NOTE: You can also experiment with different n_iter
# (number of parameter settings that are sampled by the RandomizedSearch), default = 10
lrClassifier_rand = RandomizedSearchCV(lrClassifier,
                                       parameters,
                                       cv=5,
                                       verbose=1,
                                       n_jobs=-1)
lrClassifier_rand.fit(X_train_text_vectors, y_train)

print("Best parameters: ", lrClassifier_rand.best_params_)
print("Best score: ", lrClassifier_rand.best_score_)

lrClassifier_rand_val_predictions = lrClassifier_rand.predict(X_val_text_vectors)
lrClassifier_rand_test_predictions = lrClassifier_rand.predict(X_test_text_vectors)

print("LR with RandomizedSearchCV on Validation: Accuracy Score: %f, F1-score: %f" % \
      (accuracy_score(y_val, lrClassifier_rand_val_predictions), f1_score(y_val, lrClassifier_rand_val_predictions)))
print("LR with RandomizedSearchCV on Test: Accuracy Score: %f, F1-score: %f" % \
      (accuracy_score(y_test, lrClassifier_rand_test_predictions), f1_score(y_test, lrClassifier_rand_test_predictions)))

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    8.5s finished


Best parameters:  {'penalty': 'l2', 'C': 0.02}
Best score:  0.7541428571428571
LR with RandomizedSearchCV on Validation: Accuracy Score: 0.748714, F1-score: 0.796953
LR with RandomizedSearchCV on Test: Accuracy Score: 0.748750, F1-score: 0.795773


#### 6.2 DecisionTreeClassifier 

Using the __DecisionTreeClassifier__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Using the __RandomizedSearchCV__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [17]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, f1_score

dtClassifier = DecisionTreeClassifier(class_weight = 'balanced')
parameters = {'max_depth': [10, 20, 30],
              'min_samples_leaf': [5, 15, 25]}

# NOTE: RandomizedSearchCV uses by default the score function of the estimator to evaluate
# (r2_score for regression; accuracy_score for classification). If desired,
# other scoring functions can be specified via the 'scoring' parameter. 
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# NOTE: You can experiment with different cv numbers, default = 5
# NOTE: You can also experiment with different n_iter
# (number of parameter settings that are sampled by the RandomizedSearch), default = 10
dtClassifier_rand = RandomizedSearchCV(dtClassifier,
                                       parameters,
                                       cv=5,
                                       verbose=1,
                                       n_jobs=-1)
dtClassifier_rand.fit(X_train_text_vectors, y_train)

print("Best parameters: ", dtClassifier_rand.best_params_)
print("Best score: ", dtClassifier_rand.best_score_)

dtClassifier_rand_val_predictions = dtClassifier_rand.predict(X_val_text_vectors)
dtClassifier_rand_test_predictions = dtClassifier_rand.predict(X_test_text_vectors)

print("DTClassifier with RandomizedSearchCV on Validation: Accuracy Score: %f, F1-score: %f" % \
      (accuracy_score(y_val, dtClassifier_rand_val_predictions), f1_score(y_val, dtClassifier_rand_val_predictions)))
print("DTClassifier with RandomizedSearchCV on Test: Accuracy Score: %f, F1-score: %f" % \
      (accuracy_score(y_test, dtClassifier_rand_test_predictions), f1_score(y_test, dtClassifier_rand_test_predictions)))

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:   49.5s finished


Best parameters:  {'min_samples_leaf': 25, 'max_depth': 10}
Best score:  0.732904761904762
DTClassifier with RandomizedSearchCV on Validation: Accuracy Score: 0.730143, F1-score: 0.777267
DTClassifier with RandomizedSearchCV on Test: Accuracy Score: 0.734375, F1-score: 0.779770


#### 6.3 RandomForestClassifier 

Using the __RandomForestClassifier__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Using the __RandomizedSearchCV__ from here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [18]:
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, f1_score

rfClassifier = RandomForestClassifier(class_weight = 'balanced')
parameters = {'n_estimators': [200, 300, 400],
              'max_depth': [10, 20],
              'min_samples_leaf': [15, 25]}

# NOTE: RandomizedSearchCV uses by default the score function of the estimator to evaluate
# (r2_score for regression; accuracy_score for classification). If desired,
# other scoring functions can be specified via the 'scoring' parameter. 
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# NOTE: You can experiment with different cv numbers, default = 5
# NOTE: You can also experiment with different n_iter
# (number of parameter settings that are sampled by the RandomizedSearch), default = 10
rfClassifier_rand = RandomizedSearchCV(rfClassifier,
                                       parameters,
                                       cv=5,
                                       verbose=1,
                                       n_jobs=-1)
rfClassifier_rand.fit(X_train_text_vectors, y_train)

print("Best parameters: ", rfClassifier_rand.best_params_)
print("Best score: ", rfClassifier_rand.best_score_)

rfClassifier_rand_val_predictions = rfClassifier_rand.predict(X_val_text_vectors)
rfClassifier_rand_test_predictions = rfClassifier_rand.predict(X_test_text_vectors)

print("RFClassifier with RandomizedSearchCV on Validation: Accuracy Score: %f, F1-score: %f" % \
      (accuracy_score(y_val, rfClassifier_rand_val_predictions), f1_score(y_val, rfClassifier_rand_val_predictions)))
print("RFClassifier with RandomizedSearchCV on Test: Accuracy Score: %f, F1-score: %f" % \
      (accuracy_score(y_test, rfClassifier_rand_test_predictions), f1_score(y_test, rfClassifier_rand_test_predictions)))

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed: 22.0min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 23.3min finished


Best parameters:  {'n_estimators': 200, 'min_samples_leaf': 15, 'max_depth': 10}
Best score:  0.7462063492063492
RFClassifier with RandomizedSearchCV on Validation: Accuracy Score: 0.744286, F1-score: 0.792102
RFClassifier with RandomizedSearchCV on Test: Accuracy Score: 0.748500, F1-score: 0.793810


### 7. Ideas for improvement

**Preprocessing**: We can usually improve performance with some additional work. You can try the following:
* Change the feature extractor to TF, TF-IDF. Also experiment with different vocabulary size.
* Add the other text field __summary__ to the model and get bag of words features of it.
* Come up with some other features such as having certain punctuations, all-capitalized words or some words that might be useful in this problem.

**Hyperparameter Tuning**: 
* Always a good idea to try other parameter ranges and/or combinations of parameters, including threshold calibration. 
