<a href="https://colab.research.google.com/github/LxYuan0420/aws-machine-learning-university-accelerated-nlp/blob/master/colab_notebooks/MLA_NLP_Lecture2_Tree_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [2]:
%cd /gdrive/MyDrive/Colab Notebooks/git/aws-machine-learning-university-accelerated-nlp/colab_notebooks

/gdrive/MyDrive/Colab Notebooks/git/aws-machine-learning-university-accelerated-nlp/colab_notebooks


**Machine Learning Accelerator - Natural Language Processing - Lecture 2**

Tree-Based Models for a Classification Problem, and Hyperparameter Tuning
We continue to work with our review dataset to see how Tree-based classifiers (Decision Tree, Random Forest), along with efficient optimization techniques (GridSearch, RandomizedSearch), perform to predict the isPositive field of our review dataset (that is very similar to the final project dataset)..

1. Reading the dataset
1. Exploratory data analysis
1. Stop word removal and stemming
1. Train - Validation Split
1. Data processing with Pipeline and ColumnTransform
1. Fit the Decision Tree classifier Find more details on the DecisionTreeClassifier here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
1. Test the classifier
1. Fit and test the Random Forest classifier Find more details on the RandomForestClassifier here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
1. Hyperparameter Tuning
    1. Find more details on the GridSearchCV here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
    1. Find more details on the RandomizedSearchCV here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
1. Ideas for improvement



Overall dataset schema:

1. reviewText: Text of the review
1. summary: Summary of the review
1. verified: Whether the purchase was verified (True or False)
1. time: UNIX timestamp for the review
1. rating: Rating of the review
1. log_votes: Logarithm-adjusted votes log(1+votes)
1. isPositive: Whether the review is positive or negative (1 or 0)

**1. Reading the dataset**

In [3]:
import pandas as pd

df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')
df.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


**2. Exploratory data analysis**

In [4]:
df["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

In [5]:
df.isna().sum()

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64

**3. Text Processing: Stop words removal and stemming**


In [6]:
import nltk

nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:

import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        
        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""
            
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
        
    return final_text_list

**4. Train - validation Split**

9/1 split

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df[["reviewText", "summary", "time", "log_votes"]],
                                                  df["isPositive"],
                                                  test_size=0.1,
                                                  shuffle=True,
                                                  random_state=42)

In [9]:
print("Processing the reviewText field")

X_train["reviewText"] = process_text(X_train["reviewText"].tolist())
X_val["reviewText"] = process_text(X_val["reviewText"].tolist())


Processing the reviewText field


In [10]:
X_train["summary"] = process_text(X_train["summary"].tolist())
X_val["summary"] = process_text(X_val["summary"].tolist())

**5. Data processing with Pipeline and ColumnTransform**

In the previous examples, we have seen how to use pipeline to prepare a data field for our machine learning model. This time, we will focus on multiple fields: numeric and text fields. We are using linear regression model from Sklearn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model.

For the numerical features pipeline, the numerical_processor below, we use a MinMaxScaler (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.
For the numerical features pipeline, the text_processor below, we use CountVectorizer() for the text fields.
The selective preparations of the dataset features are then put together into a collective ColumnTransformer, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future.

In [11]:
numerical_features = ["time", "log_votes"]
text_features = ["summary", "reviewText"]

model_features = numerical_features + text_features
model_target = "isPositive"

In [13]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier

# Column Transformer
numerical_processor = Pipeline([("num_scalar", MinMaxScaler())])

# preprocess text features
text_preprocessor_0 = Pipeline([("text_vect_0", CountVectorizer(binary=True, max_features=50))])
text_preprocessor_1 = Pipeline([("text_vect_1", CountVectorizer(binary=True, max_features=150))])

data_preprocessor = ColumnTransformer([
    ("numerical_pre", numerical_processor, numerical_features),
    ("text_pre_0", text_preprocessor_0, text_features[0]),
    ("text_pre_1", text_preprocessor_1, text_features[1])
])


# Pipeline 
pipeline = Pipeline([
    ("data_preprocessing", data_preprocessor),
    ("DecisionTreeClassifiern", DecisionTreeClassifier(max_depth=10, min_samples_leaf=15))
])

**6. Train Regressor**

In [14]:
pipeline.fit(X_train, y_train.values)

Pipeline(memory=None,
         steps=[('data_preprocessing',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('numerical_pre',
                                                  Pipeline(memory=None,
                                                           steps=[('num_scalar',
                                                                   MinMaxScaler(copy=True,
                                                                                feature_range=(0,
                                                                                               1)))],
                                                           verbose=False),
                                                  ['time', 'log_votes']),
                                                 ('text_pre_0',
                               

**7. Fitting Linear Regression models and checking the validation performance**

7.1 LinearRegression
Let's first fit LinearRegression from Sklearn library, and check the performance on the validation dataset. Using the coef_ atribute, we can also print the learned weights of the model.

Find more details on LinearRegression here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [15]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_pred = pipeline.predict(X_val)
print(classification_report(y_val.values, val_pred))
print(f"Accuracy on validation set: {accuracy_score(y_val.values, val_pred)}")


              precision    recall  f1-score   support

         0.0       0.71      0.71      0.71      2557
         1.0       0.83      0.83      0.83      4443

    accuracy                           0.79      7000
   macro avg       0.77      0.77      0.77      7000
weighted avg       0.79      0.79      0.79      7000

Accuracy on validation set: 0.7872857142857143


**8. Fit and test the Random Forest classifier¶**
(Go to top)

This time, we will use the Random Forest classifier. Let's update our pipeline for that

In [17]:
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ("data_preprocessing", data_preprocessor),
    ("rf", RandomForestClassifier(n_estimators=150, max_depth=10, min_samples_leaf=15))
])

pipeline.fit(X_train[model_features], y_train.values)

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))



[[1634  923]
 [ 403 4040]]
              precision    recall  f1-score   support

         0.0       0.80      0.64      0.71      2557
         1.0       0.81      0.91      0.86      4443

    accuracy                           0.81      7000
   macro avg       0.81      0.77      0.79      7000
weighted avg       0.81      0.81      0.81      7000

Accuracy (validation): 0.8105714285714286


**9. Hyperparameter Tuning**

Let's try different parameter values and see how the DecisionTreeClassifier model performs under some combinations of parameters.

Warning: The number of hyperparameters tuned, along with the cross-validations, can greatly increase training time! Especially if trying hyperparameters tuning on the RandomForestClassifier instead of the lower performing DecisionTreeClassifier that we showcase below for speed! Similar tuning on a RandomForestClassifier model can take more minutes to hours!

9.1 GridSearchCV
Find more details on the GridSearchCV here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [18]:
from sklearn.model_selection import GridSearchCV

#parameter grid
param_grid = {
    "rf__max_depth": [10,20],
    "rf__min_samples_leaf": [5,10]
}

grid_search = GridSearchCV(pipeline,
                           param_grid,
                           cv=3,
                           verbose=10,
                           n_jobs=-1)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   34.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:  3.0min finished


Best parameters: {'rf__max_depth': 20, 'rf__min_samples_leaf': 5}
Best score: 0.8341269841269842


In [19]:
val_predictions = grid_search.best_estimator_.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[1836  721]
 [ 469 3974]]
              precision    recall  f1-score   support

         0.0       0.80      0.72      0.76      2557
         1.0       0.85      0.89      0.87      4443

    accuracy                           0.83      7000
   macro avg       0.82      0.81      0.81      7000
weighted avg       0.83      0.83      0.83      7000

Accuracy (validation): 0.83
