===========================================


Title: 6.2 Exercises


Author: Chad Wood


Date: 28 Jan 2022


Modified By: Chad Wood


Description: This program demonstrates building and using a logistic regression machine learning model that uses multiple feature engineering techniques to classify a Amazon reviews as either positive or negative.


=========================================== 

## 6.2 Exercises

<b>(1) Using the Amazon Alexa reviews dataset, build a logistic regression model to predict positive or negative feedback based on review text. Be sure to run a test with something random you create (out of sample). Remember: 1 is positive, 0 is negative.</b>

In [1]:
import pandas as pd
import lib.normalizer as nm
import nltk

# Allows no/not to be retained through normalization
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

data = pd.read_csv('data/amazon_alexa.tsv', sep='\t')

In [2]:
# Normalizes data
verified_reviews = nm.Normalizer(data['verified_reviews'])
data['verified_reviews'] = verified_reviews.normalize(
    strip_html=True, remove_special_chars=True, 
    remove_digits=True, remove_stopwords=True,
    remove_accented_chars=True, expand_contractions=True,
    lemmatize_text=True, text_lower=True,
    stopwords=stopword_list)

In [3]:
import numpy as np
import re

# Drops nan rows
data = data.dropna().reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3150 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


### <i> Below is my initial take, before realizing the book walks us through the process.</i>

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Attempts to use BOW Feature Model for ML Model
def bow(corpus):
    # Gets bag of words features
    cv = CountVectorizer(min_df=0., max_df=1.)
    cv_X = cv.fit_transform(corpus)
    cv_names = cv.get_feature_names()

    return pd.DataFrame(cv_X.toarray(), columns=cv_names)

In [5]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model

# Gets BOW Model of corpus
corpus_model = bow(data['verified_reviews'])

# Splits data into training/testing pairs
train_corpus, test_corpus, train_label_nums, test_label_nums = train_test_split(np.array(corpus_model),
                                                                                np.array(data['feedback']),
                                                                                test_size=0.35, random_state=42)

# Builds ML the model, trains it with corpus
logistic = linear_model.LogisticRegression()
logistic.fit(train_corpus, train_label_nums)

# Predicts rating values
y_pred = logistic.predict(test_corpus)

In [9]:
import importlib
importlib.reload(meu) # Updated book module due to pd.MultiIndex using labels arg
                      # labels arg changed to codes
# import lib.model_evaluation_utils as meu

# Generates confusion matrix
meu.display_confusion_matrix(true_labels=test_label_nums, predicted_labels=y_pred)

          Predicted:    
                   1   0
Actual: 1        991   9
        0         66  37


<i><b>"Be sure to run a test with something random you create (out of sample)." :</b></i>

In [170]:
# Builds sample
sample = ['not fan write review not meet expectation',
          'really love write test review']
sample_label = [0, 1]

# Gets BOW Model of sample
bow(sample)

# Creates compatible DF
tmp = pd.DataFrame(corpus_model.columns).set_index(0).T
tmp['drop_this_col'] = 0,0 # Adds two rows to DF
tmp = tmp.fillna(0).drop(columns=['drop_this_col'])

# Adds sample to DF
for string, row in zip(sample, range(2)):
    for word in string.split():
        tmp[word].iloc[row] = 1

sample_corpus = np.array(tmp)
sample_label = np.array(sample_label)

# Predicts and displays confusion matrix
sample_pred = logistic.predict(sample_corpus)
meu.display_confusion_matrix(true_labels=sample_label, predicted_labels=sample_pred)

          Predicted:   
                   1  0
Actual: 1          1  0
        0          1  0


### <i> Below is my second take, following the books guide (pg 315+).</i>

In [172]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

# Splits data into training/testing pairs
train_corpus, test_corpus, train_label_nums, test_label_nums = train_test_split(np.array(data['verified_reviews']),
                                                                                np.array(data['feedback']),
                                                                                test_size=0.35, random_state=42)

# builds BOW features
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0)
cv_train_features = cv.fit_transform(train_corpus)

# transforms test_corpus into features
cv_test_features = cv.transform(test_corpus)

In [182]:
# Logistic Regression
log_reg = linear_model.LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
log_reg.fit(cv_train_features, train_label_nums)

# Gets CV accuracy
lr_bow_cv_scores = cross_val_score(log_reg, cv_train_features, train_label_nums, cv=5)
lr_bow_cv_mean_score = np.mean(lr_bow_cv_scores)
print('CV Accuracy (5-fold):', lr_bow_cv_scores) # 5-fold
print('Average CV Accuracy:', lr_bow_cv_mean_score) # Average

# Model test accuracy
lr_bow_test_score = log_reg.score(cv_test_features, test_label_nums)
print('Test Accuracy:', lr_bow_test_score)

CV Accuracy (5-fold): [0.92682927 0.94634146 0.92665037 0.93887531 0.93154034]
Average CV Accuracy: 0.9340473492754487
Test Accuracy: 0.9320036264732547


In [184]:
# Using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# builds BOW features
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)
tv_train_features = tv.fit_transform(train_corpus)

# transforms test_corpus into features
tv_test_features = tv.transform(test_corpus)

# Logistic Regression
log_reg = linear_model.LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
log_reg.fit(tv_train_features, train_label_nums)

# Gets CV accuracy
lr_tfidf_scores = cross_val_score(log_reg, tv_train_features, train_label_nums, cv=5)
lr_tfidf_mean_score = np.mean(lr_tfidf_scores)
print('CV Accuracy (5-fold):', lr_tfidf_scores) # 5-fold
print('Average CV Accuracy:', lr_tfidf_mean_score) # Average

# Model test accuracy
lr_tfidf_score = log_reg.score(tv_test_features, test_label_nums)
print('Test Accuracy:', lr_tfidf_score)

CV Accuracy (5-fold): [0.92439024 0.92439024 0.92665037 0.92420538 0.92420538]
Average CV Accuracy: 0.9247683224998509
Test Accuracy: 0.9084315503173164


<b>At the end of Chapter 5, the author uses a custom-built class to summarize model performance. This class doesn’t actually exist (from the author) but you can make it a reality. Using the object you have from mnb_predictions, create something similar to the output on page 335.<b>

In [191]:
# Textbook uses *mnb_predictions = gs_mnb.predict(test_corpus)* for predicted_labels arg
# y_pred from my model returns the same output, with respect to label names/nums.
meu.display_classification_report(true_labels=test_label_nums,
                                  predicted_labels=y_pred)

              precision    recall  f1-score   support

           1       0.94      0.99      0.96      1000
           0       0.80      0.36      0.50       103

    accuracy                           0.93      1103
   macro avg       0.87      0.68      0.73      1103
weighted avg       0.93      0.93      0.92      1103

