# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [2]:
# YOUR CODE HERE
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
# YOUR CODE HERE
filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")
df = pd.read_csv(filename, header=0)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
# YOUR CODE HERE
df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


In [5]:
df.shape

(1973, 2)

In [6]:
y = df['Positive Review']
X = df['Review']

In [7]:
X.head()

0    This was perhaps the best of Johannes Steinhof...
1    This very fascinating book is a story written ...
2    The four tales in this collection are beautifu...
3    The book contained more profanity than I expec...
4    We have now entered a second time of deep conc...
Name: Review, dtype: object

In [8]:
X.shape

(1973,)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1234)

In [10]:
X_train.head()

1369    As my brother said when flipping through this ...
1366    Cooper's book is yet another warm and fuzzy ma...
385     I have many robot books and this is the best a...
750     As China re-emerges as a dominant power in the...
643     I have been a huge fan of Michael Crichton for...
Name: Review, dtype: object

In [11]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(X_train)
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [12]:
# YOUR CODE HERE
from sklearn.model_selection import GridSearchCV

In [13]:
#find best max_iter
model = LogisticRegression()
param_grid_lr = {'max_iter': [100, 150, 200, 250, 300, 400, 500]}
grid_search = GridSearchCV(model, param_grid_lr, cv=5, scoring='accuracy')
grid_search.fit(X_train_tfidf, y_train)
best_max_iter = grid_search.best_params_['max_iter']
print("Best max_iter value: ", best_max_iter)

Best max_iter value:  100


In [14]:
lr_model = LogisticRegression(max_iter=100)
lr_model.fit(X_train_tfidf, y_train)
probability_predictions = lr_model.predict_proba(X_test_tfidf)[:,1]
class_label_predictions = lr_model.predict(X_test_tfidf)

auc = roc_auc_score(y_test, probability_predictions)
print('AUC on the test data: {:.4f}'.format(auc))

len_feature_space = len(tfidf_vectorizer.vocabulary_)
print('The size of the feature space: {0}'.format(len_feature_space))

first_five = list(tfidf_vectorizer.vocabulary_.items())[1:5]
print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))


AUC on the test data: 0.9161
The size of the feature space: 19029
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('my', 11353), ('brother', 2455), ('said', 14836), ('when', 18601)]:


In [15]:
#test prediction
print('Review #1:\n')
print(X_test.to_numpy()[124])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[124])) 

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[124]))

Review #1:

I've been a fan of Carol Dweck's scholarly work for years. Her work on self-esteem, self-concept, and the incremental vs. entity theories of intelligence provides some of the most powerfully useful tools I've encountered for educators and parents in their work with children, as well as in their own self-awareness and lives. I'm delighted to see this information written here in such a user-friendly conversational tone, rich with stories that illustrate the nuances and complexities of Dweck's research and ideas. I'm recommending this book to all of my graduate students (teachers and principals working with gifted learners), as well as to parents of high-ability children.

Dona Matthews, Ph.D., Director of the Hunter College Center for Gifted Studies and Education, City University of New York


Prediction: Is this a good review? True

Actual: Is this a good review? True



In [16]:
#test prediction
print('Review #1:\n')
print(X_test.to_numpy()[120])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[120])) 

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[120]))

Review #1:

Book goes over a lot of information in a very short time, but not much of that information is worth anything unless you're building a circle-track or drag car. Took the hit and ordered Stanforth's Competition Car Suspension


Prediction: Is this a good review? False

Actual: Is this a good review? False



In [17]:
for min_df in [1,10,100,1000]:
    print('\nMin Document Frequency Value: {0}'.format(min_df))
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, ngram_range=(1,2))
    tfidf_vectorizer.fit(X_train)
    X_train_tfidf = tfidf_vectorizer.transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    model = LogisticRegression(max_iter=100)
    model.fit(X_train_tfidf, y_train)
    probability_predictions = model.predict_proba(X_test_tfidf)[:,1]
    auc = roc_auc_score(y_test, probability_predictions)
    print('AUC on the test data: {:.4f}'.format(auc))
    len_feature_space = len(tfidf_vectorizer.vocabulary_)
    print('The size of the feature space: {0}'.format(len_feature_space))
    first_five = list(tfidf_vectorizer.vocabulary_.items())[1:5]
    print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))
    first_five_stop = list(tfidf_vectorizer.stop_words_)[1:5]
    print('Glimpse of first 5 stop words \n{}:'.format(first_five_stop))
    


Min Document Frequency Value: 1
AUC on the test data: 0.9310
The size of the feature space: 143560
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('my', 79875), ('brother', 20610), ('said', 105149), ('when', 137651)]:
Glimpse of first 5 stop words 
[]:

Min Document Frequency Value: 10
AUC on the test data: 0.9254
The size of the feature space: 4257
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('my', 2288), ('brother', 588), ('said', 2967), ('when', 4049)]:
Glimpse of first 5 stop words 
['too comfortable', 'fact in', 'by discarding', 'guidance of']:

Min Document Frequency Value: 100
AUC on the test data: 0.8625
The size of the feature space: 279
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('my', 144), ('when', 258), ('through', 233), ('this', 226)]:
Glimpse of first 5 stop words 
['the war', 'too comfortable', 'fact in', 'by discarding']:

Min Document Frequency Value: 1000


In [18]:
for max_df in [1,10,100,1000]:
    print('\nMax Document Frequency Value: {0}'.format(max_df))
    tfidf_vectorizer = TfidfVectorizer(max_df=max_df, ngram_range=(1,2))
    tfidf_vectorizer.fit(X_train)
    X_train_tfidf = tfidf_vectorizer.transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    model = LogisticRegression(max_iter=100)
    model.fit(X_train_tfidf, y_train)
    probability_predictions = model.predict_proba(X_test_tfidf)[:,1]
    auc = roc_auc_score(y_test, probability_predictions)
    print('AUC on the test data: {:.4f}'.format(auc))
    len_feature_space = len(tfidf_vectorizer.vocabulary_)
    print('The size of the feature space: {0}'.format(len_feature_space))
    first_five = list(tfidf_vectorizer.vocabulary_.items())[1:5]
    print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))
    first_five_stop = list(tfidf_vectorizer.stop_words_)[1:5]
    print('Glimpse of first 5 stop words \n{}:'.format(first_five_stop))


Max Document Frequency Value: 1
AUC on the test data: 0.7418
The size of the feature space: 105091
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('mushy', 58767), ('dumped', 27415), ('venus', 98087), ('brother said', 15070)]:
Glimpse of first 5 stop words 
['or even', 'the war', 'the promises', 'and brief']:

Max Document Frequency Value: 10
AUC on the test data: 0.8755
The size of the feature space: 139729
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('acting', 2318), ('pretend', 93896), ('dumps', 36257), ('mushy', 77677)]:
Glimpse of first 5 stop words 
['the war', 'of money', 'national', 'aren']:

Max Document Frequency Value: 100
AUC on the test data: 0.9235
The size of the feature space: 143285
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('said', 104967), ('flipping', 45748), ('girls', 50151), ('start', 112862)]:
Glimpse of first 5 stop words 
['might', 'did', 'for', 'th

In [27]:
best_vectorizer = TfidfVectorizer(min_df=1, max_df=1000, ngram_range=(1,2))
best_vectorizer.fit(X_train)
X_train_best = best_vectorizer.transform(X_train)
X_test_best = best_vectorizer.transform(X_test)
best_model = LogisticRegression(max_iter=100)
best_model.fit(X_train_best, y_train)
probability_predictions = best_model.predict_proba(X_test_best)[:,1]
class_label_predictions = best_model.predict(X_test_best)

auc = roc_auc_score(y_test, probability_predictions)
print('AUC on the test data: {:.4f}'.format(auc))

len_feature_space = len(tfidf_vectorizer.vocabulary_)
print('The size of the feature space: {0}'.format(len_feature_space))

first_five = list(tfidf_vectorizer.vocabulary_.items())[1:5]
print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))


AUC on the test data: 0.9326
The size of the feature space: 143550
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('my', 79869), ('brother', 20608), ('said', 105142), ('when', 137641)]:


In [20]:
#test prediction
print('Review #1:\n')
print(X_test.to_numpy()[124])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[124])) 

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[124]))

Review #1:

I've been a fan of Carol Dweck's scholarly work for years. Her work on self-esteem, self-concept, and the incremental vs. entity theories of intelligence provides some of the most powerfully useful tools I've encountered for educators and parents in their work with children, as well as in their own self-awareness and lives. I'm delighted to see this information written here in such a user-friendly conversational tone, rich with stories that illustrate the nuances and complexities of Dweck's research and ideas. I'm recommending this book to all of my graduate students (teachers and principals working with gifted learners), as well as to parents of high-ability children.

Dona Matthews, Ph.D., Director of the Hunter College Center for Gifted Studies and Education, City University of New York


Prediction: Is this a good review? True

Actual: Is this a good review? True



In [21]:
#test prediction
print('Review #1:\n')
print(X_test.to_numpy()[120])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[120])) 

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[120]))

Review #1:

Book goes over a lot of information in a very short time, but not much of that information is worth anything unless you're building a circle-track or drag car. Took the hit and ordered Stanforth's Competition Car Suspension


Prediction: Is this a good review? False

Actual: Is this a good review? False



In [22]:
import pickle

In [23]:
pkl_model_filename = "Pickle_BookReview_Regression_Model.pkl"
pickle.dump(best_model, open(pkl_model_filename, 'wb'))

In [24]:
persistent_model = pickle.load(open(pkl_model_filename, 'rb'))
persistent_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [29]:
prediction = persistent_model.predict(X_test_best) 
print(prediction)

[ True  True False  True  True False False False  True  True  True False
  True  True False  True False False False False False  True False  True
 False False  True  True False False False False False  True  True False
  True False  True  True  True  True False  True False False False  True
  True  True False False False False False False  True  True False False
  True  True False  True False  True  True False False  True False False
 False  True  True False False  True False False  True  True False  True
 False  True False False  True False False False False False False False
 False  True  True False  True  True  True False  True  True  True  True
 False False  True False  True  True False False  True  True False  True
 False  True False False  True False  True  True  True  True  True False
 False  True False  True  True  True  True False False  True  True  True
  True  True False  True False False  True False  True  True  True  True
 False  True  True False  True  True False  True Fa