# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, log_loss, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")
df = pd.read_csv(filename, header=0)
df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
# Create labeled examples
y = df['Positive Review']
X = df['Review']
print('The size of X: ' + str(X.shape))
X.head()

The size of X: (1973,)


0    This was perhaps the best of Johannes Steinhof...
1    This very fascinating book is a story written ...
2    The four tales in this collection are beautifu...
3    The book contained more profanity than I expec...
4    We have now entered a second time of deep conc...
Name: Review, dtype: object

In [5]:
# Check if class imbalance existed
class_distribution = y.value_counts()
print('Class Distribution: ' + str(class_distribution))

# Calculate class ratios
class_ratios = class_distribution / len(df)
print('\nClass Ratios: ' + str(class_ratios))

# Class imbalance occurs when there is a significant disparity in the number 
# of instandces between different classes. 
# In this case, since the class ratios are quite close(0.503 & 0.497), 
# there is a relatively balanced class distribution.

Class Distribution: False    993
True     980
Name: Positive Review, dtype: int64

Class Ratios: False    0.503294
True     0.496706
Name: Positive Review, dtype: float64


In [6]:
# Take a look at an example of a positive and a negative review
print('A Positive Review:\n', X[2])
print('A Negative Review:\n', X[3])

A Positive Review:
 The four tales in this collection are beautifully composed; they are art, not just stories.  Each story is deep in its unique complexities.  Each one has plots and subplots and paints an impeccable image of the story upon the reader's mind.  And when I look back upon the book as a whole, upon the adventurous stories, the excitement and emotion that the author presents so exquisitely, I can't help but be extremely impressed.

A Negative Review:
 The book contained more profanity than I expected to read in a book by Rita Rudner.  I had expected more humor from a comedienne.  Too bad, because I really like her humor



## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [7]:
# Split labeled examples into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, 
                                                    random_state=1234)
X_train.head()

500     There is a reason this book has sold over 180,...
1047    There is one thing that every cookbook author ...
1667    Being an engineer in the aerospace industry I ...
1646    I have no idea how this book has received the ...
284     It is almost like dream comes true when I saw ...
Name: Review, dtype: object

In [8]:
# Implement TF-IDF vectorizer to transform text
# 1. Create a TfidfVectorizer oject
tfidf_vectorizer = TfidfVectorizer()

# 2. Fit the vectorizer to X_train
tfidf_vectorizer.fit(X_train)

# 3. Print the first 50 items in the vocabulary
print("Vocabulary size {0}: ".format(len(tfidf_vectorizer.vocabulary_)))
print(str(list(tfidf_vectorizer.vocabulary_.items())[0:50])+'\n')

      
# 4. Transform *both* the training and test data using the fitted vectorizer and its 'transform' attribute
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


# 5. Print the matrix
print(X_train_tfidf.todense())

# 6. Print the matrix
print(X_test_tfidf.todense())


Vocabulary size 18558: 
[('there', 16673), ('is', 9043), ('reason', 13533), ('this', 16714), ('book', 2189), ('has', 7803), ('sold', 15423), ('over', 11793), ('180', 73), ('000', 1), ('copies', 3867), ('it', 9076), ('gets', 7240), ('right', 14207), ('to', 16835), ('the', 16627), ('point', 12568), ('accompanies', 444), ('each', 5372), ('strategy', 15943), ('with', 18277), ('visual', 17844), ('aid', 750), ('so', 15386), ('you', 18497), ('can', 2604), ('get', 7239), ('mental', 10534), ('picture', 12402), ('in', 8491), ('your', 18501), ('head', 7844), ('further', 7051), ('its', 9088), ('section', 14743), ('on', 11601), ('analyzing', 974), ('stocks', 15886), ('and', 984), ('commentary', 3384), ('state', 15782), ('of', 11543), ('financial', 6568), ('statements', 15786), ('market', 10286), ('are', 1220), ('money', 10863), ('if', 8336), ('just', 9282), ('starting', 15774)]

[[0.         0.16185315 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.       

In [21]:
# Fit a logistic regression model to the transformed training data 
# and evaluate the model
model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)

# Make predictions on the transformed test data using the predit_proba() method
probability_predictions = model.predict_proba(X_test_tfidf)[:,1]
loss = log_loss(y_test, probability_predictions)
print('Log loss of the Logistic Regression model: ' + str(loss))

# Make predictions on the transformed test data using the predict() method
class_label_predictions = model.predict(X_test_tfidf)
acc_score = accuracy_score(y_test, class_label_predictions)
print('Accuracy score of the Logistic Regression model: ' + str(acc_score))


# Computer the area under the ROC curve for the test data.
auc = roc_auc_score(y_test, probability_predictions)
print('AUC on the test data: {:.4f}'.format(auc))

# Computer the cross validation score
cv_scores = cross_val_score(model, X_train_tfidf, y_train, cv=5, scoring='roc_auc')
mean_cv_scores = np.mean(cv_scores)
print('Mean Cross Validation on the test data:', mean_cv_scores)

len_feature_space = len(tfidf_vectorizer.vocabulary_)
print('The size of the feature space: {0}'.format(len_feature_space))

# Get a glimpse of the features
first_five =  list(tfidf_vectorizer.vocabulary_.items())[1:5]
print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'
     .format(first_five))


Log loss of the Logistic Regression model: 0.6688500721532677
Accuracy score of the Logistic Regression model: 0.6072874493927125
AUC on the test data: 0.6435
Mean Cross Validation on the test data: 0.5779984229539844
The size of the feature space: 9
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('this', 7), ('book', 1), ('it', 4), ('to', 8)]:


In [10]:
# random test the model
print('Review #1:\n')
print(X_test.to_numpy()[238])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[238]))

print('Actual: Is this a good review?{}\n'.format(y_test.to_numpy()[238]))

Review #1:

I have read other books by Alesia Holliday and enjoyed them so I looked forward to reading this book.  Unfortunately, I could not get any farther than the first 25 pages.  I even tried diving in further into the book to see if it got better and I still could not read more than 5 pages without turning away.  The best I can do to pin down why I dislike it so much is to say that it tries too hard.  No character seems to even approach reality.  They are all, including the main character and her love interest, over the top


Prediction: Is this a good review? False

Actual: Is this a good review?False



In [23]:
# Experiment with Different Document Frequency Values and Analyze the Results
for min_df in [1, 10, 100, 1000]:
    print('\nMin Document Frequency Value: {0}'.format(min_df))
    
    #1. Create a TfidfVectorizer object
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, ngram_range=(1,2))
    
    #2. Fit the vectorizer toX_train
    tfidf_vectorizer.fit(X_train)

    #3. Transform the training and testing data
    X_train_tfidf = tfidf_vectorizer.transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)

    
    #4. Fit a logistic regression model to the transformed training data 
    # and evaluate the model
    model = LogisticRegression(max_iter=200)
    model.fit(X_train_tfidf, y_train)
    
    #5. Make predictions on the transformed test data
    probability_predictions = model.predict_proba(X_test_tfidf)[:,1]
    loss = log_loss(y_test, probability_predictions)
    print('Log loss of the Logistic Regression model: ' + str(loss))
    
    #6. Make predictions on the transformed test data using the predict() method
    class_label_predictions = model.predict(X_test_tfidf)
    acc_score = accuracy_score(y_test, class_label_predictions)
    print('Accuracy score of the Logistic Regression model: ' + str(acc_score))
    
    #7. Compute the Area under the ROC curve for the test data
    auc = roc_auc_score(y_test, probability_predictions)
    print('AUC on the test data: {:.4f}'.format(auc))
    
    #8. Computer the cross validation score
    cv_scores = cross_val_score(model, X_train_tfidf, y_train, cv=5, scoring='roc_auc')
    mean_cv_scores = np.mean(cv_scores)
    print('Mean Cross Validation on the test data:', mean_cv_scores)

    #9. Computer the size of the resulting feature spacing
    len_feature_space = len(tfidf_vectorizer.vocabulary_)
    print('The size of the feature space: {0}'.format(len_feature_space))
    
    #10. Get a glimpse of the features:
    first_five = list(tfidf_vectorizer.vocabulary_.items())[1:5]
    print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}'
         .format(first_five))
    
    #11. Print the first five "stop words"
    first_five_stop = list(tfidf_vectorizer.stop_words_)[1:5]
    print('Glimpse of first 5 stop words \n{}'.format(first_five_stop))


Min Document Frequency Value: 1
Log loss of the Logistic Regression model: 0.5595754586404746
Accuracy score of the Logistic Regression model: 0.8562753036437247
AUC on the test data: 0.9268
Mean Cross Validation on the test data: 0.8963567733060099
The size of the feature space: 138486
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('is', 61671), ('reason', 97323), ('this', 120815), ('book', 18054)]
Glimpse of first 5 stop words 
[]

Min Document Frequency Value: 10
Log loss of the Logistic Regression model: 0.4970517918147751
Accuracy score of the Logistic Regression model: 0.8380566801619433
AUC on the test data: 0.9195
Mean Cross Validation on the test data: 0.898919599323726
The size of the feature space: 4023
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('is', 1687), ('reason', 2699), ('this', 3396), ('book', 464)]
Glimpse of first 5 stop words 
['sagan fans', 'almoxt', 'grisham', 'kanji previously']

Min Do

In [13]:
#1. The model with min_df=1 has the highest AUC(0.9268) and the highest 
#   cross_validation (0.8964),indicating excellent predictive performance and
#   generalization across different folds of the training data. 
#   It also has a relatively good accuracy(0.8563) and a reasonable 
#   log loss(0.5596), indicating well-calibrated probabilities. 
#2. The model with min_df=10 follows closely with a high AUC(0.9195) and a similar
#   mean cross_validation(0.8989), which also having a lower log loss(0.4971).
#3. The models with min_df=100 and min_df=1000 have decreasing AUC and 
#   cross_validation, indicating decreasing discrimination power in 
#   distinguishing between classes. 
#4. Overall, based on the testing outputs, the model with min_df=1 seems to be 
#   the best performer, as it consistently demonstrates the highest AUC and 
#   mean cross_validation, along with reasonable accuuracy and log loss scores.