# Lab 8: Define and Solve an ML Problem of Your Choosing

In [7]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [8]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename)

df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1) I have chosen the Book Review data set
2) I will be predicting whether a certain review given about a book is a positive review or not. 
3) This is a supervised binary classification problem.
4) The feature is the review that book readers leave
5) A publishing company can make a good use of this model to choose books with good reviews to publish which will svae them from a lot of loss and helps them invest on books with good reviews.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [9]:
df.isnull().sum()

Review             0
Positive Review    0
dtype: int64

In [14]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.6.7-py3-none-any.whl (1.5 MB)
     |████████████████████████████████| 1.5 MB 15.3 MB/s            
[?25hCollecting regex>=2021.8.3
  Downloading regex-2023.8.8-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB)
     |████████████████████████████████| 759 kB 49.6 MB/s            
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 1.6 MB/s             
[?25hCollecting importlib-resources
  Downloading importlib_resources-5.4.0-py3-none-any.whl (28 kB)
Collecting zipp>=3.1.0
  Downloading zipp-3.6.0-py3-none-any.whl (5.3 kB)
Installing collected packages: zipp, importlib-resources, tqdm, regex, nltk
Successfully installed importlib-resources-5.4.0 nltk-3.6.7 regex-2023.8.8 tqdm-4.64.1 zipp-3.6.0


In [37]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import re

nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    # Tokenize: Split the text into words (tokens)
    words = nltk.word_tokenize(text)
    
    # Lowercase: Convert all words to lowercase
    words = [word.lower() for word in words]
    
    # Remove stopwords: Remove common words that don’t add much meaning
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    # Remove punctuation and non-alphabetic characters
    words = [word for word in words if re.match(r'[a-zA-Z]', word)]
    
    return ' '.join(words)


# Apply the preprocessing function to the 'Review' column
df['Processed_Review'] = df['Review'].apply(preprocess_text)

# Show the first few rows of the dataframe to inspect the processed reviews
print(df[['Review', 'Processed_Review']].head())


[nltk_data] Downloading package stopwords to /home/codio/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/codio/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


                                              Review  \
0  This was perhaps the best of Johannes Steinhof...   
1  This very fascinating book is a story written ...   
2  The four tales in this collection are beautifu...   
3  The book contained more profanity than I expec...   
4  We have now entered a second time of deep conc...   

                                    Processed_Review  
0  perhaps best johannes steinhoff books since de...  
1  fascinating book story written form numerous l...  
2  four tales collection beautifully composed art...  
3  book contained profanity expected read book ri...  
4  entered second time deep concern science math ...  


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

Thre's only have one feature which is the Review column, which I obviously intend to keep. The data preparation techniques I used include text cleaning by converting the text to lowercase, removing non-alphabetic characters, tokenizing the text, and removing stopwords. I then apply TF-IDF vectorization to transform the text data into numerical feature vectors.The model I will use is the Logistic Regression Model. I will fit the logistic regression model to the training data, make predictions and use evaluation metrics like the AUC and ROC scores to evaluate the performace of the model. Finally, I will experiment with different values of hyperparameters such as min_def and max_def to find the optimum performance.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [39]:
y = df['Positive Review'] 
X = df['Review']

X.shape

(1973,)

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75, random_state=1234)

X_train.head()

500     There is a reason this book has sold over 180,...
1047    There is one thing that every cookbook author ...
1667    Being an engineer in the aerospace industry I ...
1646    I have no idea how this book has received the ...
284     It is almost like dream comes true when I saw ...
Name: Review, dtype: object

In [42]:
for min_df in [1, 3, 6, 8, 10,100,1000]:
    
    print('\nMin Document Frequency Value: {0}'.format(min_df))
    
    
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, ngram_range=(1,2))

    
    tfidf_vectorizer.fit(X_train)

    
    X_train_tfidf = tfidf_vectorizer.transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)

   
    model = LogisticRegression(max_iter=200)
    model.fit(X_train_tfidf, y_train)
    
    
    probability_predictions = model.predict_proba(X_test_tfidf)[:,1]

    
    auc = roc_auc_score(y_test, probability_predictions)
    print('AUC on the test data: {:.4f}'.format(auc))

   
    len_feature_space = len(tfidf_vectorizer.vocabulary_)
    print('The size of the feature space: {0}'.format(len_feature_space))
    
    
    first_five = list(tfidf_vectorizer.vocabulary_.items())[0:5]
    print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))

    
    first_five_stop = list(tfidf_vectorizer.stop_words_)[0:5]
    print('Glimpse of first 5 stop words \n{}:'.format(first_five_stop))
    


Min Document Frequency Value: 1
AUC on the test data: 0.9268
The size of the feature space: 138486
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('there', 119835), ('is', 61671), ('reason', 97323), ('this', 120815), ('book', 18054)]:
Glimpse of first 5 stop words 
[]:

Min Document Frequency Value: 3
AUC on the test data: 0.9280
The size of the feature space: 17684
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('there', 15005), ('is', 7641), ('reason', 11976), ('this', 15162), ('book', 2272)]:
Glimpse of first 5 stop words 
['most others', 'israel will', 'swath of', 'naive catherine', 'they used']:

Min Document Frequency Value: 6
AUC on the test data: 0.9258
The size of the feature space: 7337
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('there', 6166), ('is', 3114), ('reason', 4913), ('this', 6239), ('book', 879)]:
Glimpse of first 5 stop words 
['most others', 'israel will'

In [34]:
print('Review #1:\n')
print(X_test.to_numpy()[13])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[13])) 

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[13]))

Review #1:

This is the best review of the JFK assassination that I have seen. There is still a large "assassination industry," which can afford to find documents that you haven't read and charge you with ignorance if you haven't read them, and find 15 more if you read them. This gives a common-sense overview that seems quite reasonable. I trust it. I am always willing to consider other opinions, but the balance of evidence has always indicated that Oswald acted alone.

It would be nice to have a new edition of this book..


Prediction: Is this a good review? True

Actual: Is this a good review? True



In [35]:
print('Review #2:\n')
print(X_test.to_numpy()[250])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[250])) 

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[250]))

Review #2:

A wonderful wonderful book.Although the book was written some 150 year ago, Mary Eliza Rogers takes you into the intimacies of daily life in Palestine in the 1850's as if it was occurring today. She writes from her  heart with honesty,integrity and a clear mind. And although written at a  time of Victorian prejudicies and colonialism she writes without bias or  judgement. From her beautiful and colourful descriptions one can envisage  the Holy Land as it was before undergoing the process of modernisation and  change. For anyone who has any attachment to this land it is a truly  wonderful and personal experience to read this book


Prediction: Is this a good review? True

Actual: Is this a good review? True

