# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename)

df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

<Double click this Markdown cell to make it editable, and record your answers here.>

Data Set Chosen
Data Set: bookReviewsData.csv

Prediction Task
Prediction Objective: Predict the rating of book reviews.

Label: Rating (e.g., 1 to 5 stars)

Learning Type and Problem Classification
Supervised/Unsupervised: Supervised

Type of Problem: Classification, specifically multi-class classification (ratings from 1 to 5 stars)

Features
Features:

Review text
Review length
Author information
Review date
Book genre

Importance of the Problem
Importance: Predicting the rating of book reviews helps publishers and authors understand how well a book is being received by the audience. It can inform marketing strategies, editorial decisions, and improve future book offerings. For example, if the model predicts lower ratings for certain genres or authors, a company can investigate further to understand the underlying causes and make necessary improvements to meet reader expectations.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
import pandas as pd
import os

# File path for the book reviews data set
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")

# Load the data set
df = pd.read_csv(bookReviewDataSet_filename)

# Display the first few rows
df.head()


Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

<Double click this Markdown cell to make it editable, and record your answers here.>

New Feature List
After inspecting the data, I have decided to focus on the following features:

Features to Keep:

review_text: The text of the book review.
review_length: The length of the review text.
author: Information about the author, encoded appropriately if necessary.
Features to Remove:

Any features that are irrelevant to the prediction task or contain too many missing values that can't be reasonably filled.
Data Preparation Techniques
To prepare the data for modeling, I will apply the following data preparation techniques:

Handling Missing Values:

For numerical columns, fill missing values with the mean of the column.
For categorical columns, fill missing values with the mode or a placeholder value like 'Unknown'.
Feature Engineering:

Text Data: Convert the review_text feature into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization.
Review Length: Create a new feature review_length by calculating the length of each review.
Author Information: Apply one-hot encoding if necessary to convert categorical author data into numerical form.
Outlier Detection and Removal:

Use box plots to detect outliers in numerical features and apply appropriate techniques to handle them, such as capping or removing outliers.
Scaling Features:

Apply standard scaling to numerical features to ensure they are on a comparable scale, which is essential for many machine learning algorithms.
Addressing Class Imbalance:

Use techniques such as class weighting during model training to address any class imbalance in the target variable.
Model Selection
The primary model to be used is Logistic Regression, due to its simplicity and interpretability for binary classification problems. Other models may also be considered for comparison and improvement purposes:

Logistic Regression: As a baseline model.
Support Vector Machines (SVM): For potentially better performance on high-dimensional data.
Random Forest: For robustness and ability to handle feature interactions.
Neural Networks: For potentially capturing more complex patterns in the data.
Model Training and Evaluation Plan
Model Training:

Split the data into training and testing sets (e.g., 75% training, 25% testing).
Train the Logistic Regression model using the training set, applying techniques such as class weighting to handle any imbalance in the target variable.
Train additional models for comparison if needed.
Performance Analysis:

Evaluate the model's performance using metrics such as accuracy, precision, recall, and F1 score.
Use cross-validation to ensure the model generalizes well to new data and is not overfitting to the training data.
Model Improvement:

Analyze model performance metrics and identify areas for improvement.
Tune hyperparameters using techniques such as grid search or random search.
Experiment with feature selection and engineering to identify the most predictive features.
Consider ensemble methods (e.g., bagging, boosting) to combine multiple models and improve performance.
Final Model Selection and Validation:

Select the best-performing model based on validation metrics.
Perform a final evaluation on the test set to confirm the model's performance.
Document and interpret the final model's predictions and their implications for understanding book review sentiments.
Timeline
Week 1: Data exploration and cleaning, feature engineering, initial model training.
Week 2: Model evaluation, hyperparameter tuning, model comparison.
Week 3: Final model selection, validation, and documentation.
By following this plan, I aim to build a robust and accurate model to predict the sentiment of book reviews, providing valuable insights for publishers and authors.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [4]:
# Importing necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning packages
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

# Neural network packages
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.optimizers import Adam

# Additional utilities
import warnings
warnings.filterwarnings('ignore')


2024-07-26 13:54:28.654239: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-07-26 13:54:28.654274: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [6]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename)

df.head()

# File path for the book reviews data set
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")

# Load the data set
df = pd.read_csv(bookReviewDataSet_filename)

# Display the first few rows
df.head()

# Importing necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning packages
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

# Neural network packages
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.optimizers import Adam

# Additional utilities
import warnings
warnings.filterwarnings('ignore')
