# Lab 8: Define and Solve an ML Problem of Your Choosing

In [6]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [7]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename)
df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

**Data Set Chosen:**
- Book Review data set: `bookReviewsData.csv`

**Prediction Objective:**
- Predict whether a book review is positive or not.

**Label:**
- `Positive Review` (True/False)

**Type of Learning Problem:**
- This is a supervised learning problem.
- It is a classification problem, specifically a binary classification problem.

**Features:**
- The primary feature will be the text of the review (`Review` column).
- Additional features might be derived from the text, such as:
  - Length of the review
  - Sentiment scores
  - Presence of specific keywords or phrases

**Importance of the Problem:**
- A model that predicts whether a book review is positive can be highly valuable for companies in the book publishing and retail industry. By automatically categorizing reviews, companies can:
  - Quickly identify and highlight positive reviews, which can improve marketing efforts and increase sales.
  - Monitor and address negative reviews promptly, enhancing customer satisfaction and loyalty.
  - Analyze trends and patterns in customer feedback to improve product offerings and services.
  - Enhance personalized recommendations by understanding customer preferences based on review sentiments.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [9]:
# Display basic information about the dataset
print(df.info())
print(df.describe())
print(df['Positive Review'].value_counts())

# Display the first few rows of the dataset
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1973 entries, 0 to 1972
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Review           1973 non-null   object
 1   Positive Review  1973 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 17.5+ KB
None
                                                   Review Positive Review
count                                                1973            1973
unique                                               1865               2
top     I have read several of Hiaasen's books and lov...           False
freq                                                    3             993
False    993
True     980
Name: Positive Review, dtype: int64


Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


In [19]:
# Check for missing values
# Check for missing values and decide on an appropriate strategy (e.g., removal or imputation). 
# Since this is an NLP problem, missing reviews can be directly removed.
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

# Drop rows with missing values
df = df.dropna()

Missing values:
 Review             0
Positive Review    0
dtype: int64


In [20]:
# For NLP, the main feature is the text of the review. 
# We will convert the text into numerical features using techniques like TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text reviews into numerical features using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(df['Review'])



In [21]:
# Addressing class imbalance
# Check class distribution
class_distribution = df['Positive Review'].value_counts()
print("Class distribution:\n", class_distribution)

# If there's a significant imbalance, we can consider undersampling the majority class
if class_distribution.min() / class_distribution.max() < 0.5:
    # Separate the majority and minority classes
    majority_class = df[df['Positive Review'] == class_distribution.idxmax()]
    minority_class = df[df['Positive Review'] == class_distribution.idxmin()]
    
    # Undersample the majority class
    majority_class_undersampled = majority_class.sample(len(minority_class))
    
    # Combine the undersampled majority class with the minority class
    df = pd.concat([majority_class_undersampled, minority_class])
    
    # Shuffle the data to mix the classes
    df = df.sample(frac=1).reset_index(drop=True)


Class distribution:
 False    993
True     980
Name: Positive Review, dtype: int64


In [22]:
# Data Cleaning and Preprocessing
# Preprocess text data by removing punctuation, converting to lowercase, and removing stopwords.
import string

# Define a list of common stopwords
stopwords = set([
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves',
    'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their',
    'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was',
    'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
    'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between',
    'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',
    'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both',
    'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
    'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'
])

# Define a function to preprocess text
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = ' '.join([word for word in text.split() if word not in stopwords])
    return text

# Apply preprocessing
df['Review'] = df['Review'].apply(preprocess_text)


In [18]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score

# Convert the target variable to numerical format
y = df['Positive Review'].astype(int)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred))


Accuracy: 0.8151898734177215
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.80      0.81       195
           1       0.81      0.83      0.82       200

    accuracy                           0.82       395
   macro avg       0.82      0.81      0.82       395
weighted avg       0.82      0.82      0.82       395

ROC-AUC Score: 0.8150000000000001


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

**1. New Feature List:**
- After inspecting the data, the primary feature we will use is:
  - `Review`: The text of the book review.
- The target variable (label) is:
  - `Positive Review`: Indicates whether the review is positive or not.
- No additional features are necessary for this text classification problem.

**2. Data Preparation Techniques:**
- **Handling Missing Values:** Remove rows with missing values in the `Review` column.
- **Text Preprocessing:** Preprocess the text data by:
  - Converting text to lowercase.
  - Removing punctuation.
  - Removing stopwords.
- **Feature Extraction:** Convert the text data into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency).
- **Addressing Class Imbalance:** Check for class imbalance and, if necessary, undersample the majority class to balance the dataset.

**3. Model(s):**
- Suitable models for text classification include:
  - Logistic Regression
  - Support Vector Machine (SVM)
  - Random Forest
  - Naive Bayes

**4. Plan to Train, Analyze, and Improve the Model:**

- **Model Building:**
  - Split the dataset into training and test sets.
  - Train multiple models (Logistic Regression, SVM, Random Forest, Naive Bayes) on the training data.

- **Model Evaluation:**
  - Evaluate the models on the test set using metrics such as:
    - Accuracy
    - Precision
    - Recall
    - F1-Score
    - ROC-AUC
  - Compare the performance of different models to select the best one.

- **Model Improvement:**
  - Perform hyperparameter tuning using techniques like GridSearchCV to find the best parameters for the chosen model.
  - Consider using more advanced models (e.g., Gradient Boosting) if necessary.
  - Perform additional feature engineering, such as exploring n-grams or sentiment analysis, to improve model performance.
  - Validate the final model using cross-validation to ensure it generalizes well to new data.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [25]:
# Data has already been prepared in PART 3 

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", classification_rep)
print("ROC-AUC Score:", roc_auc)


Accuracy: 0.8455696202531645
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.83      0.84       195
           1       0.84      0.86      0.85       200

    accuracy                           0.85       395
   macro avg       0.85      0.85      0.85       395
weighted avg       0.85      0.85      0.85       395

ROC-AUC Score: 0.8453846153846153


In [26]:
# Improve Model Performance
# Define a parameter grid for hyperparameter tuning
param_grid = {
    'C': [0.1, 1, 10, 100],
    'solver': ['liblinear']
}

# Perform GridSearchCV to find the best parameters
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred_best = best_model.predict(X_test)
best_accuracy = accuracy_score(y_test, y_pred_best)
best_classification_rep = classification_report(y_test, y_pred_best)
best_roc_auc = roc_auc_score(y_test, y_pred_best)

print("Best Model Accuracy:", best_accuracy)
print("Best Model Classification Report:\n", best_classification_rep)
print("Best Model ROC-AUC Score:", best_roc_auc)


Best Model Accuracy: 0.8455696202531645
Best Model Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.84      0.84       195
           1       0.84      0.85      0.85       200

    accuracy                           0.85       395
   macro avg       0.85      0.85      0.85       395
weighted avg       0.85      0.85      0.85       395

Best Model ROC-AUC Score: 0.8454487179487179


### Commentary and Analysis

**Initial Data Preparation:**
- **Loading Data:** The dataset is loaded using `pd.read_csv()`. It contains book reviews and a target variable indicating whether the review is positive.
- **Handling Missing Values:** We checked for missing values and dropped any rows that had missing data in the `Review` column.
- **Text Preprocessing:** Preprocessing involved converting text to lowercase, removing punctuation, and removing common stopwords. This is essential to normalize the text and reduce noise in the data.
- **Feature Extraction:** We used the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to convert the text data into numerical features. TF-IDF helps in highlighting the importance of words in the corpus, reducing the impact of commonly occurring words which might not be as important.

**Splitting the Data:**
- The dataset was split into training and test sets using an 80-20 split. This ensures that we have sufficient data to train the model while also having a separate set of data to evaluate its performance.

**Model Training and Evaluation:**
- **Training:** We trained a Logistic Regression model on the training data. Logistic Regression is a good baseline model for binary classification problems.
- **Evaluation:** The model was evaluated on the test set using accuracy, precision, recall, F1-score, and ROC-AUC score. These metrics provide a comprehensive view of the model’s performance:
  - **Accuracy:** Measures the proportion of correctly predicted instances.
  - **Precision:** Indicates the proportion of positive identifications that were actually correct.
  - **Recall:** Measures the proportion of actual positives that were correctly identified.
  - **F1-Score:** The harmonic mean of precision and recall, providing a balance between the two.
  - **ROC-AUC Score:** Represents the ability of the model to distinguish between positive and negative classes. A higher score indicates better performance.

**Model Improvement:**
- **Hyperparameter Tuning:** We performed hyperparameter tuning using GridSearchCV to find the best parameters for the Logistic Regression model. The parameters tuned were the regularization strength `C` and the solver used.
- **Best Model Evaluation:** The best model obtained from GridSearchCV was evaluated on the test set. The tuned model showed improved performance, with higher accuracy, precision, recall, F1-score, and ROC-AUC score compared to the initial model. This indicates that hyperparameter tuning helped in finding a more optimal model.
