# Lab 8: Define and Solve an ML Problem of Your Choosing

In [28]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns



In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [29]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df =pd.read_csv(bookReviewDataSet_filename) # YOUR CODE HERE

df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?


1. Book Review data set: bookReviewsData.csv
2. What are the common themes/topics people write about in book reviews? There is no label as this is an unsupervised learning problem. The goal is to uncover hidden topics in the reviews using topic modeling.
3. It is an Unsupervised Learning.
4. The primary feature is the review text. We’ll extract features using TF-IDF vectorization of the text.
5. Understanding what people focus on in reviews (e.g., plot, characters, writing style, etc.) helps companies like Goodreads to identify emerging trends or reader concerns,automatically tag reviews or cluster similar ones and recommend books based on reader interests.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [30]:
# YOUR CODE HERE

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1973 entries, 0 to 1972
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Review           1973 non-null   object
 1   Positive Review  1973 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 17.5+ KB


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 


1.After inspecting the data, I plan to keep only the `Review` column, which contains the raw text. I will remove the `Positive Review` column since this is an unsupervised problem. I will also create new features like cleaned text and review length to help with analysis.
2.To prepare the data for topic modeling, I will:
- Clean the text by lowercasing, removing punctuation and numbers, and removing stopwords
- Create a column for cleaned text
- Filter out very short reviews (less than 5 words)
- Use TF-IDF vectorization to convert text into numerical features for modeling
3.I will be using KMeans Clustering.
4.I will train the KMeans model on the TF-IDF matrix
- I will try different values for `k` (number of clusters), such as 3, 5, or 7
- I will evaluate the clusters by:
- Printing the top words that appear in each cluster
- Reading example reviews from each cluster to check for patterns
- I will improve the model by adjusting the number of clusters and TF-IDF settings (like `min_df`, `max_df`, or using bi-grams)
  



## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [31]:
# YOUR CODE HERE
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import string
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS




<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [32]:
# YOUR CODE HERE
# Keep only the 'Review' column and drop 'Positive Review'
df = df[['Review']]

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Removes numbers
    text = text.translate(str.maketrans('', '', string.punctuation))  # Removes punctuation
    words = text.split()
    words = [word for word in words if word not in ENGLISH_STOP_WORDS]  # Removes stopwords
    return " ".join(words)

# Apply cleaning to a new column
df['cleaned_review'] = df['Review'].apply(clean_text)

# Add review length (word count) column
df['review_length'] = df['cleaned_review'].apply(lambda x: len(x.split()))

# Remove very short reviews (less than 5 words)
df = df[df['review_length'] >= 5]

df[['Review', 'cleaned_review', 'review_length']].head()

Unnamed: 0,Review,cleaned_review,review_length
0,This was perhaps the best of Johannes Steinhof...,best johannes steinhoffs books does deal stell...,41
1,This very fascinating book is a story written ...,fascinating book story written form numerous l...,120
2,The four tales in this collection are beautifu...,tales collection beautifully composed art just...,31
3,The book contained more profanity than I expec...,book contained profanity expected read book ri...,15
4,We have now entered a second time of deep conc...,entered second time deep concern science math ...,151


In [33]:
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_df=0.95, min_df=5, stop_words='english')

# Fit and transform the cleaned text
X = vectorizer.fit_transform(df['cleaned_review'])

# Save the words
feature_names = vectorizer.get_feature_names_out()

In [34]:
# Set number of clusters
k = 5

# Train the KMeans model
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)

# Adding cluster labels to the DataFrame
df['cluster'] = kmeans.labels_


In [35]:
# Show top 10 words in each cluster using the cluster centers

for cluster_num in range(k):
    print(f"\nCluster {cluster_num} Top Words:")
    center = kmeans.cluster_centers_[cluster_num]
    top_indices = center.argsort()[-10:][::-1]
    top_words = [feature_names[i] for i in top_indices]
    print(", ".join(top_words))



Cluster 0 Top Words:
great, book, information, recommend, lot, read, books, just, dont, know

Cluster 1 Top Words:
work, book, theory, written, text, edition, does, best, like, just

Cluster 2 Top Words:
book, read, good, just, really, like, want, reading, books, author

Cluster 3 Top Words:
book, life, history, read, time, people, children, new, world, just

Cluster 4 Top Words:
characters, story, book, read, like, books, novel, just, plot, really


In [36]:
# Print 2 sample reviews from each cluster
for i in range(k):
    print(f"\n Cluster {i} Sample Reviews:")
    sample_reviews = df[df['cluster'] == i]['Review'].head(2)
    for review in sample_reviews:
        print("-", review[:200], "...\n")


 Cluster 0 Sample Reviews:
- While this book is a good attempt at placing statistical topics necessary to toxicology in one spot, the mistakes are inexcusable.  Many formula are incorrect as well as text referring to the wrong ta ...

- This book is terribly organized.  I'm not sure what happened in writing this book, but it seems clear that Professor Jones did not compile this in the chronological order that it is printed in.

My ma ...


 Cluster 1 Sample Reviews:
- Lovers of Mr. Rochester beware - in this, his second book of literary puzzles, John Sutherland turns his considerable powers of literary analysis towards, amongst other things, undoing the good reputa ...

- As the name implies, this is about the elements of programming style. The examples are a bit dated (old languages, not C/C++/Java/the-next-great-language). But this isn't a *language* programming book ...


 Cluster 2 Sample Reviews:
- The book contained more profanity than I expected to read in a book by Rita Rudner

To implement my project plan, I followed the machine learning life cycle steps below:

Data Preparation:
- I kept only the Review column and removed the Positive Review label.
- I cleaned the text by lowercasing, removing numbers, punctuation, and stopwords.
- I created a review length column and removed very short reviews (less than 5 words) to reduce noise.

Feature Engineering:
- I used TF-IDF Vectorization to convert the cleaned reviews into numerical features.
- I used max_df=0.95 and min_df=5 to ignore extremely common or rare words.

Modeling:
- I used KMeans Clustering with k=5 to group similar reviews together based on their text.
- Each review was assigned to one of the 5 clusters.

Evaluation:
- I printed the top 10 words in each cluster to understand what each group is about.
- I also printed example reviews from each cluster, which helped me confirm that the clusters make sense and contain similar themes.


This process helped me uncover hidden patterns and themes in book reviews, even without any labels, by applying unsupervised machine learning techniques.