<a href="https://colab.research.google.com/github/Data-Analytics-with-Python/individual-assignment-iii-victoriaepshtein/blob/main/Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTANT: Before you start, enter your name and student number below.

**Full Name**: Victoria Epshtein

**Student Number**: 400250225

# Predictive Analytics for Nata Supermarket

Welcome to Part III of our case assignment. In this part, we will continue working with the same dataset of **Nata Supermarket**.  Our focus here will be on performing predictive modeling tasks.

Throughout this assignment, please ensure that your results are reproducible by setting the **random_state to 20**



In [1]:
import pandas as pd

df = pd.read_csv('/content/natadata.csv')
display(df.head())

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,4/9/2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,8/3/2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10/2/2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


# Loading and preparing the data

To begin with, load the data as a `pandas` data frame. Recall that you there are missing values in the data. **Make sure to address** the following issues from part I of the assignment before starting your analysis:

* Remove the missing values
* Remove any column of constant values
* Convert the column `Dt_Customer` to number of days the customer has been with the company.

In [2]:
# Remove rows with missing values
df.dropna(inplace=True)

# Remove columns with constant values
# Identify columns where all values are the same
constant_columns = [col for col in df.columns if df[col].nunique() == 1]
df.drop(columns=constant_columns, inplace=True)

# Convert Dt_Customer to number of days the customer has been with the company
# Convert 'Dt_Customer' to datetime objects, handling mixed formats
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'], errors='coerce', dayfirst=False)

# Drop rows where Dt_Customer conversion failed (if any)
df.dropna(subset=['Dt_Customer'], inplace=True)

# Determine the most recent date in the dataset
latest_date = df['Dt_Customer'].max()

# Calculate the number of days since the customer joined
df['Customer_Tenure_Days'] = (latest_date - df['Dt_Customer']).dt.days

# Drop the original Dt_Customer column as it's no longer needed
df.drop(columns=['Dt_Customer'], inplace=True)

display(df.head())

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,...,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,Customer_Tenure_Days
0,5524,1957,Graduation,Single,58138.0,0,0,58,635,88,...,4,7,0,0,0,0,0,0,1,971
1,2174,1954,Graduation,Single,46344.0,1,1,38,11,1,...,2,5,0,0,0,0,0,0,0,125
3,6182,1984,Graduation,Together,26646.0,1,0,26,11,4,...,4,6,0,0,0,0,0,0,0,65
5,7446,1967,Master,Together,62513.0,0,1,16,520,42,...,10,6,0,0,0,0,0,0,0,453
7,6177,1985,PhD,Married,33454.0,1,0,32,76,10,...,4,8,0,0,0,0,0,0,0,488


# Question 1 (45 points)

In this question, we will predict a customer's spending amount on each product category over a two year period. Let us assume that when we try to predict a customer's spending on a product category (such as wines), their spending on other products is not observable.

In this question and Question 2, we will focus on **wines**.

(i). **Split the data** into two data frames, X (**features**) and y (**target**).

Then, further **split the data** into **training** and **testing** sets.

In [3]:
from sklearn.model_selection import train_test_split

# Define the target variable (y) and features (X)
y = df['MntWines']
X = df.drop(columns=['MntWines'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (724, 26)
Shape of X_test: (181, 26)
Shape of y_train: (724,)
Shape of y_test: (181,)


(ii). **Build a machine learning pipeline** that combines the following steps to predict spending amount on wines:

* Performing one-hot encoding for the categorical features;
* A random forest model for regression.

In the random forest model, **specify** the following hyperparameters:
* Number of trees;
* Maximum depth of any tree
* Minimum number of data points required to split a node;
* Minimum number of data points in any leaf node

In addition, **fit your model** to the training data.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Create a preprocessor for one-hot encoding categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('num', 'passthrough', numerical_features) # Keep numerical features as they are
    ])

# Build the Random Forest Regressor pipeline
# Specify hyperparameters as requested
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(
        n_estimators=100,      # Number of trees
        max_depth=10,          # Maximum depth of any tree
        min_samples_split=5,   # Minimum number of data points required to split a node
        min_samples_leaf=3,    # Minimum number of data points in any leaf node
        random_state=20        # For reproducibility
    ))
])

# Fit the model to the training data
rf_pipeline.fit(X_train, y_train)

print("Random Forest Regressor pipeline built and fitted successfully!")

Random Forest Regressor pipeline built and fitted successfully!


(iii). Use the model to **predict** the spending amount on wines by a customer with the following features.

| Feature |  Value  |
|---------|---------|
| Age     | 48      |
|Education|Graduation|
|Marital_Status| Married|
|Income|80,000|
|Kidhome|1|
|Teenhome|1|
|Dt-Customer|2016-10-10|
|Recency|43|     
|NumDealsPurchases|2|
|   NumWebPurchases|1|
|NumCatalogPurchases|0|
|NumStorePurchases|15|
|NumWebVisitsMonth|5|
|AcceptedCmp1,2,3,4,5| 0 |
|Complain|0|




(iv). **Consider** **two** measures to evaluate the model's performance on the test dataset.

Based on you computational results, how would you describe the model's performance?

In [7]:
import pandas as pd
import numpy as np

# Create a DataFrame for the new customer's features
new_customer_data = pd.DataFrame({
    'ID': [0], # ID can be arbitrary for a single prediction
    'Year_Birth': [1971], # Derived from Age 48 (2019 - 48 = 1971)
    'Education': ['Graduation'],
    'Marital_Status': ['Married'],
    'Income': [80000.0],
    'Kidhome': [1],
    'Teenhome': [1],
    'Recency': [43],
    'NumDealsPurchases': [2],
    'NumWebPurchases': [1],
    'NumCatalogPurchases': [0],
    'NumStorePurchases': [15],
    'NumWebVisitsMonth': [5],
    'AcceptedCmp1': [0],
    'AcceptedCmp2': [0],
    'AcceptedCmp3': [0],
    'AcceptedCmp4': [0],
    'AcceptedCmp5': [0],
    'Complain': [0],
    'Response': [0],
    # Add the missing 'Mnt' columns with a default value of 0
    'MntSweetProducts': [0],
    'MntFruits': [0],
    'MntMeatProducts': [0],
    'MntGoldProds': [0],
    'MntFishProducts': [0]
})

# Recalculate the `latest_date` from the original Dt_Customer column
# This is necessary because Dt_Customer was dropped from the main 'df'
temp_df_for_date = pd.read_csv('/content/natadata.csv')
temp_df_for_date['Dt_Customer'] = pd.to_datetime(temp_df_for_date['Dt_Customer'], errors='coerce', dayfirst=False)
temp_df_for_date.dropna(subset=['Dt_Customer'], inplace=True)
latest_date_ref = temp_df_for_date['Dt_Customer'].max()

# Calculate 'Customer_Tenure_Days' for the new customer
new_customer_Dt_Customer = pd.to_datetime('2016-10-10', errors='coerce')
new_customer_data['Customer_Tenure_Days'] = (latest_date_ref - new_customer_Dt_Customer).days

# Predict the spending amount for the new customer
predicted_spending = rf_pipeline.predict(new_customer_data)

print(f"Predicted spending amount on wines for the new customer: ${predicted_spending[0]:.2f}")

Predicted spending amount on wines for the new customer: $332.94


(v). **Perform** a 6-fold cross validation with a performance score of your choice.

**Note**: You may need to research on how to specify the performance score for regression models.

In [8]:
from sklearn.metrics import r2_score, mean_absolute_error

# Make predictions on the test set
y_pred = rf_pipeline.predict(X_test)

# Calculate R-squared
r_squared = r2_score(y_test, y_pred)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

print(f"R-squared on the test set: {r_squared:.4f}")
print(f"Mean Absolute Error (MAE) on the test set: {mae:.2f}")

print("\nModel performance description:")
if r_squared > 0.7:
    print("The R-squared value is high, indicating that the model explains a large proportion of the variance in wine spending, suggesting a good fit.")
elif r_squared > 0.3:
    print("The R-squared value is moderate, suggesting that the model explains some of the variance in wine spending, but there is still room for improvement.")
else:
    print("The R-squared value is low, indicating that the model does not explain much of the variance in wine spending, suggesting a poor fit.")

print(f"The Mean Absolute Error (MAE) of {mae:.2f} means, on average, the model's predictions for wine spending are off by approximately ${mae:.2f}.")

R-squared on the test set: 0.7457
Mean Absolute Error (MAE) on the test set: 102.09

Model performance description:
The R-squared value is high, indicating that the model explains a large proportion of the variance in wine spending, suggesting a good fit.
The Mean Absolute Error (MAE) of 102.09 means, on average, the model's predictions for wine spending are off by approximately $102.09.


(vi). **Perform** hyperparameter tuning using `GridSearchCV` for the following hyperparameters:

* Number of trees: 50, 100
* Maximum depth of any tree: 5, 10, 15
* Minimum number of data points required to split a node: 3, 6
* Minimum number of data points in any leaf node: 2,4,8


Based on your computational result, **show**:
* the best hyperparameter comination
* the corresponding performance score

In addition, **retrieve** the best model (the one corresponding to the best performance score).

In [9]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for GridSearchCV
param_grid = {
    'regressor__n_estimators': [50, 100],
    'regressor__max_depth': [5, 10, 15],
    'regressor__min_samples_split': [3, 6],
    'regressor__min_samples_leaf': [2, 4, 8]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    rf_pipeline,           # The pipeline to tune
    param_grid,            # The parameter grid
    cv=6,                  # 6-fold cross-validation
    scoring='neg_mean_squared_error', # Using negative mean squared error for scoring
    n_jobs=-1,             # Use all available cores
    verbose=1
)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_rmse = np.sqrt(-grid_search.best_score_)

# Retrieve the best model
best_rf_model = grid_search.best_estimator_

print("Best hyperparameters found:")
for param, value in best_params.items():
    print(f"- {param}: {value}")

print(f"\nCorresponding best RMSE: {best_rmse:.2f}")
print("\nBest Random Forest model retrieved successfully!")

Fitting 6 folds for each of 36 candidates, totalling 216 fits
Best hyperparameters found:
- regressor__max_depth: 15
- regressor__min_samples_leaf: 2
- regressor__min_samples_split: 3
- regressor__n_estimators: 100

Corresponding best RMSE: 165.97

Best Random Forest model retrieved successfully!


# Question 2 (24 points)

In this question, we will compare the performance of the best model found through `GridSearchCV` in Question 1 with the performance of the linear regression model.

(i). **Construct** a linear regression model using all relevant features and fit it to the training data.

Further, **evaluate** the model's performance on the test data and compare it with the best random forest model found in Question 1, with respect to the two performance considered in Question 1.

**Note**: You may use the same training and testing datasets as in Question 1.

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# Identify categorical and numerical features (re-using from Q1)
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Create a preprocessor for one-hot encoding categorical features
# Re-using the preprocessor definition to ensure consistency
preprocessor_lr = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('num', 'passthrough', numerical_features)
    ])

# Build the Linear Regression pipeline
lr_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_lr),
    ('regressor', LinearRegression())
])

# Fit the linear regression model to the training data
lr_pipeline.fit(X_train, y_train)

print("Linear Regression model built and fitted successfully!")

# Make predictions on the test set using the Linear Regression model
y_pred_lr = lr_pipeline.predict(X_test)

# Calculate R-squared for Linear Regression
r_squared_lr = r2_score(y_test, y_pred_lr)

# Calculate Mean Absolute Error (MAE) for Linear Regression
mae_lr = mean_absolute_error(y_test, y_pred_lr)

print(f"\n--- Linear Regression Model Performance on Test Set ---")
print(f"R-squared (Linear Regression): {r_squared_lr:.4f}")
print(f"Mean Absolute Error (MAE) (Linear Regression): {mae_lr:.2f}")

# Retrieve performance metrics from the best Random Forest model (from Q1)
# Note: best_rf_model was already fit and evaluated on the test set earlier.
# For direct comparison, we will re-predict using best_rf_model if it wasn't done for the specific purpose of comparison.

y_pred_rf_best = best_rf_model.predict(X_test)
r_squared_rf_best = r2_score(y_test, y_pred_rf_best)
mae_rf_best = mean_absolute_error(y_test, y_pred_rf_best)

print(f"\n--- Best Random Forest Model Performance on Test Set (from Q1) ---")
print(f"R-squared (Random Forest): {r_squared_rf_best:.4f}")
print(f"Mean Absolute Error (MAE) (Random Forest): {mae_rf_best:.2f}")

print("\n--- Comparison of Model Performance ---")
if r_squared_rf_best > r_squared_lr:
    print(f"The Best Random Forest model has a higher R-squared ({r_squared_rf_best:.4f}) compared to Linear Regression ({r_squared_lr:.4f}), indicating it explains more variance.")
else:
    print(f"The Linear Regression model has a higher or similar R-squared ({r_squared_lr:.4f}) compared to Best Random Forest ({r_squared_rf_best:.4f}), indicating similar or better explained variance.")

if mae_rf_best < mae_lr:
    print(f"The Best Random Forest model has a lower MAE ({mae_rf_best:.2f}) compared to Linear Regression ({mae_lr:.2f}), suggesting it makes smaller absolute errors on average.")
else:
    print(f"The Linear Regression model has a lower or similar MAE ({mae_lr:.2f}) compared to Best Random Forest ({mae_rf_best:.2f}), suggesting similar or better average absolute errors.")


Linear Regression model built and fitted successfully!

--- Linear Regression Model Performance on Test Set ---
R-squared (Linear Regression): 0.6348
Mean Absolute Error (MAE) (Linear Regression): 139.58

--- Best Random Forest Model Performance on Test Set (from Q1) ---
R-squared (Random Forest): 0.7409
Mean Absolute Error (MAE) (Random Forest): 101.75

--- Comparison of Model Performance ---
The Best Random Forest model has a higher R-squared (0.7409) compared to Linear Regression (0.6348), indicating it explains more variance.
The Best Random Forest model has a lower MAE (101.75) compared to Linear Regression (139.58), suggesting it makes smaller absolute errors on average.


(ii). Let's further compare the distribution of prediction errors by the two models in the following steps.

**Step 1**. For both the linear regression and the (best) random forest model, compute the absolute residual residual for each record in the test dataset.
  * Note that the absolute residual is distance between the predicted value and actual value, i.e., $|y_{pred}-y_{test}|$.

So we end up with two sets of absolute residuals (one by the linear regression model and the other by the random forest model).

**Step 2**. For each pair of absolute residuals for the same test data point, we can define a point in a scatterplot. Genereate such a scatterplot using `plotly.express` (LR residuals vs. RF residuals).  

**Step 3**. Add a 45 degree reference line to the plot. This can be done using the following codes. (You may need to change `min_val` and `max_val` for better visualization).

```
min_val = 0
max_val = 10
fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash"),
)
```

**Implement the above steps**.

Note that the above steps essentially creates a [Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot). How would you interpret the plot (for comparing the two predictive models).

In [11]:
import plotly.express as px
import pandas as pd
import numpy as np

# Step 1: Compute the absolute residuals for both models
absolute_residuals_rf = np.abs(y_test - y_pred_rf_best)
absolute_residuals_lr = np.abs(y_test - y_pred_lr)

# Create a DataFrame for plotting
residuals_df = pd.DataFrame({
    'RandomForest_Absolute_Residuals': absolute_residuals_rf,
    'LinearRegression_Absolute_Residuals': absolute_residuals_lr
})

# Step 2: Generate a scatterplot using plotly.express
fig = px.scatter(residuals_df,
                 x='LinearRegression_Absolute_Residuals',
                 y='RandomForest_Absolute_Residuals',
                 title='Comparison of Absolute Residuals: Linear Regression vs. Random Forest',
                 labels={
                     'LinearRegression_Absolute_Residuals': 'Linear Regression Absolute Residuals',
                     'RandomForest_Absolute_Residuals': 'Random Forest Absolute Residuals'
                 },
                 hover_data={'LinearRegression_Absolute_Residuals': ':.2f', 'RandomForest_Absolute_Residuals': ':.2f'})

# Step 3: Add a 45 degree reference line
# Determine min and max values for the reference line based on the data
min_val = min(residuals_df['LinearRegression_Absolute_Residuals'].min(), residuals_df['RandomForest_Absolute_Residuals'].min())
max_val = max(residuals_df['LinearRegression_Absolute_Residuals'].max(), residuals_df['RandomForest_Absolute_Residuals'].max())

# Add some padding to min_val and max_val for better visualization
padding = (max_val - min_val) * 0.05
min_val -= padding
max_val += padding

fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash"),
    name='45-degree line'
)

fig.update_layout(showlegend=False)
fig.show()

print("\nInterpretation of the scatterplot:")
print("The scatterplot compares the absolute prediction errors (residuals) of the Linear Regression model against the Random Forest model for each test data point. Points below the 45-degree red dashed line indicate instances where the Random Forest model had a smaller absolute error than the Linear Regression model. Conversely, points above the line signify cases where Linear Regression performed better. The further a point is from the line, the larger the difference in absolute error between the two models.")
print("If most points lie below the 45-degree line, it suggests that the Random Forest model generally makes smaller errors than the Linear Regression model. If points are clustered around the line, the models have similar error distributions. Outliers far from the origin indicate instances where one or both models made large errors.")



Interpretation of the scatterplot:
The scatterplot compares the absolute prediction errors (residuals) of the Linear Regression model against the Random Forest model for each test data point. Points below the 45-degree red dashed line indicate instances where the Random Forest model had a smaller absolute error than the Linear Regression model. Conversely, points above the line signify cases where Linear Regression performed better. The further a point is from the line, the larger the difference in absolute error between the two models.
If most points lie below the 45-degree line, it suggests that the Random Forest model generally makes smaller errors than the Linear Regression model. If points are clustered around the line, the models have similar error distributions. Outliers far from the origin indicate instances where one or both models made large errors.


# Question 3. (24 points)

In this question, we will consider a classification problem on customers' spendings on meat products.


(i). For the column that represents customers' spendings on meat products, **calculate** the 33.33% and 66.67% percentiles. (**Hint**: You may use the function `df['MntMeatProducts'].quantile([1/3,2/3])`.)

Based on the two percentiles, **label** each row in the dataset
* If a customer's spending is below the 33.33% percentile, label their spending as "low";

* If a customer's spending is above the 66.67% percentile, label their spending as "high";


* If a customer's spending is between the two percentiles, label their spending as "medium".

In [12]:
# Calculate the 33.33% and 66.67% percentiles for 'MntMeatProducts'
meat_spending_quantiles = df['MntMeatProducts'].quantile([1/3, 2/3])
lower_bound = meat_spending_quantiles.iloc[0]
upper_bound = meat_spending_quantiles.iloc[1]

print(f"33.33% percentile for MntMeatProducts: {lower_bound:.2f}")
print(f"66.67% percentile for MntMeatProducts: {upper_bound:.2f}")

# Label each row based on the spending percentiles
def categorize_meat_spending(spending):
    if spending < lower_bound:
        return 'low'
    elif spending > upper_bound:
        return 'high'
    else:
        return 'medium'

df['MntMeatProducts_Category'] = df['MntMeatProducts'].apply(categorize_meat_spending)

print("\nDistribution of MntMeatProducts_Category:")
display(df['MntMeatProducts_Category'].value_counts())

33.33% percentile for MntMeatProducts: 24.00
66.67% percentile for MntMeatProducts: 161.00

Distribution of MntMeatProducts_Category:


Unnamed: 0_level_0,count
MntMeatProducts_Category,Unnamed: 1_level_1
medium,312
high,300
low,293


(ii). **Build a K-nearest-neighbors model** to predict the label of a customer's spending on meat products. The model should be part of a machine learning pipeline that preprocesses the data.





In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

# Define the target variable (y) and features (X) for the classification task
y_meat_category = df['MntMeatProducts_Category']
X_meat_features = df.drop(columns=['MntMeatProducts', 'MntMeatProducts_Category'])

# Split the data into training and testing sets
X_train_meat, X_test_meat, y_train_meat, y_test_meat = train_test_split(
    X_meat_features, y_meat_category, test_size=0.2, random_state=20, stratify=y_meat_category # Stratify to maintain class distribution
)

# Identify categorical and numerical features for this new X
categorical_features_meat = X_meat_features.select_dtypes(include=['object']).columns
numerical_features_meat = X_meat_features.select_dtypes(include=['int64', 'float64']).columns

# Create a preprocessor for one-hot encoding categorical features and scaling numerical features
meat_preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features_meat),
        ('num', StandardScaler(), numerical_features_meat) # Scale numerical features
    ])

# Build the K-Nearest Neighbors Classifier pipeline
knn_pipeline = Pipeline(steps=[
    ('preprocessor', meat_preprocessor),
    ('classifier', KNeighborsClassifier(n_neighbors=5)) # Using default n_neighbors=5
])

# Fit the model to the training data
knn_pipeline.fit(X_train_meat, y_train_meat)

print("K-Nearest Neighbors Classifier pipeline built and fitted successfully!")

K-Nearest Neighbors Classifier pipeline built and fitted successfully!


(iii). Evaluate the performace of your model in part (ii) on the test data by **generating the classification report**.

Further, **interpret** each number in the classification report based on the current context.

In [14]:
from sklearn.metrics import classification_report

# Make predictions on the test set
y_pred_meat = knn_pipeline.predict(X_test_meat)

# Generate the classification report
report = classification_report(y_test_meat, y_pred_meat)

print("Classification Report for K-Nearest Neighbors Model:")
print(report)

print("\nInterpretation of the Classification Report:")
print("The classification report provides a detailed breakdown of the model's performance for each class ('high', 'low', 'medium' meat spending) and overall:")
print("- **Precision**: For a given class, it indicates the proportion of positive identifications that were actually correct. For example, if precision for 'high' is 0.70, it means 70% of customers predicted to have 'high' meat spending actually did.")
print("- **Recall**: For a given class, it indicates the proportion of actual positives that were correctly identified. For example, if recall for 'high' is 0.65, it means the model correctly identified 65% of all customers who actually had 'high' meat spending.")
print("- **F1-score**: This is the harmonic mean of precision and recall, providing a single metric that balances both. A higher F1-score indicates a better balance between precision and recall.")
print("- **Support**: This is the number of actual occurrences of each class in the test set.")
print("- **Accuracy**: This is the overall percentage of correctly classified instances across all classes. It indicates how often the model is correct overall.")
print("- **Macro Avg**: The average of precision, recall, and F1-score calculated independently for each class, useful when class imbalance is present.")
print("- **Weighted Avg**: The average of precision, recall, and F1-score weighted by the support of each class, useful for understanding overall performance considering class distribution.")


Classification Report for K-Nearest Neighbors Model:
              precision    recall  f1-score   support

        high       0.86      0.85      0.86        60
         low       0.78      1.00      0.87        59
      medium       0.80      0.60      0.69        62

    accuracy                           0.81       181
   macro avg       0.82      0.82      0.81       181
weighted avg       0.82      0.81      0.80       181


Interpretation of the Classification Report:
The classification report provides a detailed breakdown of the model's performance for each class ('high', 'low', 'medium' meat spending) and overall:
- **Precision**: For a given class, it indicates the proportion of positive identifications that were actually correct. For example, if precision for 'high' is 0.70, it means 70% of customers predicted to have 'high' meat spending actually did.
- **Recall**: For a given class, it indicates the proportion of actual positives that were correctly identified. For example

# Describe how you used Gen. AI. in this assignment (2 points)

I used Gemini AI as a tool during this assignment, giving it prompts based on the assignment questions and monitoring the output, sometimes having to adjust the prompt in order to get the desired outcome.

**Note**: The remaining 5 points will be assigned to readability of the work.