<a href="https://colab.research.google.com/github/Data-Analytics-with-Python/individual-assignment-iii-shaownm/blob/main/Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTANT: Before you start, enter your name and student number below.

**Full Name**: Md Samiul Karim Shaown

**Student Number**:400587186

# Predictive Analytics for Nata Supermarket

Welcome to Part III of our case assignment. In this part, we will continue working with the same dataset of **Nata Supermarket**.  Our focus here will be on performing predictive modeling tasks.

Throughout this assignment, please ensure that your results are reproducible by setting the **random_state to 20**



# Loading and preparing the data

To begin with, load the data as a `pandas` data frame. Recall that you there are missing values in the data. **Make sure to address** the following issues from part I of the assignment before starting your analysis:

* Remove the missing values
* Remove any column of constant values
* Convert the column `Dt_Customer` to number of days the customer has been with the company.

In [5]:
# These are the libraries to use for data and modeling.

import pandas as pd             # For working with tables (DataFrames)
import numpy as np              # For math operations
from datetime import datetime   # For working with dates

# Machine Learning tools
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier

# Will use 20 everywhere to keep results consistent
RANDOM_STATE = 20


In [6]:
# Load the dataset
df = pd.read_csv("Nata Supermarkets.csv")

# Look at the first few rows
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-04-09,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-08-03,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-10-02,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


In [8]:
# Convert Dt_Customer to proper date format
df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], dayfirst=True, errors="coerce")

# Create a new column showing number of days as a customer
today = datetime.today()
df["Customer_Days"] = (today - df["Dt_Customer"]).dt.days

# Find constant columns (columns where every value is the same)
constant_columns = [col for col in df.columns if df[col].nunique() == 1]
print("Constant columns:", constant_columns)

# Drop the constant columns
df = df.drop(columns=constant_columns)

# Check for missing values
print("Missing values:\n", df.isnull().sum())

# Drop missing rows
df = df.dropna()

# Check final shape
df.shape


Constant columns: []
Missing values:
 ID                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Response               0
Customer_Days          0
dtype: int64


(905, 28)

# Question 1 (45 points)

In this question, we will predict a customer's spending amount on each product category over a two year period. Let us assume that when we try to predict a customer's spending on a product category (such as wines), their spending on other products is not observable.

In this question and Question 2, we will focus on **wines**.

(i). **Split the data** into two data frames, X (**features**) and y (**target**).

Then, further **split the data** into **training** and **testing** sets.

In [12]:
# Create X (features) and y (target)

# y is the column we want to predict.
# Predict wine spending, which is stored in the column "MntWines".

y = df["MntWines"]   # target variable


# X contains all other information we can use to predict wine spending.
# Drop the target column (MntWines), ID, and Dt_Customer because Dt_Customer is now
# represented by Customer_Days and the original datetime column is not needed for prediction.
X = df.drop(columns=["MntWines", "ID", "Dt_Customer"])


# Split into training and testing sets
# Split the data so that the model can learn from the training set
# test_size = 0.3 means 30% of the data will be used for testing.
# RANDOM_STATE = 20 keeps results the same each time we run it.

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=RANDOM_STATE
)

# Print shapes so we understand the split
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (633, 25)
Testing set shape: (272, 25)


(ii). **Build a machine learning pipeline** that combines the following steps to predict spending amount on wines:

* Performing one-hot encoding for the categorical features;
* A random forest model for regression.

In the random forest model, **specify** the following hyperparameters:
* Number of trees;
* Maximum depth of any tree
* Minimum number of data points required to split a node;
* Minimum number of data points in any leaf node

In addition, **fit your model** to the training data.

In [13]:
# Build preprocessing + Random Forest pipeline

# Identify categorical and numeric columns
categorical_columns = X.select_dtypes(include=["object"]).columns
numeric_columns = X.select_dtypes(exclude=["object"]).columns

# Preprocessing: one-hot encode categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_columns)
    ],
    remainder="passthrough"
)

# Random Forest model with required hyperparameters
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=3,
    min_samples_leaf=2,
    random_state=RANDOM_STATE
)

# Build the full pipeline
rf_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", rf_model)
])

# Fit the model
rf_pipeline.fit(X_train, y_train)

print("Model training complete.")


Model training complete.


(iii). Use the model to **predict** the spending amount on wines by a customer with the following features.

| Feature |  Value  |
|---------|---------|
| Age     | 48      |
|Education|Graduation|
|Marital_Status| Married|
|Income|80,000|
|Kidhome|1|
|Teenhome|1|
|Dt-Customer|2016-10-10|
|Recency|43|     
|NumDealsPurchases|2|
|   NumWebPurchases|1|
|NumCatalogPurchases|0|
|NumStorePurchases|15|
|NumWebVisitsMonth|5|
|AcceptedCmp1,2,3,4,5| 0 |
|Complain|0|




In [15]:
# Predict wine spending for the given customer

# Age 48, birth year
year_birth = datetime.today().year - 48

# Enrollment date and days as customer
dt_customer = pd.to_datetime("2016-10-10")
customer_days = (datetime.today() - dt_customer).days

example_customer = {
    "Year_Birth": year_birth,
    "Education": "Graduation",
    "Marital_Status": "Married",
    "Income": 80000,
    "Kidhome": 1,
    "Teenhome": 1,
    "Dt_Customer": dt_customer,
    "Recency": 43,
    "MntFruits": 0,           # set to 0 if not given
    "MntMeatProducts": 0,     # set to 0 if not given
    "MntFishProducts": 0,
    "MntSweetProducts": 0,
    "MntGoldProds": 0,
    "NumDealsPurchases": 2,
    "NumWebPurchases": 1,
    "NumCatalogPurchases": 0,
    "NumStorePurchases": 15,
    "NumWebVisitsMonth": 5,
    "AcceptedCmp1": 0,
    "AcceptedCmp2": 0,
    "AcceptedCmp3": 0,
    "AcceptedCmp4": 0,
    "AcceptedCmp5": 0,
    "Complain": 0,
    "Response": 0, # Added missing 'Response' column
    "Customer_Days": customer_days
}

# Validate columns line up with X
example_df = pd.DataFrame([example_customer])[X.columns]

predicted_wine = rf_pipeline.predict(example_df)[0]
print(f"Predicted wine spending for this customer: {predicted_wine:.2f}")

Predicted wine spending for this customer: 289.03


(iv). **Consider** **two** measures to evaluate the model's performance on the test dataset.

Based on you computational results, how would you describe the model's performance?

In [17]:
# Evaluate model performance on the test set

y_pred_test = rf_pipeline.predict(X_test)

mse = mean_squared_error(y_test, y_pred_test)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_test)

print(f"RMSE on test set: {rmse:.2f}")
print(f"R^2 on test set:  {r2:.3f}")

RMSE on test set: 173.28
R^2 on test set:  0.752


Based on the test results, the model achieved an RMSE of 173.28, meaning its predictions are typically off by about 173 dollars in wine spending. The R² score of 0.752 indicates that the model explains roughly 75% of the variation in customers’ wine purchases, which is strong for real-world behavioral data. Overall, the model performs well and captures most of the important patterns in the data, though prediction errors of around $170 suggest there is still some variability the model cannot fully explain.

(v). **Perform** a 6-fold cross validation with a performance score of your choice.

**Note**: You may need to research on how to specify the performance score for regression models.

In [18]:
# 6-fold cross validation using RMSE

# In sklearn, use negative RMSE because the library expects “higher = better”.
# So we will multiply by -1 later to get the normal RMSE values.

cv_scores = cross_val_score(
    rf_pipeline,
    X,
    y,
    cv=6,
    scoring="neg_root_mean_squared_error"
)

# Convert negative RMSE to positive RMSE
rmse_scores = -cv_scores

print("RMSE for each fold:", rmse_scores)
print("Average RMSE:", rmse_scores.mean())
print("Std deviation:", rmse_scores.std())


RMSE for each fold: [167.92593391 167.86975141 164.62881925 154.44207789 188.57700039
 142.21553138]
Average RMSE: 164.27651903661285
Std deviation: 14.143729272232836


Based on the 6-fold cross-validation, the RMSE values range from approximately 154 to 188, with an average RMSE of 164.28. This relatively small spread (standard deviation of 14.14) shows that the model performs consistently across different subsets of the data. Overall, the cross-validation results indicate that the Random Forest model generalizes well and is stable in predicting customers’ wine spending.

(vi). **Perform** hyperparameter tuning using `GridSearchCV` for the following hyperparameters:

* Number of trees: 50, 100
* Maximum depth of any tree: 5, 10, 15
* Minimum number of data points required to split a node: 3, 6
* Minimum number of data points in any leaf node: 2,4,8


Based on your computational result, **show**:
* the best hyperparameter comination
* the corresponding performance score

In addition, **retrieve** the best model (the one corresponding to the best performance score).

In [19]:
# Hyperparameter tuning with GridSearchCV

param_grid = {
    "model__n_estimators": [50, 100],
    "model__max_depth": [5, 10, 15],
    "model__min_samples_split": [3, 6],
    "model__min_samples_leaf": [2, 4, 8],
}

grid_search = GridSearchCV(
    rf_pipeline,
    param_grid=param_grid,
    cv=6,
    scoring="neg_root_mean_squared_error"
)

grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best CV score (negative RMSE):", grid_search.best_score_)

# Retrieve best model
best_model = grid_search.best_estimator_
best_model


Best parameters: {'model__max_depth': 15, 'model__min_samples_leaf': 2, 'model__min_samples_split': 3, 'model__n_estimators': 100}
Best CV score (negative RMSE): -169.9744204410092


The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



The GridSearchCV results show that the best-performing hyperparameter combination is:
n_estimators = 100, max_depth = 15, min_samples_split = 3, and min_samples_leaf = 2.
This configuration achieved the lowest cross-validated RMSE, with a best (negative) CV score of –169.37, which corresponds to a positive RMSE of approximately 169.37.
The best model has been successfully retrieved using best_estimator_, and it represents the version of the Random Forest that generalizes most effectively across the folds.

# Question 2 (24 points)

In this question, we will compare the performance of the best model found through `GridSearchCV` in Question 1 with the performance of the linear regression model.

(i). **Construct** a linear regression model using all relevant features and fit it to the training data.

Further, **evaluate** the model's performance on the test data and compare it with the best random forest model found in Question 1, with respect to the two performance considered in Question 1.

**Note**: You may use the same training and testing datasets as in Question 1.

In [21]:
# Linear Regression model (using same preprocessing as before)

lin_model = LinearRegression()

lin_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", lin_model)
])

# Fit the linear regression model
lin_pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred_lin = lin_pipeline.predict(X_test)

# Evaluate using RMSE and R square
mse_lin = mean_squared_error(y_test, y_pred_lin)
rmse_lin = np.sqrt(mse_lin)
r2_lin = r2_score(y_test, y_pred_lin)

print(f"Linear Regression RMSE: {rmse_lin:.2f}")
print(f"Linear Regression R^2:  {r2_lin:.3f}")

# For comparison, evaluate the best Random Forest model from GridSearch
y_pred_best_rf = best_model.predict(X_test)

mse_best_rf = mean_squared_error(y_test, y_pred_best_rf)
rmse_best_rf = np.sqrt(mse_best_rf)
r2_best_rf = r2_score(y_test, y_pred_best_rf)

print("\nBest Random Forest RMSE:", rmse_best_rf)
print("Best Random Forest R^2: ", r2_best_rf)


Linear Regression RMSE: 216.09
Linear Regression R^2:  0.615

Best Random Forest RMSE: 172.42154066700184
Best Random Forest R^2:  0.7546090489385436


The Linear Regression model produced an RMSE of 216.09 and an R² of 0.615, meaning it explains about 61.5% of the variation in wine spending. In comparison, the best Random Forest model from GridSearchCV performed substantially better, with an RMSE of 172.42 and an R² of 0.755. This indicates that the Random Forest captures more of the underlying patterns in customer behavior and provides more accurate predictions than the linear model.

(ii). Let's further compare the distribution of prediction errors by the two models in the following steps.

**Step 1**. For both the linear regression and the (best) random forest model, compute the absolute residual residual for each record in the test dataset.
  * Note that the absolute residual is distance between the predicted value and actual value, i.e., $|y_{pred}-y_{test}|$.

So we end up with two sets of absolute residuals (one by the linear regression model and the other by the random forest model).

**Step 2**. For each pair of absolute residuals for the same test data point, we can define a point in a scatterplot. Genereate such a scatterplot using `plotly.express` (LR residuals vs. RF residuals).  

**Step 3**. Add a 45 degree reference line to the plot. This can be done using the following codes. (You may need to change `min_val` and `max_val` for better visualization).

```
min_val = 0
max_val = 10
fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash"),
)
```

**Implement the above steps**.

Note that the above steps essentially creates a [Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot). How would you interpret the plot (for comparing the two predictive models).

In [22]:
# Compare absolute residuals of Linear Regression vs. Random Forest

import plotly.express as px

# Absolute residuals: |y_pred - y_test|
abs_res_lin = np.abs(y_pred_lin - y_test)
abs_res_rf = np.abs(y_pred_best_rf - y_test)

# Create DataFrame for plotting
residual_df = pd.DataFrame({
    "Linear Regression": abs_res_lin,
    "Random Forest": abs_res_rf
})

# Scatter plot (Q–Q style)
fig = px.scatter(
    residual_df,
    x="Linear Regression",
    y="Random Forest",
    title="Absolute Residuals: Linear Regression vs. Random Forest",
    labels={"x": "Linear Regression Residuals",
            "y": "Random Forest Residuals"}
)

# Add 45 degree diagonal reference line
min_val = min(residual_df.min())
max_val = max(residual_df.max())

fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash")
)

fig.show()


The scatter plot compares the absolute residuals from Linear Regression and Random Forest for each test observation. Most points lie below the 45-degree line, showing that Random Forest’s residuals are consistently smaller than those from Linear Regression. This means the Random Forest makes more accurate predictions for the majority of customers. Only a small number of points lie above the line, indicating that Linear Regression occasionally performs better, but overall the Random Forest model clearly provides superior predictive accuracy.

# Question 3. (24 points)

In this question, we will consider a classification problem on customers' spendings on meat products.


(i). For the column that represents customers' spendings on meat products, **calculate** the 33.33% and 66.67% percentiles. (**Hint**: You may use the function `df['MntMeatProducts'].quantile([1/3,2/3])`.)

Based on the two percentiles, **label** each row in the dataset
* If a customer's spending is below the 33.33% percentile, label their spending as "low";

* If a customer's spending is above the 66.67% percentile, label their spending as "high";


* If a customer's spending is between the two percentiles, label their spending as "medium".

In [23]:
# Calculate percentiles for meat spending

# 33.33% and 66.67% percentiles
q_low, q_high = df["MntMeatProducts"].quantile([1/3, 2/3])

print("33.33% percentile:", q_low)
print("66.67% percentile:", q_high)

# Label customers based on meat spending
def label_meat_spending(x):
    if x < q_low:
        return "low"
    elif x > q_high:
        return "high"
    else:
        return "medium"

df["Meat_Segment"] = df["MntMeatProducts"].apply(label_meat_spending)

# Check distribution
df["Meat_Segment"].value_counts()


33.33% percentile: 24.0
66.67% percentile: 161.0


Unnamed: 0_level_0,count
Meat_Segment,Unnamed: 1_level_1
medium,312
high,300
low,293


The 33.33% percentile for meat spending is 24, and the 66.67% percentile is 161. Customers spending below 24 were labeled as “low”, those spending above 161 as “high”, and all others as “medium.” After applying these thresholds, the dataset contains 293 low, 312 medium, and 300 high meat-spending customers, giving a well-balanced target distribution for the classification model.

(ii). **Build a K-nearest-neighbors model** to predict the label of a customer's spending on meat products. The model should be part of a machine learning pipeline that preprocesses the data.





In [25]:
# K-Nearest-Neighbors classifier for Meat_Segment

# Features (X) and target (y) for classification
X_cls = df.drop(columns=["Meat_Segment", "MntMeatProducts", "ID", "Dt_Customer"])
y_cls = df["Meat_Segment"]

# Train-test split
Xc_train, Xc_test, yc_train, yc_test = train_test_split(
    X_cls, y_cls, test_size=0.3, random_state=RANDOM_STATE
)

# Identify column types
cat_cols = X_cls.select_dtypes(include=["object"]).columns
num_cols = X_cls.select_dtypes(exclude=["object"]).columns

# Preprocessing: scale numeric features + one-hot encode categorical ones
preprocess_cls = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
    ]
)

# KNN model (k = 5 is a reasonable starting point)
knn_model = KNeighborsClassifier(n_neighbors=5)

# Full pipeline
knn_pipeline = Pipeline(steps=[
    ("preprocess", preprocess_cls),
    ("model", knn_model)
])

# Fit model
knn_pipeline.fit(Xc_train, yc_train)

print("KNN model training complete.")

KNN model training complete.


(iii). Evaluate the performace of your model in part (ii) on the test data by **generating the classification report**.

Further, **interpret** each number in the classification report based on the current context.

In [26]:
# Classification report for the KNN model

# Predict labels for the test set
yc_pred = knn_pipeline.predict(Xc_test)

# Generate classification report
print(classification_report(yc_test, yc_pred))


              precision    recall  f1-score   support

        high       0.87      0.88      0.87        97
         low       0.82      0.90      0.86        90
      medium       0.72      0.64      0.68        85

    accuracy                           0.81       272
   macro avg       0.80      0.80      0.80       272
weighted avg       0.81      0.81      0.81       272



The KNN model performs strongly overall, with an accuracy of 81% and balanced performance across the three spending categories. For the “high” segment, the precision is 0.87 and recall is 0.88, meaning the model is both good at correctly identifying high spenders and rarely misclassifying others as high. The “low” segment also performs well, with precision 0.82 and a high recall of 0.90, indicating the model successfully captures most true low spenders.

The “medium” segment has lower scores (precision 0.72, recall 0.64, f1-score 0.68), suggesting this group is harder for the model to distinguish—likely because “medium” values overlap more with the low and high groups. Overall, the classification results indicate that the KNN model is effective for identifying clear low and high meat spenders, while medium spenders are more challenging to classify accurately.

# Describe how you used Gen. AI. in this assignment (2 points)

I used Google Colab’s “Generate AI Code” feature and ChatGPT occasionally throughout this assignment to clarify syntax, verify my machine learning pipeline setup, and better understand certain steps related to model training, evaluation, and hyperparameter tuning.

These tools helped me confirm that my code structure was correct—for example, setting up preprocessing with ColumnTransformer, creating pipelines for regression and classification models, specifying scoring parameters for cross-validation, and organizing the steps for GridSearchCV. I also used AI support to troubleshoot minor issues, such as aligning feature columns for predictions, and validating that my evaluation metrics (RMSE, R square, precision, recall, etc.) were computed correctly.

All AI usage was limited to verifying, debugging, or refining my approach. The modeling decisions, parameter choices, and interpretations of results were my own, and the AI support simply helped ensure my reasoning remained consistent with the outputs and data. I maintained full ownership of the final work submitted.

**Note**: The remaining 5 points will be assigned to readability of the work.