<a href="https://colab.research.google.com/github/Data-Analytics-with-Python/individual-assignment-iii-torkelfaa/blob/main/Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTANT: Before you start, enter your name and student number below.

**Full Name**: Torkel Faarlund

**Student Number**: 400677803

# Predictive Analytics for Nata Supermarket

Welcome to Part III of our case assignment. In this part, we will continue working with the same dataset of **Nata Supermarket**.  Our focus here will be on performing predictive modeling tasks.

Throughout this assignment, please ensure that your results are reproducible by setting the **random_state to 20**



# Loading and preparing the data

To begin with, load the data as a `pandas` data frame. Recall that you there are missing values in the data. **Make sure to address** the following issues from part I of the assignment before starting your analysis:

* Remove the missing values
* Remove any column of constant values
* Convert the column `Dt_Customer` to number of days the customer has been with the company.

In [2]:
import pandas as pd

df = pd.read_excel("nata_supermarket.xlsx")
df = df.dropna()

# Removing any column of constant values
constant_cols = [col for col in df.columns if df[col].nunique() == 1]
df = df.drop(columns = constant_cols)

# Converting Dt_Customer to number of days with the company

# Different date formats gave me issues
df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], dayfirst=True)

# Choosing reference date = most recent customer enrollment
max_date = df["Dt_Customer"].max()

# Calculating days each customer has been with the company
df["Days_With_Company"] = (max_date - df["Dt_Customer"]).dt.days



# Question 1 (45 points)

In this question, we will predict a customer's spending amount on each product category over a two year period. Let us assume that when we try to predict a customer's spending on a product category (such as wines), their spending on other products is not observable.

In this question and Question 2, we will focus on **wines**.

(i). **Split the data** into two data frames, X (**features**) and y (**target**).

Then, further **split the data** into **training** and **testing** sets.

In [3]:
from sklearn.model_selection import train_test_split

# Dropping the original datetime column because I got errors when trying to model the RandomForest
df = df.drop(columns=["Dt_Customer"])

# Defining x and y, making sure to remove the spending columns we're not observing
cols_to_remove = [
    "MntMeatProducts", "MntFishProducts", "MntFruits", "MntSweetProducts",
    "MntGoldProds", "ID", "Response"
]

x = df.drop(columns=["MntWines"] + cols_to_remove)
y = df["MntWines"]

# 3. Train/test split (80% train, 20% test)
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)


(ii). **Build a machine learning pipeline** that combines the following steps to predict spending amount on wines:

* Performing one-hot encoding for the categorical features;
* A random forest model for regression.

In the random forest model, **specify** the following hyperparameters:
* Number of trees;
* Maximum depth of any tree
* Minimum number of data points required to split a node;
* Minimum number of data points in any leaf node

In addition, **fit your model** to the training data.

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

# Identifying categorical and numerical columns
categorical_cols = x_train.select_dtypes(include=["object"]).columns
numerical_cols = x_train.select_dtypes(exclude=["object"]).columns

# Preprocessing
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numerical_cols)
    ]
)

# Random Forest model
rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=12,
    min_samples_split=10,
    min_samples_leaf=4,
    random_state=42
)

# Combining preprocessing + model into a pipeline
model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("rf", rf)
])

# Fitting the pipeline to the training data
model.fit(x_train, y_train)


(iii). Use the model to **predict** the spending amount on wines by a customer with the following features.

| Feature |  Value  |
|---------|---------|
| Age     | 48      |
|Education|Graduation|
|Marital_Status| Married|
|Income|80,000|
|Kidhome|1|
|Teenhome|1|
|Dt-Customer|2016-10-10|
|Recency|43|     
|NumDealsPurchases|2|
|   NumWebPurchases|1|
|NumCatalogPurchases|0|
|NumStorePurchases|15|
|NumWebVisitsMonth|5|
|AcceptedCmp1,2,3,4,5| 0 |
|Complain|0|




In [5]:
# Computing their days with company
customer_days_with_company = (max_date - pd.to_datetime("2016-10-10")).days

# Building the profile
customer = pd.DataFrame({
    "Year_Birth": [2025 - 48],
    "Education": ["Graduation"],
    "Marital_Status": ["Married"],
    "Income": [80000],
    "Kidhome": [1],
    "Teenhome": [1],
    "Recency": [43],
    "NumDealsPurchases": [2],
    "NumWebPurchases": [1],
    "NumCatalogPurchases": [0],
    "NumStorePurchases": [15],
    "NumWebVisitsMonth": [5],
    "AcceptedCmp1": [0],
    "AcceptedCmp2": [0],
    "AcceptedCmp3": [0],
    "AcceptedCmp4": [0],
    "AcceptedCmp5": [0],
    "Complain": [0],
    "Days_With_Company": [customer_days_with_company],
})

# Predicting
model.predict(customer)


array([379.17129436])

(iv). **Consider** **two** measures to evaluate the model's performance on the test dataset.

Based on you computational results, how would you describe the model's performance?

In [8]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Predictions on the test set
y_pred = model.predict(x_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

mae, rmse, df["MntWines"].mean(), df["MntWines"].median()


(83.91732381156834,
 np.float64(145.93282683777426),
 np.float64(305.09160649819495),
 174.5)

The mean spending is 305 with a median of 174.5.

The model has an MAE of 84, corresponding to 25/50% of typical spending, depending on how you view it.

The RMSE is higher, at 146, meaning that the model sometimes makes larger errors because RMSE penalises larger deviations.

Overall, the model captures general spending patterns reasonably well, but there is still a lot of variability in customer behavior that the model does not fully explain.

(v). **Perform** a 6-fold cross validation with a performance score of your choice.

**Note**: You may need to research on how to specify the performance score for regression models.

In [12]:
from sklearn.model_selection import cross_val_score

r2_scores = cross_val_score(
    model,
    x,
    y,
    cv=6,
    scoring="r2"
)

print("R2 scores for each fold:", r2_scores)
print("Average R2:", r2_scores.mean())



R2 scores for each fold: [0.82322955 0.81323926 0.77429362 0.81741003 0.76456831 0.82303443]
Average R2: 0.8026292019333733


Using 6-fold cross-validation with the R2 performance metric, the model achieved R2 values between 0.76 and 0.82, with an average of 0.803.

This means the model explains roughly 80% of the variation in wine spending across customers, which is a strong result for this type of behavioral data.

Compared to the earlier MAE and RMSE, which suggested moderate absolute prediction accuracy, the high R2 indicates that the model captures the overall spending patterns very well.

(vi). **Perform** hyperparameter tuning using `GridSearchCV` for the following hyperparameters:

* Number of trees: 50, 100
* Maximum depth of any tree: 5, 10, 15
* Minimum number of data points required to split a node: 3, 6
* Minimum number of data points in any leaf node: 2,4,8


Based on your computational result, **show**:
* the best hyperparameter comination
* the corresponding performance score

In addition, **retrieve** the best model (the one corresponding to the best performance score).

In [13]:
from sklearn.model_selection import GridSearchCV

# Defining parameter grid for the random forest inside pipeline
param_grid = {
    "rf__n_estimators": [50, 100],
    "rf__max_depth": [5, 10, 15],
    "rf__min_samples_split": [3, 6],
    "rf__min_samples_leaf": [2, 4, 8],
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=6,
    scoring="neg_mean_absolute_error",
    n_jobs=-1
)

# Fitting to training data
grid_search.fit(x_train, y_train)

# Retrieving best parameters
best_params = grid_search.best_params_
best_score = -grid_search.best_score_

best_model = grid_search.best_estimator_


In [14]:
print("Best hyperparameters:", best_params)
print("Best MAE:", best_score)
print("Best model:", best_model)


Best hyperparameters: {'rf__max_depth': 15, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 3, 'rf__n_estimators': 50}
Best MAE: 88.27487708947812
Best model: Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  Index(['Education', 'Marital_Status'], dtype='object')),
                                                 ('num', 'passthrough',
                                                  Index(['Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency',
       'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases',
       'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3',
       'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2',
       'Complain', 'Days_With_Company'],
      dtype='object'))])),
                ('rf',
                 RandomForestRegressor(max_depth=15, min_samples_lea

# Question 2 (24 points)

In this question, we will compare the performance of the best model found through `GridSearchCV` in Question 1 with the performance of the linear regression model.

(i). **Construct** a linear regression model using all relevant features and fit it to the training data.

Further, **evaluate** the model's performance on the test data and compare it with the best random forest model found in Question 1, with respect to the two performance considered in Question 1.

**Note**: You may use the same training and testing datasets as in Question 1.

In [16]:
from sklearn.linear_model import LinearRegression

# Identifing categorical and numerical columns
categorical_cols = x_train.select_dtypes(include=["object"]).columns
numerical_cols   = x_train.select_dtypes(exclude=["object"]).columns

# Preprocessor
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numerical_cols)
    ]
)

# Linear regression model
linreg = LinearRegression()

# Pipeline
linreg_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("linreg", linreg)
])

# Fit model
linreg_model.fit(x_train, y_train)

In [17]:
# Predictions
y_pred_lin = linreg_model.predict(x_test)

# Results
lin_mae = mean_absolute_error(y_test, y_pred_lin)
lin_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lin))

lin_mae, lin_rmse


(119.3884787716785, np.float64(183.77800324871012))

In [18]:
# Predictions
y_pred_rf = best_model.predict(x_test)

# Results
rf_mae = mean_absolute_error(y_test, y_pred_rf)
rf_rmse = np.sqrt(mean_squared_error(y_test, y_pred_rf))

rf_mae, rf_rmse


(78.34333922796718, np.float64(138.93577003828855))

On the test set, the linear model achieved an MAE of approximately 119.4 and an RMSE of about 183.8.

In comparison, the best Random Forest model found via GridSearchCV obtained a significantly lower MAE of 78.3 and RMSE of 138.9 on the same test data.

The Random Forest model clearly outperforms linear regression on both performance measures.

This indicates that wine spending depends on the features in a nonlinear way that a simple linear model cannot capture, whereas the Random Forest can model more complex relationships and interactions between variables.

(ii). Let's further compare the distribution of prediction errors by the two models in the following steps.

**Step 1**. For both the linear regression and the (best) random forest model, compute the absolute residual residual for each record in the test dataset.
  * Note that the absolute residual is distance between the predicted value and actual value, i.e., $|y_{pred}-y_{test}|$.

So we end up with two sets of absolute residuals (one by the linear regression model and the other by the random forest model).

**Step 2**. For each pair of absolute residuals for the same test data point, we can define a point in a scatterplot. Genereate such a scatterplot using `plotly.express` (LR residuals vs. RF residuals).  

**Step 3**. Add a 45 degree reference line to the plot. This can be done using the following codes. (You may need to change `min_val` and `max_val` for better visualization).

```
min_val = 0
max_val = 10
fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash"),
)
```

**Implement the above steps**.

Note that the above steps essentially creates a [Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot). How would you interpret the plot (for comparing the two predictive models).

In [20]:
import plotly.express as px

# Absolute residuals
y_pred_lin = linreg_model.predict(x_test)
y_pred_rf  = best_model.predict(x_test)

abs_res_lin = np.abs(y_pred_lin - y_test)
abs_res_rf  = np.abs(y_pred_rf  - y_test)

residuals_df = pd.DataFrame({
    "Linear Residual": abs_res_lin,
    "Random Forest Residual": abs_res_rf
})

# Scatter plot
fig = px.scatter(
    residuals_df,
    x="Linear Residual",
    y="Random Forest Residual",
    title="Absolute Residuals: Linear Regression vs Random Forest",
    labels={
        "Linear Residual": "Linear Regression Absolute Residual",
        "Random Forest Residual": "Random Forest Absolute Residual"
    }
)

# Adding 45-degree reference line
min_val = 0
max_val = max(residuals_df.max())

fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash")
)

fig.show()


The majority of points lie below the reference line, meaning that in most cases the Random Forest model has smaller residuals than the Linear Regression model.

This is especially apparent for observations where Linear Regression produces large errors (300–900), while the Random Forest residuals remain substantially smaller.

Overall, the plot confirms the earlier results: the Random Forest model consistently outperforms Linear Regression, both in typical prediction accuracy and in avoiding large errors.

The distribution of residuals shows that Random Forest generalizes better and unusual customers more effectively than the linear model.

# Question 3. (24 points)

In this question, we will consider a classification problem on customers' spendings on meat products.


(i). For the column that represents customers' spendings on meat products, **calculate** the 33.33% and 66.67% percentiles. (**Hint**: You may use the function `df['MntMeatProducts'].quantile([1/3,2/3])`.)

Based on the two percentiles, **label** each row in the dataset
* If a customer's spending is below the 33.33% percentile, label their spending as "low";

* If a customer's spending is above the 66.67% percentile, label their spending as "high";


* If a customer's spending is between the two percentiles, label their spending as "medium".

In [21]:
percentiles = df["MntMeatProducts"].quantile([1/3, 2/3])
p33 = percentiles.loc[1/3]
p66 = percentiles.loc[2/3]

p33, p66


(np.float64(24.0), np.float64(146.33333333333258))

In [22]:
conditions = [
    df["MntMeatProducts"] < p33,
    df["MntMeatProducts"] > p66
]

choices = ["low", "high"]

df["MeatClass"] = np.select(
    conditions,
    choices,
    default="medium"
)

(ii). **Build a K-nearest-neighbors model** to predict the label of a customer's spending on meat products. The model should be part of a machine learning pipeline that preprocesses the data.





In [25]:
y = df["MeatClass"]

# Dropping the meat spending column and the target
x = df.drop(columns=["MntMeatProducts", "MeatClass"])


x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42, stratify=y
)

In [27]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Identifing column types
categorical_cols = x_train.select_dtypes(include=["object"]).columns
numeric_cols      = x_train.select_dtypes(exclude=["object"]).columns

# Preprocessing
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", StandardScaler(), numeric_cols),
    ]
)

# KNN model
knn = KNeighborsClassifier(n_neighbors=5)

# Pipeline
knn_model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("knn", knn)
])

# Fitting
knn_model.fit(x_train, y_train)


(iii). Evaluate the performace of your model in part (ii) on the test data by **generating the classification report**.

Further, **interpret** each number in the classification report based on the current context.

In [30]:
from sklearn.metrics import classification_report, accuracy_score

y_pred = knn_model.predict(x_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.7972972972972973
              precision    recall  f1-score   support

        high       0.85      0.85      0.85       148
         low       0.81      0.88      0.84       146
      medium       0.73      0.66      0.69       150

    accuracy                           0.80       444
   macro avg       0.80      0.80      0.80       444
weighted avg       0.79      0.80      0.79       444



The KNN classifier achieves an accuracy of about 80% on the test set.

The model performs very well for identifying both high and low meat spenders, with precision, recall, and F1-scores above 0.80 for both groups.

This means that the model reliably detects customers at the extremes of meat spending.

In contrast, performance is weaker for the medium spending group (F1 ≈ 0.69), indicating that customers with mid-level spending patterns are harder to distinguish.

Overall, the model has strong predictive performance, particularly for identifying low and high spenders.

# Describe how you used Gen. AI. in this assignment (2 points)

I used AI to help with coding and troubleshooting throughout the assignment.

The AI helped on preparing the data and implementing the machine learning models.

All analysis and interpretations were done by me.

**Note**: The remaining 5 points will be assigned to readability of the work.