[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Data-Analytics-with-Python/individual-assignment-iii-Kaufmann11/blob/main/Assignment_3_Karim.ipynb)


IMPORTANT: Before you start, enter your name and student number below.

**Full Name**: Karim Kaufmann

**Student Number**:400677758

# Predictive Analytics for Nata Supermarket

Welcome to Part III of our case assignment. In this part, we will continue working with the same dataset of **Nata Supermarket**.  Our focus here will be on performing predictive modeling tasks.

Throughout this assignment, please ensure that your results are reproducible by setting the **random_state to 20**



# Loading and preparing the data

To begin with, load the data as a `pandas` data frame. Recall that you there are missing values in the data. **Make sure to address** the following issues from part I of the assignment before starting your analysis:

* Remove the missing values
* Remove any column of constant values
* Convert the column `Dt_Customer` to number of days the customer has been with the company.

In [2]:
import pandas as pd

df = pd.read_csv("Mappe1.csv", sep=";")

df = df.dropna()

constant_cols = [col for col in df.columns if df[col].nunique() <= 1]
df = df.drop(columns=constant_cols)

df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"])
reference_date = df["Dt_Customer"].max()
df["Customer_Days"] = (reference_date - df["Dt_Customer"]).dt.days

print(df.head())

     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumStorePurchases  NumWebVisitsMonth  \
0  2012-09-04       58       635  ...                  4                  7   
1  2014-03-08       38        11  ...                  2                  5   
2  2013-08-21       26       426  ...                 10                  4   
3  2014-02-10       26        11  ...                  4                  6   
4  2014-01-19       94       173  ...                  6                  5   

   AcceptedCmp3  AcceptedCmp4  AcceptedCmp5  AcceptedCmp

  df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"])


# Question 1 (45 points)

In this question, we will predict a customer's spending amount on each product category over a two year period. Let us assume that when we try to predict a customer's spending on a product category (such as wines), their spending on other products is not observable.

In this question and Question 2, we will focus on **wines**.

(i). **Split the data** into two data frames, X (**features**) and y (**target**).

Then, further **split the data** into **training** and **testing** sets.

In [3]:
from sklearn.model_selection import train_test_split

y = df["MntWines"]

X = df.drop(columns=[
    "MntWines",
    "MntFruits",
    "MntMeatProducts",
    "MntFishProducts",
    "MntSweetProducts",
    "MntGoldProds"
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=20
)


(ii). **Build a machine learning pipeline** that combines the following steps to predict spending amount on wines:

* Performing one-hot encoding for the categorical features;
* A random forest model for regression.

In the random forest model, **specify** the following hyperparameters:
* Number of trees;
* Maximum depth of any tree
* Minimum number of data points required to split a node;
* Minimum number of data points in any leaf node

In addition, **fit your model** to the training data.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

df = pd.read_csv("Mappe1.csv", sep=";")
df = df.dropna()

df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"])
reference_date = df["Dt_Customer"].max()
df["Customer_Days"] = (reference_date - df["Dt_Customer"]).dt.days
df = df.drop(columns=["Dt_Customer"])

y = df["MntWines"]
X = df.drop(columns=[
    "MntWines",
    "MntFruits",
    "MntMeatProducts",
    "MntFishProducts",
    "MntSweetProducts",
    "MntGoldProds"
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=20
)

cat = X_train.select_dtypes(include="object").columns

pre = ColumnTransformer(
    [("onehot", OneHotEncoder(handle_unknown="ignore"), cat)],
    remainder="passthrough"
)

model = Pipeline([
    ("prep", pre),
    ("rf", RandomForestRegressor(
        n_estimators=100,
        max_depth=10,
        min_samples_split=2,
        min_samples_leaf=1,
        random_state=20
    ))
])

model.fit(X_train, y_train)



  df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"])
The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



(iii). Use the model to **predict** the spending amount on wines by a customer with the following features.

| Feature |  Value  |
|---------|---------|
| Age     | 48      |
|Education|Graduation|
|Marital_Status| Married|
|Income|80,000|
|Kidhome|1|
|Teenhome|1|
|Dt-Customer|2016-10-10|
|Recency|43|     
|NumDealsPurchases|2|
|   NumWebPurchases|1|
|NumCatalogPurchases|0|
|NumStorePurchases|15|
|NumWebVisitsMonth|5|
|AcceptedCmp1,2,3,4,5| 0 |
|Complain|0|




In [9]:
import pandas as pd

# Calculate Year_Birth based on Age and Dt_Customer
# Assuming Age 48 corresponds to Dt_Customer 2016-10-10
# Year_Birth = 2016 - 48 = 1968

new = pd.DataFrame([{
    "ID": 0, # Placeholder ID as it's not a feature but required for column consistency
    "Year_Birth": 1968,
    "Education": "Graduation",
    "Marital_Status": "Married",
    "Income": 80000,
    "Kidhome": 1,
    "Teenhome": 1,
    "Recency": 43,
    "NumDealsPurchases": 2,
    "NumWebPurchases": 1,
    "NumCatalogPurchases": 0,
    "NumStorePurchases": 15,
    "NumWebVisitsMonth": 5,
    "AcceptedCmp3": 0,
    "AcceptedCmp4": 0,
    "AcceptedCmp5": 0,
    "AcceptedCmp1": 0,
    "AcceptedCmp2": 0,
    "Complain": 0,
    "Z_CostContact": 3,
    "Z_Revenue": 11,
    "Response": 0
}])

new["Customer_Days"] = (reference_date - pd.to_datetime("2016-10-10")).days

pred = model.predict(new)
pred

array([398.01631576])

(iv). **Consider** **two** measures to evaluate the model's performance on the test dataset.

Based on you computational results, how would you describe the model's performance?

In [7]:
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

r2, rmse

(0.7766443885863452, np.float64(150.67947150929112))

(v). **Perform** a 6-fold cross validation with a performance score of your choice.

**Note**: You may need to research on how to specify the performance score for regression models.

In [10]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=6,
    scoring="neg_mean_squared_error"
)

cv_rmse = (-cv_scores) ** 0.5
cv_rmse.mean(), cv_rmse.std()


(np.float64(157.30765504605802), np.float64(7.938246423175109))

(vi). **Perform** hyperparameter tuning using `GridSearchCV` for the following hyperparameters:

* Number of trees: 50, 100
* Maximum depth of any tree: 5, 10, 15
* Minimum number of data points required to split a node: 3, 6
* Minimum number of data points in any leaf node: 2,4,8


Based on your computational result, **show**:
* the best hyperparameter comination
* the corresponding performance score

In addition, **retrieve** the best model (the one corresponding to the best performance score).

In [11]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "rf__n_estimators": [50, 100],
    "rf__max_depth": [5, 10, 15],
    "rf__min_samples_split": [3, 6],
    "rf__min_samples_leaf": [2, 4, 8]
}

grid = GridSearchCV(
    model,
    param_grid,
    cv=6,
    scoring="neg_mean_squared_error",
    n_jobs=-1
)

grid.fit(X_train, y_train)

best_params = grid.best_params_
best_score = grid.best_score_
best_model = grid.best_estimator_

best_params, best_score


({'rf__max_depth': 15,
  'rf__min_samples_leaf': 2,
  'rf__min_samples_split': 3,
  'rf__n_estimators': 100},
 np.float64(-24559.18747344191))

# Question 2 (24 points)

In this question, we will compare the performance of the best model found through `GridSearchCV` in Question 1 with the performance of the linear regression model.

(i). **Construct** a linear regression model using all relevant features and fit it to the training data.

Further, **evaluate** the model's performance on the test data and compare it with the best random forest model found in Question 1, with respect to the two performance considered in Question 1.

**Note**: You may use the same training and testing datasets as in Question 1.

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

cat = X_train.select_dtypes(include="object").columns

pre_lin = ColumnTransformer(
    [("onehot", OneHotEncoder(handle_unknown="ignore"), cat)],
    remainder="passthrough"
)

lin_model = Pipeline([
    ("prep", pre_lin),
    ("lin", LinearRegression())
])

lin_model.fit(X_train, y_train)

y_pred_lin = lin_model.predict(X_test)
r2_lin = r2_score(y_test, y_pred_lin)
rmse_lin = np.sqrt(mean_squared_error(y_test, y_pred_lin))

y_pred_rf = best_model.predict(X_test)
r2_rf = r2_score(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

print("Linear Regression - R2:", r2_lin, "RMSE:", rmse_lin)
print("Best Random Forest - R2:", r2_rf, "RMSE:", rmse_rf)


Linear Regression - R2: 0.6926620041622538 RMSE: 176.751774191032
Best Random Forest - R2: 0.781371724604439 RMSE: 149.0763733268506


(ii). Let's further compare the distribution of prediction errors by the two models in the following steps.

**Step 1**. For both the linear regression and the (best) random forest model, compute the absolute residual residual for each record in the test dataset.
  * Note that the absolute residual is distance between the predicted value and actual value, i.e., $|y_{pred}-y_{test}|$.

So we end up with two sets of absolute residuals (one by the linear regression model and the other by the random forest model).

**Step 2**. For each pair of absolute residuals for the same test data point, we can define a point in a scatterplot. Genereate such a scatterplot using `plotly.express` (LR residuals vs. RF residuals).  

**Step 3**. Add a 45 degree reference line to the plot. This can be done using the following codes. (You may need to change `min_val` and `max_val` for better visualization).

```
min_val = 0
max_val = 10
fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash"),
)
```

**Implement the above steps**.

Note that the above steps essentially creates a [Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot). How would you interpret the plot (for comparing the two predictive models).

In [13]:
import numpy as np
import pandas as pd
import plotly.express as px

y_pred_lin = lin_model.predict(X_test)
y_pred_rf = best_model.predict(X_test)

res_lin = np.abs(y_pred_lin - y_test.values)
res_rf = np.abs(y_pred_rf - y_test.values)

res_df = pd.DataFrame({
    "LR_residual": res_lin,
    "RF_residual": res_rf
})

fig = px.scatter(
    res_df,
    x="LR_residual",
    y="RF_residual",
    labels={"LR_residual": "Linear Regression |y_pred - y_true|",
            "RF_residual": "Random Forest |y_pred - y_true|"},
    title="Absolute residuals: Linear Regression vs Random Forest"
)

min_val = 0
max_val = float(res_df.max().max())

fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash"),
)

fig.show()


# Question 3. (24 points)

In this question, we will consider a classification problem on customers' spendings on meat products.


(i). For the column that represents customers' spendings on meat products, **calculate** the 33.33% and 66.67% percentiles. (**Hint**: You may use the function `df['MntMeatProducts'].quantile([1/3,2/3])`.)

Based on the two percentiles, **label** each row in the dataset
* If a customer's spending is below the 33.33% percentile, label their spending as "low";

* If a customer's spending is above the 66.67% percentile, label their spending as "high";


* If a customer's spending is between the two percentiles, label their spending as "medium".

In [14]:
import numpy as np
import pandas as pd

q_low, q_high = df["MntMeatProducts"].quantile([1/3, 2/3])

def label_meat(x):
    if x < q_low:
        return "low"
    elif x > q_high:
        return "high"
    else:
        return "medium"

df["MeatLabel"] = df["MntMeatProducts"].apply(label_meat)


(ii). **Build a K-nearest-neighbors model** to predict the label of a customer's spending on meat products. The model should be part of a machine learning pipeline that preprocesses the data.





In [15]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

X_cls = df.drop(columns=["MeatLabel", "MntMeatProducts"])
y_cls = df["MeatLabel"]

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cls, y_cls, test_size=0.3, random_state=20, stratify=y_cls
)

cat_cols = X_train_c.select_dtypes(include="object").columns
num_cols = X_train_c.columns.difference(cat_cols)

pre_cls = ColumnTransformer(
    [
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ("num", StandardScaler(), num_cols),
    ]
)

knn_model = Pipeline(
    [
        ("preprocess", pre_cls),
        ("knn", KNeighborsClassifier(n_neighbors=5)),
    ]
)

knn_model.fit(X_train_c, y_train_c)


(iii). Evaluate the performace of your model in part (ii) on the test data by **generating the classification report**.

Further, **interpret** each number in the classification report based on the current context.

In [16]:
from sklearn.metrics import classification_report

y_pred_c = knn_model.predict(X_test_c)

print(classification_report(y_test_c, y_pred_c))


              precision    recall  f1-score   support

        high       0.85      0.83      0.84       222
         low       0.78      0.88      0.82       219
      medium       0.68      0.61      0.64       224

    accuracy                           0.77       665
   macro avg       0.77      0.77      0.77       665
weighted avg       0.77      0.77      0.77       665



# Describe how you used Gen. AI. in this assignment (2 points)

was very much needed for support, without AI would have been very difficult. Helps with the starting point and also with error messages

**Note**: The remaining 5 points will be assigned to readability of the work.