<a href="https://colab.research.google.com/github/Data-Analytics-with-Python/individual-assignment-iii-torkelfaa/blob/main/Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTANT: Before you start, enter your name and student number below.

**Full Name**:

**Student Number**:

# Predictive Analytics for Nata Supermarket

Welcome to Part III of our case assignment. In this part, we will continue working with the same dataset of **Nata Supermarket**.  Our focus here will be on performing predictive modeling tasks.

Throughout this assignment, please ensure that your results are reproducible by setting the **random_state to 20**



# Loading and preparing the data

To begin with, load the data as a `pandas` data frame. Recall that you there are missing values in the data. **Make sure to address** the following issues from part I of the assignment before starting your analysis:

* Remove the missing values
* Remove any column of constant values
* Convert the column `Dt_Customer` to number of days the customer has been with the company.

In [59]:
import pandas as pd

df = pd.read_excel("nata_supermarket.xlsx")
df = df.dropna()

# Removing any column of constant values
constant_cols = [col for col in df.columns if df[col].nunique() == 1]
df = df.drop(columns = constant_cols)

# Converting Dt_Customer to number of days with the company

# Different date formats gave me issues
df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], dayfirst=True)

# Choosing reference date = most recent customer enrollment
max_date = df["Dt_Customer"].max()

# Calculating days each customer has been with the company
df["Days_With_Company"] = (max_date - df["Dt_Customer"]).dt.days



# Question 1 (45 points)

In this question, we will predict a customer's spending amount on each product category over a two year period. Let us assume that when we try to predict a customer's spending on a product category (such as wines), their spending on other products is not observable.

In this question and Question 2, we will focus on **wines**.

(i). **Split the data** into two data frames, X (**features**) and y (**target**).

Then, further **split the data** into **training** and **testing** sets.

In [60]:
from sklearn.model_selection import train_test_split

# Dropping the original datetime column because I got errors when trying to model the RandomForest
df = df.drop(columns=["Dt_Customer"])

# Defining x and y, making sure to remove the spending columns we're not observing
cols_to_remove = [
    "MntMeatProducts", "MntFishProducts", "MntFruits", "MntSweetProducts",
    "MntGoldProds", "ID", "Response"
]

x = df.drop(columns=["MntWines"] + cols_to_remove)
y = df["MntWines"]

# 3. Train/test split (80% train, 20% test)
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)


(ii). **Build a machine learning pipeline** that combines the following steps to predict spending amount on wines:

* Performing one-hot encoding for the categorical features;
* A random forest model for regression.

In the random forest model, **specify** the following hyperparameters:
* Number of trees;
* Maximum depth of any tree
* Minimum number of data points required to split a node;
* Minimum number of data points in any leaf node

In addition, **fit your model** to the training data.

In [61]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

# Identifying categorical and numerical columns
categorical_cols = x_train.select_dtypes(include=["object"]).columns
numerical_cols = x_train.select_dtypes(exclude=["object"]).columns

# Preprocessing
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numerical_cols)
    ]
)

# Random Forest model
rf = RandomForestRegressor(
    n_estimators=300,        # Number of trees
    max_depth=12,            # Max depth of each tree
    min_samples_split=10,    # Minimum samples required to split a node
    min_samples_leaf=4,      # Minimum samples in a leaf node
    random_state=42
)

# Combining preprocessing + model into a pipeline
model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("rf", rf)
])

# Fitting the pipeline to the training data
model.fit(x_train, y_train)


(iii). Use the model to **predict** the spending amount on wines by a customer with the following features.

| Feature |  Value  |
|---------|---------|
| Age     | 48      |
|Education|Graduation|
|Marital_Status| Married|
|Income|80,000|
|Kidhome|1|
|Teenhome|1|
|Dt-Customer|2016-10-10|
|Recency|43|     
|NumDealsPurchases|2|
|   NumWebPurchases|1|
|NumCatalogPurchases|0|
|NumStorePurchases|15|
|NumWebVisitsMonth|5|
|AcceptedCmp1,2,3,4,5| 0 |
|Complain|0|




In [62]:
# Computing their days with company
customer_days_with_company = (max_date - pd.to_datetime("2016-10-10")).days

# Building the profile
customer = pd.DataFrame({
    "Year_Birth": [2025 - 48],
    "Education": ["Graduation"],
    "Marital_Status": ["Married"],
    "Income": [80000],
    "Kidhome": [1],
    "Teenhome": [1],
    "Recency": [43],
    "NumDealsPurchases": [2],
    "NumWebPurchases": [1],
    "NumCatalogPurchases": [0],
    "NumStorePurchases": [15],
    "NumWebVisitsMonth": [5],
    "AcceptedCmp1": [0],
    "AcceptedCmp2": [0],
    "AcceptedCmp3": [0],
    "AcceptedCmp4": [0],
    "AcceptedCmp5": [0],
    "Complain": [0],
    "Days_With_Company": [customer_days_with_company],
})

# Predicting
model.predict(customer)


array([379.17129436])

(iv). **Consider** **two** measures to evaluate the model's performance on the test dataset.

Based on you computational results, how would you describe the model's performance?

(v). **Perform** a 6-fold cross validation with a performance score of your choice.

**Note**: You may need to research on how to specify the performance score for regression models.

(vi). **Perform** hyperparameter tuning using `GridSearchCV` for the following hyperparameters:

* Number of trees: 50, 100
* Maximum depth of any tree: 5, 10, 15
* Minimum number of data points required to split a node: 3, 6
* Minimum number of data points in any leaf node: 2,4,8


Based on your computational result, **show**:
* the best hyperparameter comination
* the corresponding performance score

In addition, **retrieve** the best model (the one corresponding to the best performance score).

# Question 2 (24 points)

In this question, we will compare the performance of the best model found through `GridSearchCV` in Question 1 with the performance of the linear regression model.

(i). **Construct** a linear regression model using all relevant features and fit it to the training data.

Further, **evaluate** the model's performance on the test data and compare it with the best random forest model found in Question 1, with respect to the two performance considered in Question 1.

**Note**: You may use the same training and testing datasets as in Question 1.

(ii). Let's further compare the distribution of prediction errors by the two models in the following steps.

**Step 1**. For both the linear regression and the (best) random forest model, compute the absolute residual residual for each record in the test dataset.
  * Note that the absolute residual is distance between the predicted value and actual value, i.e., $|y_{pred}-y_{test}|$.

So we end up with two sets of absolute residuals (one by the linear regression model and the other by the random forest model).

**Step 2**. For each pair of absolute residuals for the same test data point, we can define a point in a scatterplot. Genereate such a scatterplot using `plotly.express` (LR residuals vs. RF residuals).  

**Step 3**. Add a 45 degree reference line to the plot. This can be done using the following codes. (You may need to change `min_val` and `max_val` for better visualization).

```
min_val = 0
max_val = 10
fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash"),
)
```

**Implement the above steps**.

Note that the above steps essentially creates a [Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot). How would you interpret the plot (for comparing the two predictive models).

# Question 3. (24 points)

In this question, we will consider a classification problem on customers' spendings on meat products.


(i). For the column that represents customers' spendings on meat products, **calculate** the 33.33% and 66.67% percentiles. (**Hint**: You may use the function `df['MntMeatProducts'].quantile([1/3,2/3])`.)

Based on the two percentiles, **label** each row in the dataset
* If a customer's spending is below the 33.33% percentile, label their spending as "low";

* If a customer's spending is above the 66.67% percentile, label their spending as "high";


* If a customer's spending is between the two percentiles, label their spending as "medium".

(ii). **Build a K-nearest-neighbors model** to predict the label of a customer's spending on meat products. The model should be part of a machine learning pipeline that preprocesses the data.





(iii). Evaluate the performace of your model in part (ii) on the test data by **generating the classification report**.

Further, **interpret** each number in the classification report based on the current context.

# Describe how you used Gen. AI. in this assignment (2 points)

**Note**: The remaining 5 points will be assigned to readability of the work.