<a href="https://colab.research.google.com/github/Data-Analytics-with-Python/individual-assignment-iii-mozzimmashafique-jpg/blob/main/Assignment_2_MozzimmaShafique.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORTANT: Before you start, enter your name and student number below.

**Full Name**: Mozzimma Shafique

**Student Number**:400643800

# Predictive Analytics for Nata Supermarket

Welcome to Part III of our case assignment. In this part, we will continue working with the same dataset of **Nata Supermarket**.  Our focus here will be on performing predictive modeling tasks.

Throughout this assignment, please ensure that your results are reproducible by setting the **random_state to 20**



# Loading and preparing the data

To begin with, load the data as a `pandas` data frame. Recall that you there are missing values in the data. **Make sure to address** the following issues from part I of the assignment before starting your analysis:

* Remove the missing values
* Remove any column of constant values
* Convert the column `Dt_Customer` to number of days the customer has been with the company.

In [31]:
import pandas as pd
from datetime import datetime

random_state = 20

df = pd.read_csv("NATA Supermarket.csv")
df = df.dropna()

constant_cols = [col for col in df.columns if df[col].nunique() == 1]
df = df.drop(columns=constant_cols)

df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], dayfirst=True, errors="coerce")
df["Customer_Days"] = (datetime.today() - df["Dt_Customer"]).dt.days

df.head()



Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,Customer_Days
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,...,4,7,0,0,0,0,0,0,1,4824.0
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,...,2,5,0,0,0,0,0,0,0,4274.0
2,4141,1965,Graduation,Together,71613.0,0,0,NaT,26,426,...,10,4,0,0,0,0,0,0,0,
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,...,4,6,0,0,0,0,0,0,0,4300.0
4,5324,1981,PhD,Married,58293.0,1,0,NaT,94,173,...,6,5,0,0,0,0,0,0,0,


# Question 1 (45 points)

In this question, we will predict a customer's spending amount on each product category over a two year period. Let us assume that when we try to predict a customer's spending on a product category (such as wines), their spending on other products is not observable.

In this question and Question 2, we will focus on **wines**.

(i). **Split the data** into two data frames, X (**features**) and y (**target**).

Then, further **split the data** into **training** and **testing** sets.

In [43]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["MntWines"])
y = df["MntWines"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=random_state
)


(ii). **Build a machine learning pipeline** that combines the following steps to predict spending amount on wines:

* Performing one-hot encoding for the categorical features;
* A random forest model for regression.

In the random forest model, **specify** the following hyperparameters:
* Number of trees;
* Maximum depth of any tree
* Minimum number of data points required to split a node;
* Minimum number of data points in any leaf node

In addition, **fit your model** to the training data.

In [33]:
import pandas as pd
import numpy as np
from datetime import datetime

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split


df = pd.read_csv("NATA Supermarket.csv")

df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], dayfirst=True, errors="coerce")
df = df.dropna(subset=["Dt_Customer"])

df["Customer_Days"] = (pd.to_datetime("today") - df["Dt_Customer"]).dt.days
df = df.drop(columns=["Dt_Customer"])

y = df["MntWines"]
X = df.drop(columns=["MntWines"])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=20
)

categorical_cols = X.select_dtypes(include=["object"]).columns
numeric_cols = X.select_dtypes(exclude=["object"]).columns

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numeric_cols),
    ]
)

model = RandomForestRegressor(
    n_estimators=200,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=20
)

pipeline = Pipeline([
    ("preprocess", preprocessor),
    ("model", model)
])

pipeline.fit(X_train, y_train)




(iii). Use the model to **predict** the spending amount on wines by a customer with the following features.

| Feature |  Value  |
|---------|---------|
| Age     | 48      |
|Education|Graduation|
|Marital_Status| Married|
|Income|80,000|
|Kidhome|1|
|Teenhome|1|
|Dt-Customer|2016-10-10|
|Recency|43|     
|NumDealsPurchases|2|
|   NumWebPurchases|1|
|NumCatalogPurchases|0|
|NumStorePurchases|15|
|NumWebVisitsMonth|5|
|AcceptedCmp1,2,3,4,5| 0 |
|Complain|0|




(iv). **Consider** **two** measures to evaluate the model's performance on the test dataset.

Based on you computational results, how would you describe the model's performance?

In [34]:
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

y_pred = pipeline.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

r2, rmse


(0.7363223650948857, np.float64(159.0417189937229))

According to the calculated R² and RMSE the model exhibits fair predictive power. The R² value tells us about the proportion of the variation in customers’ wine spending that is accounted for by the model, and the number we have here suggests that a significant chunk of this variation is captured by our model but not all. The RMSE gives the average magnitude of the prediction error and we see here that although it appears that you can predict customer spend, there is still noticeable error in the predictions to the true values. Generally, the model can predict wine spending reasonably accurately but not extremely well, and thus a more complex (or better tuned) variation of the model is likely required.

(v). **Perform** a 6-fold cross validation with a performance score of your choice.

**Note**: You may need to research on how to specify the performance score for regression models.

In [35]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    pipeline,
    X,
    y,
    cv=6,
    scoring="r2"
)

cv_scores


array([0.7148906 , 0.74947025, 0.7536044 , 0.75303726, 0.67674126,
       0.82509598])

(vi). **Perform** hyperparameter tuning using `GridSearchCV` for the following hyperparameters:

* Number of trees: 50, 100
* Maximum depth of any tree: 5, 10, 15
* Minimum number of data points required to split a node: 3, 6
* Minimum number of data points in any leaf node: 2,4,8


Based on your computational result, **show**:
* the best hyperparameter comination
* the corresponding performance score

In addition, **retrieve** the best model (the one corresponding to the best performance score).

In [36]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "model__n_estimators": [50, 100],
    "model__max_depth": [5, 10, 15],
    "model__min_samples_split": [3, 6],
    "model__min_samples_leaf": [2, 4, 8]
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=6,
    scoring="r2"
)

grid_search.fit(X_train, y_train)

grid_search.best_params_, grid_search.best_score_


({'model__max_depth': 15,
  'model__min_samples_leaf': 2,
  'model__min_samples_split': 3,
  'model__n_estimators': 100},
 np.float64(0.7694583491125032))

In [37]:
best_model = grid_search.best_estimator_
best_model


# Question 2 (24 points)

In this question, we will compare the performance of the best model found through `GridSearchCV` in Question 1 with the performance of the linear regression model.

(i). **Construct** a linear regression model using all relevant features and fit it to the training data.

Further, **evaluate** the model's performance on the test data and compare it with the best random forest model found in Question 1, with respect to the two performance considered in Question 1.

**Note**: You may use the same training and testing datasets as in Question 1.

In [38]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

categorical_cols = X.select_dtypes(include=["object"]).columns
numeric_cols = X.select_dtypes(exclude=["object"]).columns

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore"))
        ]), categorical_cols),

        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median"))
        ]), numeric_cols)
    ]
)

lin_pipeline = Pipeline([
    ("preprocess", preprocessor),
    ("model", LinearRegression())
])

lin_pipeline.fit(X_train, y_train)

y_pred_lin = lin_pipeline.predict(X_test)

r2_lin = r2_score(y_test, y_pred_lin)
rmse_lin = np.sqrt(mean_squared_error(y_test, y_pred_lin))

r2_lin, rmse_lin


(0.5100466809393469, np.float64(216.79625003420065))

To test the effectiveness of linear regression model, I computed its R² score and root mean squared error (RMSE) on the test data to compare them with the best random forest model from question 1. A linear regression model was built as well and led to lower R² and higher RMSE, which suggests that less variance in customers? purchasing is explained by this model, but makes larger prediction errors. On the contrary, best random forest model obtained higher R² and lower RMSE implying that it can provide more excellent prediction, being able to capture the main underlying pattern of data. In a nutshell, the random forest model is better than the linear regression model on both evaluation criteria.

(ii). Let's further compare the distribution of prediction errors by the two models in the following steps.

**Step 1**. For both the linear regression and the (best) random forest model, compute the absolute residual residual for each record in the test dataset.
  * Note that the absolute residual is distance between the predicted value and actual value, i.e., $|y_{pred}-y_{test}|$.

So we end up with two sets of absolute residuals (one by the linear regression model and the other by the random forest model).

**Step 2**. For each pair of absolute residuals for the same test data point, we can define a point in a scatterplot. Genereate such a scatterplot using `plotly.express` (LR residuals vs. RF residuals).  

**Step 3**. Add a 45 degree reference line to the plot. This can be done using the following codes. (You may need to change `min_val` and `max_val` for better visualization).

```
min_val = 0
max_val = 10
fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash"),
)
```

**Implement the above steps**.

Note that the above steps essentially creates a [Q-Q plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot). How would you interpret the plot (for comparing the two predictive models).

In [39]:
import numpy as np
import plotly.express as px

y_pred_lin = lin_pipeline.predict(X_test)
y_pred_rf_best = best_model.predict(X_test)

abs_res_lin = np.abs(y_pred_lin - y_test)
abs_res_rf = np.abs(y_pred_rf_best - y_test)

fig = px.scatter(
    x=abs_res_lin,
    y=abs_res_rf,
    labels={"x": "Linear Regression Residuals", "y": "Random Forest Residuals"},
    title="Absolute Residuals: Linear Regression vs Random Forest"
)

min_val = 0
max_val = max(abs_res_lin.max(), abs_res_rf.max())

fig.add_shape(
    type="line",
    x0=min_val, y0=min_val,
    x1=max_val, y1=max_val,
    line=dict(color="red", width=2, dash="dash")
)

fig.show()


The Q–Q style scatterplot represents the absolute residuals of linear regression model versus the random forest model. If the performances on the two models are closely similar, the points will be close to the 45-degree reference line. In this plot, most dots are below the line, providing evidence that the random forest model typically has residuals smaller than those for the linear regression model on a given observation. This implies that the random forest model is a more accurate prediction of, and is capturing the data's inherent relationships between the number of sides on a die and its probability better than, the linear regression.

# Question 3. (24 points)

In this question, we will consider a classification problem on customers' spendings on meat products.


(i). For the column that represents customers' spendings on meat products, **calculate** the 33.33% and 66.67% percentiles. (**Hint**: You may use the function `df['MntMeatProducts'].quantile([1/3,2/3])`.)

Based on the two percentiles, **label** each row in the dataset
* If a customer's spending is below the 33.33% percentile, label their spending as "low";

* If a customer's spending is above the 66.67% percentile, label their spending as "high";


* If a customer's spending is between the two percentiles, label their spending as "medium".

In [40]:
p33, p66 = df["MntMeatProducts"].quantile([1/3, 2/3])

def label_spending(x):
    if x < p33:
        return "low"
    elif x > p66:
        return "high"
    else:
        return "medium"

df["Meat_Spend_Level"] = df["MntMeatProducts"].apply(label_spending)

df["Meat_Spend_Level"].head()


Unnamed: 0,Meat_Spend_Level
0,high
1,low
3,low
5,medium
7,medium


(ii). **Build a K-nearest-neighbors model** to predict the label of a customer's spending on meat products. The model should be part of a machine learning pipeline that preprocesses the data.





In [41]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

X_class = df.drop(columns=["Meat_Spend_Level"])
y_class = df["Meat_Spend_Level"]

categorical_cols = X_class.select_dtypes(include=["object"]).columns
numeric_cols = X_class.select_dtypes(exclude=["object"]).columns

preprocessor_class = ColumnTransformer(
    transformers=[
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("onehot", OneHotEncoder(handle_unknown="ignore"))
        ]), categorical_cols),
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), numeric_cols)
    ]
)

knn_model = KNeighborsClassifier(n_neighbors=5)

knn_pipeline = Pipeline([
    ("preprocess", preprocessor_class),
    ("model", knn_model)
])

knn_pipeline.fit(X_class, y_class)


(iii). Evaluate the performace of your model in part (ii) on the test data by **generating the classification report**.

Further, **interpret** each number in the classification report based on the current context.

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_class, y_class, test_size=0.2, random_state=20
)

knn_pipeline.fit(X_train_c, y_train_c)

y_pred_c = knn_pipeline.predict(X_test_c)

print(classification_report(y_test_c, y_pred_c))


              precision    recall  f1-score   support

        high       0.93      0.89      0.91        56
         low       0.70      0.91      0.79        55
      medium       0.81      0.66      0.73        73

    accuracy                           0.80       184
   macro avg       0.81      0.82      0.81       184
weighted avg       0.82      0.80      0.80       184



Precision gives us a measure of how many users that the model labeled as “low”/“medium”/“high”, were actually low, medium, or high. More accuracy results in less wrong labelings.

Recall informs us how many (of those who really are in a group) the model managed to recall. More recall = Less lost customer.

F1-score is a combination of precision and recall. It consolidates precision and recall into a single value, overall accuracy, by averaging false positives and negatives.

We refer to these as the support for each spending range, which are just the headcount of training data customers in each group from which we will later draw test data customers.

A performance model will need high presision, recall and F1-score for all three labels. Low scores for any class would indicate problems with effectively segregating customers from this class that have the corresponding spending pattern.

# Describe how you used Gen. AI. in this assignment (2 points)

I leaned on Generative AI as auxiliary aid for clearing up how the assignment was supposed to work and if I actually understood what machine learning works. It was also helpful for me to find out corresponding functions and syntax in Python, like how to build a pipeline, do preprocessing, evaluate model etc. All code decisions were made and results produced by me within the Colab notebook.

**Note**: The remaining 5 points will be assigned to readability of the work.