# Lab 11 (Evaluable)

We work for one of the most popular car buying and selling platforms in the world. From the product team they want to introduce a price recommender for the user based on the qualities of the car they want to sell. They have asked the Data Science team to tackle the challenge including:
- An exhaustive analysis of the data of the vehicles introduced in the platform in the past.
- The development of a predictive pricing model.
- The creation of a streamlit app that allows you to view the results of the analysis and interact with the model.
- Adding an explainability tab to the app so that all users can understand why each price is recommended to them.

# Practice Information:
**Due date:** By end of November, 28th (14h)

**Submission procedure:** via Moodle.

**Name:** Luca Franceschi

**NIA:** 253885

### Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from random import seed
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
#from xgboost import XGBRegressor
import lightgbm as lgb
import pickle

import warnings
warnings.simplefilter('ignore')

sns.set_palette("icefire")

### Read the Data

In [None]:
df = pd.read_csv("car_ad_display.csv", encoding = "ISO-8859-1", sep=";").drop(columns='Unnamed: 0')
df.head(3)

## Data Gathering and Data Wrangling

In [None]:
df.info()

In [None]:
df = df.dropna()
df.info()

#### EX1: How many different entries do we have for the car names column?

**Solution:** for both model and car brand we have 8467 entries.

#### Let's reduce the number of car names with a cutoff

In [None]:
def shorten_categories(categories, cutoff):
    categorical_map = {}
    for i in range(len(categories)):
        if categories.values[i] >= cutoff:
            categorical_map[categories.index[i]] = categories.index[i]
        else:
            categorical_map[categories.index[i]] = 'Other'
    return categorical_map

In [None]:
car_map = shorten_categories(df.car.value_counts(), 10)
df['car'] = df['car'].map(car_map)
df.car.value_counts()

#### EX2: Do the same with car model feature!

In [None]:
# CODE HERE 
# NOTE: with the same threshold we remove 2/3 of the data aprox

model_map = shorten_categories(df.model.value_counts(), 10)
df['model'] = df['model'].map(model_map)
df.model.value_counts()

#### EX3: Plot a bar char of the TOP 10 most expensive cars.
#### Which is the mean price per car brand?  (for the top 10 most expensive)

In [None]:
# Here I am taking into account duplicates and different models / bodies for the different brands.
# Otherwise we would be showing the brands. Since the problem asked for cars specifically.

tmp = df.groupby(['car', 'model', 'body']).mean(numeric_only=True).sort_values('price', ascending=False)
tmp = tmp.reset_index()
tmp['index'] = tmp.agg(lambda x: f'{x['car']}: {x['model']} ({x['body']})', axis=1)
tmp = tmp.reset_index(drop=True)
tmp.head(10)

In [None]:
# CODE HERE 
sns.barplot(tmp[:10], x='price', y='index')
plt.title('Top 10 most expensive cars')
plt.show()

In [None]:
tmp = df.groupby(['car']).mean(numeric_only=True).sort_values('price', ascending=False)
tmp = tmp.reset_index()
tmp.head(10)

In [None]:
sns.barplot(tmp[:10], x='price', y='car')
plt.title('Top 10 most expensive brands on average')
plt.show()

#### Let's analyze each variable distribution (except for car and model)

In [None]:
# Define the type of plot for each column based on the data type
plot_types = {}
columns = [x for x in df.columns if x not in ["car", "model"]]

for col in columns:
    if df[col].dtype == 'object':  # Categorical columns
        plot_types[col] = 'bar'
    else:
        unique_values = df[col].nunique()
        if unique_values < 10:  # Discrete columns
            plot_types[col] = 'bar'
        else:  # Continuous columns
            plot_types[col] = 'kde'

n_cols = 3
n_rows = (len(columns) + 2) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 5, n_rows * 4))
axes = axes.flatten()

# Plot each column in the dataframe
for i, col in enumerate(columns):
    ax = axes[i]
    if plot_types[col] == 'bar':
        # For categorical and discrete data, use a count plot (bar chart)
        sns.countplot(x=col, data=df, ax=ax)
        ax.set_title(f'Count Plot of {col}')
        ax.set_xlabel('')
        ax.set_ylabel('Counts')
        plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right')
    else:
        # For continuous data, use a density plot
        sns.kdeplot(df[col], ax=ax, fill=True)
        ax.set_title(f'Density Plot of {col}')
        ax.set_xlabel(col)
        ax.set_ylabel('Density')

# Hide any unused subplots
for j in range(i + 1, n_rows * n_cols):
    fig.delaxes(axes[j])

fig.tight_layout()
plt.show()

In [None]:
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns

n_cols = 2
n_rows = (len(numeric_columns) + 2) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 3, n_rows * 2))
axes = axes.flatten()

for i, col in enumerate(numeric_columns):
    sns.boxplot(y=col, data=df, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}')

for j in range(i + 1, n_rows * n_cols):
    fig.delaxes(axes[j])

fig.tight_layout()
plt.show()

#### Let's analyze each variable behaviour with respect to the target (price)

In [None]:
target = 'price'
features = [x for x in df.columns if x not in ["car", "model", target]]

n_cols = 3
n_rows = (len(features) + 2) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 6))
axes = axes.flatten()

# Plot each feature against the target variable in the dataframe
for i, feature in enumerate(features):
    ax = axes[i]
    if df[feature].dtype == 'object' or df[feature].nunique() < 10:
        # For categorical data, use a boxplot or violin plot
        sns.boxplot(x=feature, y=target, data=df, ax=ax)
    else:
        # For numerical data, use a scatter plot
        sns.scatterplot(x=feature, y=target, data=df, ax=ax)
    ax.set_title(f'{feature} vs {target}')
    plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

# Hide any unused subplots
for j in range(i + 1, n_rows * n_cols):
    fig.delaxes(axes[j])

fig.tight_layout()
plt.show()

#### As we see, there are many outliers in the features and in the target data.
#### Let's get rid of outliers in the target

In [None]:
#Let's filter the prices between 1K and 100K
df = df[df["price"] <= 100000]
df = df[df["price"] >= 1000]

plt.figure(figsize=(6, 4))
sns.kdeplot(x="price", data=df, fill=True)
ax.set_title(f'Count Plot of Price')
plt.show()

#### Let's get rid of outliers in the rest of the numeric features

In [None]:
#Let's filter the mileage over 600
df = df[df["mileage"] <= 600]

#Let's filter the engV over 7.5
df = df[df["engV"] <= 7.5]

#Let's filter the year over 1975
df = df[df["year"] >= 1975]

#### Check how the behaviour of the features with the target has changed

In [None]:
target = 'price'
features = [x for x in df.columns if x not in ["car", "model", target]]

n_cols = 3
n_rows = (len(features) + 2) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 6))
axes = axes.flatten()

# Plot each feature against the target variable in the dataframe
for i, feature in enumerate(features):
    ax = axes[i]
    if df[feature].dtype == 'object' or df[feature].nunique() < 10:
        # For categorical data, use a boxplot or violin plot
        sns.boxplot(x=feature, y=target, data=df, ax=ax)
    else:
        # For numerical data, use a scatter plot
        sns.scatterplot(x=feature, y=target, data=df, ax=ax)
    ax.set_title(f'{feature} vs {target}')
    plt.setp(ax.get_xticklabels(), rotation=45, horizontalalignment='right')

# Hide any unused subplots
for j in range(i + 1, n_rows * n_cols):
    fig.delaxes(axes[j])

fig.tight_layout()
plt.show()

#### EX4: Which of the features do you predict would be more important for estimating the price?

**Solution:** Because of the above plots, the features that seem to have the most inter-feature variance (i.e.: different classes in the same feature seem to behave very differently), therefore are most likely important to predict the price are:

- Year
- EngV
- Mileage
- Drive
- (Not seen in above plots, but seen in exercise before) Brand

#### EX5: After all changes, How many rows are left?

In [None]:
# CODE HERE:
len(df)

**Solution:** 
8224 non-null rows

### Let's prepare the data for model:

In [None]:
df_original = df.copy()
df.info()

In [None]:
#Let's encode the string features:

le_car = LabelEncoder()
df['car'] = le_car.fit_transform(df['car'])
print('*CAR: \n', df["car"].unique(), '\n')

le_body = LabelEncoder()
df['body'] = le_body.fit_transform(df['body'])
print('*BODY: \n', df["body"].unique(), '\n')

le_engType = LabelEncoder()
df['engType'] = le_engType.fit_transform(df['engType'])
print('*EngType: \n', df["engType"].unique(), '\n')

le_drive = LabelEncoder()
df['drive'] = le_drive.fit_transform(df['drive'])
print('*DRIVE: \n', df["drive"].unique(), '\n')

In [None]:
#Encode registration string feature into a int boolean feature
yes_l = ['yes', 'YES', 'Yes', 'y', 'Y']
df['registration'] = np.where(df['registration'].isin(yes_l), 1, 0)
df['registration'].value_counts()

In [None]:
# We will drop 'model' feature as there is no simple way to handle that amount of unique values.
df = df.drop(columns='model')

In [None]:
print(df.info())
df.head(3)

#### EX6: Now that all data is in numeric data type, Plot the correlation matrix among features:

In [None]:
# CODE HERE

corr = df.corr(numeric_only=True)
plt.figure(figsize=(6, 5))
heatmap = sns.heatmap(corr, vmin=-1, vmax=1, annot=True, cmap='icefire')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':10});

#### EX7: Which variables are more correlated with the target?

**Solution:**

More positive correlated: engV, year, drive
More negatively correlated: mileage, body (since it's categorical, being *negatively* correlated does not mean anything, but it's correlated.)

## Model training

In [None]:
#Let's split train and test data
X = df.drop("price", axis=1)
y = df["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_test.to_csv('X_test.csv')

#### Ensure X and Y have the same lenght for both train and test

In [None]:
print("Lenght X_train:",len(X_train))
print("Length y_train:", len(y_train))
print("Lenght X_test:",len(X_test))
print("Length y_test:", len(y_test))

#### Try different models:

In [None]:
#Linear Regression:

linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
display(linear_reg)

y_pred_test = linear_reg.predict(X_test)
error = np.sqrt(mean_squared_error(y_test, y_pred_test))
print("{:,.02f}".format(error))
# its an error why the dollar lol

In [None]:
#Random Forest:

random_forest_reg = RandomForestRegressor(random_state=0)
random_forest_reg.fit(X_train, y_train)
display(random_forest_reg)

y_pred_test = random_forest_reg.predict(X_test)
error = np.sqrt(mean_squared_error(y_test, y_pred_test))
print("{:,.02f}".format(error))

In [None]:
#XGBoost:

lgb_reg = lgb.LGBMRegressor()
lgb_reg.fit (X_train, y_train)
display(lgb_reg)

y_pred_test=lgb_reg.predict(X_test)
error = np.sqrt(mean_squared_error(y_test, y_pred_test))
print("{:,.02f}".format(error))

#### it seems that LightGBM performs better for this use case, so let's continue with the this algorithm grid search for choosing the best parameters (this can take some minutes):

In [None]:
# Add as many parametrers as you want
max_depth = [2, 8, 12]
n_estimators = [50, 100, 300]
learning_rate = [0.1]

parameters = {
    "max_depth": max_depth,
    "n_estimators": n_estimators,
    "learning_rate": learning_rate}

lgb_reg = lgb.LGBMRegressor(random_state=42, force_row_wise=True)

# Grid Search
gs = GridSearchCV(lgb_reg, parameters, scoring='neg_mean_squared_error')
gs.fit(X_train, y_train)

In [None]:
lgb_reg = gs.best_estimator_
lgb_reg.fit(X_train, y_train)

y_pred_test = lgb_reg.predict(X_test)

error = np.sqrt(mean_squared_error(y_test, y_pred_test))
print("{:,.02f}".format(error))

print("The R2_score is:", r2_score(y_test, y_pred_test))

#### EX8: Test with an invented example (just run the code and answer the questions):

In [None]:
A = []
Q = [
    "Enter your brand car "+  str(list(df_original['car'].unique()[:5]))[:-1]+" , ...]: ",
    "Enter the body category of your car "+ str(list(df_original['body'].unique()))+': ',
    "Enter the milage: ",
    "Enter the engV (use '.' as decimal): ",
    "Enter the engType "+ str(list(df_original['engType'].unique()))+': ',
    "Enter if it registered (yes/no): ",
    "Enter the year of the car: ",
    "Enter the drive type of the car "+ str(list(df_original['drive'].unique()))+': ']

for q in Q:
    a = input(q)
    A.append(a)

print("Your answers are:", A)

In [None]:
X_sample = np.array([A])

# Apply the encoder and data type corrections:
X_sample[:, 0] = str(X_sample[:, 0][0] if X_sample[:, 0][0] in list(df_original['car'].unique()) else 'Other')
X_sample[:, 0] = le_car.transform(X_sample[:,0])
X_sample[:, 1] = le_body.transform(X_sample[:,1])
X_sample[:, 4] = le_engType.transform(X_sample[:,4])
X_sample[:, 5] = int(1 if X_sample[:, 5][0] in yes_l else 0)
X_sample[:, 7] = le_drive.transform(X_sample[:,7])

X_sample = np.array([[
    int(X_sample[0, 0]),
    int(X_sample[0, 1]),
    int(X_sample[0, 2]),
    float(X_sample[0, 3]),
    int(X_sample[0, 4]),
    int(X_sample[0, 5]),
    int(X_sample[0, 6]),
    int(X_sample[0, 7])
]])

print('The encoded array is: ', X_sample)

In [None]:
y_pred_sample = lgb_reg.predict(X_sample)
print("Your car estimated price is: ","${:,.02f}".format(y_pred_sample[0]))

Which questions??

### Store and read the model

In [None]:
# Store
data = {"model": lgb_reg, "le_car": le_car, "le_body": le_body, "le_engType":le_engType , "le_drive":le_drive}
with open('models/model.pkl', 'wb') as file:
    pickle.dump(data, file)

In [None]:
# Read
with open('models/model.pkl', 'rb') as file:
    data = pickle.load(file)

model = data["model"]
le_car = data["le_car"]
le_body = data["le_body"]
le_engType = data["le_engType"]
le_drive = data["le_drive"]

y_pred_sample = model.predict(X_sample)
print("Your car estimated price is: ","${:,.02f}".format(y_pred_sample[0]))

In [None]:
df.to_csv('car_ad_display_clean.csv')

## Explainability AI

As an excellent data scientist, we cannot conclude our work without understanding how the model works. In this section of the project, we will apply SHAP as a technique to understand, debug and explain our model.

### Global explainability

#### EX9: Train a Shap explainer and calculate the shap_values object for the X_test dataset. Print the shap values object of the first sample of X_test.

In [None]:
# CODE HERE

import shap
shap.initjs()
explainer = shap.Explainer(model)
shap_values = explainer(X_test)

display(X_test)
print(shap_values[0])

#### EX10: Which is the average price cost prediction for all cars?

**Solution:**

The ```base_values``` attribute holds the average prediction for any car: in this case $14,393.64 on average

Note: ($ is in local currency, not necessarily USD)

#### Let's plot the summary plot and bar plot for global explainability of the model.

In [None]:
#Global Explainability
shap.summary_plot(shap_values, X_test)

In [None]:
#Plot var: built from column 1 of all shap_values
shap.plots.bar(shap_values)

#### EX11: Which are your insights?

**Solution:**
The features that import the most (in descending order) for this model can be seen in the previous barplot. That means that probably if we were to remove, for instance, `body` from the analysis, the results would not change drastically. However if we were to change `year`, the results would change drastically, probably worsening significantly our model's capabilities.

#### Let's do deep dive in the variables `Mileage`, `engV` and `year`. 

In [None]:
shap.plots.scatter(shap_values[:,"mileage"])

In [None]:
shap.plots.scatter(shap_values[:,"engV"])

In [None]:
shap.plots.scatter(shap_values[:,"year"])

#### EX12: What are the most relevant insights abour the evolution of the features' values and their Shap values.

#### **Solution:**
We can see that `mileage` is very important, but only in a small portion of the range. Once the mileage gets past 100 (probably thousand) miles the feature loses importance, meaning that the impact of the feature miles does not change the output significantly in the range `[100-600]`. That becomes especially true after `300` miles.

`EngV` importance is pretty linear, having its breakpoint at around `3L`. If below, the impact is negative, and positive otherwise.

`Year`'s importance seems to be exponential. Cars manufactured before `2010`'s have quite a negative (but similar) impact. After that, manufacturing year's importance seems to be exponential.

#### Let's analyze the relationship of the variables `engV` and `year` and their Shap values according to the value of `mileage`.

In [None]:
#Let's analyze the evolution of Shap values of engV based on mileage
shap.dependence_plot("engV", shap_values.values, X_test, interaction_index= "mileage")

In [None]:
#Let's analyze the evolution of Shap values of engV based on mileage
shap.dependence_plot("year", shap_values.values, X_test, interaction_index= "mileage")

 #### EX13: What are the most relevant insights about the evolution of the features' values and their Shap values.

**Solution:**

Manufacturing year seems to be quite negatively correlated with mileage (as can be seen in the heatmap above). Also having higher engine volume seems to be correlated with higher mileage, but not as importantly.

### Local explainability

Local explainability facilitates the understanding of the prediction for some particular cases. In other words, XAI closes to a personalized prediction explainability. Let's use the first sample of X_test for the following steps.

#### Using the waterfall, force and decision plots, we can explain how the model works.

In [None]:
shap.plots.waterfall(shap_values[0])

In [None]:
shap.plots.force(shap_values[0])

In [None]:
shap.decision_plot(shap_values[0].base_values,shap_values[0].values, X_test.iloc[0])

Given this example (`year=1990, engV=1.8, ...`) we can see how each variable affects the final prediction: we can see that most variables affect negatively the price (in blue), which gets reduced from the mean prediction ($14,393.64) down to its actual prediction ($2,981.22). We can see how the year reduces around $8,000, engV around $1,500, etc. These plots give a good sense on why the predicted value is what it is, and the reasons behind that.