# Ceteris-paribus profiles

## Importing libraries and loading models:

In [12]:
import pickle
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import dalex as dx
warnings.filterwarnings('ignore')
np.random.seed(23)


gb = pickle.load(open('..\\resources\\models\\gradient_boosting.pkl', 'rb'))
nn = pickle.load(open('..\\resources\\models\\neural_network.pkl', 'rb'))
rf = pickle.load(open('..\\resources\\models\\random_forest.pkl', 'rb'))

df = pickle.load(open('..\\resources\\data\\housing_preproc.pkl', 'rb'))

## Model prediction (`neural network`)

In [13]:
from sklearn.model_selection import train_test_split

X, y = df.drop(columns=["median_house_value"]), df[["median_house_value"]]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

In [14]:
nn.predict(X_test.iloc[[211]])

array([1.87196857])

In [15]:
y_test.iloc[[211]]

Unnamed: 0,median_house_value
17798,1.81852


## Model prediction (other models)

In [21]:
rf.predict(X_test.iloc[[211]])

array([1.88803219])

In [22]:
gb.predict(X_test.iloc[[211]])

array([2.33324731])

As we can see, `neural network` was the closest in predicting target value, `random forest` being right behind it. `Gradient boosting` had the worst prediction of all three models.

## CP Profile

In [16]:
nn_exp = dx.Explainer(nn, X_train, y_train, label = "Neural network")

Preparation of a new explainer is initiated

  -> data              : 13828 rows 13 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 13828 values
  -> model_class       : sklearn.model_selection._search.GridSearchCV (default)
  -> label             : Neural network
  -> predict function  : <function yhat_default at 0x0000016D4AF82E50> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = -1.52, mean = -0.0227, max = 3.15
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -2.51, mean = 0.021, max = 3.11
  -> model_info        : package sklearn

A new explainer has been created!


In [17]:
cp_nn = nn_exp.predict_profile(X_test.iloc[[211]])
cp_nn.plot()

Calculating ceteris paribus: 100%|████████████████████████████████████████████████████| 13/13 [00:00<00:00, 419.26it/s]


CP profiles for `neural network` shows that upon increase in `longitude` and `latitude` the target value decreases (with some small fluctuations).
Similar trend could be observed with `population` as well. <br>
Upon increasing `total rooms` variable we can see a rapid increase in target value. <br>
There can be made an interesting observation about `total bedrooms`, `households` and `median income` variables. All of them have non-constant trends. `Total bedrooms` as well as `median income` at first have a downward trend, and after a certain point change to upward trend. `Households`, on the other hand, go the other way round - firstly decreasing, then increasing. <br>
`Housing median age` variable is almost constant, meaning that no matter the value, the target will not change. <br>
When speaking about `ocean proximity` we can see that the only situation when the target value would change is when the households were on island, decreasing its value.

## Ceteris Paribus profiles for other models

In [18]:
gb_exp = dx.Explainer(gb, X_train, y_train, label = "Gradient Boosting")
rf_exp = dx.Explainer(rf, X_train, y_train, label = "Random Forest")

Preparation of a new explainer is initiated

  -> data              : 13828 rows 13 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray.
  -> target variable   : 13828 values
  -> model_class       : sklearn.model_selection._search.RandomizedSearchCV (default)
  -> label             : Gradient Boosting
  -> predict function  : <function yhat_default at 0x0000016D4AF82E50> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = -1.49, mean = -0.00167, max = 3.02
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -2.31, mean = -2.95e-17, max = 3.08
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 13828 rows 13 cols
  -> target variable   : Parameter 'y' was a pandas.DataFrame. Con

In [None]:
cp_rf = rf_exp.predict_profile(X_test.iloc[[211]])
cp_rf.plot()

CP profiles for `random forest` model differ from `neural network` profiles. 
The similarities are `longitude` and `latitude` decrease and constant `housing median age`. <br>
`Total rooms`, `total bedrooms`, `population` and `households` are nearly constant here, besides the left limit. <br>
`Median income` has kind of a rapid increase in target value. <br> 
`Ocean proximity` differs totally, as it increases slightly when being <1H OCEAN  and decreases upon being inland.

In [19]:
cp_gb = gb_exp.predict_profile(X_test.iloc[[211]])
cp_gb.plot()

Calculating ceteris paribus: 100%|████████████████████████████████████████████████████| 13/13 [00:00<00:00, 324.93it/s]


CP profiles for `Gradient Boosting` are definitely more similar to `random forest` than to `neural network`.  <br>
They also show decrease in `longitude` and `latitude`, as well as constant `housing median age`. <br>
Similarly, `total bedrooms`, `population` and `households` are constant besides left limits. <br>
`Median income` is increasing rapidly as well. 
`Total rooms` variable, is kind of similar, as it is constant, it doesn't have the fluctuations at limits. <br>
The main difference lays in `ocean proximity`. Although target value decreases upon households being inland, it also slightly decreases for near bay and slightly increases for near ocean.

## Summary

As we could see, `neural network` differs significantly from other two models. It is worth noting that it had the best prediction for target value. It didn't matter that `random forest` was close in prediction - the impact of variables mostly varies from the previous model. Although `gradient boosting` was far off in prediction the value - it is very similar in variable changes (to `random forest`). <br>
We can make an assumption that as `neural network` is built in the very different way from the other two models (and they are kind of similar) it also will have different CP profiles. It shows us how important it is to choose our model carefully, with regard to what is the most important to us.