## Homework 1 - Michał Gromadzki

### Importing libraries

In [28]:
import numpy as np 
import pandas as pd 
import dalex as dx
import os
import matplotlib.pyplot as pl
import seaborn as sns
import warnings
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score,mean_squared_error
from sklearn.ensemble import RandomForestRegressor
warnings.filterwarnings('ignore')

### Loading dataset

In [29]:
df = pd.read_csv('insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### EDA and Preprocessing

Checking for nulls.

In [30]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

No nulls.

Encoding categorical features.

In [31]:
#sex
le = LabelEncoder()
le.fit(df.sex.drop_duplicates()) 
df.sex = le.transform(df.sex)
# smoker or not
le.fit(df.smoker.drop_duplicates()) 
df.smoker = le.transform(df.smoker)
#region
le.fit(df.region.drop_duplicates()) 
df.region = le.transform(df.region)

Checking correlation.

In [32]:
df.corr()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
age,1.0,-0.020856,0.109272,0.042469,-0.025019,0.002127,0.299008
sex,-0.020856,1.0,0.046371,0.017163,0.076185,0.004588,0.057292
bmi,0.109272,0.046371,1.0,0.012759,0.00375,0.157566,0.198341
children,0.042469,0.017163,0.012759,1.0,0.007673,0.016569,0.067998
smoker,-0.025019,0.076185,0.00375,0.007673,1.0,-0.002181,0.787251
region,0.002127,0.004588,0.157566,0.016569,-0.002181,1.0,-0.006208
charges,0.299008,0.057292,0.198341,0.067998,0.787251,-0.006208,1.0


A strong correlation is observed only with smoking

### Models

#### LinearRegression

In [33]:
x = df.drop(['charges'], axis = 1)
y = df.charges

x_train,x_test,y_train,y_test = train_test_split(x,y, random_state = 0)
lr = LinearRegression().fit(x_train,y_train)

y_train_pred = lr.predict(x_train)
y_test_pred = lr.predict(x_test)

print(lr.score(x_test,y_test))

0.7962732059725786


#### Forest

In [34]:
forest = RandomForestRegressor(n_estimators = 100,
                              criterion = 'mse',
                              random_state = 1,
                              n_jobs = -1)
forest.fit(x_train,y_train)
forest_train_pred = forest.predict(x_train)
forest_test_pred = forest.predict(x_test)

print('MSE train data: %.3f, MSE test data: %.3f' % (
mean_squared_error(y_train,forest_train_pred),
mean_squared_error(y_test,forest_test_pred)))
print('R2 train data: %.3f, R2 test data: %.3f' % (
r2_score(y_train,forest_train_pred),
r2_score(y_test,forest_test_pred)))

MSE train data: 3729086.094, MSE test data: 19933823.142
R2 train data: 0.974, R2 test data: 0.873


## Homework

### 1.Selecting observation and calculating the model prediction.

In [54]:
slct = x_train.iloc[[100]]
print("LinearRegression:")
print(lr.predict(slct))
print("Forest:")
print(forest.predict(slct))
print("correct:")
print(y_train.iloc[[100]])

LinearRegression:
[29835.16273088]
Forest:
[19786.6794334]
correct:
1011    18767.7377
Name: charges, dtype: float64


### 2.Calculating the model prediction decomposition using Break Down.

Creating explainers.

In [36]:
explainer = dx.Explainer(lr, 
                        data = x_test,  
                        y = y_test)
explainer_forest = dx.Explainer(forest, 
                        data = x_test,  
                        y = y_test)

Preparation of a new explainer is initiated

  -> data              : 335 rows 6 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 335 values
  -> model_class       : sklearn.linear_model._base.LinearRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x000002A1BF05A0D0> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 1.86e+02, mean = 1.34e+04, max = 4.02e+04
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -1.09e+04, mean = -11.0, max = 2.2e+04
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 335 rows 6 cols
  -> target variable   : Par

In [37]:
bd_pr = explainer.predict_parts(
                       new_observation = slct,
                       type = "break_down")
bd_pr_forest = explainer_forest.predict_parts(
                       new_observation = slct,
                       type = "break_down")
bd_pr.plot()
bd_pr_forest.plot()

### 3.Calculating the model prediction decomposition using SHAP values.

In [38]:
shap_pr = explainer.predict_parts(
                       new_observation = slct,
                       type = "shap")
shap_pr_forest = explainer_forest.predict_parts(
                       new_observation = slct,
                       type = "shap")
shap_pr.plot()
shap_pr_forest.plot()

Low <mark>BMI</mark> decently reduces the predicted cost.
Being a <mark>smoker</mark> masivelly increases the predicted cost.
Other variables have a relatively small impact on the prediction, while <mark>age</mark> and <mark>children</mark> appear to slightly increase the predicted cost.

The second model takes into account <mark>BMI</mark> more than the first and pays less atention to <mark>smoker</mark>

We were also able to achieve some of the above conclusions using correlation matrix.

### 4.Finding two observations in the dataset that have different effects

In [39]:
df.describe()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0,1338.0
mean,39.207025,0.505232,30.663397,1.094918,0.204783,1.515695,13270.422265
std,14.04996,0.50016,6.098187,1.205493,0.403694,1.104885,12110.011237
min,18.0,0.0,15.96,0.0,0.0,0.0,1121.8739
25%,27.0,0.0,26.29625,0.0,0.0,1.0,4740.28715
50%,39.0,1.0,30.4,1.0,0.0,2.0,9382.033
75%,51.0,1.0,34.69375,2.0,0.0,2.0,16639.912515
max,64.0,1.0,53.13,5.0,1.0,3.0,63770.42801


The youngest person is 18 years old, and the oldest is 64 years old.

Selecting young person.

In [40]:
young = x_train.loc[x_train["age"]==18].reset_index().drop(["index"],axis=1)
young

Unnamed: 0,age,sex,bmi,children,smoker,region
0,18,0,20.79,0,0,2
1,18,0,31.92,0,0,0
2,18,0,38.17,0,0,2
3,18,0,28.215,0,0,0
4,18,0,40.185,0,0,0
5,18,0,31.35,0,0,2
6,18,1,23.32,1,0,2
7,18,0,40.28,0,0,0
8,18,1,38.17,0,1,2
9,18,1,41.14,0,0,2


In [41]:
slct_young = young.iloc[[45]]

Selecting old person.

In [42]:
old = x_train.loc[x_train["age"]==64].reset_index().drop(["index"],axis=1)
old

Unnamed: 0,age,sex,bmi,children,smoker,region
0,64,1,39.16,1,0,2
1,64,1,24.7,1,0,1
2,64,0,30.115,3,0,1
3,64,0,22.99,0,1,2
4,64,1,36.96,2,1,2
5,64,1,26.41,0,0,0
6,64,0,31.825,2,0,0
7,64,1,38.19,0,0,0
8,64,0,39.05,3,0,2
9,64,1,23.76,0,1,2


In [43]:
slct_old = old.iloc[[5]]

Analyzing the effects.

#### Young

In [57]:
bd_pr_young = explainer.predict_parts(
                       new_observation = slct_young,
                       type = "break_down")
bd_pr_young_forest = explainer_forest.predict_parts(
                       new_observation = slct_young,
                       type = "break_down")
bd_pr_young.plot()
bd_pr_young_forest.plot()

In [58]:
shap_pr_young = explainer.predict_parts(
                       new_observation = slct_young,
                       type = "shap")
shap_pr_young_forest = explainer_forest.predict_parts(
                       new_observation = slct_young,
                       type = "shap")
shap_pr_young.plot()
shap_pr_young_forest.plot()

#### Old

In [46]:
bd_pr_old = explainer.predict_parts(
                       new_observation = slct_old,
                       type = "break_down")
bd_pr_old_forest = explainer_forest.predict_parts(
                       new_observation = slct_old,
                       type = "break_down")
bd_pr_old.plot()
bd_pr_old_forest.plot()

In [47]:
shap_pr_old = explainer.predict_parts(
                       new_observation = slct_old,
                       type = "shap")
shap_pr_old_forest = explainer_forest.predict_parts(
                       new_observation = slct_old,
                       type = "shap")
shap_pr_old.plot()
shap_pr_old_forest.plot()

Lower age decreases the predicted cost, while higher age increases the predicted cost.

### 5.Comment

RandomForestRegressor has much better accuracy than LinearRegression. In 2nd and 3rd taks the second model takes into account <mark>BMI</mark> more than the first and pays less atention to <mark>smoker</mark>.

Task 4 provides us with the most conclusions. Fristly, we can see that even the oldest person has a lower predicted cost, than the
youngest person, just becous the young person is a smoker. This fact higlights how much attention models pay to this variable.
Secendly RandomForestRegressor seems to pay more attention to variables such as <mark>age</mark> and <mark>BMI</mark>.
Lastly it is hard to draw conclusion for other variables, because there seems to be no pattern on the way models are predicting theirs influence on the predicted cost.

In conclusion, RandomForestRegressor is a better suited model for this task.