## Steps to do for google collab users

In [None]:
# nothing 

# Introduction

Context: Train a linear regressor to predict the weight of a person based on their gender (m/f) and their height.

For our task fitting the linear regressor means getting the weight $w_0$, $w_1$, $w_2$ in the following formula $\hat y = w_0 + w_1 * X_1 + w_2 * X_2$ with 
- $\hat y$ the predicted weight
- $X_1$ the binary value corresponding to the gender
- $X_2$ the height in cm

## Goal
- Understand why the linear regressor is an interpretable model

In [35]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets, tree
from sklearn.model_selection import train_test_split

In [36]:
# Kaggle dataset (experiment purposes, might be not accurate depending on demography)

# dataset_url = "../dataset/weight-height.csv" # in local
dataset_url = "https://raw.githubusercontent.com/PaulQbFeng/ml_interpretability_starter_pack/master/dataset/weight-height.csv"

In [37]:
raw_df = pd.read_csv(dataset_url)
print(raw_df.shape)
raw_df.sample(5)

(10000, 3)


Unnamed: 0,Gender,Height,Weight
3959,Male,68.866302,190.385361
7417,Female,61.017988,120.139649
7777,Female,61.079452,124.802926
9844,Female,64.893463,133.854826
4303,Male,66.927669,169.535121


## Preprocessing 

- have columns in lowercase 
- convert height in inches to cm
- convert weight in pound to kg
- Treat gender column as a binary column named is_female with value 1 if female, 0 if male

In [38]:
INCH_TO_CM = 2.54
POUND_TO_KG = 0.45359237

df = pd.DataFrame()
df["is_female"] = (raw_df.Gender == "Female").astype(int)
df["height"] = raw_df.Height * INCH_TO_CM
df["weight"] = raw_df.Weight * POUND_TO_KG

In [39]:
df.sample(5)

Unnamed: 0,is_female,height,weight
7261,1,162.891066,55.017396
4802,0,170.781925,79.790821
4014,0,177.923615,92.886418
4860,0,165.33358,72.788301
2770,0,176.225969,88.336898


## Create train and test set

In [None]:
feature_cols = ["is_female", "height"]
pred_col = ["weight"]

X = df[feature_cols]
y = df[pred_col]

In [None]:
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=2)

## Train linear regressor

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

#### Get the weights of the model (TODO)

In [None]:
w0 = 
w1, w2 = 

## Visualize the test set + prediction

### Split test set into male / female to plot it in 2D

In [None]:
# test set
X_test_female = X_test.loc[X_test.is_female == 1]["height"]
y_test_female = y_test.loc[X_test_female.index]["weight"]

X_test_male = X_test.loc[X_test.is_female == 0]["height"]
y_test_male = y_test.loc[X_test_male.index]["weight"]

In [None]:
# predictions
heights_boundary = np.array([140, 210]) # only need 2 points to plot a line
y_pred_male = w0 + w1 * 0 + w2 * heights_boundary
y_pred_female = w0 + w1 * 1 + w2 * heights_boundary

### Plot

In [None]:
LABEL_FONT_SIZE = 20
DOT_SIZE = 15
LINE_WIDTH = 3

plt.figure(figsize=(14,10))

plt.title("Weight and Height distribution with their predictions for the test set", size=LABEL_FONT_SIZE)
plt.xlabel("Height (cm)", size=LABEL_FONT_SIZE)
plt.ylabel("Weight (kg)", size=LABEL_FONT_SIZE)

plt.scatter(X_test_male, y_test_male.values, color='sandybrown', label="male", s=DOT_SIZE)
plt.scatter(X_test_female, y_test_female.values, color='cornflowerblue', label="female", s=DOT_SIZE)

plt.plot(heights_boundary, y_pred_female, color='red', label="female prediction", linewidth=LINE_WIDTH) 
plt.plot(heights_boundary, y_pred_male, color='green', label="male prediction", linewidth=LINE_WIDTH) 

# # for fun (optional)
# my_height = 190 # cm
# my_weight = 84 # kg
# plt.plot(my_height, my_weight, color="magenta", marker="D", label="my weight")

plt.legend(prop={"size": 15})
plt.grid(True)
plt.show()

## Questions 

1. Is my linear regressor suited for this task ? 
    - TODO
2. In general, taller people are on average heavier. Does my model reflect this behaviour ?
    - TODO
3. More precisely, according to this model and dataset, if someone grows 2 cm, do I know with a certain accuracy what would be the new predicted value ?
    - TODO
4. In general, men are heavier than women at the same height. Does my model reflect this behaviour ?
    - TODO
5. More precisely, according to this model and dataset, do I know what would be the numerical difference between the weight of a man and a woman at the same height ?
    - TODO

# Conclusion

