# Regularization

Let's improve our understanding of what impacted **Titanic** passenger's chance of survival
- We will use logistic classifiers which are easy to interpret
- Remember we already did it with statsmodels in lecture "Decision Science - Logistic Regression"
- We were using `p-values` & statistical assumptions to detect which features were irrelevant / don't generalize
- This time, we will use `regularization` to detect relevant/irrelevant features based on under/overfitting criteria
- **Our goal is to compare `L1` and `L2` penalties**

## 1. We load and preprocess the data for you

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_titanic_dataset_encoded.csv")

# the dataset is already one-hot-encoded
data.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,1,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,0,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [3]:
# We build X and y

y = data["survived"]
X = data.drop(columns="survived")
X.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [4]:
# We MinMaxScale our features for you
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X.shape

(714, 12)

## 1.  Logistic Regression without regularization

❓ Rank the feature by decreasing order of importance according to a simple **non-regularized** Logistic Regression

- Careful, `LogisticRegression` is penalized by default
- Increase `max_iter` to a larger number until model converges

In [28]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty = 'none')
lr.fit(X_scaled,y)
dico = dict(zip(X_scaled.columns, lr.coef_[0]))
dico = dict(sorted(dico.items(), key=lambda item: abs(item[1]), reverse = True))
dico

{'embark_town_Queenstown': -11.91872516247785,
 'embark_town_Southampton': -11.523125645249683,
 'embark_town_Cherbourg': -11.221670926005537,
 'sex_female': 2.671882643323036,
 'pclass': 2.5471874647397943,
 'sibsp': -2.477131472641966,
 'class_Third': -2.4568912003453716,
 'class_First': 2.3604172889309374,
 'age': -2.1961513649407185,
 'fare': 1.3588116679941569,
 'who_child': 1.3363564092808387,
 'parch': -0.8938204023430882}

❓How do you interpret, in plain english language, the value for the coefficient `sex_female` ?

<details>
    <summary>Answer</summary>

> "All other things being equal (such as age, ticket class etc...),
being a women increases your log-odds of survival by 2.67 (your coef value)"
    
> "Controling for all other explaining factors available in this dataset,
being a women increases your odds-ratio of survival by exp(2.67) = 14"

</details>


A female has 2.67 times more chance to survived than a male

❓ What is the feature that most impacts the chances of survival according to your model ? 

In [29]:
top_1_feature = ["embark_town_Queenstown"]

In [30]:
from nbresult import ChallengeResult
result = ChallengeResult('unregularized', top_1_feature = top_1_feature)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/05-Model-Tuning/02-Regularization
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_unregularized.py::TestUnregularized::test_top_1 [32mPASSED[0m[32m        [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/unregularized.pickle

[32mgit[39m commit -m [33m'Completed unregularized step'[39m

[32mgit[39m push origin master


## 2.  Logistic Regression with a L2 penalty

Let's use a **Logistic model** whose log-loss has been penalized with a **L2** term to figure out the **most important features** without overfitting.  
This is the "classification" equivalent to the "Ridge" regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its feature importance
- By "strongly regularized" we mean "more than sklearn's default applied regularization factor". 
- Default sklearn's values are very useful orders of magnitudes to keep in mind for "scaled features"

In [46]:
lr_l2 = LogisticRegression(C = 1/1000)
lr_l2.fit(X_scaled,y)
dico = dict(zip(X_scaled.columns, lr_l2.coef_[0]))
dico = dict(sorted(dico.items(), key=lambda item: abs(item[1]), reverse = True))
dico

{'sex_female': 0.08649986671813313,
 'class_Third': -0.053643798505083724,
 'pclass': -0.047718312639078275,
 'class_First': 0.04179281510950037,
 'embark_town_Cherbourg': 0.02366475837636093,
 'embark_town_Southampton': -0.021575260846777133,
 'who_child': 0.015244703010485671,
 'fare': 0.008876166261881145,
 'age': -0.005415661316647762,
 'parch': 0.004379301883743804,
 'embark_town_Queenstown': -0.0032032879733073726,
 'sibsp': -0.001203489268659918}

❓ What are the top 2 features driving chances of survival according to your model ?

In [47]:
# Fill your top 2 features below
top_2_features = ['sex_female','class_Third']

#### 🧪 Test your code below

In [48]:
from nbresult import ChallengeResult
result = ChallengeResult('ridge', top_2 = top_2_features)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/05-Model-Tuning/02-Regularization
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_ridge.py::TestRidge::test_top2 [32mPASSED[0m[32m                         [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/ridge.pickle

[32mgit[39m commit -m [33m'Completed ridge step'[39m

[32mgit[39m push origin master


## 2. Logistic Regression with a L1 penalty

This time, we'll use a logistic model whose log-loss has been penalized with a **L1** term to **filter-out the less important features**.  
This is the "classification" equivalent to the **Lasso** regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its feature importance

In [57]:
lr_l1 = LogisticRegression(C = 1/5, penalty = 'l1', solver='saga')
lr_l1.fit(X_scaled,y)
dico = dict(zip(X_scaled.columns, lr_l1.coef_[0]))
dico = dict(sorted(dico.items(), key=lambda item: abs(item[1]), reverse = True))
dico

{'sex_female': 2.267919758067576,
 'pclass': -0.7882736735835298,
 'who_child': 0.6869034873330024,
 'class_Third': -0.6109082043303047,
 'class_First': 0.3178436603672172,
 'embark_town_Cherbourg': 0.28651354917121713,
 'age': -0.13873555827442766,
 'sibsp': -0.0454225360201478,
 'parch': 0.0,
 'fare': 0.0,
 'embark_town_Queenstown': 0.0,
 'embark_town_Southampton': 0.0}

❓ What are the features that have absolutely no impact on chances of survival, according to your L1 model?
- Do you notice how some of them were "highly important" according to the non-regularized model ? 
- From now on, we will always regularize our linear models!

In [58]:
zero_impact_features = ['embark_town_Southampton','embark_town_Queenstown']

#### 🧪 Test your code below

In [59]:
from nbresult import ChallengeResult
result = ChallengeResult('lasso', zero_impact_features = zero_impact_features)
result.write()
print(result.check())

platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /Users/humbert/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/humbert/code/HumbertMonnot/data-challenges/05-ML/05-Model-Tuning/02-Regularization
plugins: anyio-3.4.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_lasso.py::TestLasso::test_zero_impact [32mPASSED[0m[32m                  [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/lasso.pickle

[32mgit[39m commit -m [33m'Completed lasso step'[39m

[32mgit[39m push origin master


**🏁 Congratulation! Don't forget to commit and push your notebook**