# Regularization

Let's improve our understanding of what impacted **Titanic** passengers' chance of survival
- We will use logistic classifiers which are easy to interpret
- Remember we already did it with statsmodels in lecture "Decision Science - Logistic Regression"
- We were using `p-values` & statistical assumptions to detect which features were irrelevant / don't generalize
- This time, we will use `regularization` to detect relevant/irrelevant features based on under/overfitting criteria
- **Our goal is to compare `L1` and `L2` penalties**

## 1. We load and preprocess the data for you

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_titanic_dataset_encoded.csv")

# the dataset is already one-hot-encoded
data.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,1,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,0,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [4]:
# We build X and y

y = data["survived"]
X = data.drop(columns="survived")
X.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,3,22.0,1,0,7.25,0,0,1,0,0,0,1
1,1,38.0,1,0,71.2833,1,1,0,0,1,0,0
2,3,26.0,0,0,7.925,1,0,1,0,0,0,1
3,1,35.0,1,0,53.1,1,1,0,0,0,0,1
4,3,35.0,0,0,8.05,0,0,1,0,0,0,1


In [6]:
# We MinMaxScale our features for you
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled.shape

(714, 12)

In [7]:
X_scaled

Unnamed: 0,pclass,age,sibsp,parch,fare,sex_female,class_First,class_Third,who_child,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
0,1.0,0.271174,0.2,0.000000,0.014151,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.0,0.472229,0.2,0.000000,0.139136,1.0,1.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.321438,0.0,0.000000,0.015469,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.0,0.434531,0.2,0.000000,0.103644,1.0,1.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.434531,0.0,0.000000,0.015713,0.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
709,1.0,0.484795,0.0,0.833333,0.056848,1.0,0.0,1.0,0.0,0.0,1.0,0.0
710,0.5,0.334004,0.0,0.000000,0.025374,0.0,0.0,0.0,0.0,0.0,0.0,1.0
711,0.0,0.233476,0.0,0.000000,0.058556,1.0,1.0,0.0,0.0,0.0,0.0,1.0
712,0.0,0.321438,0.0,0.000000,0.058556,0.0,1.0,0.0,0.0,1.0,0.0,0.0


## 1.  Logistic Regression without regularization

❓ Rank the features by decreasing order of importance after training a simple **non-regularized** Logistic Regression (i.e. look at the coefficients after fitting)
- Careful: `LogisticRegression` is penalized by default
  - take a look at the [penalty parameter](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to find out how to remove the penalty)
- Increase `max_iter` to a larger number until model converges

<details>
    <summary>Hint</summary>
    <img src="https://wagon-public-datasets.s3.amazonaws.com/data-science-images/05-ML/05-Model-Tuning/model_selection.png" alt="penalizing a regression" width="500" height="600">
</details>


In [11]:
from sklearn.linear_model import LogisticRegression

In [20]:
model = LogisticRegression(penalty=None, max_iter=1000)
model.fit(X_scaled, y)

In [21]:
pd.Series(model.coef_.tolist()[0], index = X_scaled.columns).sort_values(ascending=False)

sex_female                  2.671883
pclass                      2.547187
class_First                 2.360417
fare                        1.358812
who_child                   1.336356
parch                      -0.893820
age                        -2.196151
class_Third                -2.456891
sibsp                      -2.477131
embark_town_Cherbourg     -11.221671
embark_town_Southampton   -11.523126
embark_town_Queenstown    -11.918725
dtype: float64

In [22]:
#feature_importance = pd.DataFrame({
#    "Feature": X.columns,
#    "Coefficient": coeff
#}).sort_values(by="Coefficient", ascending = False).reset_index(drop = True)
#feature_importance

❓How do you interpret, in plain English, the value for the coefficient `sex_female`?

<details>
    <summary>Answer</summary>

> "All other things being equal (such as age, ticket class etc...),
being a women increases your log-odds of survival by 2.67 (your coef value)"
    
> "Controling for all other explaining factors available in this dataset,
being a women increases your odds of survival by exp(2.67) = 14"

</details>


Este valor positivo significa que ser mujer está asociado con una mayor probabilidad de supervivencia en el Titanic.

El modelo sugiere que el hecho de ser mujer es un factor importante que incrementa las probabilidades de haber sobrevivido en este contexto.

❓ What is the feature that most impacts the chances of survival according to your model?  
Fill the `top_1_feature` list below with the name of this feature

In [16]:
top_1_feature = ["embark_town_Queenstown"]

In [17]:
from nbresult import ChallengeResult
result = ChallengeResult('unregularized', top_1_feature = top_1_feature)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /root/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /root/code/MonicaVenzor/05-ML/05-Model-Tuning/data-regularization/tests
plugins: anyio-3.6.2, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_unregularized.py::TestUnregularized::test_top_1 [32mPASSED[0m[32m              [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/unregularized.pickle

[32mgit[39m commit -m [33m'Completed unregularized step'[39m

[32mgit[39m push origin master



## 2.  Logistic Regression with an L2 penalty

Let's use a **Logistic model** whose log-loss has been penalized with a **L2** term to figure out the **most important features** without overfitting.  
This is the "classification" equivalent to the "Ridge" regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its features by importance (look at the coefficients)
- By "strongly regularized" we mean "more than Sklearn's default regularization factor". 
- Sklearn's default values are very useful orders of magnitudes to keep in mind for "scaled features"

In [39]:
model_r = LogisticRegression(penalty='l2', max_iter=1000)
model_r.fit(X_scaled, y)

In [40]:
pd.Series(model_r.coef_.tolist()[0], index = X_scaled.columns).sort_values(ascending=False)

sex_female                 2.482402
who_child                  1.126462
class_First                0.648990
fare                       0.482459
embark_town_Cherbourg      0.260330
embark_town_Southampton   -0.116155
embark_town_Queenstown    -0.354136
parch                     -0.539578
pclass                    -0.715694
class_Third               -0.786230
age                       -1.516539
sibsp                     -1.561901
dtype: float64

❓ What are the top 2 features driving chances of survival according to your model?  
Fill the `top_2_features` list below with the name of these features

In [41]:
top_2_features = ["sex_female", "who_child"]

#### 🧪 Test your code below

In [42]:
from nbresult import ChallengeResult
result = ChallengeResult('ridge', top_2 = top_2_features)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /root/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /root/code/MonicaVenzor/05-ML/05-Model-Tuning/data-regularization/tests
plugins: anyio-3.6.2, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_ridge.py::TestRidge::test_top2 [32mPASSED[0m[32m                               [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/ridge.pickle

[32mgit[39m commit -m [33m'Completed ridge step'[39m

[32mgit[39m push origin master



## 2. Logistic Regression with an L1 penalty

This time, we'll use a logistic model whose log-loss has been penalized with a **L1** term to **filter-out the less important features**.  
This is the "classification" equivalent to the **Lasso** regressor

❓ Instantiate a **strongly regularized** `LogisticRegression` and rank its features by importance

In [45]:
model_l = LogisticRegression(penalty='l1', solver='liblinear', C=0.1, max_iter=1000)
model_l.fit(X_scaled, y)

In [46]:
pd.Series(model_l.coef_.tolist()[0], index = X_scaled.columns).sort_values(ascending=False)

sex_female                 2.000536
who_child                  0.255837
sibsp                      0.000000
parch                      0.000000
fare                       0.000000
class_First                0.000000
embark_town_Cherbourg      0.000000
embark_town_Queenstown     0.000000
age                       -0.063072
class_Third               -0.144568
embark_town_Southampton   -0.239184
pclass                    -1.442778
dtype: float64

❓ What are the features that have absolutely no impact on chances of survival, according to your L1 model?  
Fill the `zero_impact_features` list below with the name of these features; you may have to add elements to the list.

- Do you notice how some of them were "highly important" according to the non-regularized model? 
- From now on, we will always regularize our linear models!

In [47]:
zero_impact_features = ["sibsp", "parch", "fare", "class_First", "embark_town_Cherbourg", "embark_town_Queenstown"]

#### 🧪 Test your code below

In [48]:
from nbresult import ChallengeResult
result = ChallengeResult('lasso', zero_impact_features = zero_impact_features)
result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /root/.pyenv/versions/3.10.6/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /root/code/MonicaVenzor/05-ML/05-Model-Tuning/data-regularization/tests
plugins: anyio-3.6.2, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_lasso.py::TestLasso::test_zero_impact [32mPASSED[0m[32m                        [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/lasso.pickle

[32mgit[39m commit -m [33m'Completed lasso step'[39m

[32mgit[39m push origin master



**🏁 Congratulation! Don't forget to commit and push your notebook**