# Regularization

## 1. Import data

- Each row corresponds to the profile of health insurance client
- Our target is the `price_range` category of its contract
- Our features are client specificities

👉 Our goal is to predict which features are most influencial in defining insurance price range  
👉 For that, we will use & fintune linear models which are easy to interpret

In [None]:
import pandas as pd
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/medical_costs_small.csv")
data

👇 Create your `X` and `y`. Encode your binary target, and scale your features (replace them).

In [None]:
# X and y

In [None]:
# Encode

In [None]:
# Scale

## 2.  Logistic Regression with Ridge (L2) penalty

We'll use a Ridge model to figure out the **most important features** without risking overfitting on our small dataset

- Fit and tune a Ridge classification model on your dataset to maximize `accuracy`.  
- Your goal here is to find the regularization penalty `C` that gives the best score.

<details>
    <summary>Hints</summary>

- Feel free to use `RandomizedSearchCV` or `GridSearchCV` or a combination of both.

- [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) uses **Ridge** regularization by default. 

- `C` = 1/`alpha`: the larger it is, the less you regularize, the more you overfit.


    
    
</details>


#### Find your most important features

In [None]:
# Sort features

In [None]:
# Fill your top 2 features names below
top_2_features = ["", ""]

#### 🧪 Test your code below

In [None]:
from nbresult import ChallengeResult
result = ChallengeResult('ridge', top_2_features = top_2_features)
result.write()
print(result.check())

## 2. Logistic Regression with Lasso (L1) penalty

This time, we'll use a **Lasso**-regularized model to **filter-out the less important features**.  

👇 Finetune your model

<details>
    <summary>Hints</summary>

- You can still use `LogisticRegression` model
- Read carefully, not all solvers support all penalty types
    
</details>


#### ❓ Find features that have the least impact

## 3. `GridSearchCV` with multiple metrics?

What if we wanted to measure another performance metric than accuracy ?

Gridsearch can be computationally expensive, and you don't want to run one for each performance metric.

👇 Can you make **one** GridSearchCV where you keep track of `accuracy`, `precision` and `recall` score at each fit, while keeping `accuracy` as your decision metric to automatically choose the `best_estimator_` ?  (Read the docs!)

<details><summary>Hints</summary>

Look at the `refit` argument
    
</details>

Take some time to understand what's in your grid search's `cv_results_` instance attribute.

❓ Can you rank your trainings per mean cross-validated `recall` scores?
(Turn cv_results_ into a DataFrame to make things clearer)

### 🏁 Congratulation! Don't forget to push the notebook before moving on