# Regularization in ML
- It is a technique used in Machine Learning to prevent or reduce overfitting, as well as improve the generalization performance of a model.
- It adds a penalty component to the model's formula: $y = B_0 + B_1X_1 + \epsilon$ ($\epsilon$ here is the penalty level).
- There's no one good value of Regularization level. You can control its intensity using hyperparameter tuning to get the best outcome.
- Types of Regularization:
    - L1 Regularization (Lasso)
         - **Technique** It applies a penalty with absolute value of coefficients
         - **Effect** It encourages sparsity in the model, meaning ti can shrink some coefficients to exactly zero. Therefore, it makes Lasso efficient for feature selection, as it only selects the most important features with low overfitting.
         - **Visualization:** Imagine a straight line with sharp corners trying to fit the data. 
         - **Suitable for:** Scenarios where feature selection is desirable, and the model interpretability is important due to zero coefficients.
    - L2 Regularization (Ridge)
        - **Technique**  It applies a penalty with squared value of coefficients
        - **Effect** It prevents the coefficients from growing too large, thereby reducing the variance in the model. Unlike L1, L2 regularization doesn't make the coefficient go to zero, but shrinks close to zero.
        - **Visualization:** Imagine a smoother curve fitting the data compared to L1.
        - **Suitable for:** When feature selection is not a primary concern, and reducing model complexity for better generalization is important.
    - ElasticNet
        - **Technique** Using weighted average, it combines both L1 and L2 regularization by adding both penalties to the loss function
        - **Effect** Because you can control the level of both L1 and L2, it's recommended when you have highly correlated features.
        - **Visualization:** The fitting curve can fall somewhere between L1's sharp corners and L2's smoother curve depending on the `l1_ratio`.
        - **Suitable for:** When you want both feature selection and coefficient shrinkage, or when you're unsure which approach (L1 or L2) might be better for your problem.
- Applying Regularization: 
    - To apply it for Linear Regression, you need to switch from `LinearRegression()` to `Lasso()` or `Ridge()`. 
    - However, in other algorithms, you need to use hyperparameters. For example, `LogisticRegression(penalty={‘l1’, ‘l2’, ‘elasticnet’, None}, default=’l2’)`

Here's a table summarizing the key points:

| Feature                 | L1 Regularization (Lasso) | L2 Regularization (Ridge) | Elastic Net Regularization |
|-------------------------|---------------------------|---------------------------|----------------------------|
| Penalty Term             | Absolute value of coefficients | Square of coefficients    | L1 + L2 (with weight)       |
| Effect on Coefficients   | Drives some to zero (sparse) | Shrinks towards zero       | Combination of both         |
| Feature Selection       | Yes                       | No                         | Can be achieved             |
| Model Complexity        | Reduced                   | Reduced                    | Reduced                     |
| Generalization           | Improved                  | Improved                    | Improved (potentially)     |
| Visualization            | Sharp corners              | Smoother curve              | Between L1 and L2 curves   |
| Suitable for             | Feature selection, interpretability | Reducing complexity          | Combining selection & shrinkage |


Choosing the right regularization technique depends on your specific problem and the desired outcome. Experimentation is often necessary to find the best approach for your dataset and model. 

- Coefficient minimization formula: $\beta = \beta - \lambda * sign(\beta)$
- Where: $\beta$ is the coefficient, $\lambda$ is the regularization intensity/parameter,  $sign(\beta)$ is the sign of the coefficient (negative vs positive)

![reg](https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F17277811%2F75f6d401b8efcc9329cde3ffe0bf6d71%2Fridge2.png?generation=1723038136194204&alt=media)

# Automating Multiple Regression Models with Hyperparameter Tuning

In [32]:
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet

#model evaluation
from sklearn.metrics import mean_squared_error, r2_score

In [33]:
path = '/Users/bassel_instructor/Documents/Datasets/'

df = pd.read_csv(path+'insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


Expenses represents the target, which is the amount of medical expenses.

In [34]:
df.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
expenses    0
dtype: int64

In [35]:
for col in ['sex', 'smoker', 'region']:
    print(col, ':',df[col].unique())

sex : ['female' 'male']
smoker : ['yes' 'no']
region : ['southwest' 'southeast' 'northwest' 'northeast']


- region is not ordinal so we use One Hot Encoding (`get_dummies()`)
- sex and smoker are binary so we use either One Hot Encoding or label encoding (`map()` or `factorize()`)

In [36]:
df_org = df.copy()

In [37]:
df = pd.get_dummies(data=df, columns=['region'], dtype=int)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,expenses,region_northeast,region_northwest,region_southeast,region_southwest
0,19,female,27.9,0,yes,16884.92,0,0,0,1
1,18,male,33.8,1,no,1725.55,0,0,1,0
2,28,male,33.0,3,no,4449.46,0,0,1,0
3,33,male,22.7,0,no,21984.47,0,1,0,0
4,32,male,28.9,0,no,3866.86,0,1,0,0


In [38]:
df['sex'], sex_mapping = pd.factorize(df['sex'])
df['smoker'], smoker_mapping = pd.factorize(df['smoker'])
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,expenses,region_northeast,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,0,16884.92,0,0,0,1
1,18,1,33.8,1,1,1725.55,0,0,1,0
2,28,1,33.0,3,1,4449.46,0,0,1,0
3,33,1,22.7,0,1,21984.47,0,1,0,0
4,32,1,28.9,0,1,3866.86,0,1,0,0


> We can perform further data preprocessing and wrangling, but we'll skip this to jump straight to our topic.

In [39]:
X = df.drop(columns='expenses', axis=1) 
y = df['expenses']

In [None]:
# Optional - hold off
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Define The Models and Their Hyperparameters To Test

In [40]:
models = [
    {'name':'Linear Regression', 'model':LinearRegression()}, #basic linear regression no hyperparameters
    {'name':'Ridge Regression', 'model': Ridge(), 'params':{'alpha': [0.01, 0.1, 1, 10 , 100]}}, # experimenting with diff regularization intensity values
    {'name':'Lasso Regression', 'model': Lasso(), 'params':{'alpha': [0.01, 0.1, 1, 10 , 100]}},
    {'name':'ElasticNet Regression', 'model': ElasticNet(), 'params':{'alpha': [0.01, 0.1, 1], 'l1_ratio':[0.2, 0.3,0.5]}} # experimenting with 2 diff hyperparameters
]

In [11]:
for model_info in models:
    print(model_info)

{'name': 'Linear Regression', 'model': LinearRegression()}
{'name': 'Ridge Regression', 'model': Ridge(), 'params': {'alpha': [0.01, 0.1, 1, 10, 100]}}
{'name': 'Lasso Regression', 'model': Lasso(), 'params': {'alpha': [0.01, 0.1, 1, 10, 100]}}
{'name': 'ElasticNet Regression', 'model': ElasticNet(), 'params': {'alpha': [0.01, 0.1, 1], 'l1_ratio': [0.2, 0.3, 0.5]}}


Two methods:
- **Method 1**: Use if you want a quick and easy `GridSearchCV` and a small dataset
    - No pre-split
    - Choose a specific metric
    - Let the `GridSearchCv` run the simulation based on specified metric
    - Get the `best_estimator_` 
- **Method 2**: Use if you want to do double evaluation and with multiple metrics. Also, if you have data enough to cover for multiple splits
    - Pre-split the data (train vs test)
    - Let the `GridSearchCv` run the simulation and pick the evaluation metric by default
    - Run additional evaluation on test data with multiple metrics
    - Decide which model is the best based on 2 metrics:
        - `mean_squared_error`
        - `r2_score`

#### Method 1

In [42]:

for model_info in models:
    model_gs = GridSearchCV(model_info['model'] #grab the model function
                            , model_info.get('params', {})
                            , cv = 5  # number of folds in cross-validation
                            , scoring='neg_mean_squared_error'
                            
                            )
    
    model_gs.fit(X, y) # you can use X_train, y_train if you want to preserve another slide of the data for additional evaluation

print("Best parameters:", model_gs.best_params_)
print("Best cross-validation score:", model_gs.best_score_)

  model = cd_fast.enet_coordinate_descent(


Best parameters: {'alpha': 0.01, 'l1_ratio': 0.5}
Best cross-validation score: -36985204.79887968


#### Method 2

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [25]:
# track needed parameters for final evaluation
model_names = []
best_parameters = []
mean_sq_err_scores = []
r_sq_scores = []

for model_info in models:
    model_gs = GridSearchCV(model_info['model'] #grab the model function
                            , model_info.get('params', {})
                            , cv = 5  # number of folds in cross-validation
                            # if you don't choose the scoring technique, GridSearchCv will pick the default for each
                            )
    
    model_gs.fit(X_train, y_train) # you can use X_train, y_train if you want to preserve another slide of the data for additional evaluation

    #Calculate the predicted values for evluation
    gs_best_model = model_gs.best_estimator_
    y_pred = gs_best_model.predict(X_test)

    #calculate evaluation metrics
    mse_val = mean_squared_error(y_test, y_pred)
    r2_val = r2_score(y_test, y_pred)

    #append insights to the empty lists
    model_names.append(model_info['name'])
    best_parameters.append(model_gs.best_params_)
    mean_sq_err_scores.append(mse_val)
    r_sq_scores.append(r2_val)
    

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [21]:
results_dict = {'model_names':model_names,
'best_parameters':best_parameters,
'mean_sq_err_scores':mean_sq_err_scores,
'r_sq_scores':r_sq_scores}

In [22]:
pd.DataFrame(results_dict)

Unnamed: 0,model_names,best_parameters,mean_sq_err_scores,r_sq_scores
0,Linear Regression,{},39129920.0,0.742771
1,Ridge Regression,{'alpha': 1},39155500.0,0.742602
2,Lasso Regression,{'alpha': 0.1},39129820.0,0.742771
3,ElasticNet Regression,"{'alpha': 0.01, 'l1_ratio': 0.5}",39327990.0,0.741469


What we're looking for:
- The highest `r2_score` and closest to 1
- The lowest `mean_squared_error`. Note: unlike `r2_score`, there's no baseline. Your goal is to always find the lowest value possible.

In [23]:
mean_sq_err_scores

[39129920.26774458, 39155503.99207848, 39129820.88042316, 39327986.12157916]

In [26]:
1000 * .8 # cross-validation

800.0

In [27]:
1000 * .8 * .8 #pre-split and cross-validation

640.0

### Checking for Overfitting

You need to add `return_train_score`

In [53]:

for model_info in models:
    model_gs = GridSearchCV(model_info['model'] #grab the model function
                            , model_info.get('params', {})
                            , cv = 5  # number of folds in cross-validation
                            , scoring='neg_mean_squared_error'
                            , return_train_score=True
                            )
    
    model_gs.fit(X, y) # you can use X_train, y_train if you want to preserve another slide of the data for additional evaluation

print("Best parameters:", model_gs.best_params_)
print("Best cross-validation score:", -model_gs.best_score_)

  model = cd_fast.enet_coordinate_descent(


Best parameters: {'alpha': 0.01, 'l1_ratio': 0.5}
Best cross-validation score: 36985204.79887968


In [52]:
# Extract results
results = model_gs.cv_results_
train_scores = results['mean_train_score']
test_scores = results['mean_test_score']

# Compare train and test scores
print("\nTrain vs Test Scores:")
for params, train_score, test_score in zip(results['params'], train_scores, test_scores):
    print(f"Parameters: {params}")
    print(f"Train Score: {-train_score:,.3f}, Test Score: {-test_score:,.3f}")
    print(f"Difference: {train_score - test_score:,.3f}")
    print()


Train vs Test Scores:
Parameters: {'alpha': 0.01, 'l1_ratio': 0.2}
Train Score: 36,660,505.825, Test Score: 37,103,700.203
Difference: 443,194.378

Parameters: {'alpha': 0.01, 'l1_ratio': 0.3}
Train Score: 36,614,113.254, Test Score: 37,058,579.050
Difference: 444,465.795

Parameters: {'alpha': 0.01, 'l1_ratio': 0.5}
Train Score: 36,537,902.763, Test Score: 36,985,204.799
Difference: 447,302.035

Parameters: {'alpha': 0.1, 'l1_ratio': 0.2}
Train Score: 46,542,647.353, Test Score: 47,004,015.953
Difference: 461,368.600

Parameters: {'alpha': 0.1, 'l1_ratio': 0.3}
Train Score: 44,864,073.852, Test Score: 45,317,968.531
Difference: 453,894.679

Parameters: {'alpha': 0.1, 'l1_ratio': 0.5}
Train Score: 41,601,533.171, Test Score: 42,041,992.423
Difference: 440,459.252

Parameters: {'alpha': 1, 'l1_ratio': 0.2}
Train Score: 100,146,532.713, Test Score: 100,692,685.882
Difference: 546,153.169

Parameters: {'alpha': 1, 'l1_ratio': 0.3}
Train Score: 97,166,582.592, Test Score: 97,719,361.001
D

Overall, we're getting an r2 score around .75, we can go back. We have multiple options to improve this:
- Go back to your hyperparameter list and tweak them and rerun
- Consider different models that are non-linear `RandomForestRegressor()`
- Go back to the data prep stage and improve feature engineering. e.g. data scaling, outlier treatment, etc...

### Using `RandomForestRegressor()`

In [54]:
from sklearn.ensemble import RandomForestRegressor

rfr_model = RandomForestRegressor()

rfr_model.fit(X_train, y_train)

In [56]:
#evaluation
y_test_pred = rfr_model.predict(X_test)

#calculate the evaluation metrics
mse_val = mean_squared_error(y_test, y_test_pred)
r2_val = r2_score(y_test, y_test_pred)

In [60]:
print("R2 Score:", r2_val)
print(f"Mean Squared Error score:{mse_val:,.1f}" )

R2 Score: 0.8257337087906462
Mean Squared Error score:26,509,517.1


In [58]:
pd.DataFrame(results_dict)

Unnamed: 0,model_names,best_parameters,mean_sq_err_scores,r_sq_scores
0,Linear Regression,{},39129920.0,0.742771
1,Ridge Regression,{'alpha': 1},39155500.0,0.742602
2,Lasso Regression,{'alpha': 0.1},39129820.0,0.742771
3,ElasticNet Regression,"{'alpha': 0.01, 'l1_ratio': 0.5}",39327990.0,0.741469


**Observation** It appears the Random Forest model (which is a very advanced ML model) did a lot better than linear regression models. This could be because the behavior of the data is non-linear.