# Regularisation in Machine Learning
Regularization is a technique used in machine learning to prevent overfitting and improve performance on unseen data.

There are 3 types of regularisation:
1. Lasso (l1)
2. Ridge
3. Elastic Net

Each type applies penalties in different ways to control model complexity and improve generalisation.

Note: Linear Regression fits all features equally, minimizing squared error.

## L1 (Lasso) Regularisation

- It adds the absolute value of magnitude of the coefficient as a penalty term to the loss function (MSE). This penalty can shrink some coefficients to zero which helps in selecting only the important features and ignoring the less important ones.

cost = MSE + λ*∑​∣wi​∣ (absolute sum of all model weights)

The hyperparameter λ is the strength of penalisation. 
- Higher = stronger regularization (more coefficients pushed toward 0)
- Lower = weaker regularization (model fits data more closely, but higher risk of overfitting)

Use cross validation of λ to find the best balance between bias and variance.

- L1 can be used for feature selection as it pushes some features to 0 coefficient.

- Note: it encourages some weights to be 0, effectively performing feature selection.
- Geometrically, the constraint forms a diamond shape.
- When the optimization boundary touches the diamond corners, some weights become exactly 0 → sparse solution.
- Result: Feature selection — only the most important features remain.

- Disadvantages:
    - can underperform when all features are relevant.
    - unstable when featurs are correlated (tends to pick 1 of the 2 correlated features)

## L2 (Ridge) Regularisation

- It sums the squares of the weight coefficients.
    - this causes all weights of coefficients to shrink

cost = MSE + λ*∑​(wi​)^2 (sum of all model weights squared)

- Encourages small weights (shrinks them toward zero), but rarely makes them exactly zero.

- Geometrically, the constraint forms a circle (or sphere).
- Optimization tends to shrink all weights smoothly, none drop exactly to zero.
- Result: Keeps all features, but reduces their magnitude (weight shrinkage).

- Disadvantages:
    - no feature selection
    - less interpretable model, harder to know which features matter
    - not ideal when you have many irrelevant variables
    - assumes normal data distribution
    - may not handle spare data, unlike l1


## Elastic Net
- combines l1 and l2


cost = MSE + λ1​∑​∣wi​∣+λ2​∑​(wi)^2



## When to use

- Many irrelevant or redundant features:	L1 (Lasso) — will eliminate them
- All features are useful but may be noisy:	L2 (Ridge) — reduces overfitting without discarding
- Highly correlated features:	L2 (Ridge) tends to share weight between them
- Need interpretability (which features matter):	L1 (Lasso)
- Numerical stability & smooth optimization:	L2 (Ridge)
- Want both benefits:	Elastic Net (mix of L1 and L2)


## Comparion of each reguarisation and base

In [53]:
# setup

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error

data = pd.read_csv('../../data/Auto.csv')
data.set_index('name', inplace=True)
data_X = data.drop('mpg', axis=1)
data_y = data['mpg']
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.2, random_state=42)
data.head()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,1
buick skylark 320,15.0,8,350.0,165,3693,11.5,70,1
plymouth satellite,18.0,8,318.0,150,3436,11.0,70,1
amc rebel sst,16.0,8,304.0,150,3433,12.0,70,1
ford torino,17.0,8,302.0,140,3449,10.5,70,1


In [63]:
# base

model = LinearRegression()
model.fit(X_train, y_train)
r_score = model.score(X_test, y_test)
coefficients = model.coef_
print("Coefficients:", coefficients)
r_score

Coefficients: [-0.34578883  0.01510871 -0.02130175 -0.00614163  0.03795001  0.76774258
  1.61345707]


0.7901500386760352

In [64]:
# L1 Regularisation (Lasso)
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
lasso_r_score = lasso_model.score(X_test, y_test)
lasso_r_score

0.789304646084754

We can see that L1 has not improved the r^2 in this case, lets try to cross validate and find the best alpha value.

In [65]:
# L1 Regularisation (Lasso) cross validation

parameters = {'alpha': [i for i in np.arange(0.01, 1.0, 0.01)]}
scores = []
for parameter in parameters['alpha']:
    model = Lasso(alpha=parameter)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)
best_alpha = parameters['alpha'][np.argmax(scores)]
print("Best alpha:", best_alpha)
print("Best score:", max(scores))
coefficients = Lasso(alpha=best_alpha).fit(X_train, y_train).coef_
print("Coefficients:", coefficients)

Best alpha: 0.49
Best score: 0.7933732960042428
Coefficients: [-0.         -0.         -0.01122415 -0.00630222  0.          0.71450716
  0.26975546]


alpha of 0.49 gives us the best score and this is slightly higher than the base models.

In [66]:
# l2 Regularisation (Ridge)
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X_train, y_train)
ridge_r_score = ridge_model.score(X_test, y_test)
ridge_r_score

0.7901661107227567

l2 gives a slightly better score than base modle for alpha=0.1

In [68]:

parameters = {'alpha': [i for i in np.arange(0.01, 1.0, 0.01)]}
scores = []
for parameter in parameters['alpha']:
    model = Ridge(alpha=parameter)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score)
best_alpha = parameters['alpha'][np.argmax(scores)]
print("Best alpha:", best_alpha)
print("Best score:", max(scores))
coefficients = Ridge(alpha=best_alpha).fit(X_train, y_train).coef_
print("Coefficients:", coefficients)

Best alpha: 0.99
Best score: 0.7903067847624526
Coefficients: [-0.33971326  0.01492141 -0.02114254 -0.00614542  0.03810602  0.76745057
  1.59935737]


after using cross validation, it seems l1 gives the best result compared to l2 and base. This could be due to the fact that there are irrelevant feature.