# **Regularization**
- Regularization is a technique in machine learning used to prevent overfitting by adding a penalty to large coefficients.

- In simple words → it makes the model simpler by shrinking the weights (slopes).

- Think of it as: "Don’t let the line tilt too aggressively, keep coefficients small and stable."

There are mainly two types:

## 1. **Ridge Regression (L2 Regularization)**
- Ridge Regression is a type of linear regression that adds a penalty to large coefficients to prevent overfitting. 
- It shrinks coefficients (slopes) but never makes them exactly zero.

In [1]:
"""
Ridge Regression (L2 Regularization)
------------------------------------

Mathematical Model:
    ŷ = B0 + B1*x

How Ridge differs from Linear Regression:
    In normal Linear Regression:
        B1 = Σ(xi - x̄)(yi - ȳ) / Σ(xi - x̄)²

    In Ridge Regression:
        B1 = Σ(xi - x̄)(yi - ȳ) / [Σ(xi - x̄)² + λ]

    - λ (lambda) is the regularization strength (hyperparameter).
    - Larger λ → coefficients shrink more (slope becomes smaller).
    - λ = 0 → Ridge becomes normal linear regression.
    
Intercept (B0):
    Still calculated the same way:
        B0 = ȳ - B1 * x̄
    But because B1 is smaller in Ridge, B0 adjusts accordingly.
    
Example:
    x = [1,2,3,4,5], y = [3,4,2,5,6]
    x̄ = 3, ȳ = 4

    Numerator = 7
    Denominator = 10 + λ

    If λ = 2:
        B1 = 7 / 12 = 0.583
        B0 = 4 - (0.583 * 3) = 2.251

    Final Ridge line:
        ŷ = 2.251 + 0.583x
"""

'\nRidge Regression (L2 Regularization)\n------------------------------------\n\nMathematical Model:\n    ŷ = B0 + B1*x\n\nHow Ridge differs from Linear Regression:\n    In normal Linear Regression:\n        B1 = Σ(xi - x̄)(yi - ȳ) / Σ(xi - x̄)²\n\n    In Ridge Regression:\n        B1 = Σ(xi - x̄)(yi - ȳ) / [Σ(xi - x̄)² + λ]\n\n    - λ (lambda) is the regularization strength (hyperparameter).\n    - Larger λ → coefficients shrink more (slope becomes smaller).\n    - λ = 0 → Ridge becomes normal linear regression.\n    \nIntercept (B0):\n    Still calculated the same way:\n        B0 = ȳ - B1 * x̄\n    But because B1 is smaller in Ridge, B0 adjusts accordingly.\n    \nExample:\n    x = [1,2,3,4,5], y = [3,4,2,5,6]\n    x̄ = 3, ȳ = 4\n\n    Numerator = 7\n    Denominator = 10 + λ\n\n    If λ = 2:\n        B1 = 7 / 12 = 0.583\n        B0 = 4 - (0.583 * 3) = 2.251\n\n    Final Ridge line:\n        ŷ = 2.251 + 0.583x\n'

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [3]:
housing = fetch_california_housing(as_frame=True)

# 'housing' is a Bunch object (like a dictionary with extra attributes)
# housing.data   -> features (X)
# housing.target -> target values (y)
# housing.frame  -> full pandas DataFrame with both features + target combined
df = housing.frame

# Show first 5 rows of the dataset
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
# Separate the input features (X) from the dataset
# .drop("MedHouseVal", axis=1) removes the target column (house value)
# axis=1 → means drop by column (axis=0 would mean drop by row)
# Result: 'x' contains only independent variables (features) like MedInc, HouseAge, etc.
x = df.drop("MedHouseVal", axis=1)

# Select the target column (dependent variable)
# 'y' will only contain the house prices (MedHouseVal) which we want to predict
y = df["MedHouseVal"]

In [5]:
# Separate the input features (X) from the dataset
# .drop("MedHouseVal", axis=1) removes the target column (house value)
# axis=1 → means drop by column (axis=0 would mean drop by row)
# Result: 'x' contains only independent variables (features) like MedInc, HouseAge, etc.
x = df.drop("MedHouseVal", axis=1)

# Select the target column (dependent variable)
# 'y' will only contain the house prices (MedHouseVal) which we want to predict
y = df["MedHouseVal"]

In [6]:
#   test_size=0.2 -> 20% data will go to testing, 80% to training
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=45
)
# random_state=45 → fixes the randomness of data splitting

### **Standard Scaler**
- StandardScaler is a preprocessing technique in Machine Learning that ensures all features are on the same scale before training a model.
- In raw datasets, some features may have very large values (e.g., income = 100,000) while others may have very small values (e.g., age = 20). This imbalance can cause the model to give more importance to larger-valued features just because of their scale, not because they are truly more important.

In [7]:
"""
StandardScaler Example (Step by Step)

We want to standardize the data:
    X = [10, 20, 30, 40, 50]

Formula:
    z_i = (x_i - μ) / σ
    where μ = mean, σ = standard deviation

-----------------------------------------------------
Step 1: Compute mean (μ)
    μ = (10 + 20 + 30 + 40 + 50) / 5 = 30

Step 2: Compute variance and std (σ)
    (10-30)² = 400
    (20-30)² = 100
    (30-30)² = 0
    (40-30)² = 100
    (50-30)² = 400
    Sum = 1000

    Variance = 1000 / 5 = 200
    σ = √200 ≈ 14.142

Step 3: Transform values (z = (x - μ) / σ)
    For 10 → (10-30)/14.142 ≈ -1.414
    For 20 → (20-30)/14.142 ≈ -0.707
    For 30 → (30-30)/14.142 = 0
    For 40 → (40-30)/14.142 ≈ 0.707
    For 50 → (50-30)/14.142 ≈ 1.414

    Transformed data:
    Z ≈ [-1.414, -0.707, 0, 0.707, 1.414]

Step 4: Verify results
    Mean of Z ≈ 0
    Std of Z = 1

-----------------------------------------------------
Summary:
    StandardScaler transforms each feature so that
    it has mean = 0 and std = 1. This ensures all
    features are on the same scale.
"""

'\nStandardScaler Example (Step by Step)\n\nWe want to standardize the data:\n    X = [10, 20, 30, 40, 50]\n\nFormula:\n    z_i = (x_i - μ) / σ\n    where μ = mean, σ = standard deviation\n\n-----------------------------------------------------\nStep 1: Compute mean (μ)\n    μ = (10 + 20 + 30 + 40 + 50) / 5 = 30\n\nStep 2: Compute variance and std (σ)\n    (10-30)² = 400\n    (20-30)² = 100\n    (30-30)² = 0\n    (40-30)² = 100\n    (50-30)² = 400\n    Sum = 1000\n\n    Variance = 1000 / 5 = 200\n    σ = √200 ≈ 14.142\n\nStep 3: Transform values (z = (x - μ) / σ)\n    For 10 → (10-30)/14.142 ≈ -1.414\n    For 20 → (20-30)/14.142 ≈ -0.707\n    For 30 → (30-30)/14.142 = 0\n    For 40 → (40-30)/14.142 ≈ 0.707\n    For 50 → (50-30)/14.142 ≈ 1.414\n\n    Transformed data:\n    Z ≈ [-1.414, -0.707, 0, 0.707, 1.414]\n\nStep 4: Verify results\n    Mean of Z ≈ 0\n    Std of Z = 1\n\n-----------------------------------------------------\nSummary:\n    StandardScaler transforms each feature so th

In [None]:
# Step 1: Initialize StandardScaler
# StandardScaler will standardize data → mean = 0, std = 1
scaler = StandardScaler()

# Step 2: Fit the scaler on training data and transform it
# - fit(): calculates mean and std from training data
# - transform(): applies scaling using that mean and std
# - fit_transform(): does both together
x_train_scaled = scaler.fit_transform(x_train)

# Step 3: Transform test data using training mean & std
# NOTE: We only use transform here (not fit), because:
#       - Test data must use same scaling as training
#       - If we call fit again, it will leak test info (data leakage)
x_test_scaled = scaler.transform(x_test)

# Step 4: Initialize Ridge Regression with alpha=1.0 (penalty strength)
ridge = Ridge(alpha=1.0)

# Step 5: Train Ridge model on scaled training data
ridge.fit(x_train_scaled, y_train)

# Step 6: Predict target values on scaled test data
y_pred_ridge = ridge.predict(x_test_scaled)

# Step 7: Mean Squared Errir
print("Ridge's Mean Squared Error: ", mean_squared_error(y_test, y_pred_ridge))

# Step 7: Evaluate model performance using R² Score
# R² measures how well predicted values match actual test values (1 = perfect)
print("Ridge's R2 Score: ", r2_score(y_test, y_pred_ridge))

Ridge's Mean Squared Error:  0.5209389458537057
Ridge's R2 Score:  0.6074101939965981
