# L1 and L2 Regularization: Mathematical Foundations and Practical Implementation

## Mathematical Formulation

### L1 Regularization (Lasso)

#### Objective Function
$\min_w \left[ \text{Loss}(w) + \lambda \sum_{j=1}^p |w_j| \right]$

Where:
- $w_j$: Model coefficients
- $\lambda$: Regularization strength
- $\text{Loss}(w)$: Original model loss function

#### Key Properties
1. **Penalty Calculation**:
   $L1\_\text{penalty} = \lambda \sum_{j=1}^p |w_j|$

2. **Geometric Interpretation**:
   - Constraint region forms a diamond (octahedron)
   - Encourages sparse solutions
   - Forces some coefficients to exactly zero

### L2 Regularization (Ridge)

#### Objective Function
$\min_w \left[ \text{Loss}(w) + \lambda \sum_{j=1}^p w_j^2 \right]$

Where:
- $w_j$: Model coefficients
- $\lambda$: Regularization strength
- $\text{Loss}(w)$: Original model loss function

#### Key Properties
1. **Penalty Calculation**:
   $L2\_\text{penalty} = \lambda \sum_{j=1}^p w_j^2$

2. **Geometric Interpretation**:
   - Constraint region forms a sphere
   - Uniformly shrinks coefficients
   - Never forces coefficients to exactly zero

## Comprehensive Python Implementation

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression

class RegularizationComparison:
    def __init__(self, n_features=20, n_informative=5, noise=0.1):
        # Generate synthetic data
        X, y = make_regression(
            n_samples=200, 
            n_features=n_features, 
            n_informative=n_informative, 
            noise=noise, 
            random_state=42
        )
        
        # Split data
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )
        
        # Scale features
        self.scaler = StandardScaler()
        self.X_train_scaled = self.scaler.fit_transform(self.X_train)
        self.X_test_scaled = self.scaler.transform(self.X_test)
    
    def compare_regularization(self, alphas=[0.01, 0.1, 1, 10]):
        results = {
            'Lasso': [],
            'Ridge': []
        }
        
        for alpha in alphas:
            # Lasso Regression
            lasso = Lasso(alpha=alpha, random_state=42)
            lasso.fit(self.X_train_scaled, self.y_train)
            results['Lasso'].append({
                'alpha': alpha,
                'coefficients': lasso.coef_,
                'score': lasso.score(self.X_test_scaled, self.y_test)
            })
            
            # Ridge Regression
            ridge = Ridge(alpha=alpha, random_state=42)
            ridge.fit(self.X_train_scaled, self.y_train)
            results['Ridge'].append({
                'alpha': alpha,
                'coefficients': ridge.coef_,
                'score': ridge.score(self.X_test_scaled, self.y_test)
            })
        
        return results

# Visualization
def plot_coefficient_paths(results):
    plt.figure(figsize=(12, 5))
    
    # Lasso Coefficients
    plt.subplot(1, 2, 1)
    lasso_coeffs = [res['coefficients'] for res in results['Lasso']]
    plt.title('Lasso Coefficient Paths')
    plt.xlabel('Regularization Strength (λ)')
    plt.ylabel('Coefficient Value')
    plt.plot(np.log([r['alpha'] for r in results['Lasso']]), 
             lasso_coeffs, marker='o')
    
    # Ridge Coefficients
    plt.subplot(1, 2, 2)
    ridge_coeffs = [res['coefficients'] for res in results['Ridge']]
    plt.title('Ridge Coefficient Paths')
    plt.xlabel('Regularization Strength (λ)')
    plt.ylabel('Coefficient Value')
    plt.plot(np.log([r['alpha'] for r in results['Ridge']]), 
             ridge_coeffs, marker='o')
    
    plt.tight_layout()
    plt.show()

# Demonstration
comparison = RegularizationComparison()
results = comparison.compare_regularization()
plot_coefficient_paths(results)
```

## Mathematical Insights

### Regularization Strength Analysis

#### L1 Regularization Loss Function
$J(w) = \text{MSE} + \lambda \sum_{j=1}^p |w_j|$

#### L2 Regularization Loss Function
$J(w) = \text{MSE} + \lambda \sum_{j=1}^p w_j^2$

### Sparsity Probability
- **L1 Regularization**: 
  $P(\text{Coefficient} = 0) \propto \lambda$
- **L2 Regularization**: 
  $P(\text{Coefficient} \approx 0) \propto \frac{1}{\lambda}$

## Practical Guidelines

### When to Use L1 (Lasso)
- High-dimensional datasets
- Feature selection required
- Sparse model desired
- Many irrelevant features

### When to Use L2 (Ridge)
- All features potentially relevant
- Multicollinearity present
- Smooth coefficient reduction needed
- Stable predictions desired

### Hybrid Approach: Elastic Net
Combines L1 and L2 penalties:
$\text{Loss} + \lambda_1 \sum |w_j| + \lambda_2 \sum w_j^2$

## Key Takeaways
1. L1 creates sparse models
2. L2 shrinks coefficients uniformly
3. Regularization strength is crucial
4. Always cross-validate to find optimal parameters

**Pro Tip**: Experiment with different regularization techniques and strengths to find the best model performance!