**Supervised Learning 1**

# Linear Regression

## Part 3: Regularised Regression (**Bonus**)

<br>

When features are correlated (multicollinearity), regularisation can improve model stability and prevent overfitting. 

<br>

**The problem: Multicollinearity and Overfitting**
- When predictors are highly correlated (like our deprivation measures), coefficient estimates become unstable
- Small changes in data can lead to large changes in coefficients
- Models might overfit to training data

**A solution: Regularisation**
- Adds a penalty term to the loss function
- *Shrinks* coefficients toward zero
- Trades a small increase in bias for a large decrease in variance

<br>

**Recap:** Types of Linear Regression

1. **Ordinary Least Squares (OLS)** - Minimises squared errors, assumes normal distribution
2. **Ridge Regression** - Adds L2 penalty to prevent overfitting, useful with multicollinearity
3. **Lasso Regression** - Adds L1 penalty, can perform feature selection
4. **Elastic Net** - Combines Ridge and Lasso penalties
5. Plus more...

> [ðŸ“š Scikit-learn Linear Models Documentation](https://scikit-learn.org/stable/modules/linear_model.html)

<br>

---

#### Setup: Import Libraries

In [1]:
import pandas as pd         # For data manipulation
import altair as alt        # For plotting our results
import numpy as np          # For numerical operations


## // Models
from sklearn import linear_model

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


---

<br>
<br>

(We'll continue with the case study we saw in the previous notebooks)

## Case Study: The Glasgow Effect

The "Glasgow Effect" refers to the unexplained poor health outcomes in Glasgow compared to other UK cities, even after accounting for deprivation. Let's explore relationships between deprivation and life expectancy using real data from 61 Glasgow neighborhoods.

**Dataset Variables:**
- `incomeDeprevation`: Proportion of population experiencing income deprivation
- `employmentDeprivation`: Proportion experiencing employment deprivation  
- `childPoverty`: Child poverty rate
- `femaleLE`, `maleLE`: Life expectancy by gender
- `disabilityRate`: Proportion with disabilities

**Load the data**

In [2]:
# Load the Glasgow health data (directly into a pandas dataframe)
data_url = 'https://raw.githubusercontent.com/RDeconomist/RDeconomist.github.io/main/charts/extreme/glasgowHealthData.csv'
data = pd.read_csv(data_url)
data.head()

Unnamed: 0,areaName,incomeDeprevation,employmentDeprivation,childPoverty,femaleLE,maleLE,disabilityRate
0,"Anniesland, Jordanhill and Whiteinch",0.14,0.15,0.14,80.8,75.8,0.19
1,Arden and Carnwadric,0.26,0.25,0.34,76.0,72.8,0.22
2,Baillieston and Garrowhill,0.12,0.12,0.14,81.6,76.0,0.21
3,Balornock and Barmulloch,0.29,0.27,0.38,78.2,70.8,0.3
4,"Bellahouston, Craigton and Mosspark",0.2,0.18,0.22,80.5,73.9,0.29


<br>
<br>

**Exploratory analysis**

Let's create a correlation matrix to check correlations between features

In [3]:
# First, let's check how correlated our features are
features = ['incomeDeprevation', 'employmentDeprivation', 'childPoverty', 'disabilityRate']
correlation_matrix = data[features].corr()      # Built-in method to calculate a correlation matrix

print("Feature Correlations:")
print(correlation_matrix.round(3))
print("\n*Note*: High correlations (>0.7) between features can cause multicollinearity")
print("This means the features are explaining similar variance in the outcome.")

Feature Correlations:
                       incomeDeprevation  employmentDeprivation  childPoverty  \
incomeDeprevation                  1.000                  0.987         0.901   
employmentDeprivation              0.987                  1.000         0.858   
childPoverty                       0.901                  0.858         1.000   
disabilityRate                     0.856                  0.886         0.654   

                       disabilityRate  
incomeDeprevation               0.856  
employmentDeprivation           0.886  
childPoverty                    0.654  
disabilityRate                  1.000  

*Note*: High correlations (>0.7) between features can cause multicollinearity
This means the features are explaining similar variance in the outcome.


<br>

**Prep our data:** As before, we need to extract our input and output data.

- **X** (uppercase): Feature matrix (can have multiple columns). E.g. Numpy array, numeric pandas DataFrame or Series
- **y** (lowercase): Target vector (single column of outcomes)

In [4]:
X_multi = data[['incomeDeprevation', 'employmentDeprivation', 'childPoverty', 'disabilityRate']].values
y_male = data['maleLE'].values

<br>
<br>

### **Example: Ridge Regression**

Ridge regression is often the first choice for handling multicollinearity. Let's see how it works:

**How Ridge Works:**
- Adds L2 penalty: minimizes (RSS + Î± Ã— Î£Î²Â²)
- Shrinks coefficients proportionally
- Never sets coefficients exactly to zero
- Good when you believe all features are relevant

<br>

### Step 1. Standardise the features

In this step, we scale all the input values to be roughly normally distributed. The characteristics of the data will still be the same, but each feature will now have a more similar scale, so this avoids any features with particularly large values dominating the coefficients.

> Note: when we use models with regularisation, we typically need to *standardise* the values. There are different methods to achieve this, but the general principle is to adjust each individual feature (i.e. column) be approximately standard normally distributed. 

> Standard Scaler removes the mean and scales to unit variance (around 1).
> See about Standardising [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [5]:
# Standardise features (required for fair regularisation)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_multi)

In [6]:
X_scaled[:5]        # Inspect the first 5 rows of the scaled data

array([[-0.87711595, -0.55766863, -1.42855513, -0.6940697 ],
       [ 0.57220431,  0.72602142,  0.2717739 , -0.14659231],
       [-1.11866933, -0.94277564, -1.42855513, -0.32908477],
       [ 0.93453438,  0.98275943,  0.61183971,  1.31334741],
       [-0.15245582, -0.17256161, -0.74842351,  1.13085494]])

<br>

### Step 2. Fit the model

When fitting a Ridge regression, we need to specify an `alpha` parameter.

This affects the strength of the regularisation (the `L2` penalty term). If alpha=0, then the model will be equialent to the Simple OLS Linear Regression we've seen before. The higher the Alpha value, the greater the regularisation (the more coefficients will be shrank).

- By default this is set as 1. So we can keep it at this or set it ourselves.
- The idea of finding the **best** parameter value introduces another key step relevant for most machine learning models.
    - New step: Testing multiple parameter values and using best practices / rules of thumb to determine the best value for our case.

> ðŸ“• See the Scikit learn docs on [Ridge regressions](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)

Instantiate the model

In [7]:
# 1. Instantiate the model
model = linear_model.Ridge(alpha=0.5)

# 2. Fit the model
model.fit(X_scaled, y_male)

# 3. Make predictions
y_pred = model.predict(X_scaled)

In [None]:
# Analyse the results
print(f"R-squared score: {model.score(X_scaled, y_male):.3f}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")


# View the coefficients for each feature
for i, feature in enumerate(features):
    print(f"{feature}: {model.coef_[i]:.3f}")

# View the intercept
print(f"Intercept: {model.intercept_}")


R-squared score: 0.585
Coefficients: [-0.48951676 -0.44989148 -0.75509423 -0.1768346 ]
Intercept: 72.64590163934426
incomeDeprevation: -0.490
employmentDeprivation: -0.450
childPoverty: -0.755
disabilityRate: -0.177
Intercept: 72.64590163934426


<br>
<br>

### Bonus: Looping through alphas 

Generally, we'll want to try multiple regularisation strength (alpha) values. We can do this with a loop, saving the results to a dataframe on each iteration.

In [18]:

# Fit OLS for comparison
ols_model = linear_model.LinearRegression()
ols_model.fit(X_scaled, y_male)

# Fit Ridge with different regularisation strengths
alphas = [0, 0.1, 1.0, 10.0, 100.0]
ridge_results = []

for alpha in alphas:
    if alpha == 0:
        # OLS (no regularisation)
        model = linear_model.LinearRegression()
    else:
        model = linear_model.Ridge(alpha=alpha)
    
    model.fit(X_scaled, y_male)
    
    ridge_results.append({
        'alpha': alpha,
        'r2_score': model.score(X_scaled, y_male),
        'income_coef': model.coef_[0],
        'employment_coef': model.coef_[1], 
        'child_poverty_coef': model.coef_[2],
        'coef_sum': sum(abs(model.coef_))  # Total coefficient magnitude
    })

ridge_df = pd.DataFrame(ridge_results)

ridge_df

Unnamed: 0,alpha,r2_score,income_coef,employment_coef,child_poverty_coef,coef_sum
0,0.0,0.776704,2.633796,-2.536875,-3.429797,9.057865
1,0.1,0.77637,2.134101,-2.127187,-3.321544,8.033603
2,1.0,0.770566,0.546348,-0.934045,-2.87697,4.813086
3,10.0,0.742826,-0.523563,-0.518655,-1.920869,3.294568
4,100.0,0.58546,-0.489517,-0.449891,-0.755094,1.871337


<br>
<br>
<br>

### Bonus: Comparing regularisation methods

Now let's compare Ridge, Lasso, and Elastic Net to understand their different behaviours:

**Key Differences:**
- **Ridge (L2)**: Shrinks all coefficients proportionally, keeps all features
- **Lasso (L1)**: Can set coefficients to exactly zero, performs feature selection
- **Elastic Net**: Combines L1 and L2, balances feature selection with stability


<br>

In [22]:
# Standardise features for regularised models
scaler = StandardScaler()
# X_scaled = X_multi.copy()
X_scaled = scaler.fit_transform(X_multi)

# Compare different regression types
models = {
    'OLS': linear_model.LinearRegression(),
    'Ridge (Î±=1.0)': linear_model.Ridge(alpha=1.0),
    'Lasso (Î±=0.1)': linear_model.Lasso(alpha=0.1),
    'ElasticNet (Î±=0.5)': linear_model.ElasticNet(alpha=0.5)
}

results = []
for name, model in models.items():
    model.fit(X_scaled, y_male)
    results.append({
        'Model': name,
        'RÂ²': model.score(X_scaled, y_male),
        'Income Coef': model.coef_[0] if len(model.coef_) > 0 else 0,
        'Employment Coef': model.coef_[1] if len(model.coef_) > 1 else 0,
        'Child Poverty Coef': model.coef_[2] if len(model.coef_) > 2 else 0
    })

results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,RÂ²,Income Coef,Employment Coef,Child Poverty Coef
0,OLS,0.776704,2.633796,-2.536875,-3.429797
1,Ridge (Î±=1.0),0.770566,0.546348,-0.934045,-2.87697
2,Lasso (Î±=0.1),0.762128,-0.0,-0.0,-2.835505
3,ElasticNet (Î±=0.5),0.705737,-0.429191,-0.337789,-1.607309


<br>

Notice in Lasso its set coefficients to 0 for income and employment, suggesting that it found child poverty to be the best predictor.

<br>
<br>

**Visualisation:** compare the coefficients across each model

In [20]:
# Transform the dataframe to long format
results_df_long = results_df.melt(id_vars=['Model', 'RÂ²'], var_name='Feature', value_name='Coefficient')
results_df_long.head()

Unnamed: 0,Model,RÂ²,Feature,Coefficient
0,OLS,0.776704,Income Coef,2.633796
1,Ridge (Î±=1.0),0.770566,Income Coef,0.546348
2,Lasso (Î±=0.1),0.762128,Income Coef,-0.0
3,ElasticNet (Î±=0.5),0.705737,Income Coef,-0.429191
4,OLS,0.776704,Employment Coef,-2.536875


In [21]:
# Visualise coefficient shrinkage
coef_chart = alt.Chart(results_df_long).mark_bar().encode(
    x=alt.X('Model:N').sort(list(models.keys())),
    y='Coefficient:Q',
    color='Feature:N',
    column='Feature:N'
).properties(
    width=150,
    height=200
)
print('\nCoefficient values across regression types:')
coef_chart.display()


Coefficient values across regression types:


<br>
<br>

---

<br>
<br>

## Key takeaways (from notebooks 1-3)

1. **Simple relationships can be powerful**: Income deprivation alone explains ~58% of variation in male life expectancy

2. **Multiple factors matter**: Adding employment and child poverty improves the model, but with diminishing returns

3. **Regularisation prevents overfitting**: When working with correlated predictors (common in policy data), Ridge or Lasso regression can provide more stable estimates

4. **Policy implications**:
   - Income support programs could have substantial health impacts
   - A 10 percentage point reduction in income deprivation is associated with ~1.5 years increased life expectancy
   - Some neighbourhoods outperform predictions - understanding why could inform best practices

---

#### Useful evaluation metrics:

$$\text{RMSE}(y, \hat{y}) = \sqrt{\frac{\sum_{i=0}^{N - 1} (y_i - \hat{y}_i)^2}{N}}$$

---