**Supervised Learning 1**

# Linear Regression

## Part 1: Introduction to Linear Regression

Linear regression is a fundamental supervised learning technique that models the relationship between variables. In public policy, it helps us understand how different factors influence outcomes - crucial for evidence-based decision making.

<br>
<br>

## Types of Linear Regression

1. **Ordinary Least Squares (OLS)** - Minimises squared errors, assumes normal distribution
2. **Ridge Regression** - Adds L2 penalty to prevent overfitting, useful with multicollinearity
3. **Lasso Regression** - Adds L1 penalty, can perform feature selection
4. **Elastic Net** - Combines Ridge and Lasso penalties
5. Plus more...

> [üìö Scikit-learn Linear Models Documentation](https://scikit-learn.org/stable/modules/linear_model.html)

<br>

In this notebook, we'll implement a simple linear regression using OLS and a single feature (explanatory variable). In later notebooks, we'll see models using multiple features and other regression types.

---

#### Setup: Import Libraries

In [1]:
import pandas as pd         # For data manipulation
import altair as alt        # For plotting our results
import numpy as np          # For numerical operations


## // Import model from sklearn
from sklearn import linear_model



---

<br>
<br>

## Case Study: The Glasgow Effect

The "Glasgow Effect" refers to the unexplained poor health outcomes in Glasgow compared to other UK cities, even after accounting for deprivation. Let's explore relationships between deprivation and life expectancy using real data from 61 Glasgow neighborhoods.

**Dataset Variables:**
- `incomeDeprevation`: Proportion of population experiencing income deprivation
- `employmentDeprivation`: Proportion experiencing employment deprivation  
- `childPoverty`: Child poverty rate
- `femaleLE`, `maleLE`: Life expectancy by gender
- `disabilityRate`: Proportion with disabilities

Load the data:

In [2]:
# Load the Glasgow health data (directly into a pandas dataframe)
data_url = 'https://raw.githubusercontent.com/RDeconomist/RDeconomist.github.io/main/charts/extreme/glasgowHealthData.csv'
data = pd.read_csv(data_url)
data.head()

Unnamed: 0,areaName,incomeDeprevation,employmentDeprivation,childPoverty,femaleLE,maleLE,disabilityRate
0,"Anniesland, Jordanhill and Whiteinch",0.14,0.15,0.14,80.8,75.8,0.19
1,Arden and Carnwadric,0.26,0.25,0.34,76.0,72.8,0.22
2,Baillieston and Garrowhill,0.12,0.12,0.14,81.6,76.0,0.21
3,Balornock and Barmulloch,0.29,0.27,0.38,78.2,70.8,0.3
4,"Bellahouston, Craigton and Mosspark",0.2,0.18,0.22,80.5,73.9,0.29


<br>

Explore dataset

In [3]:
# Display basic information about our dataset
print(f"Dataset shape: {data.shape[0]} neighbourhoods, {data.shape[1]} variables")
print(f"\nColumns: {list(data.columns)}")

print(f"\nSummary statistics for key variables:")
data[['incomeDeprevation', 'maleLE', 'femaleLE']].describe().round(3)       # [[]] allows us to select specific columns. Then we use .describe() to get the summary statistics and .round(3) to round to 3 decimal places

Dataset shape: 61 neighbourhoods, 7 variables

Columns: ['areaName', 'incomeDeprevation', 'employmentDeprivation', 'childPoverty', 'femaleLE', 'maleLE', 'disabilityRate']

Summary statistics for key variables:


Unnamed: 0,incomeDeprevation,maleLE,femaleLE
count,61.0,61.0,61.0
mean,0.213,72.646,78.51
std,0.083,3.388,2.465
min,0.06,66.2,73.1
25%,0.14,70.1,76.7
50%,0.21,72.4,78.5
75%,0.28,75.0,80.2
max,0.38,81.7,84.3


So we have 61 observations (neighbourhoods), and 61 valid values for our key features (so no missing values). The median life expectancy across these areas is 72.4 years for Males and 78.5 for Females.

<br>
<br>

## Simple Linear Regression (OLS)

Let's start with a simple model: How does income deprivation relate to male life expectancy?

<br>

### Step 1: Explore the Relationship

In [4]:
alt.Chart(data).mark_point(color='rgba(128,0,0,.8)').encode(
    x=alt.X('incomeDeprevation:Q').axis(format='%').title('Income deprivation rate'),
    y=alt.Y('maleLE:Q').scale(zero=False, padding=40).title('Male life expectancy (years)').axis(titleAngle=0, titleAlign='left', titleX=1, titleY=-2),      # This Axis code just moves our y-axis label to the top left corner of the chart, which is nicer to read
    tooltip=['areaName:N', alt.Tooltip('incomeDeprevation:Q', format='.1%'), alt.Tooltip('maleLE:Q', format='.1f')]
)

**What do we see?** There appears to be a strong negative relationship - neighbourhoods with higher income deprivation have lower male life expectancy.

> Our supervised learning problem is to learn a function that approximates this relationship.

<br>
<br>

### Step 2: Prepare Training Data

We need to extract our set of inputs (income deprivation) and set of outputs (life expectancy).

In machine learning notation:
- **X** (uppercase): Feature matrix (can have multiple columns). E.g. Numpy array, numeric pandas DataFrame or Series
- **y** (lowercase): Target vector (single column of outcomes)

> [üìö Why this notation?](https://scikit-learn.org/stable/getting_started.html); [Glossary of terms](https://scikit-learn.org/stable/glossary.html#term-X)

In [5]:
# Prepare feature matrix X and target vector y
X = data[['incomeDeprevation']].values  # Note: double brackets selects a subset of columns and keeps as a DataFrame. We then convert to a numpy array with .values
y_male = data['maleLE']

print(f"X shape: {X.shape} (61 neighbourhoods x 1 feature)")        # Check our extracted data has the correct shape (61 observations x 1 feature)
print(f"y shape: {y_male.shape} (61 target values)")

X shape: (61, 1) (61 neighbourhoods x 1 feature)
y shape: (61,) (61 target values)


<br>
<br>

### Step 3: Fit the Model

Estimate the mapping function that best gets from our known inputs (`X`) to our known outputs (`y`). This is our *training* step.

In [61]:
# Create and fit the linear regression model
model_male = linear_model.LinearRegression()
model_male.fit(X, y_male)

<br>

**Parameters**: After fitting the model, we can extract the model parameters. This depends on the model type - as you might expect, a regression model has `coefficients` and an `intercept`.   (Note: fit parameters are accessed with an underscore '_' at the end)

In [62]:
print(model_male.coef_)         # Returned as an array because we could have multiple features (independent variables). In this example, there's only one coefficient, so we could access it with `model_male.coef_[0]`
print(model_male.intercept_)

[-31.00897722]
79.23912187855267


<br>

**Evaluation**: We can calculate our R-squared value using the `.score(X, y)` method (this gives a sense of how much of the variance in our dependent variable is *explained* by our dependent variable.)

In [63]:
print(f"R-squared: {model_male.score(X, y_male):.3f}")

R-squared: 0.584


So in our simple model, income deprivation accounts for 58% of the variation in the life expectancy, which is a pretty powerful predictor.

<br>
<br>

### Step 4: Generate Predictions

This is essentially the testing step, where we apply our fitted model parameters back to some input data and then observe its predictions. This will allow us to add a regression line back to our original scatter plot.

<br>

Call the `.predict(X)` method on our fitted model to generate predictions over our input data `X`.

In [54]:
model_male.predict(X)       # This will return an array of predicted values for each input in `X`. Since we have 61 observations, this will return an array of 61 predicted values.

array([74.89786507, 71.1767878 , 75.51804461, 70.24651848, 73.03732643,
       71.79696734, 73.65750598, 75.51804461, 71.48687757, 74.58777529,
       68.07589008, 77.06849347, 74.89786507, 70.24651848, 74.89786507,
       73.03732643, 73.34741621, 68.07589008, 69.31624917, 72.72723666,
       71.48687757, 73.34741621, 72.72723666, 71.79696734, 69.93642871,
       70.55660826, 70.55660826, 74.58777529, 76.7584037 , 72.72723666,
       77.37858325, 75.51804461, 72.10705712, 70.24651848, 75.20795484,
       72.41714689, 73.96759575, 76.44831393, 72.10705712, 72.72723666,
       67.45571053, 73.96759575, 72.41714689, 72.72723666, 76.7584037 ,
       69.31624917, 70.55660826, 77.06849347, 69.93642871, 68.38597985,
       75.20795484, 76.13822416, 69.31624917, 74.89786507, 69.62633894,
       69.93642871, 73.34741621, 70.86669803, 69.93642871, 71.79696734,
       75.51804461])

<br>

To plot these predictions, we'll need to add them back to a dataframe

In [55]:
## Add the predicted values to the original dataframe, in a new column called 'predicted_maleLE'
data['predicted_maleLE'] = model_male.predict(X)

# OPTIONAL: calculate the residuals, by subtracting the predicted values from the actual values.
data['residual'] = data['maleLE'] - data['predicted_maleLE']

# View our dataframe with the new column
data.head(3)

Unnamed: 0,areaName,incomeDeprevation,employmentDeprivation,childPoverty,femaleLE,maleLE,disabilityRate,predicted_maleLE,residual
0,"Anniesland, Jordanhill and Whiteinch",0.14,0.15,0.14,80.8,75.8,0.19,74.897865,0.902135
1,Arden and Carnwadric,0.26,0.25,0.34,76.0,72.8,0.22,71.176788,1.623212
2,Baillieston and Garrowhill,0.12,0.12,0.14,81.6,76.0,0.21,75.518045,0.481955


> Note: (If our model is well-specified), looking at residuals can be a useful way to identify interesting points of routes for further analysis - which neighbourhoods have a much BETTER or much WORSE outcome than predicted? Why?

<br>
<br>
<br>

### Step 5: Visualise Model Fit

Plot our regression results against our actual data points. (Hint: this will be a layered chart, with points and line layer.)

In [64]:
#  Set our base encoding, just including the x-axis as it will be shared across both layers (no mark type here, as we set these in the points and line layers)
base = alt.Chart(data).encode(
    x = alt.X('incomeDeprevation:Q').axis(format='%').title('Income deprivation rate')
)

# Add our points layer (from `base.` rather than `alt.Chart()`))
points = base.mark_point(size=60, opacity=0.7, color='darkblue').encode(
    y = alt.Y('maleLE:Q').scale(zero=False).title('Male life expectancy (years)'),
    tooltip=[   
                'areaName:N', 
                alt.Tooltip('incomeDeprevation:Q', format='.1%'),
                alt.Tooltip('maleLE:Q', format='.1f'),
                alt.Tooltip('predicted_maleLE:Q', format='.1f', title='Predicted'),
                alt.Tooltip('residual:Q', format='.2f')
            ]
)

# Add our trend line layer (from `base.` rather than `alt.Chart()`))
trend = base.mark_line(color='red', size=2, strokeDash=[10, 5]).encode(
    y = alt.Y('predicted_maleLE:Q')
)

# Combine our layers
points + trend

<br>

Remind ourselves of the model parameters:

In [57]:
# Formatting the output
print("Model equation: y = Œ≤‚ÇÄ + Œ≤‚ÇÅx")
print(f"  Intercept (Œ≤‚ÇÄ): {model_male.intercept_:.2f} years")
print(f"  Coefficient (Œ≤‚ÇÅ): {model_male.coef_[0]:.2f}")
print(f"  R-squared: {model_male.score(X, y_male):.3f}")

Model equation: y = Œ≤‚ÇÄ + Œ≤‚ÇÅx
  Intercept (Œ≤‚ÇÄ): 79.24 years
  Coefficient (Œ≤‚ÇÅ): -31.01
  R-squared: 0.584


<br>
<br>

**We've successfully implemented a linear regression model with OLS!** 

<br>
<br>

---

<br>
<br>

### <font color='Green'><strong>Regression Exercise: </strong></font>

The examples above introduce you to performing a Linear Regression in Python with a simple OLS, single-variate model. 

In these exercises, you'll try writing your own code to perform a regression on income deprivation and **female life-expectancy**

<br>

**EX 1.1** Separate the input and output data

In [7]:
X = #TODO
y = #TODO

SyntaxError: invalid syntax (2149805495.py, line 1)

<br>

**EX 1.2** Fit the model. 

<br>

**EX 1.3** Calculate fitted values (predictions) on the input data.

<br>

**EX 1.4** Visualise the fitted regression 

<br>

**EX 1.5** Compare the relationship between income deprivation and female life expectancy, with our original analysis on male life-expectancy.

> Hint: You may wish to plot both charts side by side, or compare performance metrics.

<br>
<br>

---

<br>
<br>

## Key takeaways

1. **Simple relationships can be powerful**: Income deprivation alone may explain ~58% of variation in male life expectancy

Where might you take this analysis further?
- Compare relationship with female expectancy
- Compare income deprivation to other explanatory characteristics
- Investigate more specific features (i.e. factors that may correlate with income deprivation), to better inform policy responses.

<br>

---