**Supervised Learning 1**

# Linear Regression

## Part 1: Introduction to Linear Regression - BONUS CHARTS


<br>

In this notebook, we'll take the single variable regression results from LinReg1 and create a few more advanced charts showing the results. We will show:
- Generating sample input data x so we can generate predictions over any x-range (not just our min and max input value)
- Calculating confidence intervals and building an advanced chart with layers for our actual data points, predicted line, and error band.

<br>

---

#### Setup: Import Libraries

In [2]:
import pandas as pd         # For data manipulation
import altair as alt        # For plotting our results
import numpy as np          # For numerical operations


## // Import model from sklearn
from sklearn import linear_model



---

<br>
<br>

**Continuing the `The Glasgow Effect` case study:**

1. Import Glasgow data
2. Separate into input and output data
3. Fit the model
4. Calculate fitted values (predictions) on the input data.
5. Plot simple chart





**1. Load the data:**

In [10]:
# Load the Glasgow health data (directly into a pandas dataframe)
data_url = 'https://raw.githubusercontent.com/RDeconomist/RDeconomist.github.io/main/charts/extreme/glasgowHealthData.csv'
data = pd.read_csv(data_url)
data.head()

Unnamed: 0,areaName,incomeDeprevation,employmentDeprivation,childPoverty,femaleLE,maleLE,disabilityRate
0,"Anniesland, Jordanhill and Whiteinch",0.14,0.15,0.14,80.8,75.8,0.19
1,Arden and Carnwadric,0.26,0.25,0.34,76.0,72.8,0.22
2,Baillieston and Garrowhill,0.12,0.12,0.14,81.6,76.0,0.21
3,Balornock and Barmulloch,0.29,0.27,0.38,78.2,70.8,0.3
4,"Bellahouston, Craigton and Mosspark",0.2,0.18,0.22,80.5,73.9,0.29


<br>

**2. Split the input and output data**

In [6]:
# Prepare feature matrix X and target vector y
X = data[['incomeDeprevation']].values  # Note: double brackets selects a subset of columns and keeps as a DataFrame. We then convert to a numpy array with .values
y_male = data['maleLE']

<br>

**3. Fit the Model**

In [7]:
# Create and fit the linear regression model
model_male = linear_model.LinearRegression()
model_male.fit(X, y_male)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


<br>

**4. Generate Predictions**

In [11]:
## Add the predicted values to the original dataframe, in a new column called 'predicted_maleLE'
data['predicted_maleLE'] = model_male.predict(X)

# OPTIONAL: calculate the residuals, by subtracting the predicted values from the actual values.
data['residual'] = data['maleLE'] - data['predicted_maleLE']

# View our dataframe with the new column
data.head(3)

Unnamed: 0,areaName,incomeDeprevation,employmentDeprivation,childPoverty,femaleLE,maleLE,disabilityRate,predicted_maleLE,residual
0,"Anniesland, Jordanhill and Whiteinch",0.14,0.15,0.14,80.8,75.8,0.19,74.897865,0.902135
1,Arden and Carnwadric,0.26,0.25,0.34,76.0,72.8,0.22,71.176788,1.623212
2,Baillieston and Garrowhill,0.12,0.12,0.14,81.6,76.0,0.21,75.518045,0.481955


<br>
<br>
<br>

**5. Basic Visualisation**

Plot our regression results against our actual data points. (Hint: this will be a layered chart, with points and line layer.)

In [9]:
#  Set our base encoding, just including the x-axis as it will be shared across both layers (no mark type here, as we set these in the points and line layers)
base = alt.Chart(data).encode(
    x = alt.X('incomeDeprevation:Q').axis(format='%').title('Income deprivation rate')
)

# Add our points layer (from `base.` rather than `alt.Chart()`))
points = base.mark_point(size=60, opacity=0.7, color='darkblue').encode(
    y = alt.Y('maleLE:Q').scale(zero=False).title('Male life expectancy (years)')
)

# Add our trend line layer (from `base.` rather than `alt.Chart()`))
trend = base.mark_line(color='red', size=2, strokeDash=[10, 5]).encode(
    y = alt.Y('predicted_maleLE:Q')
)

# Combine our layers
points + trend

(Up until here, all the same as in the previous notebook example.)

<br>


<br>
<br>

---

<br>
<br>

## **Bonus 1:** Generate predictions across full theoretical range

- In *Step 4* above, we've made predictions over our input data. But this means we our prediction is constrained by the minimum and maximum of our data (notice our regression line doesn't extend out past our $x_{min}$ and $x_{max}$)
- What if we want to generate predictions across a **full theoretical range**?

<br>

We can use Numpy's `linspace()` method to generate an array of input data, evenly spaced over a set interval. (See method info [here](https://numpy.org/devdocs//reference/generated/numpy.linspace.html))
- `numpy.linspace(start, stop, num=50`: with start=0 and stop=100, num=50 will set the spacing so that the returned array has 50 values:
    - Array [0, 2.04081633, 4.08163265, 6.12244898, ... , 97.95918367, 100]

<br>

Our input data is income deprevation, with a possible scale of 0-1, so we could use up to this range. Our observed min and max values are 6% and 38%, so extending this to 0-50% could be useful.

In [13]:
## Create evenly spaced x-values from 0% to 50% deprivation. Set the number of points to 30 (can choose any number, with our linear model, we'll get a straight line regardless of the number of points)
x_range = np.linspace(0, 0.5, 30).reshape(-1, 1)  # reshape for sklearn    (wants it as a 2D array, i.e. a matrix with one columns)

## Generate predictions for our sample input x-values)
y_pred_range = model_male.predict(x_range)
y_pred_range

array([79.23912188, 78.70448434, 78.1698468 , 77.63520926, 77.10057173,
       76.56593419, 76.03129665, 75.49665911, 74.96202157, 74.42738403,
       73.8927465 , 73.35810896, 72.82347142, 72.28883388, 71.75419634,
       71.2195588 , 70.68492127, 70.15028373, 69.61564619, 69.08100865,
       68.54637111, 68.01173357, 67.47709604, 66.9424585 , 66.40782096,
       65.87318342, 65.33854588, 64.80390834, 64.26927081, 63.73463327])

In [14]:
# Create dataframe for plotting
trend_data = pd.DataFrame({
    'incomeDeprevation': x_range.flatten(),     # Flatten the array to make it a 1-dimensional array (so it can be added as a column to our dataframe)
    'predicted_maleLE': y_pred_range
})

print(f"Generated {len(trend_data)} points for smooth trend line")

trend_data.head()

Generated 30 points for smooth trend line


Unnamed: 0,incomeDeprevation,predicted_maleLE
0,0.0,79.239122
1,0.017241,78.704484
2,0.034483,78.169847
3,0.051724,77.635209
4,0.068966,77.100572


Now we've created a new DataFrame with our sample input data, let's move onto plotting this (with a more advanced chart example)>

<br>
<br>
<br>

## **Bonus 2:** Advanced charting

We can go further with our visualisation, such as by adding a confidence band, model statistics, and interactivity

<br>

**Bonus 2.1** Let's build our chart just like we did above, but using our `trend_data` dataframe for the regression line layer.

In [15]:
# First, set up our main points layer. This uses our original data, and the same encoding as before.
points = alt.Chart(data).mark_point(size=60, opacity=0.7, color='darkblue').encode(
    x=alt.X('incomeDeprevation:Q').axis(format='%').title('Income deprivation rate'),
    y=alt.Y('maleLE:Q').scale(zero=False).title('Male life expectancy (years)').axis(titleAngle=0, titleAlign='left', titleX=1, titleY=-3),
    tooltip=['areaName:N', 
             alt.Tooltip('incomeDeprevation:Q', format='.1%'),
             alt.Tooltip('maleLE:Q', format='.1f'),
             alt.Tooltip('predicted_maleLE:Q', format='.1f', title='Predicted'),
             alt.Tooltip('residual:Q', format='.2f')]
)

# Next, set up our trend line layer. This uses our `trend_data` dataframe, and the same encoding as before (since we named the columns the same thing).
trend = alt.Chart(trend_data).mark_line(color='red', size=2).encode(
    x=alt.X('incomeDeprevation:Q'),
    y=alt.Y('predicted_maleLE:Q')
)
# c1 = points + trend      # If you want the trend to be on top of the points, put the points first.
c1 = trend + points      # The order determines the layering. If you want the trend to be underneath the points, put the trend first.
c1.display()

<br>
<br>

**Bonus 2.2** Next, we'll add a third layer with confidence bands.

<br>


> Note: we'll generate an approximate bounds showing typical spread of residuals. This isn't technically a confidence band, more of a rough indicator of the typical scatter around the line.  But this can be okay for exploratory visualisation.

Calculate ±1.96 × residual standard deviation (will show where approximately 95% of your existing data points fall relative to the regression line)

In [16]:
# Add rough-confidence band (simplified - using residual std)
residual_std = data['residual'].std()
trend_data['upper'] = trend_data['predicted_maleLE'] + 1.96 * residual_std
trend_data['lower'] = trend_data['predicted_maleLE'] - 1.96 * residual_std

band = alt.Chart(trend_data).mark_area(opacity=0.2, color='red').encode(
    x='incomeDeprevation:Q',
    y='lower:Q',
    y2='upper:Q'
)
band

<br>

Add it to our main chart layer.

In [18]:
c2 = band + c1      # Putting the band underneath the points and trend line
c2.display()

<br>
<br>

**Bonus 3:** Add text layer with model statistics

Create a DataFrame with our formatted model parameters

In [19]:
df_text = pd.DataFrame([{
    'text': f"R² = {model_male.score(X, y_male):.3f} | y = {model_male.intercept_:.1f} - {abs(model_male.coef_[0]):.1f}x"
}])
df_text

Unnamed: 0,text
0,R² = 0.584 | y = 79.2 - 31.0x


In [20]:
# Define text layer with model statistics and custom x, y position.
text = alt.Chart(df_text).mark_text(
    align='left',
    baseline='top',
    lineBreak='|',
    lineHeight=12,
    dx=50,
    # dy=10,
    fontSize=12,
    color='red'
).encode(
    x=alt.value(40),    # We manually set the position of the text (i.e. pixel position, starting from top left corner)
    y=alt.value(2),
    text=alt.Text('text:N')
)

# Combine all layers
final_chart = (c2 + text).properties(
    width=500,
    height=350,
    title=alt.TitleParams(
        text="Linear Regression: Deprivation & Life Expectancy",
        subtitle="Data points and fitted line, with approximate 95% bounds",
        anchor='start',
        frame='group',
        fontSize=15,
        fontWeight='bold'
    )
).interactive()

final_chart.display()

<br>

We could save this with:

In [None]:
# !pip install vl-convert-python        ## UNCOMMENT IF ERROR SAVING AS PNG.

final_chart.save('linear_regression_chart.png', scale_factor=2.0)
final_chart.save('linear_regression_chart.json')

<br>

---