# Linear regression with scikit-learn

This notebook introduces linear regression using the scikit-learn package. We load data with pandas, build regression models with scikit-learn, and visualise results with Matplotlib.

**By the end of this notebook you will be able to:**

- Load a dataset with pandas and explore its structure
- Select features and a target variable for a regression model
- Fit a univariate and multivariate linear regression model with scikit-learn
- Interpret model coefficients and make predictions
- Standardise features using `StandardScaler`

> **Note:** This notebook focuses on model creation and prediction. Splitting data into training and test sets for model evaluation is covered in a later notebook.

## The mtcars example dataset

`mtcars` is a famous statistical dataset. The data comes from the 1974 *Motor Trend* US magazine and contains fuel consumption and 10 aspects of automobile design and performance for 32 automobiles from that era. The objective of this exercise is to analyse which factors contribute to fuel efficiency (measured in the `mpg` column). The [mtcars dataset page](https://zomalex.co.uk/datasets/mt_cars_dataset.html) has full details.

Key columns used in this notebook:

| Column | Description |
|--------|-------------|
| `mpg`  | Miles per gallon — our **target variable** |
| `wt`   | Weight in 1,000 lbs |
| `hp`   | Gross horsepower |
| `am`   | Transmission type (0 = automatic, 1 = manual) |
| `vs`   | Engine shape (0 = V-shaped, 1 = straight) |

> **Note:** With only 32 rows, this dataset is intentionally small — ideal for learning the scikit-learn API without long runtimes.

## Imports

The libraries used in this notebook:

- **NumPy** — efficient numerical arrays and mathematical operations
- **pandas** — tabular data (DataFrames) for loading and manipulating the dataset
- **Matplotlib** — plotting and visualisation
- **scikit-learn `linear_model`** — the `LinearRegression` class for building regression models
- **scikit-learn `StandardScaler`** — scales features to zero mean and unit standard deviation

In [None]:
import numpy as np                               # numerical arrays and maths
import pandas as pd                              # DataFrames for data manipulation
import matplotlib.pyplot as plt                  # plotting

from sklearn import linear_model                 # LinearRegression and other models
from sklearn.preprocessing import StandardScaler # feature scaling

## Load the dataset

Load the `mtcars` dataset using pandas and display the first five rows.

In [None]:
df = pd.read_csv('https://zomalextrainingstorage.blob.core.windows.net/datasets/misc/mtcars.csv')
df.head()

### Explore the dataset

Check the shape (rows × columns) and descriptive statistics. Key things to look for:

- `mpg` ranges from about 10 to 34 — this is our target
- `wt` ranges from about 1.5 to 5.4 (thousands of lbs)
- Some columns (e.g. `am`, `vs`) have only 0 or 1 as values — they are binary flags

In [None]:
print(f'Shape: {df.shape}')
df.describe()

## Select features and target

Models are trained using a set of **features** (`X`) and a **target variable** (`y`).

We start with a single feature — `wt` (weight) — to predict `mpg` (miles per gallon). This is **univariate linear regression**.

**Why `wt`?** The scatter plot below shows a strong negative correlation between weight and MPG: heavier cars tend to be less fuel-efficient. This makes `wt` a good starting predictor.

**Why double brackets `[[]]`?** `df[['wt']]` returns a **DataFrame** (2-D), while `df['wt']` returns a **Series** (1-D). scikit-learn's `.fit()` expects a 2-D array for `X`, so always use double brackets when selecting features.

In [None]:
X = df[['wt']] # DataFrame (2-D) — scikit-learn requires 2-D input for features
y = df['mpg']  # Series (1-D) — the target variable

print(f'X shape: {X.shape},  y shape: {y.shape}')
X.head()

## Visualise the data

Plot the relationship between `wt` (weight in 1,000 lbs) and `mpg` (miles per gallon).

> **Think about it:** What kind of relationship do you see? Is it positive or negative? Does it look linear?

In [None]:
plt.scatter(X, y, alpha=0.7)
plt.xlabel('Weight (1,000 lbs)')
plt.ylabel('Miles per Gallon')
plt.title('Scatterplot of Weight vs MPG')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

## Univariate linear regression model

Linear regression finds the straight line that **best fits** the data by minimising the sum of squared differences between actual and predicted values (the *least squares* criterion).

The fitted line has the form:

$$\hat{y} = \text{coef\_} \times x + \text{intercept\_}$$

In our case: **mpg = m × wt + b**, where `m` is the slope (`coef_`) and `b` is the y-intercept (`intercept_`). After fitting, we can read off these values directly from the model.

In [None]:
model1 = linear_model.LinearRegression()
model1.fit(X, y)
print(f'Coefficient (slope): {model1.coef_[0]:.2f}')
print(f'Intercept:           {model1.intercept_:.2f}')

Using the coefficient and intercept above, the fitted model equation is:

In [None]:
print(f'mpg = {model1.coef_[0]:.2f} \u00d7 wt + {model1.intercept_:.2f}')

## Make a prediction

Using the model, predict the MPG of a car that weighs 3,000 lbs. In the dataset, weight is measured in thousands of lbs, so `wt = 3.0` means 3,000 lbs.

In [None]:
example_weight = 3.0  # 3,000 lbs
example_weight_df = pd.DataFrame({'wt': [example_weight]})
mpg_pred1 = model1.predict(example_weight_df)
print(f'Predicted MPG for a 3,000 lb car: {mpg_pred1[0]:.2f} mpg')

To draw the regression line on the scatter plot, we predict MPG for every weight value in the dataset. This gives us the y-coordinates for each point on the line.

In [None]:
# Predict MPG for every car in the dataset — used to draw the regression line
y_pred = model1.predict(X)

Recreate the scatter plot and overlay the model's regression line. The line represents the model's prediction for every value of `wt` in the dataset.

In [None]:
plt.scatter(X, y, label='Data Points', alpha=0.7)
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression line')
plt.xlabel('Weight (1,000 lbs)')
plt.ylabel('Miles per Gallon')
plt.title('Weight vs MPG with Regression Line')
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend()
plt.show()

## Multivariate linear regression model

Multivariate regression uses **several features** to predict the target variable. We use these four columns:

| Feature | Description | Type |
|---------|-------------|------|
| `wt`    | Weight in 1,000 lbs | Continuous numeric |
| `hp`    | Gross horsepower | Continuous numeric |
| `am`    | Transmission (0 = automatic, 1 = manual) | Binary (0/1) |
| `vs`    | Engine shape (0 = V-shaped, 1 = straight) | Binary (0/1) |

These features cover a mix of numeric and binary categorical types. Because `am` and `vs` are already coded as 0/1 integers, no additional encoding is needed.

> **Think about it:** Do you expect this model to perform better than the univariate model? Why?

In [None]:
X2 = df[['wt', 'hp', 'am', 'vs']]
print(f'Shape: {X2.shape}')
X2.head()

There are now four coefficients — one per feature. Each coefficient represents the **change in MPG for a one-unit increase in that feature, holding all other features constant**.

In [None]:
model2 = linear_model.LinearRegression()
model2.fit(X2, y)
print(f'Intercept: {model2.intercept_:.2f}')
print('Coefficients:')
for feature, coef in zip(X2.columns, model2.coef_):
    print(f'  {feature}: {coef:.2f}')

In [None]:
pd.DataFrame({'Feature': X2.columns, 'Coefficient': model2.coef_.round(2)})

Interpreting the sign of each coefficient:

- **`wt` (negative):** Heavier cars get fewer MPG, even after controlling for horsepower and transmission type.
- **`hp` (negative):** More powerful engines are less fuel-efficient.
- **`am` (positive):** Manual transmission cars tend to get more MPG than automatics with otherwise identical features.
- **`vs` (positive):** Straight engines tend to be slightly more efficient than V-shaped ones.

Using the multivariate model, predict the MPG of a car that weighs 3,000 lbs (`wt = 3.0`), has `hp = 110`, uses manual transmission (`am = 1`), and has a V-shaped engine (`vs = 0`).


In [None]:
example_weight = 3.0
example_horsepower = 110
example_am = 1 # manual transmission
example_vs = 0 # v-shaped engine
example_array = pd.DataFrame([[example_weight, example_horsepower, example_am, example_vs]],
                              columns=['wt', 'hp', 'am', 'vs'])
mpg_pred2 = model2.predict(example_array)
print(f'Predicted MPG for weight 3.0 and horsepower 110, manual transmission, v-shaped engine: {mpg_pred2[0]:.2f}')

## Feature scaling

Different features can have very different ranges: `wt` varies from about 1.5 to 5.5, while `hp` ranges from roughly 50 to 335.

**Important:** Scaling is *not required* for linear regression, which uses a closed-form solution that is scale-invariant. However, many other models — including K-Nearest Neighbours (KNN), Support Vector Machines (SVM), and K-Means clustering — are sensitive to feature scale and require standardisation. We include it here to demonstrate the technique in a familiar context.

### Standardisation (z-score normalisation)

`StandardScaler` transforms each feature to have **mean = 0** and **standard deviation = 1**:

$$z = \frac{x - \mu}{\sigma}$$

In plain terms: it centres each feature at zero and rescales it so that one unit equals one standard deviation.

> **Warning:** Always fit the scaler on **training data only**. Use `.transform()` (not `.fit_transform()`) on validation or test data. Fitting on test data leaks information about the test set into your model.

In [None]:
scaler = StandardScaler()
# fit_transform = fit() + transform() combined — use only on training data
X2_scaled = scaler.fit_transform(X2)
print(f'Original X2 (first 5 rows):\n{X2.values[:5]}')
print(f'\nScaled X2 (first 5 rows):\n{X2_scaled[:5].round(2)}')

In [None]:
print(f'Means: {X2_scaled.mean(axis=0).round(2)}')
print(f'Stds:  {X2_scaled.std(axis=0).round(2)}')

### Fit and evaluate the scaled model

With scaled features, each coefficient now represents the change in MPG per **one standard deviation increase** in that feature, rather than per one unit. This makes the magnitudes directly comparable across features with very different ranges.

In [None]:
model2_scaled = linear_model.LinearRegression()
model2_scaled.fit(X2_scaled, y)
print(f'Intercept: {model2_scaled.intercept_:.2f}')
print('Coefficients (per standard deviation):')
for feature, coef in zip(X2.columns, model2_scaled.coef_):
    print(f'  {feature}: {coef:.2f}')

### Scaling new data

Any new data must be scaled using the **same scaler** that was fitted on the training data. The scaler remembers the training mean and standard deviation; reusing it ensures new data is shifted to the same scale the model was trained on.

> **Warning:** Never call `.fit_transform()` or `.fit()` on new (test/prediction) data. Doing so computes new statistics from that data, shifting the scale relative to what the model learned, and silently produces wrong predictions.

In [None]:
example_array_scaled = scaler.transform(example_array)
print(f'Original input:  {example_array.values}')
print(f'Scaled input:    {example_array_scaled.round(2)}')

In [None]:
mpg_pred2_scaled = model2_scaled.predict(example_array_scaled)
print(f'Prediction from unscaled model2:        {mpg_pred2[0]:.2f} mpg')
print(f'Prediction from scaled  model2_scaled:  {mpg_pred2_scaled[0]:.2f} mpg')

The two predictions are (approximately) equal. **Scaling changes how coefficients are interpreted, not what the model predicts.** Under the hood, both models have learned the same underlying relationship — the scaled model simply expresses it in standardised units.

## Summary

In this notebook you:

- Loaded the `mtcars` dataset with pandas and explored its structure
- Selected features (`X`) and a target (`y`) for linear regression
- Fitted a **univariate model** (`model1`) using `wt` alone and visualised the regression line
- Fitted a **multivariate model** (`model2`) using `wt`, `hp`, `am`, and `vs`
- Standardised features with `StandardScaler` and fitted a **scaled model** (`model2_scaled`)
- Confirmed that scaling does not change predictions — only the interpretation of coefficients

### Model comparison

| Model | Features | Coefficients represent |
|-------|----------|------------------------|
| `model1` | `wt` | Change in MPG per 1,000 lb increase |
| `model2` | `wt`, `hp`, `am`, `vs` | Change in MPG per one-unit increase in each feature |
| `model2_scaled` | `wt`, `hp`, `am`, `vs` (scaled) | Change in MPG per one standard deviation increase |

### Suggested exercises

1. **Add a feature:** Add `cyl` (number of cylinders) to `X2` and refit the model. Do the other coefficients change?
2. **Try a different predictor:** Replace `wt` with `hp` in the univariate model and compare the regression line.
3. **Inspect residuals:** Compute `y - y_pred` for `model1`. Plot a histogram — are the residuals roughly normally distributed?