# Linear regression with scikit-learn

This notebook introduces linear regression using the scikit-learn package. We start by loading data with pandas. We will create a linear regression model first using a single predictor (univariate model) and then several predictors (multivariate model). We will select the columns needed to train the model, fit the model, and make predictions.

We will also look at how to scale features. This is not strictly necessary for linear regression models, but some other ML models, such as K-Nearest Neighbors (KNN), K-Means clustering, and Support Vector Machines (SVM), do require it.

In these examples, we use a small dataset. We will visualise the data using Matplotlib.

In later examples, we will split the data into training and test datasets so we can evaluate the model. In this example, the focus is on the basics of scikit-learn, so we will not do that.


## The mtcars example dataset

`mtcars` is a famous statistical dataset. The data comes from the 1974 *Motor Trend* US magazine and contains fuel consumption and 10 aspects of automobile design and performance for 32 automobiles from that era. The objective of this exercise is to analyse which factors contribute to fuel efficiency (measured in the `mpg` column). The [mtcars dataset page](https://zomalex.co.uk/datasets/mt_cars_dataset.html) has full details.


The import statements below use NumPy, pandas, and several modules from scikit-learn.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.preprocessing import StandardScaler

Load the `mtcars` dataset using pandas, and display the first few rows.


In [None]:
df = pd.read_csv('https://zomalextrainingstorage.blob.core.windows.net/datasets/misc/mtcars.csv')
df.head(2)

Models are trained using a set of features (`X`) and a target variable (`y`). In this case, we have one feature (`wt`, weight) and one target variable (`mpg`, miles per gallon). This is univariate linear regression.


In [None]:
X = df[['wt']] # this is a pandas dataframe
y = df['mpg']  # this is a pandas series

X.shape, y.shape

## Visualise the data

Plot the relationship between `wt` (weight) and `mpg` (miles per gallon).


In [None]:
plt.scatter(X, y)
plt.xlabel('Weight')
plt.ylabel('Miles per Gallon')
plt.title('Scatterplot of Weight vs MPG')
plt.show()

## Univariate linear regression model

Our first machine learning model uses linear regression to find the relationship between two numeric variables.


In [None]:
model1 = linear_model.LinearRegression()
model1.fit(X, y)
print(f'model1.coef_, {model1.coef_}\nmodel1.intercept_: , {model1.intercept_}')

Using the model, predict the MPG of a car that weighs 3,000 lbs (`wt = 3.0` in the dataset).


In [None]:
example_weight = 3.0
example_weight_array = np.array([[example_weight]]) # expects 2D numpy array
mpg_pred1 = model1.predict(example_weight_array)
print(f'Predicted MPG for weight 3.0: {mpg_pred1[0]}')

In [None]:
# Predict the values from the model
y_pred = model1.predict(X)

Recreate the scatter plot and add the model's regression line.


In [None]:
plt.scatter(X, y, label='Data Points')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression line')
plt.xlabel('Weight')
plt.ylabel('Miles per Gallon')
plt.title('Weight vs MPG with Regression Line')
plt.legend()
plt.show()

## Multivariate linear regression model

Multivariate regression uses several variables to predict the target variable. We use these columns:

* weight (`wt`)
* horsepower (`hp`)
* `am`: transmission type (`am = 0` automatic, `am = 1` manual)
* `vs`: engine type (`vs = 0` V-shaped, `vs = 1` straight)


In [None]:
X2 = df[['wt', 'hp', 'am', 'vs']]
X2.shape

There are now four coefficients, one for each feature.


In [None]:
model2 = linear_model.LinearRegression()
model2.fit(X2, y)
print(f'model2.coef_, {model2.coef_}\nmodel2.intercept_: , {model2.intercept_}')

Using the multivariate model, predict the MPG of a car that weighs 3,000 lbs (`wt = 3.0`), has `hp = 110`, uses manual transmission (`am = 1`), and has a V-shaped engine (`vs = 0`).


In [None]:
example_weight = 3.0
example_horsepower = 110
example_am = 1 # manual transmission
example_vs = 0 # v-shaped engine
example_array = np.array([[example_weight, example_horsepower, example_am, example_vs]]) # expects 2D numpy array
mpg_pred2 = model2.predict(example_array)
print(f'Predicted MPG for weight 3.0 and horsepower 110, manual transmission, v-shaped engine: {mpg_pred2[0]:.2f}')

# Scaling features

Different features have different ranges. Weight is typically between 1.5 and 5.5 (thousands of lbs), while horsepower ranges from about 50 to over 300. Features with larger ranges can dominate the model training process and lead to worse models. To avoid this, we can scale features to a common range using standardisation (z-score normalisation). This involves subtracting the mean and dividing by the standard deviation for each feature.


In [None]:
scaler = StandardScaler()
X2_scaled = scaler.fit_transform(X2)
print(f'X2[:5]\n{X2.values[:5]}\nX2_scaled[:5]:\n{X2_scaled[:5]}')

In [None]:
model2_scaled = linear_model.LinearRegression()
model2_scaled.fit(X2_scaled, y)
print(f'model2_scaled.coef_, {model2_scaled.coef_}\nmodel2_scaled.intercept_: {model2_scaled.intercept_:.2f}')    

Any new data points must also be scaled before making predictions.


In [None]:
example_array_scaled = scaler.transform(example_array)
example_array_scaled

In [None]:
mpg_pred2_scaled = model2_scaled.predict(example_array_scaled)
print(f'Predicted MPG for weight 3.0 and horsepower 110, manual transmission, v-shaped engine:  (scaled): {mpg_pred2_scaled[0]:.2f}')

End of tutorial
