# Getting Started with scikit-learn

This notebook introduces the scikit-learn package. We start by loading data with pandas. We will create a linear regression model firstly using a single predictive variable (univariate model) and then several variables (multivariate model).  We will select the columns that we need to train the model, fit the model and make predictions. We will also look at how to scale features and how to handle categorical variables.

In these examples we will use a small example dataset. We will visualise the data using the matplotlib package.

In later examples, we will split the data into training and test datasets so we can evaluate the model but in this example the focus is on the basics of scikit-learn so we won't do that.

## The mtcars example dataset

MT Cars is a famous statistical dataset. The data comes from the 1974 Motor Trend US magazine and contains fuel consumption and 10 aspects of automobile design and performance for 32 supercars of the time. The objective of this exercise is to analyse what factors contribute to fuel efficiency (measured in the mpg column).  The [mtcars dataset page](https://zomalex.co.uk/datasets/mt_cars_dataset.html) has full details.

The import statements below use numpy, pandas and several modules from scikit-learn.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder


Load the mtcars dataset using pandas and display the first few rows.

In [None]:
df = pd.read_csv('https://zomalextrainingstorage.blob.core.windows.net/datasets/misc/mtcars.csv')
df.head(2)

Models are trained using a set of features (X) and a target variable (y). In this case, we have only one feature (weight) and one target variable (miles per gallon). This is univariate linear regression.

In [None]:
X = df.loc[:, ['wt']] # this is a pandas dataframe
y = df.loc[:,  'mpg'] # this is a pandas series

X.shape, y.shape

## Visualise the data

Plot the relationship between 'wt' (weight) and 'mpg' (miles per gallon).

In [None]:
plt.scatter(X, y)
plt.xlabel('Weight')
plt.ylabel('Miles per Gallon')
plt.title('Scatterplot of Weight vs MPG')
plt.show()

## Univariate linear regression model

Our first machine learning model uses linear regression to find a relationship between two numeric variables.

In [None]:
model = linear_model.LinearRegression()
model.fit(X.values, y)
print(f'model.coef_, {model.coef_}\nmodel.intercept_: , {model.intercept_}')

Using the model, predict the mpg of a car that weighs 3,000 lbs (3.0 in the dataset).

In [None]:
example_weight = 3.0
example_weight_array = np.array([[example_weight]]) # expects 2D numpy array
mpg_pred = model.predict(example_weight_array)
print(f'Predicted MPG for weight 3.0: {mpg_pred[0]}')

In [None]:
# Predict the values from the model
y_pred = model.predict(X.values)

Recreate the scatter plot and add the regression line from the model.

In [None]:
plt.scatter(X, y, label='Data Points')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression line')
plt.xlabel('Weight')
plt.ylabel('Miles per Gallon')
plt.title('Weight vs MPG with Regression Line')
plt.legend()
plt.show()

## Multivariate linear regression model

Multivariate regression uses several variables to predict the target variable. For example, we could use weight, horsepower (hp) to predict miles per gallon.

In [None]:
X_multi = df.loc[:, ['wt', 'hp']]
X_multi.shape

Note that there are now two coefficients, one for each feature.

In [None]:
model_multi = linear_model.LinearRegression()
model_multi.fit(X_multi.values, y)
print(f'model_multi.coef_, {model_multi.coef_}\nmodel_multi.intercept_: , {model_multi.intercept_}')

Using the multivariate model, predict the mpg of a car that weighs 3,000 lbs (3.0 in the dataset) and has a horsepower of 110.

In [None]:
example_weight = 3.0
example_horsepower = 110
example_multi_array = np.array([[example_weight, example_horsepower]]) # expects 2D numpy array
mpg_pred_multi = model_multi.predict(example_multi_array)
print(f'Predicted MPG for weight 3.0 and horsepower 110: {mpg_pred_multi[0]}')

# Scaling features

Different features have different ranges. Weight is typically between 1.5 and 5.5 (thousands of lbs) while horsepower ranges from about 50 to over 300. Features with larger ranges can dominate the model training process and lead to worse models. To avoid this, we can scale the features to a common range using standardization (z-score normalization).  This involves subtracting the mean and dividing by the standard deviation for each feature.


In [None]:
scaler = StandardScaler()
X_multi_scaled = scaler.fit_transform(X_multi.values)
print(f'X_multi[:5] {X_multi.values[:5]} \nX_multi_scaled[:5]: {X_multi_scaled[:5]}')

In [None]:
model_multi_scaled = linear_model.LinearRegression()
model_multi_scaled.fit(X_multi_scaled, y)
print(f'model_multi_scaled.coef_, {model_multi_scaled.coef_}\nmodel_multi_scaled.intercept_: {model_multi_scaled.intercept_:.2f}')    

We now need to scale any new data points before making predictions.

In [None]:
example_multi_array_scaled = scaler.transform(example_multi_array)
example_multi_array_scaled

In [None]:
mpg_pred_multi_scaled = model_multi_scaled.predict(example_multi_array_scaled)
print(f'Predicted MPG for weight 3.0 and horsepower 110 (scaled): {mpg_pred_multi_scaled[0]:.2f}')

## One-hot encoding of text features

The mtcars dataset has a column 'am' which is the transmission type: 0 = automatic, 1 = manual. Even thought this looks like a numeric variable, it is actually a categorical variable.  We can use one-hot encoding to convert this into a numeric format suitable for machine learning models. There are two ways of doing this: using pandas get_dummies() function or using scikit-learn OneHotEncoder class.

For simplicity, the examples below do not scale the data but in practice we would do this.

### Using pandas get_dummies() function

In [None]:
df1 = pd.get_dummies(df, columns=['am'])
df1.head(2)

In [None]:
X1 = df1.loc[:, ['wt', 'hp', 'am_0', 'am_1' ]] # this is a pandas dataframe
y1 = df1.loc[:,  'mpg'] # this is a pandas series
model1 = linear_model.LinearRegression()
model1.fit(X1.values, y1)
print(f'model1.coef_, {model1.coef_}\nmodel1.intercept_: , {model1.intercept_}')
example_weight = 3.0
example_horsepower = 110
example_transmission = 1  # manual
example_array1 = np.array([[example_weight, example_horsepower, 0, 1]]) # expects 2D numpy array
mpg_pred1 = model1.predict(example_array1)  
print(f'Predicted MPG for weight 3.0, horsepower 110 and manual transmission: {mpg_pred1[0]:.2f}')

### Using scikit-learn OneHotEncoder class

In [None]:
df2 = df.copy()
encoder = OneHotEncoder(sparse_output=False)
am_encoded = encoder.fit_transform(df2[['am']])
am_encoded[:5]


In [None]:
am_encoded_df = pd.DataFrame(am_encoded, columns=encoder.get_feature_names_out(['am']))
df2 = pd.concat([df2, am_encoded_df], axis=1) 
df2.head(5)


In [None]:
X2 = df2.loc[:, ['wt', 'hp', 'am_0', 'am_1' ]] 
y2 = df2.loc[:,  'mpg'] 
model2 = linear_model.LinearRegression()
model2.fit(X2.values, y2)
print(f'model2.coef_, {model2.coef_}\nmodel2.intercept_: , {model2.intercept_}')


Predict the mpg of a car that weighs 3,000 lbs (3.0 in the dataset), has a horsepower of 110 and manual transmission (am = 1).

In [None]:
example_array2 = np.array([[example_weight, example_horsepower, 0, 1]]) # expects 2D numpy array
mpg_pred2 = model2.predict(example_array2)
print(f'Predicted MPG for weight 3.0, horsepower 110 and manual transmission: {mpg_pred2[0]:.2f}')
