# Creating our Linear Regression

Now that we selected the best features, it's time to build a model. Our goal this time is discover the linear regression with the lowest error.

### Steps

- Separate train and test data
- Build a model using train data
- Graph the linear regression line (if it's a simple linear regression)
- Graficar la línea de regresión en un scatter plot
- Validate a model using error metrics (MSE, RMSE, MAE)
- Predict values with the created model

### Import Libraries

In [27]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots

from sklearn.model_selection import train_test_split
from sklearn import linear_model
import sklearn.metrics as metrics

### Load the dataset

In [28]:
dataset_path = './datasets/model_features.csv'

df = pd.read_csv(dataset_path)
df.head()

Unnamed: 0,engine_size,cylinders,fuel_consumption,co2_emissions
0,2.0,4,8.5,196
1,2.4,4,9.6,221
2,1.5,4,5.9,136
3,3.5,6,11.1,255
4,3.5,6,10.6,244


# Simple Linear Regression

First, let's build a simple linear regression using the feature with the highest correlation: `fuel_consumption`.

### Split data into train and test

Setting a seed ensures that the sequence of random numbers generated by the pseudorandom number generator is reproducible, meaning that if you run the code again with the same seed value, you will get the same sequence of random numbers.

In [29]:
# This ensures that the same random rows are selected for the training and testing sets each time the code is run with the same seed value, making the results reproducible.
np.random.seed(0)

df_train, df_test = train_test_split(
    df,
    train_size=0.7,
    test_size=0.3,
    random_state=100
)

### Train the model

Interesante, el coeficiente de nuestra regresión es un array bidimensional [[]] luego de entrenar el modelo, esto es pensado para que se pueda entrenar el modelo con varias variables. En este caso solo se entrena con una variable, por lo que el array es de una dimensión, por eso luce así: [[39.42...]].

Por eso al extraer el coeficiente, se debe hacer de la siguiente manera: coef[0][0].

In [30]:
# create an instance of the LinearRegression
regr = linear_model.LinearRegression()

# Convert the columns from dataframes to numpy arrays
train_x = np.asanyarray(df_train[['fuel_consumption']])
train_y = np.asanyarray(df_train[['co2_emissions']])

# Train the model using the training sets
regr.fit(train_x, train_y)

# These are the coefficients and intercept of the line of best fit
intercept = regr.intercept_
coefficients = regr.coef_

print ('Interceptor: ', intercept)
print ('coefficient of x: ', coefficients)

Interceptor:  [67.45466384]
coefficient of x:  [[16.31163802]]


La fórmula de la recta y la fórmula de la regresión lineal simple se parecen mucho. Tienen el mismo significado matemática, ambos son polinomios.

\begin{equation*}
y = mx + b
\end{equation*}

\begin{equation}
y = \theta_0 + \theta_1 x_1
\end{equation}

Now that we know the interceptor and the coefficient of x, we just need to plug a value of `fuel_consumption` to predict a co2 emission. For example, if `fuel comsumption` is 11.1:

\begin{equation}
y = \theta_0 + \theta_1 x_1
\end{equation}

\begin{equation*}
y = (67.454) + (16.311)x
\end{equation*}

\begin{equation*}
y = (67.454) + (16.311)(11.1)
\end{equation*}

\begin{equation*}
y = 67.454 + 181.041
\end{equation*}


\begin{equation*}
y = 248.495
\end{equation*}

In [31]:
# introduce a fuel_consumption value to make a prediction
fuel_consumption = 11.1
yhat = regr.predict([[fuel_consumption]])

# value may be a bit different due to the decimal precision
print("Predicted: %.3f" % yhat)

Predicted: 248.514


### Graphical representation of the regression line

In [32]:
fig = px.scatter(
    df_train,
    x='fuel_consumption',
    y='co2_emissions',
    width=600,
)

fig.add_trace(
    go.Scatter(
        x=train_x[:,0],
        y=regr.coef_[0][0]*train_x[:,0] + regr.intercept_[0],
        mode='lines',
        name='Regression Line'
    )
)

fig.show()

### Error Functions

In [33]:
# Convert the columns from dataframes to numpy arrays
test_x = np.asanyarray(df_test[['fuel_consumption']])
test_y = np.asanyarray(df_test[['co2_emissions']])

# Make predictions using the testing set
predictions = regr.predict(test_x)

# compare the predicted values with the real values
MAE = metrics.mean_absolute_error(test_y, predictions)
MSE = metrics.mean_squared_error(test_y, predictions)
RMSE = np.sqrt(metrics.mean_squared_error(test_y, predictions))

print('Mean Absolute Error: %.3f' % MAE)
print('Mean Squared Error: %.3f' % MSE)
print('Root Mean Squared Error: %.3f' % RMSE)

Mean Absolute Error: 19.892
Mean Squared Error: 792.319
Root Mean Squared Error: 28.148


### Regresión Lineal Multiple (2 variables)

- Tomando en cuenta 2 variables: fuel_consumption y engine_size.

In [34]:
# Select features and target
X = ['fuel_consumption', 'engine_size']
Y = ['co2_emissions']

# create an instance of the LinearRegression
regr = linear_model.LinearRegression()

# Convert the columns from dataframes to numpy arrays
train_x = np.asanyarray(df_train[X])
train_y = np.asanyarray(df_train[Y])

# Train the model using the training sets
regr.fit(train_x, train_y)

# These are the coefficients and intercept of the line of best fit
intercept = regr.intercept_
coefficients = regr.coef_

print ('Interceptor: ', intercept)
print ('coefficient of x: ', coefficients)

Interceptor:  [77.52641485]
coefficient of x:  [[ 9.8607939  19.40228938]]


In [35]:
# Convert the columns from dataframes to numpy arrays
test_x = np.asanyarray(df_test[X])
test_y = np.asanyarray(df_test[Y])

# Make predictions using the testing set
predictions = regr.predict(test_x)

# compare the predicted values with the real values
MAE = metrics.mean_absolute_error(test_y, predictions)
MSE = metrics.mean_squared_error(test_y, predictions)
RMSE = np.sqrt(metrics.mean_squared_error(test_y, predictions))

print('Mean Absolute Error: %.3f' % MAE)
print('Mean Squared Error: %.3f' % MSE)
print('Root Mean Squared Error: %.3f' % RMSE)

Mean Absolute Error: 16.540
Mean Squared Error: 541.856
Root Mean Squared Error: 23.278


In [36]:
# to make a prediction, introduce a fuel_consumption value
fuel_consumption = 9.6
engine_size = 2.0

yhat = regr.predict([[fuel_consumption, engine_size]])
print("Predicted: %.3f" % yhat)

Predicted: 210.995


### Regresión Lineal Multiple (3 variables)

- Tomando en cuenta 2 variables: fuel_consumption, engine_size & cylinder.

In [37]:
# Select features and target
X = ['fuel_consumption', 'engine_size', 'cylinders']
Y = ['co2_emissions']

# create an instance of the LinearRegression
regr = linear_model.LinearRegression()

# Convert the columns from dataframes to numpy arrays
train_x = np.asanyarray(df_train[X])
train_y = np.asanyarray(df_train[Y])

# Train the model using the training sets
regr.fit(train_x, train_y)

# These are the coefficients and intercept of the line of best fit
intercept = regr.intercept_
coefficients = regr.coef_

print ('Interceptor: ', intercept)
print ('coefficient of x: ', coefficients)

Interceptor:  [65.39417241]
coefficient of x:  [[ 9.73505283 11.14798722  7.09354891]]


In [38]:
# Convert the columns from dataframes to numpy arrays
test_x = np.asanyarray(df_test[X])
test_y = np.asanyarray(df_test[Y])

# Make predictions using the testing set
predictions = regr.predict(test_x)

# compare the predicted values with the real values
MAE = metrics.mean_absolute_error(test_y, predictions)
MSE = metrics.mean_squared_error(test_y, predictions)
RMSE = np.sqrt(metrics.mean_squared_error(test_y, predictions))

print('Mean Absolute Error: %.3f' % MAE)
print('Mean Squared Error: %.3f' % MSE)
print('Root Mean Squared Error: %.3f' % RMSE)

Mean Absolute Error: 16.238
Mean Squared Error: 514.250
Root Mean Squared Error: 22.677
