# Single and Multiple Linear Regression

For this project, a dataset in Microsoft Excel file format will be used. The dataset 'cars.xls' includes 9 variables and 392 data entries.

In [2]:
# Importing of the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from math import sqrt
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler

In [3]:
# Uploading of the 'cars' dataset and displaying of the 5 first rows
cars_df = pd.read_excel('cars.xls')
cars_df.head()

Unnamed: 0,Model,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130,3504,12.0,70,US
1,buick skylark 320,15.0,8,350.0,165,3693,11.5,70,US
2,plymouth satellite,18.0,8,318.0,150,3436,11.0,70,US
3,amc rebel sst,16.0,8,304.0,150,3433,12.0,70,US
4,ford torino,17.0,8,302.0,140,3449,10.5,70,US


In [4]:
# Display of some information about the dataset, the variables it contains and their type
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Model         392 non-null    object 
 1   MPG           392 non-null    float64
 2   Cylinders     392 non-null    int64  
 3   Displacement  392 non-null    float64
 4   Horsepower    392 non-null    int64  
 5   Weight        392 non-null    int64  
 6   Acceleration  392 non-null    float64
 7   Year          392 non-null    int64  
 8   Origin        392 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 27.7+ KB


### Single Linear Regression

At this point, the choice of one variable as input and one as output is being made. The choice is of the variable MPG (miles per gallon) as the output and the variable Horsepower as the input. That is, when the model is given the horsepower to be able to predict the consumption (MPG).

In [5]:
y = cars_df.MPG
X = cars_df.Horsepower

Then the splitting the dataset into two parts, train and test, for each variable takes place (X_train, X_test and Y_train, Y_test). This is done with the help of the train_test_split module.

In [6]:
# Splitting of the dataset in training and testing sets
X_train, X_test, y_train, y_test = train_test_split(pd.DataFrame(X), y, test_size = 0.3, random_state = 42)

Here, some random values for the testset were generated, so that it can contain horsepower values that the model may not have been trained on, and then give us predictions about the consumption.

The feeding of the model (i.e. its training) then takes place using sklearn's linear regression function.

In [8]:
# Calling of the linear regression function and fitting of the model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

After that, the prediction of the output follows. y_prediction is the MPG consumption for each Horsepower value contained in X_test.

In [9]:
y_prediction = regressor.predict(X_test)

# Display of the prediction values
print(y_prediction)

[29.60673686 21.95467101 31.10388018 29.44038761 26.77879949 28.60864132
 12.80546185 28.60864132 25.28165617 32.93372202  9.47847669 23.11911582
 15.30070071 28.60864132 18.96038438 26.44610097 25.28165617 28.60864132
 27.77689503 26.44610097 25.28165617 32.43467424 29.77308612 19.45943215
 29.93943538 26.11340245 26.61245023 23.61816359 33.43276979 28.44229206
 16.96419329 26.11340245 20.29117844 28.77499058 16.132447   29.93943538
 17.79593957 27.77689503 17.79593957  5.31974525 19.45943215 28.60864132
 28.77499058 28.10959355 16.46514551  5.31974525 24.44990988 29.60673686
 29.10768909 29.44038761 15.63339922 25.28165617 26.11340245 26.44610097
 22.45371879 22.7864173  22.7864173  24.44990988 25.28165617 25.28165617
  3.65625268 26.11340245 21.95467101 26.44610097 25.28165617 27.61054577
 26.44610097 29.44038761 22.7864173   8.14768263 29.10768909 11.97371556
 24.94895765 26.11340245 26.94514874 27.77689503 16.96419329 16.132447
 26.44610097 22.7864173  25.78070394 24.94895765 16.1

The model's performance can also be measured using the root mean square error (RMSE).

In [10]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))

# Display of the root mean square error(RMSE)
print(RMSE)

4.955413560049773


The RMSE is the square root of the mean squared differences between prediction and observation. In the example of the Single Linear Regression provided, a relatively small error is observed. This leads to the safe assumption that the model used performs well.

### Multiple Linear Regression

At this point, the aim is to give the model a set of characteristics and predict the consumption (MPG). The variable MPG (miles per gallon) is chosen as output but now the input is not a single variable, rather a set of variable and as so, it is called Multiple Linear Regression.

The code will be the same until the point where the selection of the variables takes place.

In [11]:
y = cars_df.MPG
X = cars_df[['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Year']]

Then, the splitting of the dataset, as usual, follows.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

Since in this instance the input variables are more than one, a scaling of the input must takeplace. This is achieved with the MinMaxScaler() module and the scaler object.

In [13]:
scaler = MinMaxScaler()
X_train_sc = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)
X_test_sc = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)

The training of the model now takes place.

In [15]:
# Calling of the linear regression function and fitting of the model
regressor = LinearRegression()
regressor.fit(X_train_sc, y_train)

After that, the prediction of the output and the evaluation of our model follows.

In [16]:
y_prediction = regressor.predict(X_test_sc)

# Display of the prediction values
print(y_prediction)

[25.73146391  7.43951976 32.18196458 21.00014858 22.53215943 12.07313305
 31.07802344 18.70570939 32.81541637 31.81168725 32.61043458 11.06012822
 27.94874796 30.68561148 32.83789505 19.18726691 28.47080114 30.5001502
 24.74054475 23.05386019 21.86033009 11.79604586 28.76799185 24.6604082
 28.5413625  33.49322741 29.39043222 21.48874234 12.23709543 11.19383148
 25.99637191 24.93856666 25.72009983  4.05423623 26.83585687 26.42815938
 21.31950279 20.15764453 22.6006999  23.11876225 25.73954047 28.97233255
 34.78222074 13.66498786 31.63817327 17.30324441 28.97269136 30.5674087
 32.93444306 34.82234891 20.71642317 23.01054837 33.51411853 28.25370044
  9.39730646 20.28635188 21.20303754 21.13816736 15.89709444 28.26170624
 26.46383173 29.65188255 26.97693107 16.3918127  25.02700416 15.47368771
 10.73896685 14.88627357 30.81196694 20.43942173 30.86883163 32.48057862
 22.8073551  33.32083108  9.52960052 20.81043883  9.5364579  28.96464728
 19.78387906 16.03247005 26.91240846  9.21439082 22.96

In [17]:
# Calculation of the root mean square error (RMSE)
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))

# Display of the root mean square error(RMSE)
print(RMSE)

3.6013431683964


In the example of Multiple Linear Regression provided, an even smaller error is observed. This leads to the safe assumption that the model used performs well and better than the Single Linear Regression one. This in turn concludes that better predictions about a certain variable can derive from the addition of more than one related variables as inputs.