# Linear regression

## Simple linear regression

As we seen we will try to perform a linear model between a some predictor variable and the depent variable. it is used for many reasons as forecasting, estimation and so on.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("mpg.csv")
df = df.loc[df["horsepower"] != "?"]
df["horsepower"] = df["horsepower"].apply(int)
df.head(1)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu


By now we split manually the dataframe into the traning set and the test set using the module randon

In [3]:
msk = np.random.rand(len(df)) < 0.8 # this return an array of T/F with 80% of TRUE
msk[0:9]

array([ True,  True,  True, False,  True,  True,  True,  True,  True])

In [4]:
train = df.loc[msk]
test = df.loc[~msk] # in pandas the not is given by ~

We perform the linear model using the package scikit-learn as following: first we create an empty model and then we pass into out data!

In [5]:
from sklearn import linear_model
reg1 = linear_model.LinearRegression()
# before pass into the model we have to transoform the pandas series into a numpy array!
train_x = np.asanyarray(train[["weight"]])   # must have the second square brackets
train_y = np.asanyarray(train[["horsepower"]])
train_y[1:5]

array([[165],
       [150],
       [140],
       [198]], dtype=int64)

In [6]:
fitted_reg = reg1.fit(train_x, train_y)
type(fitted_reg)

sklearn.linear_model._base.LinearRegression

In [7]:
print(fitted_reg.coef_,fitted_reg.intercept_)

[[0.03914781]] [-12.32316676]


To have a proper summary as in R we have to use a different tool! (why)
well let's see how it works on our test data

In [8]:
import statsmodels.api as sm
mod = sm.OLS(train_x,train_y)
fii = mod.fit()
fii.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared (uncentered):,0.967
Dependent Variable:,y,AIC:,4928.9913
Date:,2021-03-16 10:11,BIC:,4932.7533
No. Observations:,318,Log-Likelihood:,-2463.5
Df Model:,1,F-statistic:,9447.0
Df Residuals:,317,Prob (F-statistic):,5.27e-238
R-squared (uncentered):,0.968,Scale:,314570.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
x1,27.3820,0.2817,97.1945,0.0000,26.8278,27.9363

0,1,2,3
Omnibus:,94.013,Durbin-Watson:,0.974
Prob(Omnibus):,0.0,Jarque-Bera (JB):,365.214
Skew:,-1.222,Prob(JB):,0.0
Kurtosis:,7.647,Condition No.:,1.0


In [9]:
from sklearn.metrics import r2_score
test_x = np.asanyarray(test[["weight"]])   # must have the second square brackets
res_x = np.asanyarray(test[["horsepower"]])

In [10]:
prediction = reg1.predict(test_x)
prediction = prediction.astype(np.int32) # converted into an int because horsepower is int!
prediction[0:3]

array([[122],
       [127],
       [ 82]])

In [11]:
print("Mean absolute error: {}".format(np.mean(np.absolute(prediction - res_x))))
print("Residual sum of squares (MSE): {}".format( np.mean((prediction - res_x) ** 2)))
print("R2-score: {}".format(r2_score(res_x , prediction) ))

Mean absolute error: 12.58108108108108
Residual sum of squares (MSE): 287.3378378378378
R2-score: 0.7877690229167782


## Multiple regression

the process is the same as before, we just care about the numpy array that we are passing in the model:

In [12]:
train_x = np.asanyarray(train[["weight","cylinders","mpg"]]) 
train_y = np.asanyarray(train[["horsepower"]])
train_x[0:3]
# now is more similar than a matrix!

array([[3504.,    8.,   18.],
       [3693.,    8.,   15.],
       [3436.,    8.,   18.]])

In [13]:
from sklearn import linear_model
reg2 = linear_model.LinearRegression() 
fitted_reg2 = reg2.fit(train_x, train_y)

In [14]:
print(fitted_reg2.coef_)

[[ 0.01946206  7.06244775 -0.91334255]]


In [15]:
test_x = np.asanyarray(test[["weight","cylinders","mpg"]]) 
res_x = np.asanyarray(test[["horsepower"]])

In [16]:
prediction = reg2.predict(test_x)
prediction = prediction.astype(np.int32)

In [17]:
print("Residual sum of squares: {}".format(np.mean((prediction - res_x) ** 2)))

Residual sum of squares: 267.55405405405406


we can also explore the explained variance regession score using:

In [18]:
reg2.score(test_x, res_x)

0.8037692804648471

## Polynomial regression

Instead of passing just a command we are creating a new columns inside the matrix to pass at the model that containis the elements squared!

In [19]:
train_x = np.asanyarray(train[["weight","cylinders","mpg"]]) 
train_y = np.asanyarray(train[["horsepower"]])
test_x = np.asanyarray(test[["weight","cylinders","mpg"]]) 
res_x = np.asanyarray(test[["horsepower"]])

in this piece of code below we are doing the differences compared to what we did before:

In [20]:
from sklearn.preprocessing import PolynomialFeatures 
poly = PolynomialFeatures(degree=2)
poly_x = poly.fit_transform(train_x)
poly_x[0:1]

array([[1.0000000e+00, 3.5040000e+03, 8.0000000e+00, 1.8000000e+01,
        1.2278016e+07, 2.8032000e+04, 6.3072000e+04, 6.4000000e+01,
        1.4400000e+02, 3.2400000e+02]])

In [21]:
reg3 = linear_model.LinearRegression()
fitted_reg3 = reg3.fit(poly_x, train_y)
print(fitted_reg3.coef_)

[[ 0.00000000e+00  2.97297284e-02 -2.41168193e+01  1.83697357e+00
   5.86442432e-07 -1.77201348e-03 -3.13834472e-04  3.72030822e+00
  -3.85054953e-01 -6.47805344e-03]]


## p-values and summary

To see the detail of our regression we have to import a different model and perform this basic operation:

In [22]:
import statsmodels.api as sm
mod = sm.OLS(train_y,poly_x)
fii = mod.fit()
fii.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.81
Dependent Variable:,y,AIC:,2712.9446
Date:,2021-03-16 10:11,BIC:,2750.5651
No. Observations:,318,Log-Likelihood:,-1346.5
Df Model:,9,F-statistic:,150.7
Df Residuals:,308,Prob (F-statistic):,3.0000000000000003e-107
R-squared:,0.815,Scale:,287.85

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
const,77.4736,92.8004,0.8348,0.4045,-105.1294,260.0766
x1,0.0297,0.0437,0.6803,0.4968,-0.0563,0.1157
x2,-24.1168,15.6528,-1.5407,0.1244,-54.9168,6.6832
x3,1.8370,3.4573,0.5313,0.5956,-4.9659,8.6399
x4,0.0000,0.0000,0.0935,0.9255,-0.0000,0.0000
x5,-0.0018,0.0048,-0.3710,0.7109,-0.0112,0.0076
x6,-0.0003,0.0009,-0.3569,0.7214,-0.0020,0.0014
x7,3.7203,1.2946,2.8736,0.0043,1.1728,6.2678
x8,-0.3851,0.3349,-1.1498,0.2511,-1.0440,0.2739

0,1,2,3
Omnibus:,88.116,Durbin-Watson:,1.32
Prob(Omnibus):,0.0,Jarque-Bera (JB):,268.937
Skew:,1.231,Prob(JB):,0.0
Kurtosis:,6.774,Condition No.:,1085390501.0
