# Régression linéaire en Python

L'analyse suivante traite les données issues de http://www-bcf.usc.edu/~gareth/ISL/data.html
https://github.com/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb

## Partie 1 Chargement des librairies et données

Chargement des librairies

In [79]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

from sklearn import linear_model

Chargement des données (le caractère '?' représente des NaN du dataset)

In [5]:
auto = pd.read_table('data/Auto.csv', sep = ',', na_values = '?')
auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


In [45]:
auto.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
year            0
origin          0
name            0
dtype: int64

Supprime les 5 NaN

In [46]:
auto = auto.dropna()

In [47]:
auto.describe(include = 'all')

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392
unique,,,,,,,,,301
top,,,,,,,,,toyota corolla
freq,,,,,,,,,5
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592,1.576531,
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737,0.805518,
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0,
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0,1.0,
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0,1.0,
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0,2.0,


Calcul de la matrice de corrélaiton

In [48]:
auto.corr()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
mpg,1.0,-0.777618,-0.805127,-0.778427,-0.832244,0.423329,0.580541,0.565209
cylinders,-0.777618,1.0,0.950823,0.842983,0.897527,-0.504683,-0.345647,-0.568932
displacement,-0.805127,0.950823,1.0,0.897257,0.932994,-0.5438,-0.369855,-0.614535
horsepower,-0.778427,0.842983,0.897257,1.0,0.864538,-0.689196,-0.416361,-0.455171
weight,-0.832244,0.897527,0.932994,0.864538,1.0,-0.416839,-0.30912,-0.585005
acceleration,0.423329,-0.504683,-0.5438,-0.689196,-0.416839,1.0,0.290316,0.212746
year,0.580541,-0.345647,-0.369855,-0.416361,-0.30912,0.290316,1.0,0.181528
origin,0.565209,-0.568932,-0.614535,-0.455171,-0.585005,0.212746,0.181528,1.0


Impact de quelques facteurs sur la consommation

In [49]:
fig, axs = plt.subplots(1, 4, sharey=True)

auto.plot(kind='scatter', x='horsepower', y='mpg', ax=axs[0], figsize=(16, 8))

auto.plot(kind='scatter', x='weight', y='mpg', ax=axs[1])

auto.plot(kind='scatter', x='cylinders', y='mpg', ax=axs[2])

auto.plot(kind='scatter', x='acceleration', y='mpg', ax=axs[3])

<matplotlib.axes._subplots.AxesSubplot at 0x29093c39b38>

La puissance fiscale, le poids et le nombre de cylindre jouent un rôle important dans la consommation. Plus ces facteurs augmentent, plus la consommation dimminue. 
En revanche, plus l'accélération augmente, plus la consommation est importante.

## Création des modèles

In [50]:
lm = smf.ols(formula='mpg ~ horsepower', data=auto).fit()

print(lm.params)
print()

Intercept     39.935861
horsepower    -0.157845
dtype: float64



In [51]:
lm = smf.ols(formula='mpg ~ weight', data=auto).fit()

print(lm.params)
print(lm.rsquared)

Intercept    46.216525
weight       -0.007647
dtype: float64
0.692630433121


In [52]:
lm = smf.ols(formula='mpg ~ cylinders', data=auto).fit()

print(lm.params)
print(lm.rsquared)

Intercept    42.915505
cylinders    -3.558078
dtype: float64
0.604688988944


In [53]:
lm = smf.ols(formula='mpg ~ acceleration', data=auto).fit()

print(lm.params)
print(lm.rsquared)

Intercept       4.833250
acceleration    1.197624
dtype: float64
0.179207050156


In [54]:
lm = smf.ols(formula='mpg ~ weight + cylinders + horsepower', data=auto).fit()

print(lm.params)
print(lm.rsquared)

Intercept     45.736817
weight        -0.005272
cylinders     -0.388974
horsepower    -0.042728
dtype: float64
0.707651894313


In [55]:
lm = smf.ols(formula='mpg ~ weight + cylinders + horsepower + acceleration', data=auto).fit()

print(lm.summary())

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.708
Model:                            OLS   Adj. R-squared:                  0.705
Method:                 Least Squares   F-statistic:                     234.2
Date:                Thu, 05 Jan 2017   Prob (F-statistic):          6.02e-102
Time:                        20:56:09   Log-Likelihood:                -1120.1
No. Observations:                 392   AIC:                             2250.
Df Residuals:                     387   BIC:                             2270.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       46.2740      2.448     18.902   

L'ajout de l'accélération augmente de très peu la valeur de R² et son coefficient à une valeur très haute de pvalue. Il semblerait que les facteurs weight, cylinders et horsepower soient suffisants. Maintenant, testons la précision de cet algorithme.

Séparation des données train/test

In [56]:
train = auto.sample(frac=0.8,random_state=200)
test = auto.drop(train.index)

In [57]:
lm = smf.ols(formula='mpg ~ weight + cylinders + horsepower + acceleration', data=train).fit()

print(lm.summary())

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.699
Model:                            OLS   Adj. R-squared:                  0.695
Method:                 Least Squares   F-statistic:                     179.6
Date:                Thu, 05 Jan 2017   Prob (F-statistic):           2.68e-79
Time:                        20:56:09   Log-Likelihood:                -899.94
No. Observations:                 314   AIC:                             1810.
Df Residuals:                     309   BIC:                             1829.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       49.7889      2.838     17.545   

In [62]:
predicted = lm.predict(test)

In [82]:
MSE = np.sqrt(np.square(predicted - np.array(test['mpg'], dtype='float'))).mean()

print(MSE)

3.06361335887
