# Regression

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-exploration" data-toc-modified-id="Data-exploration-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data exploration</a></span></li><li><span><a href="#Data-preparation" data-toc-modified-id="Data-preparation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data preparation</a></span><ul class="toc-item"><li><span><a href="#Predictors,-target" data-toc-modified-id="Predictors,-target-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Predictors, target</a></span></li><li><span><a href="#Train-/-Test-split" data-toc-modified-id="Train-/-Test-split-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Train / Test split</a></span></li></ul></li><li><span><a href="#Linear-regression" data-toc-modified-id="Linear-regression-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Linear regression</a></span><ul class="toc-item"><li><span><a href="#Train-model" data-toc-modified-id="Train-model-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Train model</a></span></li><li><span><a href="#Evaluate-model" data-toc-modified-id="Evaluate-model-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Evaluate model</a></span></li><li><span><a href="#Cross-validation" data-toc-modified-id="Cross-validation-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Cross validation</a></span></li></ul></li></ul></div>

 * **Regression** models are used when the target variable is **quantitative**: 
  - salaries
  - gas emissions
  - age of person in a picture
  - ...
 * **Classification** models are used when the target variable is **qualitative**: 
  - surviving (or not) the Titanic
  - paying back (or not) a loan
  - identifying a dog (or not) in a picture
  - deciding which one of 3 plant species is this one
  - ...

## Data exploration

In [2]:
import pandas as pd

In [3]:
from sklearn.datasets import load_boston

In [4]:
print(load_boston().get("DESCR"))

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [5]:
boston = pd.DataFrame(load_boston().data, columns=load_boston().feature_names)
boston['MEDV']=load_boston().target

In [6]:
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [7]:
boston.shape

(506, 14)

In [8]:
boston.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')

## Data preparation

### Predictors, target

In [9]:
X = boston.drop('MEDV', axis=1)

In [10]:
X.shape

(506, 13)

In [11]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [12]:
y = boston.MEDV

In [13]:
y.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

In [14]:
y.shape

(506,)

### Train / Test split

In [17]:
from sklearn.model_selection import train_test_split

In [47]:
# usamos 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=66)

**NOTA**: `train_test_split` hace un split aleatorio. El parámetro `random_state` impone una semilla que hace este split reproducible (mismo split siempre).

In [48]:
X_train.shape

(404, 13)

In [49]:
X_test.shape

(102, 13)

In [50]:
y_train.shape

(404,)

In [51]:
y_test.shape

(102,)

## Linear regression

### Train model

In [52]:
from sklearn.linear_model import LinearRegression

In [53]:
lr = LinearRegression()

Utilizamos sólo los datos de train

In [54]:
lr.fit(X_train, y_train)

LinearRegression()

Curioseamos el modelo

In [55]:
lr.coef_

array([-1.29124243e-01,  4.60162200e-02,  2.46404449e-02,  2.12576822e+00,
       -1.82819798e+01,  3.35436229e+00, -6.18071832e-03, -1.64225594e+00,
        3.13694837e-01, -1.31019350e-02, -1.02213803e+00,  8.47049428e-03,
       -5.26472460e-01])

In [56]:
lr.intercept_

42.5862973050686

### Evaluate model

First we predict test entries

In [57]:
y_pred = lr.predict(X_test)

In [58]:
y_pred[:10]

array([11.85230551, 34.84114546, 30.1070265 , 42.82578231, 19.19547111,
       18.18205574, 23.13229461, 21.28863276, 37.27498141, 14.59726415])

Next we compare with reality

In [59]:
y_test[:10]

413    16.3
98     43.8
0      24.0
257    50.0
247    20.5
465    19.9
170    17.4
483    21.8
232    41.7
156    13.1
Name: MEDV, dtype: float64

In [60]:
from sklearn import metrics
import numpy as np

In [61]:
metrics.mean_squared_error(y_test, y_pred).round(3)

15.786

In [62]:
metrics.r2_score(y_test, y_pred).round(3)

0.811

Lets observe how the model performs on train set (less important)

In [63]:
y_pred_train = lr.predict(X_train)

In [64]:
metrics.r2_score(y_train, y_pred_train).round(3)

0.721

In [65]:
metrics.mean_squared_error(y_train, y_pred_train).round(3)

23.648

Try another train / test split now...

### Cross validation

Why trust the particular 80%/20% split made by `train_test_split` function? Better repeat and average!!

In [66]:
from sklearn.model_selection import cross_val_score

`cross_val_score` will by default return `r2` score

In [73]:
lr = LinearRegression()

In [95]:
scores = cross_val_score(lr, X, y, scoring='neg_mean_squared_error', cv=5)

In [96]:
scores

array([-12.46030057, -26.04862111, -33.07413798, -80.76237112,
       -33.31360656])

Test score is strongly dependent on particular train / test datasets!

The average of 5 splits gives us better idea of the future performance

In [97]:
np.mean(scores)

-37.13180746769886