# Implementando uma Regressão linear usando multiplas variáveis com Sklearn

In [32]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

#Visualizacao
import matplotlib.pyplot as plt
import seaborn as sns

### Carregando dados

In [25]:
b = load_boston()
boston = pd.DataFrame(b.data)
boston.columns = b.feature_names
boston['Price'] = b.target

In [99]:
print(b.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

## Criando Objeto de LinearRegression

* [Sklearn - LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [26]:
lreg = LinearRegression()

**x_multi** são as multiplas variáveis sobre as quais queremos realizar a regressão linear

**y_target** é o nosso objetivo. Queremos achar o preço das casas com base nas variáveis *x_multi*

In [27]:
x_multi = boston.drop('Price',1)
y_target = boston.Price

### Abaixo vemos que o DataFrame possui 13 colunas. Essas colunas são as *features* que serão utilizadas na regressão

In [28]:
x_multi.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


### Features de boston

In [97]:
boston.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'Price'],
      dtype='object')

In [29]:
lreg.fit(x_multi,y_target)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [30]:
print("Intercept coefficient : {}".format(lreg.intercept_))
print('Number of coeficients used: {}'.format(len(lreg.coef_)))

Intercept coefficient : 36.4911032803616
Number of coeficients used: 13


## Separando os dados em Test e Treino. (Test and Training Data) 

* **[sklearn.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**

### Definindo as features

X é o dataframe de Boston sem a coluna Price. Isso represeta que X possui as 13 features presentes para classificar casas

In [85]:
X = boston.drop('Price',axis=1)

In [96]:
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X,boston.Price)

In [87]:
lreg1 = LinearRegression()


lreg1.fit(X_train,Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

# Entender melhor sobre como realizar o predict correto

* **[How to run linear regression in python scikit learn](http://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/)**

* **[train-test-Split and Cross validation in python ](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)**

In [88]:
predict_train = lreg1.predict(X_train)
predict_test = lreg1.predict(X_test)

In [89]:
print("Mean Squared Error for train: {}".format(sklearn.metrics.mean_squared_error(Y_train,predict_train)))
print("Mean Squared Error for Y_test and test: {}".format(sklearn.metrics.mean_squared_error(Y_test,predict_test)))

Mean Squared Error for train: 20.74001654650508
Mean Squared Error for Y_test and test: 26.866481411598816


# Observações

1 - 
`sklearn.model_selection.train_test_split(X,boston.Price)`, onde `X` é uma matriz `MxN` onde `M` é o número de exemplos desse DataSet e `N`, as colunas, são as `features`que queremos construir nossa Regressão. E `boston.Price` é o objetivo, o que queremos estimar, ou predizer, com as `features` definidas em `X`

2 - Utilizamos `X_train` e `Y_train` na função `LinearRegression.fit()`, pois queremos dizer que: Para as features descritas em `X`, o valor obtido foi `Y`. Assim o metodo `.fit()` pode *aprender* com esses parametros. 

3 - Para testar nossa regressão, utilizamos o metodo `LinearRegression.predict(X_test)` assim, utilizamos um DataSet diferente do que foi utilizado para treinar. 

4 - Com o resultado de `LinearRegression.predict(X_test)` podemos verificar o erro utilizando o metodo dos [Minimos Quadrados](https://www.probabilitycourse.com/chapter9/9_1_5_mean_squared_error_MSE.php) em conjunto com o `Y_test` e verificar qual foi a diferença entre o calculado e o real.