<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Explorando-os-Dados" data-toc-modified-id="Explorando-os-Dados-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Explorando os Dados</a></span></li><li><span><a href="#Aplicando-Transformações" data-toc-modified-id="Aplicando-Transformações-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Aplicando Transformações</a></span></li><li><span><a href="#Treinando-um-modelo" data-toc-modified-id="Treinando-um-modelo-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Treinando um modelo</a></span></li></ul></div>

    In this part, you will implement linear regression with multiple variables to
    predict the prices of houses. Suppose you are selling your house and you
    want to know what a good market price would be. One way to do this is to
    first collect information on recent houses sold and make a model of housing
    prices.
    The fle ex1data2.txt contains a training set of housing prices in Portland, Oregon. 
    The first column is the size of the house (in square feet), the
    second column is the number of bedrooms, and the third column is the price
    of the house.

In [2]:
# Importando biblilotecas e lendo arquivo
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

df = pd.read_csv('ex1data2.csv', names=['size', 'bedrooms', 'price'])
df.head()

Unnamed: 0,size,bedrooms,price
0,2104,3,399900
1,1600,3,329900
2,2400,3,369000
3,1416,2,232000
4,3000,4,539900


### Explorando os Dados

In [12]:
def check_dataframe(df):
    """
    Function that goes through dataset and verify dimensions, data types, missing data and duplicated data
    """
    print('- - - - - DIMENSIONS - - - - -')
    print(f'O dataset possui {df.shape[0]} linhas e {df.shape[1]} colunas.')
    
    print('\n- - - - - MISSING DATA - - - - -')
    print('O dataset possui dados nulos?')
    missing = df.isnull().all().any()
    print(f'R: {missing}')
    if missing:
        print('\nQuantos dados?')
        print(f'R: {df.isnull().all().sum()}')
        print('\nEm quais colunas?')
        print(f'R: {df.isnull().sum()}')
    
    print('\n- - - - - DUPLICATED DATA - - - - -')
    print('O dataset possui dados duplicados?')
    duplicated = df.duplicated().all()
    print(f'R: {duplicated}')
    if duplicated:
        print('\nQuantos?')
        print(f'R: {df.duplicated().sum()}')
    
    print('\n- - - - - DATA TYPES - - - - -')
    print(f'{df.dtypes}')

In [13]:
# Chamando função
check_dataframe(df)

- - - - - DIMENSIONS - - - - -
O dataset possui 47 linhas e 3 colunas.

- - - - - MISSING DATA - - - - -
O dataset possui dados nulos?
R: False

- - - - - DUPLICATED DATA - - - - -
O dataset possui dados duplicados?
R: False

- - - - - DATA TYPES - - - - -
size        int64
bedrooms    int64
price       int64
dtype: object


### Aplicando Transformações

In [14]:
# Separando dados em teste e treinamento
train_set, test_set = train_test_split(df, test_size=.2, random_state=42)

In [15]:
train_set.head()

Unnamed: 0,size,bedrooms,price
8,1380,3,212000
3,1416,2,232000
6,1534,3,314900
40,1664,2,368500
33,3137,3,579900


In [16]:
test_set.head()

Unnamed: 0,size,bedrooms,price
27,2526,3,469000
39,2162,4,287000
26,1458,3,464500
43,1200,3,299000
24,3890,3,573900


In [21]:
# Separando dados
X_train = train_set.loc[:, ['size', 'bedrooms']]
X_test = test_set.loc[:, ['size', 'bedrooms']]

y_train = train_set['price']
y_test = test_set['price']

In [28]:
# Min Max Scaller
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaler = scaler.fit_transform(X_train)
X_train_scaler[:5]

array([[0.145615  , 0.5       ],
       [0.1555433 , 0.25      ],
       [0.18808605, 0.5       ],
       [0.22393822, 0.25      ],
       [0.63017099, 0.5       ]])

In [30]:
# Standarization
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
X_train_std_scaler = std_scaler.fit_transform(X_train)
X_train_std_scaler[:5]

array([[-0.77562586, -0.16666667],
       [-0.72751943, -1.4       ],
       [-0.56983725, -0.16666667],
       [-0.3961196 , -1.4       ],
       [ 1.57223508, -0.16666667]])

### Treinando um modelo


IMPORTANTE: Programa em Octave previu um valor de \$293.081,46 para casas com 1650 de size e 3 bedrooms

In [59]:
# Salvando em variável para comparação
octave_pred = 293081.46

In [60]:
# Regressão Linear com Múltiplas Variáveis

# Treinamento sem feature scaling
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred = lin_reg.predict(X_test)

# Calculando o erro
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(f'lin_rmse: {lin_rmse}')
prediction = lin_reg.predict([[1650, 3]])
print(f'prediction for 1650 size and 3 bedrooms: {prediction}')
print(f'diff from octave prediction: {octave_pred - prediction[0]}')

lin_rmse: 92792.37331148326
prediction for 1650 size and 3 bedrooms: [280536.50711115]
diff from octave prediction: 12544.952888852858


In [78]:
# Treinamento com MinMaxScaling
lin_reg = LinearRegression()
lin_reg.fit(X_train_scaler, y_train)

scaler = MinMaxScaler()
X_test_scaler = scaler.fit_transform(X_test)
y_pred = lin_reg.predict(X_test_scaler)

# Calculando o erro
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(f'lin_rmse: {lin_rmse}')
X_pred_scaler = scaler.fit_transform([[1650, 3]])
prediction = lin_reg.predict(X_pred_scaler)
print(f'prediction for 1650 size and 3 bedrooms: {prediction}')
print(f'diff from octave prediction: {octave_pred - prediction[0]}')

lin_rmse: 93266.45833378336
prediction for 1650 size and 3 bedrooms: [193273.24739475]
diff from octave prediction: 99808.2126052519


Quando o conjunto possui poucos dados, as transformações não surtem efeito.

In [77]:
# Treinamento com StdScaling
lin_reg = LinearRegression()
lin_reg.fit(X_train_std_scaler, y_train)

std_scaler = StandardScaler()
X_test_std_scaler = std_scaler.fit_transform(X_test)
y_pred = lin_reg.predict(X_test_std_scaler)

# Calculando o erro
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(f'lin_rmse: {lin_rmse}')
X_pred_scaler = std_scaler.fit_transform([[1650, 3]])
prediction = lin_reg.predict(X_pred_scaler)
print(f'prediction for 1650 size and 3 bedrooms: {prediction}')
print(f'diff from octave prediction: {octave_pred - prediction[0]}')

lin_rmse: 109412.53077611802
prediction for 1650 size and 3 bedrooms: [323170.16216216]
diff from octave prediction: -30088.702162162168


In [79]:
# Treinando e testando com o conjunto todo
lin_reg = LinearRegression()
X = df.loc[:, ['size', 'bedrooms']]
y = df.loc[:, 'price']
lin_reg.fit(X, y)

y_pred = lin_reg.predict(X)

# Calculando o erro
lin_mse = mean_squared_error(y, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(f'lin_rmse: {lin_rmse}')
prediction = lin_reg.predict([[1650, 3]])
print(f'prediction for 1650 size and 3 bedrooms: {prediction}')
print(f'diff from octave prediction: {octave_pred - prediction[0]}')

lin_rmse: 63926.2082498693
prediction for 1650 size and 3 bedrooms: [293081.4643349]
diff from octave prediction: -0.004334896104410291


In [81]:
# Treinando e testando com o conjunto todo - APLICANDO STD SCALLING
lin_reg = LinearRegression()
X = df.loc[:, ['size', 'bedrooms']]
y = df.loc[:, 'price']

std_scaler = StandardScaler()
X_scaler = std_scaler.fit_transform(X)

lin_reg.fit(X_scaler, y)

y_pred = lin_reg.predict(X_scaler)

# Calculando o erro
lin_mse = mean_squared_error(y, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(f'lin_rmse: {lin_rmse}')
X_pred_scaler = std_scaler.fit_transform([[1650, 3]])
prediction = lin_reg.predict(X_pred_scaler)
print(f'prediction for 1650 size and 3 bedrooms: {prediction}')
print(f'diff from octave prediction: {octave_pred - prediction[0]}')

lin_rmse: 63926.20824986929
prediction for 1650 size and 3 bedrooms: [340412.65957447]
diff from octave prediction: -47331.199574468075


In [82]:
# Treinando e testando com o conjunto todo - APLICANDO MINMAX SCALLING
lin_reg = LinearRegression()
X = df.loc[:, ['size', 'bedrooms']]
y = df.loc[:, 'price']

scaler = MinMaxScaler()
X_scaler = scaler.fit_transform(X)

lin_reg.fit(X_scaler, y)

y_pred = lin_reg.predict(X_scaler)

# Calculando o erro
lin_mse = mean_squared_error(y, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(f'lin_rmse: {lin_rmse}')
X_pred_scaler = scaler.fit_transform([[1650, 3]])
prediction = lin_reg.predict(X_pred_scaler)
print(f'prediction for 1650 size and 3 bedrooms: {prediction}')
print(f'diff from octave prediction: {octave_pred - prediction[0]}')

lin_rmse: 63926.20824986929
prediction for 1650 size and 3 bedrooms: [199467.38469349]
diff from octave prediction: 93614.07530651346


O melhor resultado é aquele dado sem as transformações propostas pelo Sickit Learn.

Analisar coeficientes.