<a href="https://colab.research.google.com/github/Mario-RJunior/calculadora-imoveis/blob/master/machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning

Agora é a etapa em que iremos gerar um modelo usando técnicas de Machine Learning para gerar um modelo que irá nos ajudar a prever o valor de aluguéis de imóveis da cidade de São Paulo. Para isso usaremos como base os arquivos csv gerados durante a [análise exploratória](https://github.com/Mario-RJunior/calculadora-imoveis/blob/master/analise_exploratoria.ipynb).

## 1) Importar os dados

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import r2_score

In [2]:
treino = pd.read_csv('https://raw.githubusercontent.com/Mario-RJunior/calculadora-imoveis/master/treino_preprocessado.csv')
teste = pd.read_csv('https://raw.githubusercontent.com/Mario-RJunior/calculadora-imoveis/master/teste_preprocessado.csv')

- Alterando a ordem das colunas

In [3]:
# Definindo a ordem das colunas
order_columns = ['zona_leste', 'zona_norte', 'zona_oeste', 'zona_sul', 'quartos', 'area', 'preco'	]

# Alterando a ordem
treino = treino.reindex(columns=order_columns)
teste = teste.reindex(columns=order_columns)

In [4]:
# Cabeçalho da base de treinos
treino.head()

Unnamed: 0,zona_leste,zona_norte,zona_oeste,zona_sul,quartos,area,preco
0,0,0,1,0,0.693147,3.044522,6.908755
1,0,1,0,0,0.693147,3.713572,7.601402
2,0,0,1,0,1.609438,5.70711,9.615205
3,0,0,0,1,1.098612,4.110874,7.496097
4,0,0,1,0,1.098612,5.493061,8.412055


In [5]:
# Cabeçalho da base de teste
teste.head()

Unnamed: 0,zona_leste,zona_norte,zona_oeste,zona_sul,quartos,area,preco
0,0,0,0,1,1.386294,4.465908,7.313887
1,0,0,0,1,1.386294,5.968708,10.59666
2,0,0,1,0,1.609438,6.196444,9.305741
3,1,0,0,0,0.693147,4.795791,7.266129
4,0,0,1,0,0.693147,3.970292,8.537192


In [6]:
# Divisão para variáveis X e y
X_train = treino.drop(labels='preco', axis=1)
y_train = treino['preco']
X_test = teste.drop('preco', axis=1)
y_test = teste['preco']

## 2) Modelo Baseline

Inicialmente criaremos um modelo denominado de ***Baseline*** que servirá como parâmetro para comparar seu resultado com os dos outros modelos que iremos gerar. É importante ressaltar que para que um modelo seja considerado bom ele deve ser superior ao baseline.

Como modelo baseline usaremos um algorítmo de regressão linear.

In [7]:
# Criando o modelo de regressão linear
rl = LinearRegression()
rl.fit(X_train, y_train)
rl.score(X_test, y_test)

0.5604465174223303

Note que este modelo tem um score de aproximadamente 0.45, que é bem ruim. Agora, precisamos testar outros modelos para encontrar o melhor e mais adequado possível.

## 3) Testando outros modelos

### 3.1) K-Nearest Neighbors (KNN)

In [8]:
neigh = KNeighborsRegressor()
neigh.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [9]:
neigh.score(X_test, y_test)

0.6399890193124183

### 3.2) Random Forest

In [10]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [11]:
rf.score(X_test, y_test)

0.6401118217835746

### 3.3) Adaboost

In [12]:
regr = AdaBoostRegressor()
regr.fit(X_train, y_train)

AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                  n_estimators=50, random_state=None)

In [13]:
regr.score(X_test, y_test)

0.5889791482428012

Podemos agora, afim de otimizar nosso tempo criar um modelo para diversos algorítmos simultaneamente.

In [14]:
# Importando os estimadores
from sklearn.linear_model import RidgeCV, Lasso, ElasticNet, LassoLars, HuberRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

In [15]:
# Criando uma lista com todos os estimadores
reg_list = [RidgeCV(),
            LGBMRegressor(), 
            XGBRegressor(objective='reg:squarederror'),
            SVR(),
            GradientBoostingRegressor(),
            MLPRegressor()
            ]

In [16]:
# Criando o modelo
from sklearn.model_selection import cross_val_score
import numpy as np

for reg in reg_list:
    print(f'Treinando Modelo {reg.__class__.__name__}')
    reg.fit(X_train, y_train)
    
    train_score = reg.score(X_train, y_train)
    cv_scores = cross_val_score(reg, X_train, y_train)
    test_score = reg.score(X_test, y_test)
    
    print(f"R2 Score Train: {train_score}")
    print(f"R2 Score Valid: {np.mean(cv_scores):.2f} +- {np.std(cv_scores):.2f}")
    print(f"R2 Score Test: {test_score}")
    print('='*80)

Treinando Modelo RidgeCV
R2 Score Train: 0.6949954947925867
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5628158188285842
Treinando Modelo LGBMRegressor
R2 Score Train: 0.8335649897226411
R2 Score Valid: 0.73 +- 0.01
R2 Score Test: 0.6524018217807195
Treinando Modelo XGBRegressor
R2 Score Train: 0.8173085197618972
R2 Score Valid: 0.74 +- 0.01
R2 Score Test: 0.6754518346344691
Treinando Modelo SVR
R2 Score Train: 0.7331335431720392
R2 Score Valid: 0.72 +- 0.03
R2 Score Test: 0.6424482920301618
Treinando Modelo GradientBoostingRegressor
R2 Score Train: 0.8279163984754841
R2 Score Valid: 0.74 +- 0.01
R2 Score Test: 0.6641387415328062
Treinando Modelo MLPRegressor




R2 Score Train: 0.722333753655436
R2 Score Valid: 0.70 +- 0.05
R2 Score Test: 0.6035065145791316




Para fazermos um teste com outros estimadores, podemos agora extrapolar e fazer uma tentativa com todos os modelos regressores do Sklearn.

In [17]:
# Testando com todos os regressores do sklearn
from sklearn.utils import all_estimators

estimators = all_estimators(type_filter='regressor')

relatorio = {'nome':[],
             'train_score':[],
             'cv_scores_mean':[],
             'test_score':[],
             'estimador':[]
             }

ignore_list = ['IsotonicRegression',
 'MultiOutputRegressor',
 'ElasticNet',
 'MultiTaskElasticNet',
 'MultiTaskElasticNetCV',
 'MultiTaskLasso',
 'MultiTaskLassoCV',
 'RadiusNeighborsRegressor',
 'RegressorChain',
 'StackingRegressor',
 'VotingRegressor']

In [18]:
estimators.extend(
    [('LGBMRegressor', LGBMRegressor),
     ('XGBRegressor', XGBRegressor)]
)

In [19]:
# Criando os modelos
for name, RegressorClass in estimators:
  if name not in ignore_list:
    print(f'Treinando Modelo {name}')
    reg = RegressorClass()
    reg.fit(X_train, y_train)

    train_score = reg.score(X_train, y_train)
    cv_scores = cross_val_score(reg, X_train, y_train)
    test_score = reg.score(X_test, y_test)

    print(f"R2 Score Train: {train_score}")
    print(f"R2 Score Valid: {np.mean(cv_scores):.2f} +- {np.std(cv_scores):.2f}")
    print(f"R2 Score Test: {test_score}")
    print('='*80)

    relatorio['nome'].append(name)
    relatorio['train_score'].append(train_score)
    relatorio['cv_scores_mean'].append(np.mean(cv_scores))
    relatorio['test_score'].append(test_score)
    relatorio['estimador'].append(reg)

Treinando Modelo ARDRegression
R2 Score Train: 0.6949800511386295
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5626085196812614
Treinando Modelo AdaBoostRegressor
R2 Score Train: 0.7227109642770126
R2 Score Valid: 0.67 +- 0.02
R2 Score Test: 0.5987349402882008
Treinando Modelo BaggingRegressor
R2 Score Train: 0.8909822029365575
R2 Score Valid: 0.70 +- 0.02
R2 Score Test: 0.6459046767997183
Treinando Modelo BayesianRidge
R2 Score Train: 0.6949911822249593
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5627539730309865
Treinando Modelo CCA




R2 Score Train: 0.5268173739911579
R2 Score Valid: 0.52 +- 0.05
R2 Score Test: 0.4247894039998378
Treinando Modelo DecisionTreeRegressor
R2 Score Train: 0.9216100052828616
R2 Score Valid: 0.61 +- 0.04
R2 Score Test: 0.5969019540428854
Treinando Modelo DummyRegressor
R2 Score Train: 0.0
R2 Score Valid: -0.01 +- 0.01
R2 Score Test: -0.0013810536838241294
Treinando Modelo ElasticNetCV
R2 Score Train: 0.6949788061574413
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.562556270900655
Treinando Modelo ExtraTreeRegressor
R2 Score Train: 0.9216100052828616
R2 Score Valid: 0.63 +- 0.05
R2 Score Test: 0.5957530125634056
Treinando Modelo ExtraTreesRegressor
R2 Score Train: 0.9216100052828616
R2 Score Valid: 0.67 +- 0.02
R2 Score Test: 0.6300444069881002
Treinando Modelo GaussianProcessRegressor
R2 Score Train: 0.8261834894491148
R2 Score Valid: -96045.91 +- 128609.13
R2 Score Test: -34350.71313366961
Treinando Modelo GradientBoostingRegressor
R2 Score Train: 0.827916398475484
R2 Score Valid: 0.74 +



R2 Score Train: 0.7227757115921967
R2 Score Valid: 0.70 +- 0.05
R2 Score Test: 0.6050512661348435
Treinando Modelo NuSVR
R2 Score Train: 0.7353168434788004
R2 Score Valid: 0.73 +- 0.03
R2 Score Test: 0.6448316482280202
Treinando Modelo OrthogonalMatchingPursuit
R2 Score Train: 0.6256572870612799
R2 Score Valid: 0.62 +- 0.04
R2 Score Test: 0.4850259732055404
Treinando Modelo OrthogonalMatchingPursuitCV
R2 Score Train: 0.6949996135157472
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5629558167496815
Treinando Modelo PLSCanonical
R2 Score Train: 0.3593709550845988
R2 Score Valid: 0.35 +- 0.01
R2 Score Test: 0.3414828762223163
Treinando Modelo PLSRegression
R2 Score Train: 0.6884671082779386
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5611160410265197
Treinando Modelo PassiveAggressiveRegressor
R2 Score Train: 0.6536677259433654
R2 Score Valid: 0.57 +- 0.14
R2 Score Test: 0.5360731333058997
Treinando Modelo RANSACRegressor




R2 Score Train: 0.663082470854452
R2 Score Valid: 0.67 +- 0.05
R2 Score Test: 0.5084429098734113
Treinando Modelo RandomForestRegressor
R2 Score Train: 0.8997566341307691
R2 Score Valid: 0.70 +- 0.02
R2 Score Test: 0.6483247331998391
Treinando Modelo Ridge
R2 Score Train: 0.6949954947925869
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5628158188285834
Treinando Modelo RidgeCV
R2 Score Train: 0.6949954947925867
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5628158188285842
Treinando Modelo SGDRegressor
R2 Score Train: 0.678695633313417
R2 Score Valid: 0.67 +- 0.05
R2 Score Test: 0.5381032582442044
Treinando Modelo SVR
R2 Score Train: 0.7331335431720392
R2 Score Valid: 0.72 +- 0.03
R2 Score Test: 0.6424482920301618
Treinando Modelo TheilSenRegressor
R2 Score Train: 0.6672597972850639
R2 Score Valid: 0.67 +- 0.05
R2 Score Test: 0.5178070551022393
Treinando Modelo TransformedTargetRegressor
R2 Score Train: 0.6944799959879222
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5604465174223303
T

## 4) Criando um relatório

Para melhor avaliarmos as performances dos modelos podemos criar um relatório com os resultados finais de cada um deles.

In [20]:
relatorio = pd.DataFrame(relatorio).sort_values(by='cv_scores_mean', ascending=False)
relatorio.head(10)

Unnamed: 0,nome,train_score,cv_scores_mean,test_score,estimador
11,GradientBoostingRegressor,0.827916,0.739262,0.663938,"([DecisionTreeRegressor(ccp_alpha=0.0, criteri..."
41,XGBRegressor,0.817309,0.738721,0.675452,"XGBRegressor(base_score=0.5, booster='gbtree',..."
40,LGBMRegressor,0.833565,0.731345,0.652402,"LGBMRegressor(boosting_type='gbdt', class_weig..."
12,HistGradientBoostingRegressor,0.837885,0.728931,0.662837,HistGradientBoostingRegressor(l2_regularizatio...
26,NuSVR,0.735317,0.726797,0.644832,"NuSVR(C=1.0, cache_size=200, coef0=0.0, degree..."
37,SVR,0.733134,0.723368,0.642448,"SVR(C=1.0, cache_size=200, coef0=0.0, degree=3..."
14,KNeighborsRegressor,0.806649,0.707182,0.639989,"KNeighborsRegressor(algorithm='auto', leaf_siz..."
33,RandomForestRegressor,0.899757,0.70219,0.648325,"(DecisionTreeRegressor(ccp_alpha=0.0, criterio..."
2,BaggingRegressor,0.890982,0.69648,0.645905,"(DecisionTreeRegressor(ccp_alpha=0.0, criterio..."
25,MLPRegressor,0.722776,0.695437,0.605051,"MLPRegressor(activation='relu', alpha=0.0001, ..."


## 5) Calibração do Modelo com GridSearchCV

In [21]:
# Importando a biblioteca
from sklearn.model_selection import GridSearchCV

### 5.1) Modelo XGBoost

In [22]:
# Definindo os parâmetros
parameters = {
    'n_estimators': [400, 700, 1000],
    'colsample_bytree': [0.7, 0.8],
    'max_depth': [15,20,25],
    'reg_alpha': [1.1, 1.2, 1.3],
    'reg_lambda': [1.1, 1.2, 1.3],
    'subsample': [0.7, 0.8, 0.9]
}

# Criando o classificador
xgb_reg = XGBRegressor(objective='reg:squarederror')

# Criando o GridSearch
gs = GridSearchCV(xgb_reg, parameters)

In [23]:
# Treinando o modelo
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                    colsample_bylevel=1, colsample_bynode=1,
                                    colsample_bytree=1, gamma=0,
                                    importance_type='gain', learning_rate=0.1,
                                    max_delta_step=0, max_depth=3,
                                    min_child_weight=1, missing=None,
                                    n_estimators=100, n_jobs=1, nthread=None,
                                    objective='reg:squarederror',
                                    random_state=0, reg_alp...
                                    scale_pos_weight=1, seed=None, silent=None,
                                    subsample=1, verbosity=1),
             iid='deprecated', n_jobs=None,
             param_grid={'colsample_bytree': [0.7, 0.8],
                         'max_depth': [15, 20, 25],
                         'n_es

In [24]:
# Verificando o melhor estimador
best_gs = gs.best_estimator_
best_gs

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.7, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=15, min_child_weight=1, missing=None, n_estimators=400,
             n_jobs=1, nthread=None, objective='reg:squarederror',
             random_state=0, reg_alpha=1.3, reg_lambda=1.2, scale_pos_weight=1,
             seed=None, silent=None, subsample=0.7, verbosity=1)

In [25]:
# Verificando a melhor pontuação
gs.best_score_

0.7272645195429972

In [26]:
# Score na base de teste
best_gs.score(X_test, y_test)

0.6596895056107179

### 5.2) Modelo Gradient Boost

In [27]:
# Definindo os parâmetros
param_grid = {
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}

# Criando o classificador
gbr_reg = GradientBoostingRegressor()

# Criando o GridSearch
gbr_gs = GridSearchCV(gbr_reg, param_grid)

In [28]:
# Treinando o modelo
gbr_gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.1,
                                                 loss='ls', max_depth=3,
                                                 max_features=None,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100,
                                                 n_ite...
                            

In [29]:
# Verificando o melhor estimador
best_gbr_gs = gbr_gs.best_estimator_
best_gbr_gs

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls',
                          max_depth=100, max_features=2, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=5, min_samples_split=8,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [30]:
# Verificando a melhor pontuação
gbr_gs.best_score_

0.7091683963726417

In [31]:
# Score na base de teste
best_gbr_gs.score(X_test, y_test)

0.6536582862514058

## 6) Salvando o melhor modelo

Iremos exportar o Gradient Boost original uma vez que ele apresentou melhores resultados do que a versão com grid search. Para isso iremos recriar o modelo.

In [32]:
# Recriando o modelo
best_gb = GradientBoostingRegressor()
best_gb.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [33]:
# Importando a biblioteca
import pickle

In [34]:
# Exportando o modelo
pickle.dump(best_gb, open('gb_regressor.pkl', 'wb'), protocol=4)