<a href="https://colab.research.google.com/github/Mario-RJunior/calculadora-imoveis/blob/master/machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning

Agora é a etapa em que iremos gerar um modelo usando técnicas de Machine Learning para gerar um modelo que irá nos ajudar a prever o valor de aluguéis de imóveis da cidade de São Paulo. Para isso usaremos como base os arquivos csv gerados durante a [análise exploratória](https://github.com/Mario-RJunior/calculadora-imoveis/blob/master/analise_exploratoria.ipynb).

## 1) Importar os dados

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import r2_score

In [2]:
treino = pd.read_csv('https://raw.githubusercontent.com/Mario-RJunior/calculadora-imoveis/master/treino_preprocessado.csv')
teste = pd.read_csv('https://raw.githubusercontent.com/Mario-RJunior/calculadora-imoveis/master/teste_preprocessado.csv')

In [3]:
# Divisão para variáveis X e y
X_train = treino.drop(labels='preco', axis=1)
y_train = treino['preco']
X_test = teste.drop('preco', axis=1)
y_test = teste['preco']

## 2) Modelo Baseline

Inicialmente criaremos um modelo denominado de ***Baseline*** que servirá como parâmetro para comparar seu resultado com os dos outros modelos que iremos gerar. É importante ressaltar que para que um modelo seja considerado bom ele deve ser superior ao baseline.

Como modelo baseline usaremos um algorítmo de regressão linear.

In [4]:
# Criando o modelo de regressão linear
rl = LinearRegression()
rl.fit(X_train, y_train)
rl.score(X_test, y_test)

0.5619293634962716

Note que este modelo tem um score de aproximadamente 0.45, que é bem ruim. Agora, precisamos testar outros modelos para encontrar o melhor e mais adequado possível.

## 3) Testando outros modelos

### 3.1) K-Nearest Neighbors (KNN)

In [5]:
neigh = KNeighborsRegressor()
neigh.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [6]:
neigh.score(X_test, y_test)

0.6399890193124183

### 3.2) Random Forest

In [7]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [8]:
rf.score(X_test, y_test)

0.6488110031897488

### 3.3) Adaboost

In [9]:
regr = AdaBoostRegressor()
regr.fit(X_train, y_train)

AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                  n_estimators=50, random_state=None)

In [10]:
regr.score(X_test, y_test)

0.6049797074091133

Podemos agora, afim de otimizar nosso tempo criar um modelo para diversos algorítmos simultaneamente.

In [11]:
# Importando os estimadores
from sklearn.linear_model import RidgeCV, Lasso, ElasticNet, LassoLars, HuberRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

In [12]:
# Criando uma lista com todos os estimadores
reg_list = [RidgeCV(),
            LGBMRegressor(), 
            XGBRegressor(objective='reg:squarederror'),
            SVR(),
            GradientBoostingRegressor(),
            MLPRegressor()
            ]

In [13]:
# Criando o modelo
from sklearn.model_selection import cross_val_score
import numpy as np

for reg in reg_list:
    print(f'Treinando Modelo {reg.__class__.__name__}')
    reg.fit(X_train, y_train)
    
    train_score = reg.score(X_train, y_train)
    cv_scores = cross_val_score(reg, X_train, y_train)
    test_score = reg.score(X_test, y_test)
    
    print(f"R2 Score Train: {train_score}")
    print(f"R2 Score Valid: {np.mean(cv_scores):.2f} +- {np.std(cv_scores):.2f}")
    print(f"R2 Score Test: {test_score}")
    print('='*80)

Treinando Modelo RidgeCV
R2 Score Train: 0.694995494792588
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5628158188285723
Treinando Modelo LGBMRegressor
R2 Score Train: 0.8335649897226411
R2 Score Valid: 0.73 +- 0.01
R2 Score Test: 0.6524018217807195
Treinando Modelo XGBRegressor
R2 Score Train: 0.8173085197618972
R2 Score Valid: 0.74 +- 0.01
R2 Score Test: 0.6748886425763778
Treinando Modelo SVR
R2 Score Train: 0.7331335431720392
R2 Score Valid: 0.72 +- 0.03
R2 Score Test: 0.6424482920301618
Treinando Modelo GradientBoostingRegressor
R2 Score Train: 0.8279163984754841
R2 Score Valid: 0.74 +- 0.01
R2 Score Test: 0.6640035513816185
Treinando Modelo MLPRegressor




R2 Score Train: 0.7141430902070662
R2 Score Valid: 0.70 +- 0.04
R2 Score Test: 0.5946274407296019




Para fazermos um teste com outros estimadores, podemos agora extrapolar e fazer uma tentativa com todos os modelos regressores do Sklearn.

In [14]:
# Testando com todos os regressores do sklearn
from sklearn.utils import all_estimators

estimators = all_estimators(type_filter='regressor')

relatorio = {'nome':[],
             'train_score':[],
             'cv_scores_mean':[],
             'test_score':[],
             'estimador':[]
             }

ignore_list = ['IsotonicRegression',
 'MultiOutputRegressor',
 'ElasticNet',
 'MultiTaskElasticNet',
 'MultiTaskElasticNetCV',
 'MultiTaskLasso',
 'MultiTaskLassoCV',
 'RadiusNeighborsRegressor',
 'RegressorChain',
 'StackingRegressor',
 'VotingRegressor']

In [15]:
estimators.extend(
    [('LGBMRegressor', LGBMRegressor),
     ('XGBRegressor', XGBRegressor)]
)

In [16]:
# Criando os modelos
for name, RegressorClass in estimators:
  if name not in ignore_list:
    print(f'Treinando Modelo {name}')
    reg = RegressorClass()
    reg.fit(X_train, y_train)

    train_score = reg.score(X_train, y_train)
    cv_scores = cross_val_score(reg, X_train, y_train)
    test_score = reg.score(X_test, y_test)

    print(f"R2 Score Train: {train_score}")
    print(f"R2 Score Valid: {np.mean(cv_scores):.2f} +- {np.std(cv_scores):.2f}")
    print(f"R2 Score Test: {test_score}")
    print('='*80)

    relatorio['nome'].append(name)
    relatorio['train_score'].append(train_score)
    relatorio['cv_scores_mean'].append(np.mean(cv_scores))
    relatorio['test_score'].append(test_score)
    relatorio['estimador'].append(reg)

Treinando Modelo ARDRegression
R2 Score Train: 0.6949800511386349
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.562608519680974
Treinando Modelo AdaBoostRegressor
R2 Score Train: 0.7310431950151979
R2 Score Valid: 0.67 +- 0.02
R2 Score Test: 0.607812382228452
Treinando Modelo BaggingRegressor
R2 Score Train: 0.8926635451491537
R2 Score Valid: 0.70 +- 0.01
R2 Score Test: 0.6400439852894649
Treinando Modelo BayesianRidge
R2 Score Train: 0.6949911822249593
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5627539730309861
Treinando Modelo CCA
R2 Score Train: 0.5268173739911581
R2 Score Valid: 0.52 +- 0.05
R2 Score Test: 0.4247894039998381
Treinando Modelo DecisionTreeRegressor




R2 Score Train: 0.9216100052828616
R2 Score Valid: 0.61 +- 0.03
R2 Score Test: 0.6007909942643348
Treinando Modelo DummyRegressor
R2 Score Train: 0.0
R2 Score Valid: -0.01 +- 0.01
R2 Score Test: -0.0013810536838241294
Treinando Modelo ElasticNetCV
R2 Score Train: 0.6949788060660995
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5625562730896081
Treinando Modelo ExtraTreeRegressor
R2 Score Train: 0.9216100052828616
R2 Score Valid: 0.60 +- 0.03
R2 Score Test: 0.5942091200753566
Treinando Modelo ExtraTreesRegressor
R2 Score Train: 0.9216100052828616
R2 Score Valid: 0.67 +- 0.02
R2 Score Test: 0.6325820011972915
Treinando Modelo GaussianProcessRegressor
R2 Score Train: 0.8261821704221463
R2 Score Valid: -96043.98 +- 128607.82
R2 Score Test: -34346.624542560974
Treinando Modelo GradientBoostingRegressor
R2 Score Train: 0.8279163984754841
R2 Score Valid: 0.74 +- 0.01
R2 Score Test: 0.6651743145126934
Treinando Modelo HistGradientBoostingRegressor
R2 Score Train: 0.8378845337772923
R2 Score Va



R2 Score Train: 0.6859048377813051
R2 Score Valid: 0.68 +- 0.04
R2 Score Test: 0.5420618396728419
Treinando Modelo MLPRegressor




R2 Score Train: 0.7145006068905753
R2 Score Valid: 0.70 +- 0.04
R2 Score Test: 0.5831582181912104
Treinando Modelo NuSVR
R2 Score Train: 0.7353168434788004
R2 Score Valid: 0.73 +- 0.03
R2 Score Test: 0.6448316482280203
Treinando Modelo OrthogonalMatchingPursuit
R2 Score Train: 0.6256572870612798
R2 Score Valid: 0.62 +- 0.04
R2 Score Test: 0.4850259732055404
Treinando Modelo OrthogonalMatchingPursuitCV
R2 Score Train: 0.6949996135157471
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5629558167496813
Treinando Modelo PLSCanonical
R2 Score Train: 0.3593709550845988
R2 Score Valid: 0.35 +- 0.01
R2 Score Test: 0.34148287622231643
Treinando Modelo PLSRegression
R2 Score Train: 0.6884671082779386
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5611160410265197
Treinando Modelo PassiveAggressiveRegressor
R2 Score Train: 0.617407427307525
R2 Score Valid: 0.57 +- 0.03
R2 Score Test: 0.4879555108704702
Treinando Modelo RANSACRegressor




R2 Score Train: 0.668619337109893
R2 Score Valid: 0.65 +- 0.05
R2 Score Test: 0.5261063340013321
Treinando Modelo RandomForestRegressor
R2 Score Train: 0.9001974006920409
R2 Score Valid: 0.71 +- 0.02
R2 Score Test: 0.6400048509765646
Treinando Modelo Ridge
R2 Score Train: 0.6949954947925869
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5628158188285834
Treinando Modelo RidgeCV
R2 Score Train: 0.694995494792588
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5628158188285723
Treinando Modelo SGDRegressor
R2 Score Train: 0.6670217212289327
R2 Score Valid: 0.67 +- 0.05
R2 Score Test: 0.5413483380682542
Treinando Modelo SVR
R2 Score Train: 0.7331335431720392
R2 Score Valid: 0.72 +- 0.03
R2 Score Test: 0.6424482920301618
Treinando Modelo TheilSenRegressor
R2 Score Train: 0.6668551290738163
R2 Score Valid: 0.67 +- 0.04
R2 Score Test: 0.5176913842002184
Treinando Modelo TransformedTargetRegressor
R2 Score Train: 0.6944855440085005
R2 Score Valid: 0.69 +- 0.04
R2 Score Test: 0.5619293634962716
T

## 4) Criando um relatório

Para melhor avaliarmos as performances dos modelos podemos criar um relatório com os resultados finais de cada um deles.

In [17]:
relatorio = pd.DataFrame(relatorio).sort_values(by='cv_scores_mean', ascending=False)
relatorio.head(10)

Unnamed: 0,nome,train_score,cv_scores_mean,test_score,estimador
11,GradientBoostingRegressor,0.827916,0.741404,0.665174,"([DecisionTreeRegressor(ccp_alpha=0.0, criteri..."
41,XGBRegressor,0.817309,0.73965,0.674889,"XGBRegressor(base_score=0.5, booster='gbtree',..."
40,LGBMRegressor,0.833565,0.731345,0.652402,"LGBMRegressor(boosting_type='gbdt', class_weig..."
12,HistGradientBoostingRegressor,0.837885,0.728931,0.662837,HistGradientBoostingRegressor(l2_regularizatio...
26,NuSVR,0.735317,0.726797,0.644832,"NuSVR(C=1.0, cache_size=200, coef0=0.0, degree..."
37,SVR,0.733134,0.723368,0.642448,"SVR(C=1.0, cache_size=200, coef0=0.0, degree=3..."
14,KNeighborsRegressor,0.806649,0.707182,0.639989,"KNeighborsRegressor(algorithm='auto', leaf_siz..."
33,RandomForestRegressor,0.900197,0.705031,0.640005,"(DecisionTreeRegressor(ccp_alpha=0.0, criterio..."
2,BaggingRegressor,0.892664,0.696814,0.640044,"(DecisionTreeRegressor(ccp_alpha=0.0, criterio..."
25,MLPRegressor,0.714501,0.696112,0.583158,"MLPRegressor(activation='relu', alpha=0.0001, ..."


## 5) Calibração do Modelo com GridSearchCV

In [18]:
# Importando a biblioteca
from sklearn.model_selection import GridSearchCV

In [19]:
# Definindo os parâmetros
parameters = {
    'n_estimators': [400, 700, 1000],
    'colsample_bytree': [0.7, 0.8],
    'max_depth': [15,20,25],
    'reg_alpha': [1.1, 1.2, 1.3],
    'reg_lambda': [1.1, 1.2, 1.3],
    'subsample': [0.7, 0.8, 0.9]
}

# Criando o classificador
xgb_reg = XGBRegressor(objective='reg:squarederror')

# Criando o GridSearch
gs = GridSearchCV(xgb_reg, parameters)

In [20]:
# Treinando o modelo
gs.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=XGBRegressor(base_score=0.5, booster='gbtree',
                                    colsample_bylevel=1, colsample_bynode=1,
                                    colsample_bytree=1, gamma=0,
                                    importance_type='gain', learning_rate=0.1,
                                    max_delta_step=0, max_depth=3,
                                    min_child_weight=1, missing=None,
                                    n_estimators=100, n_jobs=1, nthread=None,
                                    objective='reg:squarederror',
                                    random_state=0, reg_alp...
                                    scale_pos_weight=1, seed=None, silent=None,
                                    subsample=1, verbosity=1),
             iid='deprecated', n_jobs=None,
             param_grid={'colsample_bytree': [0.7, 0.8],
                         'max_depth': [15, 20, 25],
                         'n_es

In [21]:
# Verificando o melhor estimador
gs.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.7, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=20, min_child_weight=1, missing=None, n_estimators=400,
             n_jobs=1, nthread=None, objective='reg:squarederror',
             random_state=0, reg_alpha=1.3, reg_lambda=1.3, scale_pos_weight=1,
             seed=None, silent=None, subsample=0.7, verbosity=1)