# **Modelo de Floresta Aleatória: Classificação e Regressão**

---



## **Obtenção dos Dados**

O primeiro passo em quaisquer projetos de análise de dados e implementação de modelos de *Machine Learning* é a compreensão do problema. Neste caso, em particular, o conjunto de dados original utilizado foi extraído da plataforma [Kaggle](https://www.kaggle.com), e contém dados sobre aluguel de imóveis no Brasil, que variam entre apartamentos, casas em condomínios, quitinetes, etc.

Este *notebook* foi dedicado à aplicação do algoritmo ***Random Forest***, a partir de um conjunto de dados previamente tratado através de algumas técnicas como limpeza e transformação de dados, com o objetivo de estimar os preços dos imóveis.

Para conferir o *notebook* com os detalhes sobre do pré-processamento aplicado ao conjunto de dados, você pode clicar neste [link](https://github.com/TailUFPB/Tutorials/blob/main/Pré-processamento/limpeza_process_dados.ipynb).

## **Modelos Baseados em Busca**

Se você alguma vez você já estudou *Machine Learning*, certamente se deparou com o termo Árvore de Decisão. Mas, se você não conhece, não há problemas, vamos te explicar sobre o que se trata e como usar. 

*Decision Tree* ou Árvore de Decisão é o nome de um algoritmo muito famoso e bastante utilizado pela comunidade. Esse modelo usa a estratégia 'dividir para alcançar os objetivos' para encontrar as soluções de um problema de decisão. Trata-se de um algoritmos baseado em procura, onde um problema complexo é dividido em problemas mais simples, e a estes são aplicados a mesma estratégia repetidade.

De acordo com Lorena et. al. (2011), a força dessa proposta vem da capacidade de dividir o espaço de instãncias em subespaços e cada subespaço é ajustado usando diferentes modelos.

**Mas o que isso tem a ver com *Random Forest*?**

Acontece que o *Random Forest* é um algoritmo de aprendizagem supervisionada baseado em árvores. Como o nome é bem sugestivo, esse modelo funciona criando uma 'floresta de modo aleatório' usando várias árvores de decisão.

<center><img alt="IA" width="15%" src="https://www.flaticon.com/svg/static/icons/svg/2913/2913483.svg"></center>

Segundo Niklas Donges, em um artigo do Towards Data Science, de modo simples: o algoritmo de florestas aleatórias cria várias árvores de decisão e as combina para obter uma predição com maior acurácia e mais estável. É justamente o fato de criar árvores de forma aleatórias que faz o algoritmo *Random Forest* ter melhor desempenho em relação à árvore clássica.

Uma grande vantagem do algoritmo de florestas aleatórias é que ele pode ser utilizado tanto para tarefas de classificação quanto para regressão, o que representa a maioria dos sistemas de aprendizagem de máquina atuais.

**Funcionamento do Algoritmo**

De forma resumida, o Random Forest funciona da seguinte maneira:

1. Criação de um *Bootstrap Dataset*: um subconjunto de dados criado a partir do dataset original de forma aleatória, ou seja, o RF seleciona algumas amostras (linhas) para garantir a aleatoriedade do processo;

2. Iniciar a criação das árvores que vão compor a floresta: seleciona N características aleatoriamente para criar a primeira árvore.

3. Verifica quais das características separam os dados da melhor forma.

4. Agora, seleciona as características que restaram no passo 2 e assim por diante. Desta maneira, são criadas várias árvores, que possuem dados e tamanhos diferentes.

5. Testa todas as árvores e escolhe o resultado com uma votação majoritária, ou seja, a classe definida será a mais votada.

Confira um artigo completo do Towards Data Science sobre o modelo Random Forest clicando [aqui](https://towardsdatascience.com/understanding-random-forest-58381e0602d2).

## **Implementação do Modelo do Random Forest**

Neste *notebook* foram feitas as implementações do algoritmo *Random Forest* para regressão e classificação, usando a biblioteca *Scikit-Learn*, com o objetivo de prever o preço dos imóveis detalhados no *dataset*.

In [None]:
# IMPORTANDO AS PRINCIPAIS BIBLIOTECAS DE MANIPULAÇÃO E VISUALIZAÇÃO DE DADOS

import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
# IMPORTANDO O CONJUNTO DE DADOS

df = pd.read_csv('https://raw.githubusercontent.com/jeffersonverissimo/datasets/master/dataset.csv')

In [None]:
# VISUALIZANDO AS PRIMEIRAS 5 LINHAS DO DATASET

df.head()

Unnamed: 0,quintal,churrasqueira,banheiros,banheira,quartos,lareira,andares,mobilhado,jardim,estacionamento,academia,jacuzzi,vistaMontanhas,vagasGaragem,salaoFestas,parquinho,piscina,precoTotal,sauna,isolamentoAcustico,quadra,suites,quadraTenis,area,andar,areaUtilizada,Acre,Alagoas,Amazonas,Bahia,Ceará,Distrito Federal,Espírito Santo,Goiás,Maranhão,Mato Grosso,Mato Grosso do Sul,Minas Gerais,Paraná,Paraíba,Pará,Pernambuco,Piauí,Rio Grande do Norte,Rio Grande do Sul,Rio de Janeiro,Rondônia,Santa Catarina,Sergipe,São Paulo,Tocantins,areaCat,areaUtilizadaCat
0,0,0,1,0,2,0,1,0,0,0,0,0,0,1,0,0,0,1.35,0,0,0,1,0,45,0,45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0,0,3,0,3,0,1,0,0,0,0,0,0,1,0,0,0,5.561,0,0,0,1,0,140,0,140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,3.0
2,0,0,3,0,3,0,1,0,0,0,0,0,0,2,0,0,0,2.935,0,0,0,1,0,100,0,100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0
3,0,0,2,0,1,0,1,0,0,0,0,0,0,1,0,0,1,137.437,1,0,0,0,0,80,0,80,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0
4,0,1,3,0,2,0,1,0,0,0,0,0,0,2,1,1,1,3.136,1,0,1,2,0,69,0,69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0


## **Random Forest - Regression**

A biblioteca Scikit-Learn facilita a construção de modelos por meio de classes e métodos predefinidos, otimizados na linguagem C++ para trabalhar com máxima eficiência e utilizando o mínimo de processamento computacional. 

Não há exemplo mais claro do poder da e simplicidade de um modelo como o random forest quando este é implementado por meio do Scikit-learn, e para efeito de demonstração esta seção busca resolver um problema de regressão, como segue.

In [None]:
# SALVANDO UMA CÓPIA DO DATASET ORIGINAL

df_rgs = df.copy()


Scikit-learn também dispõe de uma ferramenta indispensável quando se trata de treinamento de modelos, a saber, a separação entre conjuntos de treinamento e de validação. Um modelo que não consegue generalizar com precisão razoável não pode ser aplicado em problemas reais, isto é, não pode ser considerado como uma solução viável. Neste ponto entra a necessidade de ocultar parte do banco de dados original para então julgar o desempenho de um dado modelo por sua performance sobre padrões que não foram previamente apresentados. A célula a seguir demonstra o uso desta técnica.

In [None]:
from sklearn.model_selection import train_test_split

X = df_rgs.loc[:, df.columns != "precoTotal"]
y = df_rgs.loc[:, "precoTotal"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, 
                                                    random_state=1701)


Finalmente, o modelo de regressão sendo construído e avaliado em questão de poucas linhas. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Instanciando o modelo
regressor = RandomForestRegressor(random_state=1701, verbose=1)

# Ajustando aos dados de treinamento
regressor.fit(X_train, y_train)

# Obtendo uma predição utilizando dados de validação
y_pred = regressor.predict(X_test)

# Avaliando o modelo com base no desepenho sobre dados não apresentados
regressor.score(X_test, y_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   11.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished


0.2036399950133535

Percebam que o nosso modelo não se saiu bem quando avaliado sobre sua capacidade de generalização. Isto pode ocorrer em casos onde o desenvolvedor não ajusta corretamente os hiperparâmetros do modelo. Aqui a intenção é apresentar a utilizade da biblioteca na resolução de problemas com random forests, uma visão simplista mas que introduz de maneira sucinta o caminho para se utilizar modelos desta natureza.

## **Random Forest - Classification**

In [None]:
# SALVANDO UMA CÓPIA DO DATASET ORIGINAL

df_clf = df.copy()

Como pode-se perceber, a variável alvo (*target*) do conjunto de dados é do tipo numérica, mas apresenta diversos valores, o que dificulta a implementação do modelo de *Machine Learning*.

Nesta etapa, antes de construir o modelo de classificação, deve-se adaptar a variável para os padrões do problema. Assim, a variável `precoTotal` será transformada de modo o algoritmo aceite.

Para isso, como se tratam de preços de imóveis, enquadraremos tais preços em três categorias: baixo, médio e alto ou barato, médio e caro. Depois disso, será possível transformar para numérica novamente, padrão aceito pelo modelo.

In [None]:
# SEPARANDO OS VALORES EM BARATO, MÉDIO E CARO

preco_new = pd.qcut(df_clf.precoTotal, q = 3, labels = ['barato', 'médio', 'caro'])

preco_new

0        barato
1          caro
2         médio
3          caro
4         médio
          ...  
25159    barato
25160     médio
25161     médio
25162     médio
25163      caro
Name: precoTotal, Length: 25164, dtype: category
Categories (3, object): ['barato' < 'médio' < 'caro']

In [None]:
# CONFERINDO AS QUANTIDADES DAS CLASSES CRIADAS

preco_new.value_counts()

médio     8414
barato    8392
caro      8358
Name: precoTotal, dtype: int64

Como mostrado acima, podemos observar que os dados da coluna `precoTotal` foram separados em três classes.

Agora, devemos transformar esses dados novamente para numéricos.

In [None]:
# TRANSFORMANDO DE CATEGÓTICO PARA NUMÉRICO

preco_new_transf = preco_new.cat.codes

In [None]:
# CONFERINDO AS QUANTIDADES DAS CLASSES CRIADAS

preco_new_transf.value_counts()

1    8414
0    8392
2    8358
dtype: int64

Assim, os índices da nova variável correspondem a:

* 0 = barato
* 1 = médio
* 2 = caro

In [None]:
# CRIANDO UMA NOVA COLUNA PARA O DATAFRAME

df_clf['preco'] = preco_new_transf

Finalizada esta etapa, podemos visualizar o conjunto de dados e perceber que a coluna `preco` agora faz parte do *dataset*.

In [None]:
df_clf.head()

Unnamed: 0,quintal,churrasqueira,banheiros,banheira,quartos,lareira,andares,mobilhado,jardim,estacionamento,academia,jacuzzi,vistaMontanhas,vagasGaragem,salaoFestas,parquinho,piscina,precoTotal,sauna,isolamentoAcustico,quadra,suites,quadraTenis,area,andar,areaUtilizada,Acre,Alagoas,Amazonas,Bahia,Ceará,Distrito Federal,Espírito Santo,Goiás,Maranhão,Mato Grosso,Mato Grosso do Sul,Minas Gerais,Paraná,Paraíba,Pará,Pernambuco,Piauí,Rio Grande do Norte,Rio Grande do Sul,Rio de Janeiro,Rondônia,Santa Catarina,Sergipe,São Paulo,Tocantins,areaCat,areaUtilizadaCat,preco
0,0,0,1,0,2,0,1,0,0,0,0,0,0,1,0,0,0,1.35,0,0,0,1,0,45,0,45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0
1,0,0,3,0,3,0,1,0,0,0,0,0,0,1,0,0,0,5.561,0,0,0,1,0,140,0,140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,3.0,2
2,0,0,3,0,3,0,1,0,0,0,0,0,0,2,0,0,0,2.935,0,0,0,1,0,100,0,100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1
3,0,0,2,0,1,0,1,0,0,0,0,0,0,1,0,0,1,137.437,1,0,0,0,0,80,0,80,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,2
4,0,1,3,0,2,0,1,0,0,0,0,0,0,2,1,1,1,3.136,1,0,1,2,0,69,0,69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,1


Agora, podemos eliminar a coluna orininal de preços (`precoTotal`) e separar as variáveis em dependentes e independentes que irão alimentar o modelo.

In [None]:
# EXCLUINDO A VARIÁVEL 'PRECOTOTAL'

df_clf.drop('precoTotal', axis = 1, inplace = True)

In [None]:
# CONFERINDO AS ALTERAÇÕES

df_clf.head()

Unnamed: 0,quintal,churrasqueira,banheiros,banheira,quartos,lareira,andares,mobilhado,jardim,estacionamento,academia,jacuzzi,vistaMontanhas,vagasGaragem,salaoFestas,parquinho,piscina,sauna,isolamentoAcustico,quadra,suites,quadraTenis,area,andar,areaUtilizada,Acre,Alagoas,Amazonas,Bahia,Ceará,Distrito Federal,Espírito Santo,Goiás,Maranhão,Mato Grosso,Mato Grosso do Sul,Minas Gerais,Paraná,Paraíba,Pará,Pernambuco,Piauí,Rio Grande do Norte,Rio Grande do Sul,Rio de Janeiro,Rondônia,Santa Catarina,Sergipe,São Paulo,Tocantins,areaCat,areaUtilizadaCat,preco
0,0,0,1,0,2,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,45,0,45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0
1,0,0,3,0,3,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,140,0,140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,3.0,2
2,0,0,3,0,3,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,1,0,100,0,100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1
3,0,0,2,0,1,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,80,0,80,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,2
4,0,1,3,0,2,0,1,0,0,0,0,0,0,2,1,1,1,1,0,1,2,0,69,0,69,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,1


Agora, que já adaptamos a variável alvo para o modelo de classificação, podemos iniciar a implementação do algoritmos Random Forest.

In [None]:
# IMPORTANDO O MODELO RANDOM FOREST PARA CLASSIFICAÇÃO

from sklearn.ensemble import RandomForestClassifier

In [None]:
# INSTANCIANDO O MODELO

rfc_model = RandomForestClassifier(random_state = 42, criterion = 'entropy', oob_score = True)

Assim como os demais modelos, o Random Forest possui diversos hyperparâmetros, a exemplo do `random_state` explicitado durante a etapa de instância do modelo.

Podemos verificar todos os parâmetros padrões do modelo através do seguinte comando: `.get_params`

In [None]:
# VERIFICANDO OS PARÂMETROS PADRÕES

from pprint import pprint

print('Parâmetros atualmente em uso:\n')
pprint(rfc_model.get_params())

Parâmetros atualmente em uso:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': True,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}


Abaixo, segue os detalhes sobre o significado de alguns desses parâmetros:

* **n_estimators**: número de árvores na previsão;

* **max_features**: número máximo de recursos considerados para dividir um nó;

* **max_depth**: número máximo de níveis em cada árvore de decisão;

* **min_samples_split**: número mínimo de pontos de dados colocados em um nó antes que o nó seja dividido;

* **min_samples_leaf**: número mínimo de pontos de dados permitidos em um nó folha;

* **bootstrap**: método para amostragem de pontos de dados (com ou sem substituição);

O detalhamento aprofundado de cada parâmetro pode ser consultado no site oficial do Scikit-Learn através deste [link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
# DEFININDO OS DADOS DE TREINO E TESTE

X = df_clf.drop('preco', axis = 1)
y = df_clf.preco

In [None]:
# DIVIDINDO OS DADOS EM TREINO E TESTE

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.28)

In [None]:
# TREINANDO O MODELO

rfc_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
# DESEMPENHO DO MODELO

# IMPORTAÇÃO DAS FERRAMENTAS DE MÉTRICAS

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [None]:
# REALIZANDO A PREVISÃO NOS DADOS DE TESTE

y_pred = rfc_model.predict(X_test)

In [None]:
# VIZUALIZANDO A ACURÁCIA GERAL

print('[Acurácia] Random Forest:', accuracy_score(y_test, y_pred))

[Acurácia] Random Forest: 0.7150156116945785


In [None]:
# VIZUALIZANDO OOB DO RANDOM FOREST

print('[Out-of-Bag] Random Forest:', rfc_model.oob_score_)

[Out-of-Bag] Random Forest: 0.7061485815211392


In [None]:
# VISUALIZANDO O CLASSIFICATION REPORT

print('\n[Classification Report] Random Forest')

print(classification_report(y_test, y_pred))


[Classification Report] Random Forest
              precision    recall  f1-score   support

           0       0.67      0.66      0.67      2342
           1       0.72      0.70      0.71      2412
           2       0.75      0.78      0.77      2292

    accuracy                           0.72      7046
   macro avg       0.71      0.72      0.72      7046
weighted avg       0.71      0.72      0.71      7046



A partir dos resutlados , podemos perceber que o modelo teve mais dificuldade em classificar os dados que correspondem ao **índice 0**, ou seja, quando o **imóvel é barato**. Enquanto o melhor desempenho foi para as previsões para o target do **tipo 2**, ou seja, para **imóveis caros**.

No geral, podemos concluir que o desempenho do modelo não foi bom.

Um aspecto muito importante durante a implementação de um modelo de *Machine Learning*, é saber quais parâmetros devemos usar. Esta não é uma tarefa fácil, afinal, além de serem muitos, cada um desses parâmetros afetam o desempenho do modelo de uma forma diferente.

A maneira de resolver isso é fazendo o Ajuste ou Sintonia de Hiperparâmetros, *Hyperparameter Tuning* do inglês. E claro, a biblioteca Scikit-Learn possui um módulo para realizar esse ajuste.

Assim, uma maneira de tentar melhorar o desempenho do modelo seria ajustar seus hiperparâmetros.

### **Referências**

LORENA, Ana Carolina; GAMA, João; FACELI, Katti. Inteligência Artificial: Uma abordagem de aprendizado de máquina. Grupo Gen-LTC, 2011.