# Processamento dos dados

**Autor:** Douglas Trajano

Este notebook irá atuar no processamento dos dados para treinamento do modelo que irá predizer a variável `room_type`.

As funções que irão fazer o processamento dos dados deverá estar em um arquivo **.py** com o objetivo de facilitar o deploy do modelo em um endpoint posteriormente.

A estrutura completa do projeto pode ser vista [aqui](https://github.com/DougTrajano/ds_airbnb_rio).

## / imports

In [1]:
import pandas as pd
import numpy as np
import pickle

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score

import matplotlib.pyplot as plt
%matplotlib inline

## / load dataset

In [2]:
df = pd.read_csv("data/listings.csv", low_memory=False)
df.head()

Unnamed: 0,host_response_time,host_response_rate,host_is_superhost,host_listings_count,host_identity_verified,is_location_exact,property_type,room_type,accommodates,bathrooms,...,amenities_mountain_view,amenities_soaking_tub,amenities_beach_view,amenities_jetted_tub,amenities_sun_loungers,amenities_high-resolution_computer_monitor,amenities_private_pool,amenities_bidet,amenities_brick_oven,amenities_hbo_go
0,0.0,100.0,1.0,2.0,1.0,1,0,0,5,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,91.0,0.0,3.0,1.0,1,1,0,2,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,100.0,1.0,1.0,1.0,1,1,0,3,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,100.0,1.0,1.0,1.0,1,1,0,3,1.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,100.0,1.0,1.0,1.0,1,2,0,2,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## / Divisão em X e Y, normalização dos dados

Vamos dividir o dataset em duas partes: X e Y.

`X` receberá todas as features, menos o `room_type`.
`Y` receberá apenas o valor de `room_type`.

Com base nas features `X` iremos inferir o valor de `Y`.

Também iremos aplicar uma normalização dos dados com a técnica MinMaxScaler, isso facilita o algoritmo no entendimento dos dados.

In [3]:
X = df.drop(columns=["room_type"])
y = df["room_type"].values

scaler = MinMaxScaler()
X = scaler.fit_transform(X.values)

print("X shape:", X.shape)

X shape: (33715, 210)


## / Balanceamento das classes

Como vimos anteriormente na análise exploratória, temos um desbalanceamento de classes que esperamos fornecer ao algoritmo.

Isso pode gerar resultados inesperados, caso não seja endereçado da maneira correta, pois o algoritmo aprende também com o tamanho de exemplos que é fornecido para cada classe.

Aqui utilizaremos uma técnica simples que é identificar os pesos para cada classe, isto será passado para o algoritmo que levará isso em consideração na hora de gerar o modelo.

In [4]:
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight(class_weight="balanced", classes=np.unique(y), y=y)

class_weights = {
    0: class_weights[0],
    1: class_weights[1],
    2: class_weights[2],
    3: class_weights[3]
}

print(class_weights)

{0: 0.35197519522278364, 1: 0.964719011102209, 2: 11.788461538461538, 3: 26.673259493670887}


## / Divisão do dataset em treino e teste

Nesta etapa interemos dividir o dataset em mais duas partes: treino e teste.

O objetivo é separar uma parte que será usada para avaliar o resultado do classificador e hyperparâmetros selecionados.

In [6]:
from sklearn.model_selection import train_test_split

# Split the dataset in train and test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print("train size:", len(X_train))
print("test size:", len(X_test))

train size: 22589
test size: 11126


# / modeling

Agora que já temos o dataset pronto, vamos testar alguns modelos afim de encontrar o algoritmo que gera o melhor modelo para esse conjunto de dados, bem como seus hyperparâmetros.

### / Logistic Regression

O primeiro algoritmo que iremos testar é a regressão logística.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [7]:
%%time
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000, random_state=0, class_weight=class_weights, n_jobs=-1)
clf.fit(X_train, y_train)

scores = cross_val_score(clf, X, y, cv=5)
y_pred = clf.predict(X_test)

print("Accuracy (cross-validation): %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print()
print(classification_report(y_test, y_pred, digits=4))

Accuracy (cross-validation): 0.88 (+/- 0.08)

              precision    recall  f1-score   support

           0     0.9728    0.8876    0.9283      7910
           1     0.8352    0.8861    0.8599      2872
           2     0.6061    0.8967    0.7233       242
           3     0.1706    0.8431    0.2838       102

    accuracy                         0.8870     11126
   macro avg     0.6462    0.8784    0.6988     11126
weighted avg     0.9220    0.8870    0.9003     11126

Wall time: 4min 57s


### / Random Forest

O segundo algoritmo que iremos testar é o RandomForest.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [21]:
%%time
from sklearn.model_selection import GridSearchCV

# GridSearchCV
parameters = {'n_estimators':[50, 100, 200, 300],
             "min_samples_leaf": [5, 10, 15, 20],
             "min_samples_split": [5, 10, 15, 20],
             "criterion": ["gini", "entropy"]}

model = RandomForestClassifier(random_state=0, class_weight=class_weights, n_jobs=-1)
clf = GridSearchCV(model, parameters, cv=5, verbose=2, n_jobs=-1)
clf.fit(X_train, y_train)
clf = clf.best_estimator_
preds = clf.predict(X_test)
print(classification_report(y_test, preds, digits=4))
clf

Fitting 5 folds for each of 128 candidates, totalling 640 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed: 15.1min
[Parallel(n_jobs=-1)]: Done 640 out of 640 | elapsed: 26.2min finished


              precision    recall  f1-score   support

           0     0.9869    0.9774    0.9821      7910
           1     0.9443    0.9614    0.9527      2872
           2     0.7891    0.9587    0.8657       242
           3     0.8514    0.6176    0.7159       102

    accuracy                         0.9695     11126
   macro avg     0.8929    0.8788    0.8791     11126
weighted avg     0.9703    0.9695    0.9695     11126

Wall time: 26min 20s


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                       class_weight={0: 0.35197519522278364,
                                     1: 0.964719011102209,
                                     2: 11.788461538461538,
                                     3: 26.673259493670887},
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=300,
                       n_jobs=-1, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [31]:
%%time
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=300, criterion='gini', random_state=0, 
                             min_samples_leaf=5, min_samples_split=5,
                             class_weight=class_weights, n_jobs=-1)
clf.fit(X_train, y_train)

scores = cross_val_score(clf, X, y, cv=5)
y_pred = clf.predict(X_test)

print("Accuracy (cross-validation): %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print()
print(classification_report(y_test, y_pred, digits=4))

Accuracy (cross-validation): 0.97 (+/- 0.02)

              precision    recall  f1-score   support

           0     0.9869    0.9774    0.9821      7910
           1     0.9443    0.9614    0.9527      2872
           2     0.7891    0.9587    0.8657       242
           3     0.8514    0.6176    0.7159       102

    accuracy                         0.9695     11126
   macro avg     0.8929    0.8788    0.8791     11126
weighted avg     0.9703    0.9695    0.9695     11126

Wall time: 1min 39s


In [32]:
def save_model(clf, filename="model.sav"):
    pickle.dump(clf, open(filename, 'wb'))
    return "Model saved on {}".format(filename)

def load_model(filename="model.sav"):
    clf = pickle.load(open(filename, 'rb'))
    return clf

save_model(clf)

'Model saved on model.sav'

# / Conclusões

Conseguimos obter uma boa acuracidade, porém, para que o modelo seja utilizado em produção é necessário obter mais exemplos para as classes `Shared room` e `Hotel room`. Essas classes juntas representam apenas **3%** do dataset.