# Machine Learning Prediction

Objetivos do projeto:
- Construir um modelo para predição de cancelamento de uma reserva.

## Data Acquisition

Vamos importar as bibliotecas necessárias para essa parte da análise:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

Vamos carregar o dataset e olhar para as primeiras linhas com o método `head`:

In [2]:
df = pd.read_csv("hotel_bookings.csv")

## Preprocessing

Vamos dar uma olhada nos atributos com valores nulos.

In [3]:
df.isnull().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

Como os atributos mais significativos (como o `is_canceled`) não contém valores nulos, não precisamos excluir nenhum dado da base, apenas preenchemos os valores nulos:

In [4]:
# preenchimento dos valores NaN com 0
df['children'] = df['children'].fillna(0)

# preenchimento dos valores NaN com a moda
df['country'].fillna(df['country'].mode()[0], inplace=True)
df['agent'].fillna(df['agent'].mode()[0], inplace=True)
df['company'].fillna(df['company'].mode()[0], inplace=True)

Vamos dar uma olhada nos atributos do dataset:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119390 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

Iremos manter sem alterações os atributos:
`is_canceled`, `lead_time`, `arrival_date_year`, `arrival_date_week_number`,`arrival_date_day_of_month`, `adults`,
`stays_in_weekend_nights`, `stays_in_week_nights`,`is_repeated_guest`, `booking_changes`,`days_in_waiting_list`, `adr`, `required_car_parking_spaces`, `total_of_special_requests`, `agent`, `company`.

1. Iremos substituir os seguintes atributos: 
*   `children` e `babies` -> `kids` 
*   `previous_cancellations` e `previous_bookings_not_canceled` -> `total_bookings`
*   `reserved_room_type`, `assigned_room_type` -> `same_room`

2. Iremos remover os seguintes atributos:
`meal`, `reservation_status_date`, `reservation_status`, `children`,`babies`,  `reserved_room_type`, `assigned_room_type`  

3. E, por fim, temos alguns atributos com valores categóricos. Iremos converter os seguintes atributos para inteiro:
`hotel`, `country`, `market_segment`, `distribution_channel`, `deposit_type`, `customer_type`, `arrival_date_month`



**1. Substituição dos atributos:**

In [6]:
# Agrupamento das colunas children e babies na coluna children
df['has_children'] = df['children']+df['babies']
df['has_children'] = df['has_children'].apply(lambda x: '1' if x >= 1 else '0')

# Agrupamento das colunas previous_cancellations e previous_bookings_not_canceled na coluna total_bookings
df['total_bookings'] = df['previous_cancellations']+df['previous_bookings_not_canceled']

# Agrupamento das colunas assigned_room_type e reserved_room_type em uma nova coluna same_room
def isSameRoom(row):
    if row['assigned_room_type'] == row['reserved_room_type']:
        return 1
    else:
        return 0

df['same_room'] = df.apply(isSameRoom, axis=1)

**2. Remoção dos atributos:**

In [7]:
df.drop(['meal', 'children', 
         'babies', 'assigned_room_type', 'reserved_room_type',
         'reservation_status', 'reservation_status_date'], axis=1, inplace=True)

**3. Conversão dos atributos:**

Vamos converter os meses para valores inteiros:

In [8]:
monthMap = {'January':1, 'February':2, 'March':3, 
             'April':4, 'May':5, 'June':6, 
             'July':7, 'August':8, 'September':9, 
             'October':10, 'November':11, 'December':12}

df['arrival_date_month'] = df['arrival_date_month'].replace(monthMap)

Vamos conveter também as outras colunas com valores categóricos:

In [9]:
le = preprocessing.LabelEncoder()
atts = ['hotel', 'country', 'market_segment', 'distribution_channel', 
        'deposit_type', 'customer_type']

for att in atts:
    df[att] = le.fit_transform(df[att])

## Prediction

Vamos dividir o conjunto de dados 80% para treino e 20% para teste:

In [10]:
y = df['is_canceled']
df.drop(['is_canceled'], axis='columns', inplace=True)

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)

Como temos um conjunto com atributos muito diversos, a falta de padronização pode enviesar o resultado para variáveis com maior ordem de grandeza. Vamos normalizar nossos dados deixando-os na mesma ordem de grandeza:



In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Decision Tree:

In [16]:
from sklearn import tree

from sklearn import metrics 
from sklearn.metrics import confusion_matrix

# Create Decision Tree classifer object
clf = tree.DecisionTreeClassifier(criterion="gini", max_depth=15)

# Train Decision Tree Classifer
clf = clf.fit(X_train.values,y_train)

# Predict the response for test dataset
y_pred = clf.predict(X_test.values)

#tree.plot_tree(clf)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1 score:",metrics.f1_score(y_test, y_pred))

print("Confusion Matrix: ")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8645196415110143
Precision: 0.8300279981334577
Recall: 0.800067468795682
F1 score: 0.8147724019467507
Confusion Matrix: 
[[13528  1457]
 [ 1778  7115]]


Vamos salvar o modelo em um arquivo:

In [21]:
import pickle

pickle.dump(clf, open('model.pkl','wb'))

Regressão Logística:

In [None]:
from sklearn.linear_model import LogisticRegression

# Logistic Regression
logreg = LogisticRegression(max_iter=1000)
logreg = logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1 score:",metrics.f1_score(y_test, y_pred))

print("Confusion Matrix: ")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7872099840857694
Precision: 0.7980440856876746
Recall: 0.5762806860217464
F1 score: 0.669270324806353
Confusion Matrix: 
[[13656  1301]
 [ 3780  5141]]


SVM:

In [None]:
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import confusion_matrix

# Cria um classificador SVM
clf = svm.SVC(kernel='rbf') 

# Treina o modelo com os dados de treino
clf.fit(X_train, y_train)

# Faz a predição com os dados de teste
y_pred = clf.predict(X_test)
y_pred = y_pred > 0.5 

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("F1 score:",metrics.f1_score(y_test, y_pred))

print("Confusion Matrix: ")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8245246670575425
Precision: 0.832
Recall: 0.6644994955722453
F1 score: 0.7388757322697246
Confusion Matrix: 
[[13760  1197]
 [ 2993  5928]]


MLP:

In [None]:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.optimizers import SGD

# create model with one hidden layer and one output layer
model = tf.keras.Sequential([
  layers.Dense(5, 
               activation=tf.nn.relu,
               kernel_initializer=tf.keras.initializers.RandomNormal(mean=0,stddev=1),
               ),
  layers.Dense(1, 
               activation=tf.nn.sigmoid) 
])

# set error function, optimizer and avaliation metric
model.compile(loss='MeanSquaredError', 
              optimizer=SGD(lr=0.01, momentum=0.9), 
              metrics=['accuracy'])


# train the model
history = model.fit(x=X_train,
                    y=y_train,
                    epochs=100,
                    validation_data=(X_test, y_test),
                    verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78