### Anteriormente, fizemos a análise exploratória do dataset de preços de passagens de avião visando responder algumas perguntas sobre o comportamento dos preços. Neste notebook, iremos utilizar técnicas de transformação dos dados e machine learning para realizar a previsão dos preços das passagens.

In [154]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
df_inicial = pd.read_csv('Clean_Dataset.csv')

In [3]:
df_inicial

Unnamed: 0.1,Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955
...,...,...,...,...,...,...,...,...,...,...,...,...
300148,300148,Vistara,UK-822,Chennai,Morning,one,Evening,Hyderabad,Business,10.08,49,69265
300149,300149,Vistara,UK-826,Chennai,Afternoon,one,Night,Hyderabad,Business,10.42,49,77105
300150,300150,Vistara,UK-832,Chennai,Early_Morning,one,Night,Hyderabad,Business,13.83,49,79099
300151,300151,Vistara,UK-828,Chennai,Early_Morning,one,Evening,Hyderabad,Business,10.00,49,81585


In [4]:
df_inicial['flight'].nunique()

1561

### Vamos começar dispensando as colunas de indíce e "flight", que diz respeito ao código do vôo.

In [5]:
df_inicial.drop(columns=df_inicial.columns[[0,2]], axis=1, inplace=True)

In [6]:
df_inicial.columns

Index(['airline', 'source_city', 'departure_time', 'stops', 'arrival_time',
       'destination_city', 'class', 'duration', 'days_left', 'price'],
      dtype='object')

### Agora, vamos fazer a transformação das variáveis categóricas. Utilizaremos o OneHotEncoder, que cria arrays de 0 e 1. A escolha por esse método se deu por conta do risco dos algoritmos estabelecerem relações de proximidade. Por exemplo, se os destinos estivessem classificados como números de 1 - 6, o algoritmo poderia entender que 5 e 6 são mais semelhantes do que 1 e 2.
### É importante observar que o algoritmo de RandomForest consegue lidar com variáveis categóricas, porém, a transformação será feita para trabalharmos com outros algoritmos que não possuem essa característica.

In [7]:
df_inicial.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300153 entries, 0 to 300152
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   airline           300153 non-null  object 
 1   source_city       300153 non-null  object 
 2   departure_time    300153 non-null  object 
 3   stops             300153 non-null  object 
 4   arrival_time      300153 non-null  object 
 5   destination_city  300153 non-null  object 
 6   class             300153 non-null  object 
 7   duration          300153 non-null  float64
 8   days_left         300153 non-null  int64  
 9   price             300153 non-null  int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 22.9+ MB


In [8]:
from sklearn.preprocessing import OneHotEncoder

In [9]:
categoricas = df_inicial.select_dtypes(include=['object']).columns.tolist()
ohe = OneHotEncoder()
array_ohe = ohe.fit_transform(df_inicial[categoricas]).toarray()


In [10]:
df_ohe = pd.DataFrame(array_ohe)
df_ohe

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,25,26,27,28,29,30,31,32,33,34
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
300148,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
300149,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
300150,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
300151,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [11]:
df_cont = df_inicial.drop(columns=categoricas,axis=1)
df_ohe_final = pd.concat([df_cont, df_ohe], axis=1)
df_ohe_final

Unnamed: 0,duration,days_left,price,0,1,2,3,4,5,6,...,25,26,27,28,29,30,31,32,33,34
0,2.17,1,5953,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,2.33,1,5953,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,2.17,1,5956,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,2.25,1,5955,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
4,2.33,1,5955,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
300148,10.08,49,69265,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
300149,10.42,49,77105,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
300150,13.83,49,79099,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
300151,10.00,49,81585,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


### Agora que as variáveis categóricas passaram pelo OneHotEncoder, vamos separar as váriaveis independentes em X e a variável dependente (price) em Y. Após isso, faremos a divisão em treino, validação e teste.

In [12]:
x = df_ohe_final.drop(columns='price', axis=1)
y= df_ohe_final['price']

In [13]:
from sklearn.model_selection import train_test_split

x_t, x_teste, y_t, y_teste = train_test_split(x, y, test_size=0.2, random_state=42)

x_treino, x_val, y_treino, y_val = train_test_split(x_t, y_t, test_size=0.25, random_state=42)

### Dessa maneira, dividimos o dataset deixando 20% dos dados para validação e teste e o restante para o treinamento.

In [14]:
x_teste.shape, x_val.shape, y_teste.shape, y_val.shape, x_treino.shape, y_treino.shape

((60031, 37), (60031, 37), (60031,), (60031,), (180091, 37), (180091,))

### Agora que os dados estão divididos, podemos fazer o escalonamento deles sem nos preocupar com data leakage.

In [15]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [16]:
def escalonador(df):
  numericas = ['days_left', 'duration']
  df_scaled = pd.DataFrame(scaler.fit_transform(df[numericas]))
  df.drop(columns=numericas,axis=1,inplace=True)
  df_final = pd.concat([df.reset_index(drop=True), df_scaled.reset_index(drop=True)], axis=1)
  return df_final

In [17]:
x_treino_final = escalonador(x_treino)
x_teste_final = escalonador(x_teste)
x_val_final = escalonador(x_val)

In [18]:
x_teste_final.shape, x_treino_final.shape, x_val_final.shape

((60031, 37), (180091, 37), (60031, 37))

### Agora, faremos a validação cruzada e verificaremos o valor médio do score de cada algoritmo. Também verificaremos se há indícios de overfitting. Trabalharemos com modelos de Regressão Linear, RandomForest e SGD.

In [24]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn import metrics

In [148]:
lr = LinearRegression()
cv = KFold(n_splits=10, random_state=42, shuffle=True)
cv_lr = cross_val_score(lr, x_treino_final, y_treino, cv=cv)

In [149]:
cv_lr

array([0.90951191, 0.91375732, 0.9120879 , 0.9134644 , 0.91189805,
       0.90858701, 0.91116949, 0.90983045, 0.91164852, 0.91158513])

In [40]:
cv_lr.mean()

0.9113540178961095

In [20]:
from sklearn.ensemble import RandomForestRegressor

In [27]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
cv_rf = cross_val_score(rf, x_treino_final, y_treino, cv=cv)

In [150]:
cv_rf

array([0.98428648, 0.98481028, 0.98551265, 0.98567203, 0.98443745,
       0.98508988, 0.98479165, 0.98402733, 0.98429971, 0.98535136])

In [29]:
cv_rf.mean()

0.9848278820410737

In [37]:
from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor(max_iter=1000, tol=1e-3)
cv_sgd = cross_val_score(sgd, x_treino_final, y_treino, cv=cv)

In [151]:
cv_sgd

array([0.90942722, 0.9137201 , 0.91201256, 0.91329143, 0.91164462,
       0.90851551, 0.91107612, 0.90981748, 0.91152535, 0.91171097])

In [39]:
cv_sgd.mean()

0.9112741350174295

### Não há muita variação no desempenho em cada fold. O algoritmo de RandomForest mostrou um score consideravelmente maior que os outros. Vamos agora realizar a tunagem dos hiperparâmetros, primeiro utilizando otimizador Bayesiano, já que nossa RandomForest possui muitos parâmetros a serem testados e o GridSearch tomaria tempo de mais ao testar todas as combinações possíveis. Para o modelo SGD, utilizaremos GridSearch pois estamos tunando menos hiperparâmetros.

In [None]:
!pip install scikit-optimize
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

In [None]:
search_space_rf = {'max_depth': Integer(5, 15),
                   'max_features': Categorical(['auto', 'log2','sqrt']),
                   'min_samples_leaf': Integer(2,10),
                   'n_estimators': Integer(100,300),
                   'bootstrap': Categorical([True, False])
}

rf_bayes = BayesSearchCV(rf, search_space_rf, n_iter=20, random_state=42, n_jobs=1, cv=cv)
rf_bayes.fit(x_val, y_val)



In [49]:
rf_bayes.best_params_

OrderedDict([('bootstrap', True),
             ('max_depth', 15),
             ('max_features', 'auto'),
             ('min_samples_leaf', 4),
             ('n_estimators', 297)])

In [51]:
rf_bayes.best_score_

0.9569700677637212

### Como para o modelo SGD testaremos poucos hiperparâmetros, é válido usarmos o GridSearch que testará cada combinação possível. Neste algoritmo, verificaremos qual regularização desempenha melhor. Lembrando que l1 corresponde a regularização de Lasso, que pode reduzir a variância do modelo, além de zerar coeficientes de algumas variáveis quando há autocorrelação. A regularização l2 é a regularização Ridge, na qual é aplicada penalização mas não há redução do número de variáveis. ElasticNet é uma combinação de l1 e l2, com parâmetro entre 0 e 1 que controla quão próxima está de uma ou outra. Além disso, testaremos qual taxa de aprendizado desempenha melhor.

In [None]:
from sklearn.model_selection import GridSearchCV
grid_sgd = {'learning_rate': ['optimal', 'adaptive', 'invscaling', 'constant'],
                   'penalty': ['l1', 'l2','elasticnet']
}

sgd_grid_search = GridSearchCV(sgd, grid_sgd, cv=cv)
sgd_grid_search.fit(x_val, y_val)

In [157]:
sgd_grid_search.best_params_

{'learning_rate': 'adaptive', 'penalty': 'l1'}

In [158]:
sgd_grid_search.best_score_

0.9054939829084214

### Agora que temos os melhores parâmetros, vamos treinar nossos modelos finais e realizar as previsões.

In [None]:
modelo_rf_final = RandomForestRegressor(max_depth=15, max_features='auto',min_samples_leaf=4,n_estimators=297, random_state=42)
modelo_rf_final.fit(x_treino_final, y_treino)

In [None]:
modelo_sgd_final = SGDRegressor(learning_rate='adaptive', penalty='l1', max_iter=1000, tol=1e-3, random_state=42)
modelo_sgd_final.fit(x_treino_final, y_treino)


In [None]:
modelo_lr_final = LinearRegression()
modelo_lr_final.fit(x_treino_final, y_treino)


In [160]:
y_pred_lr = modelo_lr_final.predict(x_teste_final)
y_pred_rf = modelo_rf_final.predict(x_teste_final)
y_pred_sgd = modelo_sgd_final.predict(x_teste_final)

In [None]:
from sklearn.metrics import mean_squared_error
import math

### Vamos definir uma função para retornar o Score e o RMSE

In [152]:
def score_rmse(modelo,y, alg):
  score = modelo.score(x_teste_final, y_teste)
  rmse = mean_squared_error(y_teste, y)
  print(f'Score - {alg}: {score*100:.2f}%')
  print(f'RMSE - {alg}: {math.sqrt(rmse)}')

In [161]:
score_rmse(modelo=modelo_lr_final, y=y_pred_lr, alg='Regressão Linear')
score_rmse(modelo=modelo_rf_final, y=y_pred_rf, alg = 'Random Forest')
score_rmse(modelo=modelo_sgd_final, y=y_pred_sgd, alg = 'SGD')

Score - Regressão Linear: 91.13%
RMSE - Regressão Linear: 6761.840509560028
Score - Random Forest: 97.50%
RMSE - Random Forest: 3587.112838809718
Score - SGD: 91.13%
RMSE - SGD: 6761.860178747728


### Temos o resultado de que o algoritmo RandomForest possui o maior score e o menor RMSE. Vamos agora visualizar o tamanho do erro dos resultados da RandomForest mais de perto.

In [108]:
df_erro = pd.DataFrame(y_pred_rf - y_teste).abs()
df_erro.rename(columns={'price':'Erro'}, inplace=True)

In [162]:
df_erro

Unnamed: 0,Erro
27131,2054.575311
266857,4210.665558
141228,135.464643
288329,5170.241511
97334,1047.016225
...,...
5234,163.559116
5591,177.643388
168314,822.345525
175191,1449.038139


In [163]:
df_erro.describe()

Unnamed: 0,Erro
count,60031.0
mean,1871.513335
std,3060.224233
min,0.0
25%,294.797159
50%,830.860251
75%,1921.108138
max,46187.063279


### Os dados indicam um desvio de 3000 e em até 75% das previsões o erro é no máximo de aproximadamente 2000. A mediana do erro é inferior a 1000. Importante ressaltar que estamos avaliando o erro em módulo.

### Sendo assim, podemos concluir que dentre os três algoritmos testados, aquele que conseguiu o melhor resultado foi o RandomForest. Mas, será que os hiperparâmetros que testamos realmente nos forneceram resultados melhores? Vamos testar um modelo de RandomForest no conjunto de teste sem utilizar os hiperparâmetros testados pela otimização bayesiana.

In [None]:
rf_sem_tunagem = RandomForestRegressor(random_state=42)
rf_sem_tunagem.fit(x_treino_final, y_treino)

In [132]:
y_pred_rf_st = modelo_lr_final.predict(x_teste_final)

In [133]:
score_rmse(modelo=rf_sem_tunagem, y=y_pred_rf_st, alg='Random Forest sem tunagem')

Acurácia - Random Forest sem tunagem: 98.19%
RMSE - Random Forest sem tunagem: 6761.840509560028


### O modelo apresenta um score maior, porém um RMSE também maior. Vamos olhar este erro de perto.

In [138]:
df_erro_st = pd.DataFrame(y_pred_rf_st - y_teste).abs()
df_erro_st.rename(columns={'price':'Erro'}, inplace=True)

In [143]:
df_erro, df_erro_st

(               Erro
 27131   2054.575311
 266857  4210.665558
 141228   135.464643
 288329  5170.241511
 97334   1047.016225
 ...             ...
 5234     163.559116
 5591     177.643388
 168314   822.345525
 175191  1449.038139
 287693  4551.353632
 
 [60031 rows x 1 columns],
           Erro
 27131   3986.0
 266857  9775.0
 141228  4161.0
 288329  5212.0
 97334     50.0
 ...        ...
 5234       2.0
 5591    4117.0
 168314  1742.0
 175191  6634.0
 287693  8929.0
 
 [60031 rows x 1 columns])

In [146]:
df_erro_st.describe(), df_erro.describe()

(               Erro
 count  60031.000000
 mean    4558.256351
 std     4994.517159
 min        0.000000
 25%     1454.000000
 50%     3123.000000
 75%     5514.500000
 max    59315.000000,
                Erro
 count  60031.000000
 mean    1871.513335
 std     3060.224233
 min        0.000000
 25%      294.797159
 50%      830.860251
 75%     1921.108138
 max    46187.063279)

In [141]:
df_erro.sort_values(by='Erro').head(10)


Unnamed: 0,Erro
209089,0.0
226397,0.0
97759,0.0
216934,0.0
101685,0.0
99465,0.0
221108,0.0
61180,0.0
100757,0.0
265671,0.0


In [142]:
df_erro_st.sort_values(by='Erro').head(10)

Unnamed: 0,Erro
125364,0.0
211936,0.0
84828,0.0
26299,0.0
70879,0.0
293886,1.0
130137,1.0
216192,1.0
170429,1.0
43165,1.0


### Por todas as métricas vemos que o modelo com a tunagem dos hiperparâmetros erra menos no dataset de teste. Enquanto o modelo tunado possui desvio de aproximadamente 3000, o modelo sem tunagem possui desvio de 5000. Além disso, os quartis do modelo sem tunagem possuem valores mais elevados de erro do que o tunado.

### Por fim, vamos recapitular o que foi feito. Primeiro, transformamos os dados categóricos por meio do OneHotEncoder. Embora os algoritmos de RandomForest saibam lidar com dados categóricos, fizemos esta transformação porque também trabalhamos com Regressão Linear e SGD. Após separarmos os conjuntos de treino, teste e validação, fizemos o escalonamento dos dados, evitando data leakage. Então, fizemos a validação cruzada para entendermos a precisão dos algoritmos e verificar se estaria havendo overfitting. Utilizamos a otimização bayesiana na RandomForest pois havia muitos hiperparâmetros a serem tunados e o GridSearch com o SGD pois testamos menos hiperparâmetros para ele. Finalmente, comparamos os modelos construídos e concluímos que a RandomForest nos fornece as previsões com o menor erro.