In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Predição de Satisfação de Passageiro com a viagem aérea

Essa é uma tentativa de predição de satisfação de um passageiro com a viagem feita por avião. Os dados estão disponíveis no [Kaggle](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction). O *dataset* é composto por 24 *features* para prever o *target* que é o nível de satisfação de um passageiro entre *neutral/dissatisfied* e *satisfied*, ou seja, uma classificação binária.

Escolhi um dataset tabular pois queria algo que me permitisse testar vários tipos de modelos, coisa que acredito que não seria possível com um dataset, por exemplo, de imagens pois, para mim, a principal abordagem seria "simplesmente" testar configurações de redes neurais convolucionais. Com dados tabulares, posso validar vários modelos, inclusive uma rede neural.

Assim sendo, é um problema que pode ser resolvido com modelos supervisionados, já que temos a informação do nível de satisfação de um passageiro.

# Sumário

- <b>Caracterização e tratamento do dataset</b>
- <b>Aplicando os modelos</b>
    - <b>Dividindo em dados de treino e de teste</b>
    - <b>Treinando os modelos</b>
- <b>Análise dos Resultados</b>
- <b>Conclusão</b>

# Caracterização e tratamento do dataset

In [64]:
train_df = pd.read_csv("./dataset/train.csv")
test_df = pd.read_csv("./dataset/test.csv")

Os dados já vieram separados em dados de treino (80%) e de teste (20%).

In [65]:
print(f"A quantidade de dados de treino é {len(train_df)}")
print(f"A quantidade de dados de teste é {len(test_df)}")

A quantidade de dados de treino é 103904
A quantidade de dados de teste é 25976


In [66]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [67]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      1039

O dataset possui 25 colunas, sendo uma delas a coluna alvo (*satisfaction*). Podemos perceber, entretanto, que a coluna 0 (Unnamed: 0) e 1 (id) não são de interesse para a classificação. Além disso, a maioria das colunas é categórica. Desssa forma, podemos alterar os seus tipos de int64 para category. As colunas do tipo object também são categóricas e podemos extrair *dummy_variables* delas.

In [68]:
train_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,103904.0,51951.5,29994.645522,0.0,25975.75,51951.5,77927.25,103903.0
id,103904.0,64924.210502,37463.812252,1.0,32533.75,64856.5,97368.25,129880.0
Age,103904.0,39.379706,15.114964,7.0,27.0,40.0,51.0,85.0
Flight Distance,103904.0,1189.448375,997.147281,31.0,414.0,843.0,1743.0,4983.0
Inflight wifi service,103904.0,2.729683,1.327829,0.0,2.0,3.0,4.0,5.0
Departure/Arrival time convenient,103904.0,3.060296,1.525075,0.0,2.0,3.0,4.0,5.0
Ease of Online booking,103904.0,2.756901,1.398929,0.0,2.0,3.0,4.0,5.0
Gate location,103904.0,2.976883,1.277621,0.0,2.0,3.0,4.0,5.0
Food and drink,103904.0,3.202129,1.329533,0.0,2.0,3.0,4.0,5.0
Online boarding,103904.0,3.250375,1.349509,0.0,2.0,3.0,4.0,5.0


In [69]:
train_df.isna().sum()

Unnamed: 0                             0
id                                     0
Gender                                 0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Inflight wifi service                  0
Departure/Arrival time convenient      0
Ease of Online booking                 0
Gate location                          0
Food and drink                         0
Online boarding                        0
Seat comfort                           0
Inflight entertainment                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Inflight service                       0
Cleanliness                            0
Departure Delay in Minutes             0
Arrival Delay in Minutes             310
satisfaction    

In [70]:
test_df.isna().sum()

Unnamed: 0                            0
id                                    0
Gender                                0
Customer Type                         0
Age                                   0
Type of Travel                        0
Class                                 0
Flight Distance                       0
Inflight wifi service                 0
Departure/Arrival time convenient     0
Ease of Online booking                0
Gate location                         0
Food and drink                        0
Online boarding                       0
Seat comfort                          0
Inflight entertainment                0
On-board service                      0
Leg room service                      0
Baggage handling                      0
Checkin service                       0
Inflight service                      0
Cleanliness                           0
Departure Delay in Minutes            0
Arrival Delay in Minutes             83
satisfaction                          0


Podemos perceber que, em ambos os arquivos, existem dados faltando para a coluna *Arrival Delay in Minutes*. Nesse caso, irei supor que os valores faltando são iguais a 0.

In [71]:
train_df['satisfaction'].value_counts()

neutral or dissatisfied    58879
satisfied                  45025
Name: satisfaction, dtype: int64

In [72]:
test_df['satisfaction'].value_counts()

neutral or dissatisfied    14573
satisfied                  11403
Name: satisfaction, dtype: int64

Pode-se ver que existe uma boa proporção de dados de cada classe tanto no treino quanto no teste

Vamos, de fato, limpar os dados agora. O espaço ocupado pelos dados de treino, em KB, é:

In [73]:
train_df.memory_usage().sum()/1024

20293.875

Já o de teste:

In [74]:
test_df.memory_usage().sum()/1024

5073.5625

In [112]:
from sklearn.preprocessing import MinMaxScaler

def transform_to_categorical(df, cols):
    df = df.copy()
    for col in cols:
        df[col] = df[col].astype('category')
    
    return df

def fill_with_zero(df, cols):
    df = df.copy()
    for col in cols:
        df[col] = df[col].fillna(0)
    
    return df

def to_dummies(df, cols):
    df = df.copy()
    for col in cols:
        dummies = pd.get_dummies(df[col], drop_first=True)
        df = pd.concat([df, dummies], axis=1)
        df.drop(col, inplace=True, axis=1)
    
    return df

def normalize(df, cols, scaler, already_fitted_scaler):
    numerical_only_data = df[cols]
    if not already_fitted_scaler:
        scaler.fit(numerical_only_data.values)
    
    numerical_scaled = scaler.transform(numerical_only_data.values)
    df[cols] = numerical_scaled
    return df
        
def clean_data(df, scaler, already_fitted_scaler=False):
    cleaned_df = df.drop(['Unnamed: 0', 'id'], axis=1)
    
    #Preenchendo com zeros
    COL_WITH_NAN = ['Arrival Delay in Minutes']
    cleaned_df = fill_with_zero(cleaned_df, COL_WITH_NAN)
    
    #Transformando em categorical
    NUM_COLS = ['Departure Delay in Minutes', 'Arrival Delay in Minutes']
    CAT_COLS = cleaned_df.columns.difference(NUM_COLS)
    cleaned_df = transform_to_categorical(cleaned_df, CAT_COLS)
    
    #Conseguir dummy variables dos categoricals objects
    COLS_TO_DUMMY = ['Gender', 'Customer Type', 'Type of Travel', 'Class', 'satisfaction']
    cleaned_df = to_dummies(cleaned_df, COLS_TO_DUMMY)
    
    #Normalizar colunas
    cleaned_df = normalize(cleaned_df, NUM_COLS, scaler, already_fitted_scaler)
    
    return cleaned_df

In [113]:
scaler = MinMaxScaler()

In [114]:
cleaned_train_df = clean_data(train_df, scaler, False)

In [115]:
cleaned_train_df.columns.difference(train_df.columns)

Index(['Eco', 'Eco Plus', 'Male', 'Personal Travel', 'disloyal Customer',
       'satisfied'],
      dtype='object')

Podemos ver que novas colunas surgiram. Todas elas vieram a partir de *dummy_variables* de alguma outra. Até mesmo a coluna *target* de nome *satisfied* teve seu conteúdo alterado de "unsatified/neutral" e "satisfied" para 0's ou 1's sendo 1 representando satisfação.

In [116]:
cleaned_test_df = clean_data(test_df, scaler, True)

Perceba que reusei o scaler fitado nos dados de treino sobre os de teste. Agora, os *Dataframes* com os dados ocupam, em KB:

In [121]:
cleaned_train_df.memory_usage().sum()/1024

4121.7890625

In [122]:
cleaned_test_df.memory_usage().sum()/1024

1149.734375

Cerca de um quarto do valor inicial. Além disso, as colunas quantitativas estão normalizadas em ambos os *datasets*.

In [125]:
cleaned_train_df.describe()[['Departure Delay in Minutes', 'Arrival Delay in Minutes']]

Unnamed: 0,Departure Delay in Minutes,Arrival Delay in Minutes
count,103904.0,103904.0
mean,0.009306,0.009554
std,0.024014,0.0244
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.007538,0.008207
max,1.0,1.0


In [126]:
cleaned_test_df.describe()[['Departure Delay in Minutes', 'Arrival Delay in Minutes']]

Unnamed: 0,Departure Delay in Minutes,Arrival Delay in Minutes
count,25976.0,25976.0
mean,0.008986,0.009276
std,0.023507,0.023653
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,0.007538,0.008207
max,0.708543,0.703914
