### Regresión Logística.
### Aunque no es el objetivo explicar cómo las variables independientes explican el atraso, se ha desechado 1 clase de cada variable dicotomizada y así evitar la multicolinealidad en el modelo.

In [12]:
# Importación de librerías

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import shuffle
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression

import missingno as msng
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 10)

mf = pd.read_csv('dataset_SCL_2.csv')

In [4]:
mf.head()

Unnamed: 0.1,Unnamed: 0,OPERA,MES,TIPOVUELO,SIGLADES,DIANOM,temporada_alta,periodo_dia,atraso_15,C_J_destinos,C_J_aerolineas
0,42405,Grupo LATAM,8,N,Antofagasta,Domingo,0,noche,0,2,1
1,65490,Grupo LATAM,12,N,Puerto Montt,Miercoles,1,mañana,0,2,1
2,37211,Grupo LATAM,7,I,Rosario,Sabado,1,mañana,1,3,1
3,8036,Sky Airline,2,N,Iquique,Viernes,1,noche,0,2,1
4,41039,Grupo LATAM,8,N,Antofagasta,Viernes,0,tarde,0,2,1


In [5]:
mf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68206 entries, 0 to 68205
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      68206 non-null  int64 
 1   OPERA           68206 non-null  object
 2   MES             68206 non-null  int64 
 3   TIPOVUELO       68206 non-null  object
 4   SIGLADES        68206 non-null  object
 5   DIANOM          68206 non-null  object
 6   temporada_alta  68206 non-null  int64 
 7   periodo_dia     66976 non-null  object
 8   atraso_15       68206 non-null  int64 
 9   C_J_destinos    68206 non-null  int64 
 10  C_J_aerolineas  68206 non-null  int64 
dtypes: int64(6), object(5)
memory usage: 5.7+ MB


In [6]:
# Creamos las variables dummies. Se usa drop_first para evitar la multicolinealidad extrema en la reg log

features_reg_log = pd.concat([pd.get_dummies(mf['OPERA'], prefix = 'OPERA', drop_first=True), 
                      pd.get_dummies(mf['MES'], prefix = 'MES', drop_first=True), 
                      pd.get_dummies(mf['TIPOVUELO'], prefix = 'TIPOVUELO', drop_first=True),
                      pd.get_dummies(mf['SIGLADES'], drop_first=True),
                      pd.get_dummies(mf['DIANOM'], drop_first=True),
                      pd.get_dummies(mf['periodo_dia'], drop_first=True),
                      pd.get_dummies(mf['C_J_destinos'], prefix = 'C_J_destinos', drop_first=True),
                      pd.get_dummies(mf['C_J_aerolineas'], prefix = 'C_J_aerolineas', drop_first=True)], axis = 1)
label = mf['atraso_15']

In [7]:
features_reg_log.head()

Unnamed: 0,OPERA_Aeromexico,OPERA_Air Canada,OPERA_Air France,OPERA_Alitalia,OPERA_American Airlines,OPERA_Austral,OPERA_Avianca,OPERA_British Airways,OPERA_Copa Air,OPERA_Delta Air,...,Sabado,Viernes,noche,tarde,C_J_destinos_2,C_J_destinos_3,C_J_aerolineas_2,C_J_aerolineas_3,C_J_aerolineas_4,C_J_aerolineas_5
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,0,0,0,0,0


Se separan las muestras de entrenamiento y testing. 70% entrenamiento, 30% testing.

In [8]:
x_train, x_test, y_train, y_test = train_test_split(features_reg_log, label, test_size = 0.3, random_state = 100)

In [9]:
x_train.shape, x_test.shape

((47744, 109), (20462, 109))

In [10]:
y_train.value_counts('%')

0    0.813903
1    0.186097
Name: atraso_15, dtype: float64

In [11]:
y_test.value_counts('%')

0    0.81776
1    0.18224
Name: atraso_15, dtype: float64

In [13]:
logReg = LogisticRegression()
model = logReg.fit(x_train, y_train)

In [14]:
y_pred = model.predict(x_test)

#### Métricas Regresión Logística

In [15]:
confusion_matrix(y_test, y_pred)

array([[16577,   156],
       [ 3575,   154]])

In [16]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.99      0.90     16733
           1       0.50      0.04      0.08      3729

    accuracy                           0.82     20462
   macro avg       0.66      0.52      0.49     20462
weighted avg       0.76      0.82      0.75     20462



### Este modelo tiene un Accuracy del 82%, pero una Sensitivity muy baja (50%) y una Precision para la clase 1 de solo el 4%. Es decepcionante el resultado de este modelo.