# **Ejercicios de pair programming Módulo 3 Sprint 1**
## **Regresión Logística: Lección 2 - Procesado**
---


Usando el mismo dataset que usatéis ayer, los objetivos de los ejercicios de hoy son:
- Estandarizar las variables numéricas de vuestro set de datos
  
- Codificar las variables categóricas. Recordad que tendréis que tener en cuenta si vuestras variables tienen orden o no.
  
- Chequear si vuestros datos están balanceados. En caso de que no lo estén utilizad algunas de las herramientas aprendidas en la lección para balancearlos.
  
- Guardad el dataframe con los cambios que habéis aplicado para utilizarlo en la siguiente lección.

In [1]:
# Tratamiento de los datos
# ========================
import numpy as np
import pandas as pd

# Librerías para la visualización de los datos
# ============================================
import matplotlib.pyplot as plt
import seaborn as sns

# Estandarización variables numéricas y Codificación variables categóricas
# ========================================================================
from sklearn.preprocessing import RobustScaler

# Gestión datos desbalanceados
# ============================
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.combine import SMOTETomek

# Separación de los datos en train y test
# =======================================
from sklearn.model_selection import train_test_split

# Configuración de warnings
# =========================
import warnings
warnings.filterwarnings("ignore")

# Establecer preferencias de visualización
# ========================================
plt.rcParams["figure.figsize"] = (20,20)
pd.options.display.max_columns = None 


In [2]:
df = pd.read_pickle("datos/invistico_airline_eda.pkl")
df.head(5)

Unnamed: 0,satisfaction,gender,customer_type,age,type_of_travel,class,flight_distance,seat_comfort,departure_arrival_time_convenient,food_and_drink,gate_location,inflight_wifi_service,inflight_entertainment,online_support,ease_of_online_booking,onboard_service,leg_room_service,baggage_handling,checkin_service,cleanliness,online_boarding,departure_delay_in_minutes
0,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,2,4,2,3,3,0,3,5,3,2,0
1,satisfied,Male,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,0,2,2,3,4,4,4,2,3,2,310
2,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,2,0,2,2,3,3,4,4,4,2,0
3,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,3,4,3,1,1,0,1,4,1,3,0
4,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,4,3,4,2,2,0,2,4,2,5,0


In [3]:
df_numericas = df.select_dtypes(include = np.number)
df_numericas.head()

Unnamed: 0,age,flight_distance,departure_delay_in_minutes
0,65,265,0
1,47,2464,310
2,15,2138,0
3,60,623,0
4,70,354,0


### Estandarización de las VPs con Robust Scaler

Optamos por usar el método Robust Scaler porque nuetro dataframe presenta outliers 

In [4]:

scaler = RobustScaler()

scaler.fit(df_numericas)

X_escaladas = scaler.transform(df_numericas)


df_numericas_estandar = pd.DataFrame(X_escaladas, columns = df_numericas.columns)
df_numericas_estandar.head(2)

Unnamed: 0,age,flight_distance,departure_delay_in_minutes
0,1.041667,-1.400844,0.0
1,0.291667,0.454852,25.833333


In [5]:
df.drop(["age", "flight_distance", "departure_delay_in_minutes"], axis = 1, inplace=True)
df.head()

Unnamed: 0,satisfaction,gender,customer_type,type_of_travel,class,seat_comfort,departure_arrival_time_convenient,food_and_drink,gate_location,inflight_wifi_service,inflight_entertainment,online_support,ease_of_online_booking,onboard_service,leg_room_service,baggage_handling,checkin_service,cleanliness,online_boarding
0,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,2,2,4,2,3,3,0,3,5,3,2
1,satisfied,Male,Loyal Customer,Personal Travel,Business,0,0,0,3,0,2,2,3,4,4,4,2,3,2
2,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,3,2,0,2,2,3,3,4,4,4,2
3,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,3,3,4,3,1,1,0,1,4,1,3
4,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,3,4,3,4,2,2,0,2,4,2,5


In [6]:
df = pd.concat([df, df_numericas_estandar], axis = 1)
df.head()

Unnamed: 0,satisfaction,gender,customer_type,type_of_travel,class,seat_comfort,departure_arrival_time_convenient,food_and_drink,gate_location,inflight_wifi_service,inflight_entertainment,online_support,ease_of_online_booking,onboard_service,leg_room_service,baggage_handling,checkin_service,cleanliness,online_boarding,age,flight_distance,departure_delay_in_minutes
0,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,2,2,4,2,3,3,0,3,5,3,2,1.041667,-1.400844,0.0
1,satisfied,Male,Loyal Customer,Personal Travel,Business,0,0,0,3,0,2,2,3,4,4,4,2,3,2,0.291667,0.454852,25.833333
2,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,3,2,0,2,2,3,3,4,4,4,2,-1.041667,0.179747,0.0
3,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,3,3,4,3,1,1,0,1,4,1,3,0.833333,-1.098734,0.0
4,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,3,4,3,4,2,2,0,2,4,2,5,1.25,-1.325738,0.0


### Codificación con datos estandarizados ###

- Variables que NO tienen orden:

In [7]:
lista_columnas = ["gender", "customer_type", "type_of_travel"]

df_encoded = pd.DataFrame()


for columna in lista_columnas:
    df_dummies = pd.get_dummies(df[columna], prefix_sep = "_", prefix = columna, dtype = int)

    df_encoded = pd.concat([df_encoded, df_dummies], axis = 1)



In [8]:
df_encoded.head()

Unnamed: 0,gender_Female,gender_Male,customer_type_Loyal Customer,customer_type_disloyal Customer,type_of_travel_Business travel,type_of_travel_Personal Travel
0,1,0,1,0,0,1
1,0,1,1,0,0,1
2,1,0,1,0,0,1
3,1,0,1,0,0,1
4,1,0,1,0,0,1


In [9]:
df_final = pd.concat([df, df_encoded], axis = 1)
df_final.head()

Unnamed: 0,satisfaction,gender,customer_type,type_of_travel,class,seat_comfort,departure_arrival_time_convenient,food_and_drink,gate_location,inflight_wifi_service,inflight_entertainment,online_support,ease_of_online_booking,onboard_service,leg_room_service,baggage_handling,checkin_service,cleanliness,online_boarding,age,flight_distance,departure_delay_in_minutes,gender_Female,gender_Male,customer_type_Loyal Customer,customer_type_disloyal Customer,type_of_travel_Business travel,type_of_travel_Personal Travel
0,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,2,2,4,2,3,3,0,3,5,3,2,1.041667,-1.400844,0.0,1,0,1,0,0,1
1,satisfied,Male,Loyal Customer,Personal Travel,Business,0,0,0,3,0,2,2,3,4,4,4,2,3,2,0.291667,0.454852,25.833333,0,1,1,0,0,1
2,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,3,2,0,2,2,3,3,4,4,4,2,-1.041667,0.179747,0.0,1,0,1,0,0,1
3,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,3,3,4,3,1,1,0,1,4,1,3,0.833333,-1.098734,0.0,1,0,1,0,0,1
4,satisfied,Female,Loyal Customer,Personal Travel,Eco,0,0,0,3,4,3,4,2,2,0,2,4,2,5,1.25,-1.325738,0.0,1,0,1,0,0,1


In [10]:
df_final.drop(lista_columnas, axis = 1, inplace=True)
df_final.head(2)

Unnamed: 0,satisfaction,class,seat_comfort,departure_arrival_time_convenient,food_and_drink,gate_location,inflight_wifi_service,inflight_entertainment,online_support,ease_of_online_booking,onboard_service,leg_room_service,baggage_handling,checkin_service,cleanliness,online_boarding,age,flight_distance,departure_delay_in_minutes,gender_Female,gender_Male,customer_type_Loyal Customer,customer_type_disloyal Customer,type_of_travel_Business travel,type_of_travel_Personal Travel
0,satisfied,Eco,0,0,0,2,2,4,2,3,3,0,3,5,3,2,1.041667,-1.400844,0.0,1,0,1,0,0,1
1,satisfied,Business,0,0,0,3,0,2,2,3,4,4,4,2,3,2,0.291667,0.454852,25.833333,0,1,1,0,0,1



- Variables que tienen orden:

In [11]:
ordinales = df.select_dtypes(include='category').drop(columns=lista_columnas)
ordinales.sample()

Unnamed: 0,class,seat_comfort,departure_arrival_time_convenient,food_and_drink,gate_location,inflight_wifi_service,inflight_entertainment,online_support,ease_of_online_booking,onboard_service,leg_room_service,baggage_handling,checkin_service,cleanliness,online_boarding
28088,Eco,3,4,4,2,4,4,4,4,5,3,5,4,4,4


In [14]:
# obtenemos las medianas de la variable respuesta por categorías
cat1 = df.groupby(['class', 'satisfaction'])['gender'].count().reset_index().sort_values(by='gender')
cat1

Unnamed: 0,class,satisfaction,gender
5,Eco Plus,satisfied,4019
4,Eco Plus,dissatisfied,5392
0,Business,dissatisfied,18065
3,Eco,satisfied,22973
2,Eco,dissatisfied,35336
1,Business,satisfied,44095


In [18]:
cat2 = df['class'].value_counts()
cat2

Business    62160
Eco         58309
Eco Plus     9411
Name: class, dtype: int64

In [19]:
cat2.index

CategoricalIndex(['Business', 'Eco', 'Eco Plus'], categories=['Business', 'Eco', 'Eco Plus'], ordered=False, dtype='category')

In [21]:
cat2.index[0]

'Business'

In [22]:
cat_ordinales = {}
for col in ordinales.columns:
    df_cat = df[col].value_counts()
    lista_cat = []
    for cat in df_cat.index:
        lista_cat.append(cat)
    cat_ordinales[col] = lista_cat
cat_ordinales

{'class': ['Business', 'Eco', 'Eco Plus'],
 'seat_comfort': [3, 2, 4, 1, 5, 0],
 'departure_arrival_time_convenient': [4, 5, 3, 2, 1, 0],
 'food_and_drink': [3, 4, 2, 1, 5, 0],
 'gate_location': [3, 4, 2, 1, 5, 0],
 'inflight_wifi_service': [4, 5, 3, 2, 1, 0],
 'inflight_entertainment': [4, 5, 3, 2, 1, 0],
 'online_support': [4, 5, 3, 2, 1, 0],
 'ease_of_online_booking': [4, 5, 3, 2, 1, 0],
 'onboard_service': [4, 5, 3, 2, 1, 0],
 'leg_room_service': [4, 5, 3, 2, 1, 0],
 'baggage_handling': [4, 5, 3, 2, 1],
 'checkin_service': [4, 3, 5, 2, 1, 0],
 'cleanliness': [4, 5, 3, 2, 1, 0],
 'online_boarding': [4, 3, 5, 2, 1, 0]}

In [23]:
cat2 = df['food_and_drink'].value_counts()
cat2

3    28150
4    27216
2    27146
1    21076
5    20347
0     5945
Name: food_and_drink, dtype: int64