# Trabajo Práctico Integrador - Sistemas Inteligentes

## Alumnos:

+ Azul Zaietz - 102214
+ Lisandro Torresetti - 99846

### Dataset

El [Dataset](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction) contiene información sobre la satisfacción de los pasajeros de aerolíneas estadounidenses.

### Atributos

+ Gender: Gender of the passengers (Female, Male)


+ Customer Type: The customer type (Loyal customer, disloyal customer)


+ Age: The actual age of the passengers


+ Type of Travel: Purpose of the flight of the passengers (Personal Travel, Business Travel)


+ Class: Travel class in the plane of the passengers (Business, Eco, Eco Plus)


+ Flight distance: The flight distance of this journey


+ Inflight wifi service: Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)


+ Departure/Arrival time convenient: Satisfaction level of Departure/Arrival time convenient


+ Ease of Online booking: Satisfaction level of online booking


+ Gate location: Satisfaction level of Gate location


+ Food and drink: Satisfaction level of Food and drink


+ Online boarding: Satisfaction level of online boarding


+ Seat comfort: Satisfaction level of Seat comfort


+ Inflight entertainment: Satisfaction level of inflight entertainment


+ On-board service: Satisfaction level of On-board service


+ Leg room service: Satisfaction level of Leg room service


+ Baggage handling: Satisfaction level of baggage handling


+ Check-in service: Satisfaction level of Check-in service


+ Inflight service: Satisfaction level of inflight service


+ Cleanliness: Satisfaction level of Cleanliness


+ Departure Delay in Minutes: Minutes delayed when departure


+ Arrival Delay in Minutes: Minutes delayed when Arrival


+ Satisfaction: Airline satisfaction level(Satisfaction, neutral or dissatisfaction)


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style(style="darkgrid")

In [None]:
# Constants
SHORT_FLIGHTS = 1000
LONG_FLIGHTS = 2000
SHORT_TAG = 'short'
MEDIUM_TAG = 'medium'
LONG_TAG = 'long'

YOUNG_ADULT = 30
YOUNG_ADULT_TAG = 'youngAdult'
ADULT = 45
ADULT_TAG = 'adult'
SENIOR_TAG = 'senior'
SATISFIED = 'satisfied'

In [None]:
# Auxiliar functions
def getFlightDistanceTag(distance):
    if distance < SHORT_FLIGHTS:
        return SHORT_TAG
    if SHORT_FLIGHTS <= distance and distance < LONG_FLIGHTS:
        return MEDIUM_TAG
    return LONG_TAG

def getAgeTags(age):
    if age < YOUNG_ADULT:
        return YOUNG_ADULT_TAG
    if YOUNG_ADULT <= age and age < ADULT:
        return ADULT_TAG
    return SENIOR_TAG

def getSatisfactionAsIntFromTag(satisfaction):
    return int(satisfaction == SATISFIED)

def plotHistogram(dataFrame, column, bins=60, xlim=None, figSize=(16, 4)):
    f, ax = plt.subplots(figsize=figSize)
    ax = sns.histplot(dataFrame[column], bins=bins, color='darkorange', ax=ax)
    if not xlim is None:
        ax.set_xlim(xlim[0], xlim[1])
    ax.set_title(f"{column} histogram", fontsize = 15);

satisfactionAsInt = lambda value: int(np.round(value))

In [None]:
flightsDf = pd.read_csv('./dataset/train.csv', encoding='utf-8')
flightsDf 

In [None]:
# The first two columns are gonna be removed because they dont have valuable information
flightsDf = flightsDf.drop(flightsDf.columns[[0, 1]], axis = 1)

In [None]:
flightsDf.info()

In [None]:
# Checking for NaNs
for column in flightsDf.columns.values:
    print(f"{column}: {flightsDf[column].isnull().sum()}")

In [None]:
# Removing rows with NaN in 'Arrival Delay in Minutes'
flightsDf = flightsDf[flightsDf['Arrival Delay in Minutes'].notnull()]

# Departure Delay in Minutes and Arrival Delay in Minutes should have the same data type
flightsDf['Arrival Delay in Minutes'] = flightsDf['Arrival Delay in Minutes'].astype(np.int64)

In [None]:
for column in flightsDf.columns.values:
    if flightsDf[column].dtype == object:
        continue
    mean = np.round(np.mean(flightsDf[column]), 4)
    stdDesv = np.round(np.std(flightsDf[column]), 4)
    minValue, maxValue = min(flightsDf[column]), max(flightsDf[column])
    print(f"Column: {column} \n Min Value: {minValue}\n Max Value: {maxValue} \n Mean: {mean} \n Std: {stdDesv}\n")

## Age and Type of Travel

Vamos a analizar la cantidad de pasajeros de edad menor o igual a 10 años que realizan viajes de negocios.

In [None]:
lessThan10 = flightsDf[flightsDf['Age'] <= 10]
lessThan10[lessThan10['Type of Travel'] == 'Business travel']

Dado que un viaje de negocio es un viaje con fines laborales o de negocio, consideramos que estas filas del dataset poseen inconsistencias, ya que a la edad de 10 años no es normal que se den este tipo de situaciones. Hay casos excepcionales, pero consideramos que es mejor eliminar estas filas del dataset.

In [None]:
flightsDf = flightsDf.drop(lessThan10.index)
flightsDf

## Plots

In [None]:
for column in ['Age', 'Flight Distance']:
    plotHistogram(flightsDf, column)

In [None]:
for delay in ['Departure Delay in Minutes', 'Arrival Delay in Minutes']:
    plotHistogram(flightsDf, delay, bins=100, xlim=(0, 200))

In [None]:
# Services
services = [
    'Inflight wifi service',
    'Departure/Arrival time convenient',
    'Ease of Online booking',
    'Gate location',
    'Food and drink',
    'Online boarding',
    'Seat comfort',
    'Inflight entertainment',
    'On-board service',
    'Leg room service',
    'Baggage handling',
    'Checkin service',
    'Inflight service',
    'Cleanliness'
]

inflightServices = [
    'Inflight wifi service',
    'Food and drink',
    'Seat comfort',
    'Inflight entertainment',
    'On-board service',
    'Leg room service',
    'Inflight service',
    'Cleanliness'
]
externalServices = [
    'Departure/Arrival time convenient',
    'Ease of Online booking',
    'Gate location',
    'Online boarding',
    'Baggage handling',
    'Checkin service',
]

## Groups

Realizaremos los siguientes _groupby_ para analizar distintos comportamientos, pero antes de hacer eso primero se agregaran las siguientes columnas al dataset:

+ **satisfactionBinary**: para poder contabilizar cuando agrupemos, 1 = satisfied, 0 = dissatisfied.


+ **distanceTag**: short, medium, large. La informacion de la clasificacion se obtuvo de [Flight length](https://en.wikipedia.org/wiki/Flight_length).


+ **ageTag**: youngAdult, adult, senior. Para tener clasificadas las edades de los pasajeros.


+ **takeOffDelay**: Yes/No.


+ **landDelay**: Yes/No.

En base a estas nuevas columnas se realizaran los siguientes agrupamientos:

+ **ageTag, distanceTag**: para ver como afecta a la satisfaccion de los pasajeros la duracion del viaje con respecto a su edad.


+ **class, distanceTag**: similar al caso anterior pero para las distintas clases de pasajeros.


+ **class, takeOffDelay** y **class, landDelay**: para ver que afecta mas a la satisfaccion, si despegar tarde o aterrizar tarde teniendo en cuenta la clase del pasajero



In [None]:
# In order to perform some statistics when we group by some attribute
flightsDf['satisfactionBinary'] = flightsDf['satisfaction'].apply(getSatisfactionAsIntFromTag)
flightsDf['distanceTag'] = flightsDf['Flight Distance'].apply(getFlightDistanceTag)
flightsDf['ageTag'] = flightsDf['Age'].apply(getAgeTags)
flightsDf['takeOffDelay'] = flightsDf['Departure Delay in Minutes'].apply(lambda t: 'Yes' if t > 0 else 'No')
flightsDf['landDelay'] = flightsDf['Arrival Delay in Minutes'].apply(lambda t: 'Yes' if t > 0 else 'No')

In [None]:
ageDistanceGroup = flightsDf.groupby(['ageTag', 'distanceTag']).agg({'satisfactionBinary': 'mean'})
ageDistanceGroup['satisfaction'] = ageDistanceGroup['satisfactionBinary'].apply(satisfactionAsInt)
ageDistanceGroup

In [None]:
classDistanceGroup = flightsDf.groupby(['Class', 'distanceTag']).agg({'satisfactionBinary': 'mean'})
classDistanceGroup['satisfaction'] = classDistanceGroup['satisfactionBinary'].apply(satisfactionAsInt)
classDistanceGroup

In [None]:
# Take off Delay
takeOffGroup = flightsDf.groupby(['Class', 'takeOffDelay']).agg({'satisfactionBinary': 'mean'})
takeOffGroup

In [None]:
# Land Delay
landGroup = flightsDf.groupby(['Class', 'landDelay']).agg({'satisfactionBinary': 'mean'})
landGroup

Se puede ver que en ambos casos a los que les disgusta mas tener una demora es a los de las clases Eco y Eco Plus. Para la clase Business en ambos casos la media es similar.