# US Airline Satisfaction Mini Project 

In this Project, we would like to peform some analysis on a dataset of __[US Airline passenger satisfaction survey](https://www.kaggle.com/datasets/najibmh/us-airline-passenger-satisfaction-survey?resource=download)__.

## Contents
- [Problem](#problem)
- [Data Preparation](#data-preparation)
- [Exploratory Analysis](#exploratory-analysis)
- [Sampling](#sampling)

---

<a id="problem"></a>
## Problem
Based on passenger ratings, we would like to find out how the different indivudal ratings affect the passenger's final decision for a _satisfied_ or _unsatisfied_ with the service provided by US Airline.

**Specifically**:
1. Can we predict if customer would be satified?
1. What are the most important factors that affect customer satisfaction?

---

<a id="data-preparation"></a>
## Data Preparation

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
sb.set() # set the default Seaborn style for graphics

### Import the Dataset
Source: __[US Airline passenger satisfaction survey](https://www.kaggle.com/datasets/najibmh/us-airline-passenger-satisfaction-survey?resource=download)__

Attached file: `satisfaction_v2.csv`  

In [None]:
satisfactionData = pd.read_csv('satisfaction.csv')
satisfactionData.head()

In [None]:
satisfactionData.info()

#### Initial Observations
* There are `24` columns and `129880` rows in the dataset.   
* The response variable seems to be `satisfaction_v2`.
* The following `5` columns are non-predictor/unlikly to be predictors: ID, Gender, Customer Type, Age and Type of Travel.
* The remaining `18` columns are potential predictor variables.

#### Predictor Variables
* There are `16` variables identified as `int64` by default. But it seems like only `Flight Distance` and `Departure Delay in Minutes` are actually numeric. The remaining `14` variables are ratings from 0 to 5 and should be considered as Categorical.
* The `Arrivial Delay in Minutes` variable identified as `float64` by default, and it seems to be Numeric.
* The`Class` variable identified as `object` by default, and are most likely Categorical.  
* We noted that `Arrivial Delay in Minutes` seems to be missing some values.

### Dataset Cleaning

<div class="alert alert-block alert-info">
    <b>Missing Values: </b> It's noted that <code>Arrivial Delay in Minutes</code> has count <code>129487</code> instead of <code>129880</code>. This is due to it containing <code>NULL</code> values. We will replace them with <code>0</code> here.
</div>

In [None]:
# Check count
satisfactionData['Arrival Delay in Minutes'].count()

In [None]:
satisfactionData['Arrival Delay in Minutes'].fillna(value=0, inplace=True)
# Check count
satisfactionData['Arrival Delay in Minutes'].count()

<div class="alert alert-block alert-info">
    Check that the <code>id</code>s are unique. 
</div>

In [None]:
len(satisfactionData["id"].unique())

<div class="alert alert-block alert-info">
    <b>Ordinal Categorical Variables</b><br>
    Most ordinal categorical variables are rating types in the <code>int</code> form. No conversion required. <br>
    But we will convert for <code>non-int</code> types <code>Class</code> and <code>Customer Type</code> in 
    <a href="#exploratory-analysis">Exploratory Analysis</a>
</div>

In [None]:
from pandas.api.types import CategoricalDtype
cat_type_class = CategoricalDtype(categories=['Eco', 'Eco Plus', 'Business'], ordered=True)
cat_type_customer = CategoricalDtype(categories=[ 'disloyal Customer', 'Loyal Customer'], ordered=True)

test["Class"] = test["Class"].astype(cat_type_class)
test["Class"]

In [None]:
test = satisfactionData;
#print(test)
scale_mapper = {0:0, 1:1, 2:2, 3:3, 4:4, 5:5}
scaled = test["Seat comfort"].replace(scale_mapper) 

from pandas.api.types import CategoricalDtype

cat_type_ratings = CategoricalDtype(categories=[0,1,2,3,4,5], ordered=True)
test["Seat comfort"].astype(cat_type_ratings, )
test.info()

In [None]:
test["Seat comfort"]

# TESTING

In [None]:
test = pd.DataFrame(satisfactionData)
test['Arrival Delay in Minutes'].fillna(value=0, inplace=True)

In [None]:
from pandas.api.types import CategoricalDtype

cat_type_ratings = CategoricalDtype(categories=[0,1,2,3,4,5], ordered=True)
cat_type_class = CategoricalDtype(categories=['Eco', 'Eco Plus', 'Business'], ordered=True)

test["Class"] = test["Class"].astype(cat_type_class)
test["Class"]

In [None]:
test.iloc[:,8:22] = test.iloc[:,8:22].astype(cat_type_ratings)
test.info()

In [None]:
# Removed redundant date

test = test.drop(['Gender','Customer Type','Age','Type of Travel','Class'], axis = 1)
test.info()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
cats = [0,1,2,3,4,5]
ordi = OrdinalEncoder(categories=[cats])
ordi.fit(test[['Seat comfort']])


In [None]:
encoded = pd.DataFrame(ordi.transform(test[['Seat comfort']]),columns=['Seat comfort'])
encoded.info()

In [None]:
test['Seat comfort'].unique()

In [None]:
y = pd.DataFrame(test['satisfaction_v2'])
X = pd.DataFrame(test.drop('satisfaction_v2', axis = 1))
dectree = DecisionTreeClassifier(max_depth = 5)  # change max_depth to experiment
dectree.fit(X, y)
# Plot the trained Decision Tree
f = plt.figure(figsize=(24,24))
plot_tree(dectree, filled=True, rounded=True, precision=0,
          feature_names=X.columns, 
          class_names=["neutral or dissatisfied","satisfied"])

In [None]:
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import plot_tree


# Extract Response and Predictors
y = pd.DataFrame(test['satisfaction_v2'])
X = pd.DataFrame(test.drop('satisfaction_v2', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 3)  # change max_depth to experiment
dectree.fit(X_train, y_train)                    # train the decision tree model

# Plot the trained Decision Tree
f = plt.figure(figsize=(24,24))
plot_tree(dectree, filled=True, rounded=True, 
          feature_names=X_train.columns, 
          class_names=["neutral or dissatisfied","satisfied"])

---
<a id="exploratory-analysis"></a>
## Exploratory Analysis

### Response Variable
Lets take a look at the response variable `satisfaction_v2`.

In [None]:
sb.catplot(y = 'satisfaction_v2', data = satisfactionData, kind = "count")

In [None]:
countG, countB = satisfactionData['satisfaction_v2'].value_counts()
print("[satisfied] : [neutral/dissatisfied] = [", countG, "] : [", countB, "]")

<div class="alert alert-block alert-info">
    The <code>satisfied</code> to <code>neutral/dissatisfied</code> ratio of <code> 71087 : 58794 </code> is acceptable. We will not perform any rebalancing. 
</div>

### Predictor Variables
Lets take a look at the `18` predictor variables.<br>
We shall split them into the following subcategories.

* Passenger: variables relating to the passenger.
* Service: variables corresponding to the services provided by the airline.
* Others: variables that are do not fall in the above categories.

In [None]:
satisfactionData.iloc[:,6:24].info()

#### Passenger Variables
Variables relating to the passenger. <br>
**Categorical** : `Class` `Departure/Arrival time convenient` `Type of Travel` `Customer Type` `Gender` <br>
**Numeric** : `Age` <br>

<div class="alert alert-block alert-info">
    <b>Class (Categorical)</b><br>
    The class variable seems to describle the type of flight class the passenger was in.<br>
    Since this is normally choosen by the passenger, we labeled it under <b>Passenger Variables</b><br>
    <b>Values</b><br>
    We observed that there are 3 unique values for <code>Class</code> variable.<br>
    It seems like their ordinal values(ascending) are as follows:<br>
    1: <code>Eco</code> 2: <code>Eco Plus</code> 3: <code>Business</code><br> 
    We will convert them accordingly. <br>
    <b>Distribution</b><br>
    The most common value is <code>Business</code> which is followed closely by <code>Eco</code>.<br>    
    <code>Eco Plus</code> has the least distribution. <br>
    <b>Relation</b><br>
    <code>Business</code> class have the higest satisfied rate while passengers from <code>Eco</code> and
    <code>Eco Plus</code> have higher neutral/disatisfied ratings.
</div>

In [None]:
print(satisfactionData['Class'].describe())
classTypes = satisfactionData['Class'].unique()
print(classTypes)

In [None]:
from pandas.api.types import CategoricalDtype
cat_type_class = CategoricalDtype(categories=['Eco', 'Eco Plus', 'Business'], ordered=True)
#cat_type_customer = CategoricalDtype(categories=[ 'disloyal Customer', 'Loyal Customer'], ordered=True)
satisfactionData['Class'] = satisfactionData["Class"].astype(cat_type_class)
satisfactionData['Class'].head()

In [None]:
sb.catplot(y = 'Class', data = satisfactionData, kind = "count")

In [None]:
# satisfaction_v2 vs Class
f = plt.figure(figsize=(15, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Class']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

In [None]:
sb.catplot(x='satisfaction_v2', col="Class", col_order=['Business', 'Eco', 'Eco Plus'], data = satisfactionData, kind = "count")

In [None]:
sb.catplot(x='Class', data = test, hue= 'satisfaction_v2', kind = "count")

<div class="alert alert-block alert-info">
    <b>Departure/Arrival time convenient (Categorical)</b><br>
    This variable seems to describle covenience of the flight departure and arrival times.<br>
    Although flight timings are provided by the airline, the passenger normally pick the timeslot.<br>
    As such, we labeled it under <b>Customer Variables</b><br>
    <b>Values</b><br>
    We observed that there are 6 unique values from 0 to 6.<br>
    It is a <i>rating</i> type variable.<br>
    <b>Distribution</b><br>
    Rating <code>3</code> has the highest distribution followed closely by <code>2</code> and <code>4</code><br>    
    Rating <code>0</code> has the lowest distribution. <br>
    <b>Relation</b>
</div>

In [None]:
satisfactionData['Departure/Arrival time convenient'].describe()

In [None]:
sb.catplot(x = 'Departure/Arrival time convenient', data = satisfactionData, kind = "count")

In [None]:
# satisfaction_v2 vs Departure/Arrival time convenient
f = plt.figure(figsize=(15, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Departure/Arrival time convenient']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

In [None]:
sb.catplot(x = 'Departure/Arrival time convenient', data = satisfactionData, 
           hue="satisfaction_v2", kind = "count", aspect= 2.5)

<div class="alert alert-block alert-info">
    <b>Type of Travel (Categorical)</b><br>
    This variable seems to describle type/purpose of travel of the passenger.<br>
    <b>Values</b><br>
    We observed that there are 2 unique values <code>Personal Travel</code> <code>Business travel</code> <br>
    <b>Distribution</b><br>
    <code>Business travel</code> has the higher distribution of 89693. <br>
    <b>Relation</b>
    <code>Business travel</code> appears to have higher satisfaction
</div>

In [None]:
print(satisfactionData['Type of Travel'].describe())
travelTypes = satisfactionData['Type of Travel'].unique()
print(travelTypes)

In [None]:
sb.catplot(x = 'Type of Travel', data = satisfactionData, kind = "count", aspect= 2.5)

In [None]:
# satisfaction_v2 vs Customer Type
f = plt.figure(figsize=(6, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Type of Travel']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

In [None]:
sb.catplot(x = 'Type of Travel', data = satisfactionData, 
           hue="satisfaction_v2", kind = "count", aspect= 2.5)

#### Service Variables
_Rating_ type variables that are affected by the service provided by the airline.<br>
<code>Seat comfort</code> 
<code>Food and drink</code> 
<code>Inflight wifi service</code> 
<code>Inflight entertainment</code> 
<code>Online support</code> 
<code>Ease of Online booking</code> 
<code>On-board service</code> 
<code>Leg room service</code> 
<code>Baggage handling</code>
<code>Checkin service</code>
<code>Cleanliness</code>
<code>Online boarding</code>

<div class="alert alert-block alert-info">
    We observed that there are 3 unique values for <code>Class</code> variable:<br>
    <code>Eco</code>, <code>Business</code>, <code>Eco Plus</code>.<br>
    The most common being <code>Business</code>.     
</div>

#### Circumstance Variables
Variables that are 
<code>Flight Distance</code> 
<code>Gate location </code>
   

In [None]:
satisfactionData['Flight Distance'].describe()

In [None]:
f, axes = plt.subplots(3, 1, figsize=(64, 32))
sb.boxplot(data = satisfactionData['Flight Distance'], orient = "h", ax = axes[0])
sb.histplot(data = satisfactionData['Flight Distance'], ax = axes[1])
sb.violinplot(data = satisfactionData['Flight Distance'], orient = "h", ax = axes[2])

In [None]:
# satisfaction_v2 vs Flight Distance
f = plt.figure(figsize=(16, 8))
sb.stripplot(x = 'Flight Distance', y = 'satisfaction_v2', data = satisfactionData)

In [None]:
satisfactionData['Gate location'].describe()

In [None]:
sb.catplot(y = 'Gate location', data = satisfactionData, kind = "count")

In [None]:
# satisfaction_v2 vs Gate location
f = plt.figure(figsize=(15, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Gate location']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

#### Other Variables
Departure Delay in Minutes
Arrival Delay in Minutes

In [None]:
satisfactionData['Departure Delay in Minutes'].describe()

In [None]:
test = satisfactionData.loc[~((satisfactionData['Departure Delay in Minutes'] == 0))]
test

In [None]:
test['Departure Delay in Minutes'].describe()

In [None]:
f = plt.figure(figsize=(18, 4))
#sb.boxplot(data = test[['Departure Delay in Minutes']], orient = "h",showfliers=False)
sb.boxplot(data = test[['Departure Delay in Minutes']], orient = "h",showfliers=True)

In [None]:
plt.figure(figsize=(18, 18))
sb.kdeplot(data = test,x='Departure Delay in Minutes',hue='satisfaction_v2')

In [None]:
departureDelay = test[['Departure Delay in Minutes','satisfaction_v2']].copy()
# Calculate the quartiles
Q1 = departureDelay.quantile(0.25)
Q3 = departureDelay.quantile(0.75)
# Rule to identify outliers
rule = ((departureDelay < (Q1 - 1.5 * (Q3 - Q1))) | (departureDelay > (Q3 + 1.5 * (Q3 - Q1))))
outliers = rule.any(axis = 1)
departureDelay

In [None]:
# Find the rows where ANY column is True
outliers = rule.any(axis = 1)   # axis 0 is row, 1 is column

# Check the outliers -- it's a boolean Series
outliers

In [None]:
outliers.value_counts()

In [None]:
outlierindices = outliers.index[outliers == True]
outlierindices

In [None]:
# Remove the outliers based on the row indices obtained above
departureDelay.drop(axis = 0,               # 0 drops row 1 drops column
                index = outlierindices, # this takes a list as input
                inplace = True)         # not overwritten by default 

# Check the clean data
departureDelay

In [None]:
sb.histplot(data = departureDelay['Departure Delay in Minutes'])

In [None]:
sb.kdeplot(data = departureDelay['Departure Delay in Minutes'])

In [None]:
departureDelay

In [None]:
#f = plt.figure(figsize=(20, 20))
sb.kdeplot(data = departureDelay,x='Departure Delay in Minutes',hue='satisfaction_v2')

#### Non-Predictor Variables
Gender Customer Type Age Type of Travel

In [None]:
satisfactionData['Gender'].describe()

In [None]:
sb.catplot(y = 'Gender', data = satisfactionData, kind = "count")

In [None]:
# satisfaction_v2 vs Gender
f = plt.figure(figsize=(6, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Gender']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

In [None]:
satisfactionData['Age'].describe()

In [None]:
f = plt.figure(figsize=(16, 8))
sb.stripplot(x = 'Age', y = 'satisfaction_v2', data = satisfactionData)

In [None]:
sb.catplot(x = 'Age', y = 'satisfaction_v2', row = 'Gender', data = satisfactionData, kind = 'box', aspect = 4)

#### Customer Type

In [None]:
satisfactionData['Customer Type'].describe()

In [None]:
sb.catplot(y = 'Customer Type', data = satisfactionData, kind = "count")

In [None]:
# satisfaction_v2 vs Customer Type
f = plt.figure(figsize=(6, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Customer Type']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

#### Type of Travel

In [None]:
satisfactionData['Type of Travel'].describe()

In [None]:
sb.catplot(y = 'Type of Travel', data = satisfactionData, kind = "count")

In [None]:
#sb.catplot(y = 'Type of Travel', data = satisfactionData, kind = "count")
# satisfaction_v2 vs Customer Type
f = plt.figure(figsize=(6, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Type of Travel']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")