# US Airline Satisfaction Mini Project 

In this Project, we would like to peform some analysis on a dataset of __[US Airline passenger satisfaction survey](https://www.kaggle.com/datasets/najibmh/us-airline-passenger-satisfaction-survey?resource=download)__.

## Contents
- [Problem](#problem)
- [Data Preparation](#data-preparation)
- [Exploratory Analysis](#exploratory-analysis)
- [Models](#models)


---

<a id="problem"></a>
## Problem
Based on passenger ratings, we would like to find out how the different indivudal ratings affect the passenger's final decision for a _satisfied_ or _unsatisfied_ with the service provided by US Airline.

**Specifically**:
1. Can we predict if customer would be satified?
1. What are the most important factors that affect customer satisfaction?

---

<a id="data-preparation"></a>
## Data Preparation

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
sb.set() # set the default Seaborn style for graphics

### Import the Dataset
Source: __[US Airline passenger satisfaction survey](https://www.kaggle.com/datasets/najibmh/us-airline-passenger-satisfaction-survey?resource=download)__

Attached file: `satisfaction_v2.csv`  

In [None]:
satisfactionData = pd.read_csv('satisfaction.csv')
satisfactionData.head()

In [None]:
satisfactionData.info()

#### Initial Observations
* There are `24` columns and `129880` rows in the dataset.   
* The response variable seems to be `satisfaction_v2`.
* The following `5` columns are non-predictor/unlikly to be predictors: ID, Gender, Customer Type, Age and Type of Travel.
* The remaining `18` columns are potential predictor variables.

#### Predictor Variables
* There are `16` variables identified as `int64` by default. But it seems like only `Flight Distance` and `Departure Delay in Minutes` are actually numeric. The remaining `14` variables are ratings from 0 to 5 and should be considered as Categorical.
* The `Arrivial Delay in Minutes` variable identified as `float64` by default, and it seems to be Numeric.
* The`Class` variable identified as `object` by default, and are most likely Categorical.  
* We noted that `Arrivial Delay in Minutes` seems to be missing some values.

### Dataset Cleaning

<div class="alert alert-block alert-info">
    <b>Missing Values: </b> It's noted that <code>Arrivial Delay in Minutes</code> has count <code>129487</code> instead of <code>129880</code>. This is due to it containing <code>NULL</code> values. We will replace them with <code>0</code> here.
</div>

In [None]:
# Check count
satisfactionData['Arrival Delay in Minutes'].count()

In [None]:
satisfactionData['Arrival Delay in Minutes'].fillna(value=0, inplace=True)
# Check count
satisfactionData['Arrival Delay in Minutes'].count()

<div class="alert alert-block alert-info">
    Check that the <code>id</code>s are unique. 
</div>

In [None]:
len(satisfactionData["id"].unique())

<div class="alert alert-block alert-info">
    <b>Ordinal Categorical Variables</b><br>
    Most ordinal categorical variables are rating types in the <code>int</code> form. No conversion required. <br>
    But we will convert for <code>non-int</code> types <code>Class</code> and <code>Customer Type</code> in 
    <a href="#exploratory-analysis">Exploratory Analysis</a>
</div>

---
<a id="exploratory-analysis"></a>
## Exploratory Analysis

### Response Variable
Lets take a look at the response variable `satisfaction_v2`.

In [None]:
sb.catplot(y = 'satisfaction_v2', data = satisfactionData, kind = "count")

In [None]:
countG, countB = satisfactionData['satisfaction_v2'].value_counts()
print("[satisfied] : [neutral/dissatisfied] = [", countG, "] : [", countB, "]")

<div class="alert alert-block alert-info">
    The <code>satisfied</code> to <code>neutral/dissatisfied</code> ratio of <code> 71087 : 58794 </code> is acceptable. We will not perform any rebalancing. 
</div>

### Predictor Variables
Lets take a look at the `18` predictor variables.<br>
We shall split them into the following subcategories.

* Passenger: variables relating to the passenger.
* Service: variables corresponding to the services provided by the airline.
* Others: variables that are do not fall in the above categories.

In [None]:
satisfactionData.iloc[:,6:24].info()

<a id="passenger-variables-ea"></a>
#### Passenger Variables
Variables relating to the passenger. <br>
**Categorical** : 
[`Class`](#class-ea)
[`Type of Travel`](#type-of-travel-ea)
[`Customer Type`](#customer-type-ea)
[`Gender`](#gender-ea) <br>

**Numeric** : [`Age`](#age-ea) <br>

<a id="class-ea"></a>
<div class="alert alert-block alert-info">
    <b>Class (Categorical)</b><br>
    The class variable seems to describle the type of flight class the passenger was in.<br>
    Since this is normally choosen by the passenger, we labeled it under <b>Passenger Variables</b><br>
    <b>Values</b><br>
    We observed that there are 3 unique values for <code>Class</code> variable.<br>
    It seems like their ordinal values(ascending) are as follows:<br>
    1: <code>Eco</code> 2: <code>Eco Plus</code> 3: <code>Business</code><br> 
    We will convert them accordingly. <br>
    <b>Distribution</b><br>
    The most common value is <code>Business</code> which is followed closely by <code>Eco</code>.<br>    
    <code>Eco Plus</code> has the least distribution. <br>
    <b>Relation</b><br>
    <code>Business</code> class have the higest satisfied rate while passengers from <code>Eco</code> and
    <code>Eco Plus</code> have higher neutral/disatisfied ratings.
    <br><br><a href="#passenger-variables-ea">Return</a>
</div>

In [None]:
print(satisfactionData['Class'].describe())
classTypes = satisfactionData['Class'].unique()
print(classTypes)

In [None]:
from pandas.api.types import CategoricalDtype
cat_type_class = CategoricalDtype(categories=['Eco', 'Eco Plus', 'Business'], ordered=True)
satisfactionData['Class'] = satisfactionData["Class"].astype(cat_type_class)
satisfactionData['Class'].head()

In [None]:
sb.catplot(x = 'Class', data = satisfactionData, kind = "count", aspect= 2)

In [None]:
sb.catplot(x='Class', data = satisfactionData, hue= 'satisfaction_v2', kind = "count", aspect= 2)

In [None]:
# satisfaction_v2 vs Class
f = plt.figure(figsize=(15, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Class']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

<a id="type-of-travel-ea"></a>
<div class="alert alert-block alert-info">
    <b>Type of Travel (Categorical)</b><br>
    This variable seems to describle type/purpose of travel of the passenger.<br>
    <b>Values</b><br>
    We observed that there are 2 unique values <code>Personal Travel</code> <code>Business travel</code> <br>
    <b>Distribution</b><br>
    <code>Business travel</code> has the higher distribution of 89693. <br>
    <b>Relation</b><br>
    <code>Business travel</code> appears to have higher satisfaction
    <br><br><a href="#passenger-variables-ea">Return</a>
</div>

In [None]:
print(satisfactionData['Type of Travel'].describe())
travelTypes = satisfactionData['Type of Travel'].unique()
print(travelTypes)

In [None]:
sb.catplot(x = 'Type of Travel', data = satisfactionData, kind = "count", 
           aspect= 2, order=['Business travel', 'Personal Travel'] )

In [None]:
sb.catplot(x = 'Type of Travel', data = satisfactionData, kind = "count", hue="satisfaction_v2",
           aspect= 2, order=['Business travel', 'Personal Travel'])

In [None]:
# satisfaction_v2 vs Customer Type
f = plt.figure(figsize=(15, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Type of Travel']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

<a id="customer-type-ea"></a>
<div class="alert alert-block alert-info">
    <b>Customer Type (Categorical)</b><br>
    This variable seems to describle if passenger is a loyal customer.<br>
    <b>Values</b><br>
    We observed that there are 2 unique values <code>Loyal Customer</code> <code>disloyal Customer</code> <br>
    We will assign them the weights as follows: <br>
    1: <code>disloyal Customer</code> 2: <code>Loyal Customer</code> <br>
    <b>Distribution</b><br>
    <code>Loyal Customer</code> has the higher distribution of 106100. <br>
    <b>Relation</b><br>
    <code>Loyal Customer</code> appears to have higher satisfaction rate
    <br><br><a href="#passenger-variables-ea">Return</a>
</div>

In [None]:
satisfactionData = pd.read_csv('satisfaction.csv')
print(satisfactionData['Customer Type'].describe())
customerTypes = satisfactionData['Customer Type'].unique()
print(customerTypes)

In [None]:
cat_type_customer = CategoricalDtype(categories=['disloyal Customer', 'Loyal Customer'], ordered=True)
satisfactionData['Customer Type'] = satisfactionData['Customer Type'].astype(cat_type_customer)
satisfactionData['Customer Type'].head()

In [None]:
sb.catplot(x = 'Customer Type', data = satisfactionData, kind = "count", aspect= 2)

In [None]:
sb.catplot(x = 'Customer Type', data = satisfactionData, 
           hue="satisfaction_v2", kind = "count", aspect= 2)

In [None]:
# satisfaction_v2 vs Customer Type
f = plt.figure(figsize=(15, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Customer Type']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

<a id="gender-ea"></a>
<div class="alert alert-block alert-info">
    <b>Gender (Categorical)</b><br>
    <b>Values</b><br>
    There are 2 unique values <code>Male</code> and <code>Female</code> <br>
    <b>Distribution</b><br>
    Even distribution 63981 : 65899  <br>
    <b>Relation</b><br>
    It appears that <code>Female</code> passengers have a higher satisfaction rate   
    <br><br><a href="#passenger-variables-ea">Return</a>
</div>

In [None]:
satisfactionData['Gender'].describe()

In [None]:
sb.catplot(x = 'Gender', data = satisfactionData, kind = "count", aspect= 2)

In [None]:
sb.catplot(x = 'Gender', data = satisfactionData, 
           hue="satisfaction_v2", kind = "count", aspect= 2)

In [None]:
# satisfaction_v2 vs Gender
f = plt.figure(figsize=(15, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Gender']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

<a id="age-ea"></a>
<div class="alert alert-block alert-info">
    <b>Age (Numeric)</b><br>
    <b>Values</b><br>
    There are 2 unique values <code>Male</code> and <code>Female</code> <br>
    <b>Relation</b><br>
    It appears that ages <code>40</code> to <code>60</code>passengers have a higher satisfaction rate    
    <br><br><a href="#passenger-variables-ea">Return</a>
</div>

In [None]:
satisfactionData['Age'].describe()

In [None]:
f, axes = plt.subplots(3, 1, figsize=(64, 32))
sb.boxplot(data = satisfactionData['Age'], orient = "h", ax = axes[0])
sb.histplot(data = satisfactionData['Age'], ax = axes[1])
sb.violinplot(data = satisfactionData['Age'], orient = "h", ax = axes[2])

In [None]:
f, axes = plt.subplots(3, 1, figsize=(64, 64))
sb.boxplot(data = satisfactionData, orient = "h", ax = axes[0],
          x ='Age', y = 'satisfaction_v2')
sb.kdeplot(data = satisfactionData, ax = axes[1],
          x ='Age', hue = 'satisfaction_v2')
sb.violinplot(data = satisfactionData, orient = "h", ax = axes[2],
          x ='Age', y = 'satisfaction_v2')

<div class="alert alert-block alert-info">
    <b>Age + Gender</b><br>
    Analyse the relation with Age + Gender.<br>
    We observe that the graphs are fairly similar. Not much difference in relation.
</div>

In [None]:
sb.catplot(x = 'Age', y = 'satisfaction_v2', row = 'Gender', data = satisfactionData, kind = 'box', aspect = 4)

In [None]:
sb.catplot(x = 'Age', y = 'satisfaction_v2', row = 'Gender', 
           data = satisfactionData, kind = 'violin', aspect = 4)

<a id="service-variables-ea"></a>
#### Service Variables
For our problem case, we will be focusing mainly on services on board the plane.<br>
**Focus**:
<code>Seat comfort</code> 
<code>Food and drink</code> 
<code>Inflight wifi service</code> 
<code>Inflight entertainment</code> 
<code>On-board service</code> 
<code>Leg room service</code> 
<code>Checkin service</code>
<code>Cleanliness</code>

**Non-Focus**:
<code>Online support</code> 
<code>Ease of Online booking</code> 
<code>Baggage handling</code>
<code>Online boarding</code>



<div class="alert alert-block alert-info">
    <b>Focus Service Variables (Categorical)</b><br>
    <b>Values</b><br>
    We observed that there are 6 unique values from 0 to 6.<br>
    All of which are <i>rating</i> type variables.<br>
    <b>Relation</b><br>
    We observed that generally, ratings 5 and 6 have higher statisfaction rate.<br>
    But more notably, 
    <code>Seat comfort</code>
    <code>Food and drink</code>
    <code>Inflight entertainment</code> seems to have strong relation to satisfaction. <br>
    Specifically speaking, ratings 5 and 6 have higher statisfaction rate and additionally,<br>
    ratings 3 and 4 have higher neutral/distatisfaction rate
</div>

In [None]:
focusVariables = ['Seat comfort', 'Food and drink', 
                  'Inflight wifi service', 'Inflight entertainment',
                  'On-board service','Leg room service','Checkin service',
                  'Cleanliness']

satisfactionData[focusVariables].describe()

In [None]:
from IPython.display import Markdown, display
import asyncio

def printHeader(col, phref):
    markdown = f'<div class="alert alert-block alert-info"><b>{col}</b><br><br><a href="#{phref}">Return</a></div>'
    display(Markdown(markdown))

def printExploratoryAnalysis(col, data, dataType, parentSectionId):
    printHeader(col, parentSectionId)
    if(dataType=='categorical'):
        sb.catplot(x = col, data = data, kind = "count", aspect= 2)
        sb.catplot(x = col, data = data, hue="satisfaction_v2", kind = "count", aspect= 2)
        f = plt.figure(figsize=(15, 4))
        sb.heatmap(satisfactionData.groupby(['satisfaction_v2', col]).size().unstack(),
                         linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")
        plt.show()
    elif(dataType=='Numerical'):
        f, axes = plt.subplots(3, 1, figsize=(64, 32))
        bp = sb.boxplot(data = satisfactionData[col], orient = "h", ax = axes[0])
        hp = sb.histplot(data = satisfactionData[col], ax = axes[1])
        vp = sb.violinplot(data = satisfactionData[col], orient = "h", ax = axes[2])
        f, axes = plt.subplots(3, 1, figsize=(64, 48))
        rbp = sb.boxplot(data = departDelayDataClean, orient = "h",
                   x = col, y = 'satisfaction_v2', ax = axes[0])
        rkp = sb.kdeplot(data = departDelayDataClean,
                   x= col, hue='satisfaction_v2', ax = axes[1])
        rvp = sb.violinplot(data = departDelayDataClean, orient = "h",
               x = col, y = 'satisfaction_v2', ax = axes[2])
        plt.show()
    display(Markdown('---'))

In [None]:
for var in focusVariables:
    printExploratoryAnalysis(var, satisfactionData, 'categorical', 'service-variables-ea')

<a id="other-variables-ea"></a>
#### Other Variables
Other variables that are not related to customer or airline service. <br>
<b>Categorical</b>: 
<a href="#gate-location-ea"><code>Gate location</code></a>
<a href="#departure-arrival-convenient-ea"><code>Departure/Arrival convenient</code></a>
<br>
<b>Numeric</b> : 
<a href="#flight-distance-ea"><code>Flight Distance</code></a>
<a href="#departure-delay-ea"><code>Departure Delay in Minutes</code></a>
<a href="#arrival-delay-ea"><code>Arrival Delay in Minutes</code></a>


<a id="gate-location-ea"></a>
<div class="alert alert-block alert-info">
    <b>Gate Location (Categorical)</b><br>
    This variable most likely represent the convenience of the gate location.<br>
    As the airline may not choose their gate location, we did not include under service.<br>
    <b>Values</b><br>
    We observed that there are 6 unique values from 0 to 6.<br>
    It is a <i>rating</i> type variable.<br>
    <b>Distribution</b><br>
    Rating <code>3</code> has the highest distribution. <br>
    Rating <code>0</code> has the lowest distribution. <br>
    <b>Relation</b><br>
    <code>Loyal Customer</code> appears to have higher satisfaction rate 
    <br><br><a href="#other-variables-ea">Return</a>
</div>

In [None]:
satisfactionData['Gate location'].describe()

In [None]:
f = plt.figure(figsize=(15,8))
sb.catplot(x = 'Gate location', data = satisfactionData, kind = "count", aspect= 2)

In [None]:
sb.catplot(x = 'Gate location', data = satisfactionData, 
           hue="satisfaction_v2", kind = "count", aspect= 2)

In [None]:
# satisfaction_v2 vs Gate location
f = plt.figure(figsize=(15, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Gate location']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

<a id="departure-arrival-convenient-ea"></a>
<div class="alert alert-block alert-info">
    <b>Departure/Arrival time convenient (Categorical)</b><br>
    This variable seems to describle covenience of the flight departure and arrival times.<br>
    Although flight timings are provided by the airline, the passenger normally pick the timeslot.<br>
    As such, we labeled it under <b>Other Variables</b><br>
    <b>Values</b><br>
    We observed that there are 6 unique values from 0 to 6.<br>
    It is a <i>rating</i> type variable.<br>
    <b>Distribution</b><br>
    Rating <code>3</code> has the highest distribution followed closely by <code>2</code> and <code>4</code><br>    
    Rating <code>0</code> has the lowest distribution. <br>
    <b>Relation</b>
    <br><br><a href="#other-variables-ea">Return</a>
</div>

In [None]:
satisfactionData[['Departure/Arrival time convenient']].describe()

In [None]:
sb.catplot(x = 'Departure/Arrival time convenient', data = satisfactionData, kind = "count", aspect= 2)

In [None]:
sb.catplot(x = 'Departure/Arrival time convenient', data = satisfactionData, 
           hue="satisfaction_v2", kind = "count", aspect= 2)

In [None]:
# satisfaction_v2 vs Departure/Arrival time convenient
f = plt.figure(figsize=(15, 4))
sb.heatmap(satisfactionData.groupby(['satisfaction_v2', 'Departure/Arrival time convenient']).size().unstack(), 
           linewidths = 1, annot = True, fmt = 'g', annot_kws = {"size": 18}, cmap = "BuGn")

<a id="flight-distance-ea"></a>
<div class="alert alert-block alert-info">
    <b>Flight Distance (Numeric)</b><br>
    This variable describes the flight distance most likely in miles.<br>
    <b>Relation</b><br>
    It appears that at below <code>1000</code> miles, satisfaction rate seems to be higher.
    <br><br><a href="#other-variables-ea">Return</a>
</div>

In [None]:
satisfactionData['Flight Distance'].describe()

In [None]:
f, axes = plt.subplots(3, 1, figsize=(64, 32))
sb.boxplot(data = satisfactionData['Flight Distance'], orient = "h", ax = axes[0])
sb.histplot(data = satisfactionData['Flight Distance'], ax = axes[1])
sb.violinplot(data = satisfactionData['Flight Distance'], orient = "h", ax = axes[2])

In [None]:
f = plt.figure(figsize=(15, 8))
sb.boxplot(data = satisfactionData, orient = "h",
          x ='Flight Distance', y = 'satisfaction_v2')

In [None]:
f = plt.figure(figsize=(15, 8))
sb.kdeplot(data = satisfactionData, x='Flight Distance',hue='satisfaction_v2')

In [None]:
f = plt.figure(figsize=(15,8))
sb.violinplot(data = satisfactionData, orient = 'h',
              x = 'Flight Distance', y = 'satisfaction_v2')

<a id="departure-delay-ea"></a>
<div class="alert alert-block alert-info">
    <b>Departure Delay in Minutes (Numeric)</b><br>
    We excluded <code>0</code> departure delays<br>
    <br><br><a href="#other-variables-ea">Return</a>
</div>

In [None]:
satisfactionData['Departure Delay in Minutes'].describe()

In [None]:
departDelayData = satisfactionData.loc[~((satisfactionData['Departure Delay in Minutes'] == 0))]
departDelayData['Departure Delay in Minutes'].describe()

In [None]:
f, axes = plt.subplots(3, 1, figsize=(64, 32))
sb.boxplot(data = departDelayData[['Departure Delay in Minutes']], orient = "h", ax = axes[0])
sb.histplot(data = departDelayData[['Departure Delay in Minutes']], ax = axes[1])
sb.violinplot(data = departDelayData[['Departure Delay in Minutes']], orient = "h", ax = axes[2])
#sb.boxplot(data = departDelayData[['Departure Delay in Minutes']], orient = "h",showfliers=True)

<div class="alert alert-block alert-info">
    <b>Remove outliers</b>
</div>

In [None]:
departDelayDataClean = departDelayData[['Departure Delay in Minutes','satisfaction_v2']].copy()
# Calculate the quartiles
Q1 = departDelayDataClean.quantile(0.25)
Q3 = departDelayDataClean.quantile(0.75)
# Rule to identify outliers
rule = ((departDelayDataClean < (Q1 - 1.5 * (Q3 - Q1))) 
        | (departDelayDataClean > (Q3 + 1.5 * (Q3 - Q1))))
departDelayOutliers = rule.any(axis = 1)
departDelayOutlierindices = departDelayOutliers.index[departDelayOutliers == True]

# Remove the outliers based on the row indices obtained above
departDelayDataClean.drop(axis = 0,               # 0 drops row 1 drops column
                          index = departDelayOutlierindices, # this takes a list as input
                          inplace = True)         # not overwritten by default 
# Check the clean data
departDelayDataClean['Departure Delay in Minutes'].describe()

In [None]:
f, axes = plt.subplots(3, 1, figsize=(64, 32))
sb.boxplot(data = departDelayDataClean[['Departure Delay in Minutes']], orient = "h", ax = axes[0])
sb.histplot(data = departDelayDataClean[['Departure Delay in Minutes']], ax = axes[1])
sb.violinplot(data = departDelayDataClean[['Departure Delay in Minutes']], orient = "h", ax = axes[2])

In [None]:
f, axes = plt.subplots(3, 1, figsize=(64, 48))
sb.boxplot(data = departDelayDataClean, orient = "h",
           x ='Departure Delay in Minutes', y = 'satisfaction_v2', ax = axes[0])
sb.kdeplot(data = departDelayDataClean, 
           x='Departure Delay in Minutes', hue='satisfaction_v2', ax = axes[1])
sb.violinplot(data = departDelayDataClean, orient = "h",
               x ='Departure Delay in Minutes', y = 'satisfaction_v2', ax = axes[2])

<a id="arrival-delay-ea"></a>
<div class="alert alert-block alert-info">
    <b>Arrival Delay in Minutes (Numeric)</b><br>
    We excluded <code>0</code> departure delays<br>
    <br><br><a href="#other-variables-ea">Return</a>
</div>

In [None]:
arriveDelayData = satisfactionData.loc[~((satisfactionData['Arrival Delay in Minutes'] == 0))]
arriveDelayData['Arrival Delay in Minutes'].describe()

In [None]:
arriveDelayDataClean = arriveDelayData[['Arrival Delay in Minutes','satisfaction_v2']].copy()
# Calculate the quartiles
adQ1 = arriveDelayDataClean.quantile(0.25)
adQ3 = arriveDelayDataClean.quantile(0.75)
# Rule to identify outliers
adrule = ((arriveDelayDataClean < (adQ1 - 1.5 * (adQ3 - adQ1))) 
        | (arriveDelayDataClean > (adQ3 + 1.5 * (adQ3 - adQ1))))
arriveDelayOutliers = adrule.any(axis = 1)
arriveDelayOutlierindices = arriveDelayOutliers.index[arriveDelayOutliers == True]

# Remove the outliers based on the row indices obtained above
arriveDelayDataClean.drop(axis = 0,               # 0 drops row 1 drops column
                          index = arriveDelayOutlierindices, # this takes a list as input
                          inplace = True)         # not overwritten by default 
# Check the clean data
arriveDelayDataClean['Arrival Delay in Minutes'].describe()

In [None]:
f, axes = plt.subplots(3, 1, figsize=(64, 32))
sb.boxplot(data = arriveDelayDataClean[['Arrival Delay in Minutes']], orient = "h", ax = axes[0])
sb.histplot(data = arriveDelayDataClean[['Arrival Delay in Minutes']], ax = axes[1])
sb.violinplot(data = arriveDelayDataClean[['Arrival Delay in Minutes']], orient = "h", ax = axes[2])

In [None]:
f, axes = plt.subplots(3, 1, figsize=(64, 48))
sb.boxplot(data = arriveDelayDataClean, orient = "h", ax = axes[0],
          x ='Arrival Delay in Minutes', y = 'satisfaction_v2')
sb.kdeplot(data = arriveDelayDataClean, ax = axes[1],
          x ='Arrival Delay in Minutes', hue = 'satisfaction_v2')
sb.violinplot(data = arriveDelayDataClean, orient = "h", ax = axes[2],
          x ='Arrival Delay in Minutes', y = 'satisfaction_v2')

---

<a id="models"></a>
## Models
### Creating a Model for satisfaction_v2 : Attempt 1 - Multi-Variate Classification Tree

In [None]:
# Import the encoder from sklearn
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

# OneHotEncoding of categorical predictors (not the response)
satisfactionData_cat = satisfactionData[['Class']]
ohe.fit(satisfactionData_cat)
satisfactionData_cat_ohe = pd.DataFrame(ohe.transform(satisfactionData_cat).toarray(), 
                                  columns=ohe.get_feature_names(satisfactionData_cat.columns))

# Check the encoded variables
satisfactionData_cat_ohe.info()

In [None]:
# Combining Ordinal Category variables with the OHE Categorical variables
'''
list of columns 
['Seat comfort','Food and drink','Inflight wifi service','Inflight entertainment',
'On-board service','Leg room service', 'Checkin service','Cleanliness']

'''
satisfactionData_num = satisfactionData[['Seat comfort','Food and drink',
'Inflight wifi service','Inflight entertainment','On-board service','Leg room service',
'Checkin service','Cleanliness']]
satisfactionData_res = satisfactionData['satisfaction_v2']
satisfactionData_ohe = pd.concat([satisfactionData_num, satisfactionData_cat_ohe, satisfactionData_res], 
                           sort = False, axis = 1).reindex(index=satisfactionData_num.index)

# Check the final dataframe
satisfactionData_ohe.info()

In [None]:
test = satisfactionData;
#print(test)
scale_mapper = {0:0, 1:1, 2:2, 3:3, 4:4, 5:5}
scaled = test["Seat comfort"].replace(scale_mapper) 

from pandas.api.types import CategoricalDtype

cat_type_ratings = CategoricalDtype(categories=[0,1,2,3,4,5], ordered=True)
test["Seat comfort"].astype(cat_type_ratings, )
test.info()

In [None]:
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import plot_tree

# Extract Response and Predictors
y = pd.DataFrame(satisfactionData_ohe['satisfaction_v2'])
X = pd.DataFrame(satisfactionData_ohe.drop('satisfaction_v2', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 3)  # change max_depth to experiment
dectree.fit(X_train, y_train)                    # train the decision tree model

# Plot the trained Decision Tree
f = plt.figure(figsize=(24,24))
plot_tree(dectree, filled=True, rounded=True, 
          feature_names=X_train.columns, 
          class_names=["neutral or dissatisfied","satisfied"])

### Check the accuracy of the Model

Print the Classification Accuracy and all other Accuracy Measures from the Confusion Matrix.  

| Confusion Matrix  |       |        |        |      
| :---              | :---: | :----: | :----: |         
| Actual Negative   |  (0)  |   TN   |   FP   |             
| Actual Positive   |  (1)  |   FN   |   TP   |       
|                   |       |   (0)   |   (1)   |       
|                   |       | Predicted Negative    |   Predicted Postitive  |     


* `TPR = TP / (TP + FN)` : True Positive Rate = True Positives / All Positives    
* `TNR = TN / (TN + FP)` : True Negative Rate = True Negatives / All Negatives    

* `FPR = FP / (TN + FP)` : False Positive Rate = False Positives / All Negatives 
* `FNR = FN / (TP + FN)` : False Negative Rate = False Negatives / All Positives 

In [None]:
# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy  :\t", dectree.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Good (1) predicted Good (1)
fpTrain = cmTrain[0][1] # False Positives : Bad (0) predicted Good (1)
tnTrain = cmTrain[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTrain = cmTrain[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

In [None]:
# Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy  :\t", dectree.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Good (1) predicted Good (1)
fpTest = cmTest[0][1] # False Positives : Bad (0) predicted Good (1)
tnTest = cmTest[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTest = cmTest[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

### Create a Model for satisfaction_v2 : Attempt 2 - Resampling

In [None]:
# Upsample Bad to match Good
from sklearn.utils import resample
satisfactionBad = satisfactionData_ohe[satisfactionData_ohe.satisfaction_v2 == 'neutral or dissatisfied']
satisfactionGood = satisfactionData_ohe[satisfactionData_ohe.satisfaction_v2 == 'satisfied']
 
# Upsample the Bad samples
satisfactionBad_up = resample(satisfactionBad, 
                        replace=True,                     # sample with replacement
                        n_samples=satisfactionGood.shape[0])    # to match number of Good
 
# Combine the two classes back after upsampling
satisfactionData_ohe_up = pd.concat([satisfactionGood, satisfactionBad_up])
 
# Check the ratio of the classes
satisfactionData_ohe_up['satisfaction_v2'].value_counts()

In [None]:
# Quick plot to check the balanced classes visually
sb.catplot(y = 'satisfaction_v2', data = satisfactionData_ohe_up, kind = "count")

In [None]:
# Confirm that the OHE is still in place
# and that the samples have now increased
satisfactionData_ohe_up.info()

In [None]:
# Import essential models and functions from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import plot_tree

# Extract Response and Predictors
y = pd.DataFrame(satisfactionData_ohe_up['satisfaction_v2'])
X = pd.DataFrame(satisfactionData_ohe_up.drop('satisfaction_v2', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 4)  # change max_depth to experiment
dectree.fit(X_train, y_train)                    # train the decision tree model

# Plot the trained Decision Tree
f = plt.figure(figsize=(24,24))
plot_tree(dectree, filled=True, rounded=True, 
          feature_names=X_train.columns, 
          class_names=["neutral or dissatisfied","satisfied"])

#### Check the accuracy of the Model

In [None]:
# Predict the Response corresponding to Predictors
y_train_pred = dectree.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy  :\t", dectree.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Good (1) predicted Good (1)
fpTrain = cmTrain[0][1] # False Positives : Bad (0) predicted Good (1)
tnTrain = cmTrain[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTrain = cmTrain[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

In [None]:
# Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = dectree.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy  :\t", dectree.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Good (1) predicted Good (1)
fpTest = cmTest[0][1] # False Positives : Bad (0) predicted Good (1)
tnTest = cmTest[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTest = cmTest[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

### Create a Model for satisfaction_v2 : Attempt 3 - Random Forest

In [None]:
# Import essential models and functions from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(satisfactionData_ohe_up['satisfaction_v2'])
X = pd.DataFrame(satisfactionData_ohe_up.drop('satisfaction_v2', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [None]:
# Import RandomForestClassifier model from Scikit-Learn
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest object
rforest = RandomForestClassifier(n_estimators = 100,  # n_estimators denote number of trees
                                 max_depth = 4)       # set the maximum depth of each tree

# Fit Random Forest on Train Data
rforest.fit(X_train, y_train.satisfaction_v2.ravel())

In [None]:
# Import confusion_matrix from Scikit-Learn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = rforest.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy  :\t", rforest.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Good (1) predicted Good (1)
fpTrain = cmTrain[0][1] # False Positives : Bad (0) predicted Good (1)
tnTrain = cmTrain[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTrain = cmTrain[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

In [None]:
# Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = rforest.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy  :\t", rforest.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Good (1) predicted Good (1)
fpTest = cmTest[0][1] # False Positives : Bad (0) predicted Good (1)
tnTest = cmTest[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTest = cmTest[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

In [None]:
# Import essential models and functions from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(satisfactionData_ohe_up['satisfaction_v2'])
X = pd.DataFrame(satisfactionData_ohe_up.drop('satisfaction_v2', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Import RandomForestClassifier model from Scikit-Learn
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest object
rforest = RandomForestClassifier(n_estimators = 1000,  # CHANGE AND EXPERIMENT
                                 max_depth = 4)       # CHANGE AND EXPERIMENT

# Fit Random Forest on Train Data
rforest.fit(X_train, y_train.satisfaction_v2.ravel())

In [None]:
# Import confusion_matrix from Scikit-Learn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = rforest.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy  :\t", rforest.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Good (1) predicted Good (1)
fpTrain = cmTrain[0][1] # False Positives : Bad (0) predicted Good (1)
tnTrain = cmTrain[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTrain = cmTrain[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

In [None]:
# Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = rforest.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy  :\t", rforest.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Good (1) predicted Good (1)
fpTest = cmTest[0][1] # False Positives : Bad (0) predicted Good (1)
tnTest = cmTest[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTest = cmTest[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

#### Increase the Depth of Decision Trees in the Forest

In [None]:
# Import essential models and functions from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(satisfactionData_ohe_up['satisfaction_v2'])
X = pd.DataFrame(satisfactionData_ohe_up.drop('satisfaction_v2', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Import RandomForestClassifier model from Scikit-Learn
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest object
rforest = RandomForestClassifier(n_estimators = 100,  # CHANGE AND EXPERIMENT
                                 max_depth = 10)       # CHANGE AND EXPERIMENT

# Fit Random Forest on Train Data
rforest.fit(X_train, y_train.satisfaction_v2.ravel())

In [None]:
# Import confusion_matrix from Scikit-Learn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = rforest.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy  :\t", rforest.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Good (1) predicted Good (1)
fpTrain = cmTrain[0][1] # False Positives : Bad (0) predicted Good (1)
tnTrain = cmTrain[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTrain = cmTrain[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

In [None]:
# Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = rforest.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy  :\t", rforest.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Good (1) predicted Good (1)
fpTest = cmTest[0][1] # False Positives : Bad (0) predicted Good (1)
tnTest = cmTest[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTest = cmTest[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

#### Increase both Number and Depth of Decision Trees in the Forest

In [None]:
# Import essential models and functions from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(satisfactionData_ohe_up['satisfaction_v2'])
X = pd.DataFrame(satisfactionData_ohe_up.drop('satisfaction_v2', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Import RandomForestClassifier model from Scikit-Learn
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest object
rforest = RandomForestClassifier(n_estimators = 1000,  # CHANGE AND EXPERIMENT
                                 max_depth = 10)       # CHANGE AND EXPERIMENT

# Fit Random Forest on Train Data
rforest.fit(X_train, y_train.satisfaction_v2.ravel())

In [None]:
# Import confusion_matrix from Scikit-Learn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = rforest.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy  :\t", rforest.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Good (1) predicted Good (1)
fpTrain = cmTrain[0][1] # False Positives : Bad (0) predicted Good (1)
tnTrain = cmTrain[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTrain = cmTrain[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

In [None]:
# Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = rforest.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy  :\t", rforest.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Good (1) predicted Good (1)
fpTest = cmTest[0][1] # False Positives : Bad (0) predicted Good (1)
tnTest = cmTest[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTest = cmTest[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

### Creating a Model for satisfaction_v2 : Attempt 4

In [None]:
# Import RandomForestClassifier model from Scikit-Learn
from sklearn.ensemble import RandomForestClassifier

# Extract Response and Predictors
y = pd.DataFrame(satisfactionData_ohe_up['satisfaction_v2'])
X = pd.DataFrame(satisfactionData_ohe_up.drop('satisfaction_v2', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [None]:
# Import GridSearch for hyperparameter tuning using Cross-Validation (CV)
from sklearn.model_selection import GridSearchCV

# Define the Hyper-parameter Grid to search on, in case of Random Forest
param_grid = {'n_estimators': np.arange(100,1001,100),   # number of trees 100, 200, ..., 1000
              'max_depth': np.arange(2, 11)}             # depth of trees 2, 3, 4, 5, ..., 10

#param_grid = {'n_estimators': np.arange(10,101,10),
#              'max_depth': np.arange(2, 4)} 

In [None]:
# Create the Hyper-parameter Grid
hpGrid = GridSearchCV(RandomForestClassifier(),   # the model family
                      param_grid,                 # the search grid
                      cv = 3,                     # 5-fold cross-validation
                      scoring = 'accuracy')       # score to evaluate

# Train the models using Cross-Validation
hpGrid.fit(X_train, y_train.satisfaction_v2.ravel())

In [None]:
# Fetch the best Model or the best set of Hyper-parameters
print(hpGrid.best_estimator_)

# Print the score (accuracy) of the best Model after CV
print(np.abs(hpGrid.best_score_))

#### Use the Best Model found through GridSearchCV

In [None]:
# Import essential models and functions from sklearn
from sklearn.model_selection import train_test_split

# Extract Response and Predictors
y = pd.DataFrame(satisfactionData_ohe_up['satisfaction_v2'])
X = pd.DataFrame(satisfactionData_ohe_up.drop('satisfaction_v2', axis = 1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Import RandomForestClassifier model from Scikit-Learn
from sklearn.ensemble import RandomForestClassifier

# Create the Random Forest object
rforest = RandomForestClassifier(n_estimators = 400,   # found using GridSearchCV
                                 max_depth = 10)       # found using GridSearchCV

# Fit Random Forest on Train Data
rforest.fit(X_train, y_train.satisfaction_v2.ravel())

In [None]:
# Import confusion_matrix from Scikit-Learn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_train_pred = rforest.predict(X_train)

# Print the Classification Accuracy
print("Train Data")
print("Accuracy  :\t", rforest.score(X_train, y_train))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTrain = confusion_matrix(y_train, y_train_pred)
tpTrain = cmTrain[1][1] # True Positives : Good (1) predicted Good (1)
fpTrain = cmTrain[0][1] # False Positives : Bad (0) predicted Good (1)
tnTrain = cmTrain[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTrain = cmTrain[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Train :\t", (tpTrain/(tpTrain + fnTrain)))
print("TNR Train :\t", (tnTrain/(tnTrain + fpTrain)))
print()

print("FPR Train :\t", (fpTrain/(tnTrain + fpTrain)))
print("FNR Train :\t", (fnTrain/(tpTrain + fnTrain)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_train, y_train_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

In [None]:
# Import the required metric from sklearn
from sklearn.metrics import confusion_matrix

# Predict the Response corresponding to Predictors
y_test_pred = rforest.predict(X_test)

# Print the Classification Accuracy
print("Test Data")
print("Accuracy  :\t", rforest.score(X_test, y_test))
print()

# Print the Accuracy Measures from the Confusion Matrix
cmTest = confusion_matrix(y_test, y_test_pred)
tpTest = cmTest[1][1] # True Positives : Good (1) predicted Good (1)
fpTest = cmTest[0][1] # False Positives : Bad (0) predicted Good (1)
tnTest = cmTest[0][0] # True Negatives : Bad (0) predicted Bad (0)
fnTest = cmTest[1][0] # False Negatives : Good (1) predicted Bad (0)

print("TPR Test :\t", (tpTest/(tpTest + fnTest)))
print("TNR Test :\t", (tnTest/(tnTest + fpTest)))
print()

print("FPR Test :\t", (fpTest/(fpTest + tnTest)))
print("FNR Test :\t", (fnTest/(fnTest + tpTest)))

# Plot the two-way Confusion Matrix
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18})

### Create a Model for satisfaction_v2 : Attempt 5 - Logistic Regression 

In [None]:
col_names = ['id','satisfaction_v2','Gender','Customer Type',
'Age','Type of Travel','Class','Flight Distance',
'Seat comfort','Departure/Arrival time convenient','Food and drink','Gate location','Inflight wifi service',
'Inflight entertainment','Online support','Ease of Online booking','On-board service','Leg room service','Baggage handling',
'Checkin service','Cleanliness','Online boarding','Departure Delay in Minutes','Arrival Delay in Minutes']
# load dataset, remove header
pima = pd.read_csv("satisfaction.csv", header=1, names=col_names)
#pima.head()

In [None]:
# list of columns ['Seat comfort',Food and drink',
#'Inflight wifi service','Inflight entertainment','On-board service','Leg room service',
#'Checkin service','Cleanliness']
feature_cols = ['Seat comfort','Food and drink',
'Inflight wifi service','Inflight entertainment','On-board service','Leg room service',
'Checkin service','Cleanliness']

X = pima[feature_cols] # Features
y = pima.satisfaction_v2 # Target variable

In [None]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression(solver='lbfgs', max_iter=500) # increase the limit else will get warning

# fit the model with data
logreg.fit(X_train,y_train)

y_pred = logreg.predict(X_test)

In [None]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
class_names = [0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
#pos_label='satisfied' will take satisfied as positive else will take 1 as default
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred,pos_label='satisfied'))
print("Recall:", metrics.recall_score(y_test, y_pred,pos_label='satisfied'))

In [None]:
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba,pos_label='satisfied')
auc = metrics.roc_auc_score(y_test, y_pred_proba)

plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

In [None]:
### Receiver Operating Characteristic(ROC) Curve

A plot for the true positive rate against the false positive rate.

AUC score of ~0.87. Consider good. As 1 represents perfect classifier and 0.5 represents a worthless classifier.