

# **Predicting Airline Customer Satisfaction**
## Phase 1: Data Preparation & Visualisation


<center> Names & IDs of group members </center> 

Names  | IDs
------------- | -------------
Matthew Bentham  | S3923076
John Murrowood  | S3923075
Isxaq Warsame  |  S3658179



__________

### Table of contents:
- [Introduction](#intro)
   - [Data source](#ds)
   - [Dataset detail](#dd)
   - [Dataset features](#df)
   - [Target Feature](#tf)
- [Goals & Objectives](#gao) 
- [Data Cleaning & Preprocessing](#dprep)
- [Data Exploration & Visualisation](#dvis)
- [Literature Review](#lr)
- [Summary & Conclusions](#sum)
- [References](#ref)


### INTRODUCTION <a name="intro"></a>

#### **Data source:** <a name="ds"></a>

The US airline passenger satisfaction survey dataset was sourced from kaggle, uploaded by John D 2018. This dataset contains survey results for whether a customer was satisfied with the flight or not as well as passenger and flight information. The dataset also contains information on what parts of the flight service they were satisfied with or not satisifed with.

URL: [US Airline Passenger Satisfaction](https://www.kaggle.com/datasets/johndddddd/customer-satisfaction)

#### **Dataset details:** <a name="dd"></a>

This dataset contains information on whether customers were satisfied or not during there domestic flight within the USA. There is personal details of each traveller including age, gender, type of travel (personal or buisness), as well as information on the flight including in-flight duration, gate departure and if the flight was delayed. There is also a rating for which cutomers enjoyed certain aspects of the flight such as inflight wifi, cleanliness, leg room and other things. These features will then be used for a classification problem to predict the target feature of whether a customer will be satisfied or not.

The dataset has 24 features, split into descriptive features and survey response features, including the target feature and 129,880 observations before any pre processing is preformed on the dataset.

##### **Dataset Retieval**
- The data was downloaded from kaggle as a xlsx file. Link: [US Airline Passenger Satisfaction](https://www.kaggle.com/datasets/johndddddd/customer-satisfaction) 
- As the data file is in the same github directory as this report 'satisfaction.xlsx' can be read directly 
- The first 10 rows are displayed 

In [3]:
# Reading in required packages, and setting up warnings filter
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
from tabulate import tabulate



airplane_df = pd.read_csv('satisfaction_cleaned_5000.csv')
airplane_df.head(10)

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Satisfaction
0,Male,Loyal Customer,8,Personal Travel,Eco,2359,0,5,0,3,...,2,3,2,4,3,4,2,4.0,2.0,satisfied
1,Male,Loyal Customer,32,Business travel,Business,1843,1,1,1,1,...,1,1,1,2,2,2,1,1.0,4.0,neutral or dissatisfied
2,Male,disloyal Customer,25,Business travel,Business,1578,4,0,4,2,...,4,5,4,4,3,4,4,2.0,0.0,satisfied
3,Male,Loyal Customer,52,Business travel,Business,384,3,5,4,4,...,3,3,3,3,3,3,4,15.0,12.0,neutral or dissatisfied
4,Male,disloyal Customer,47,Business travel,Business,1927,4,4,4,3,...,5,5,5,5,4,4,5,0.0,0.0,satisfied
5,Male,Loyal Customer,45,Business travel,Eco,1441,2,2,2,2,...,1,4,3,4,3,4,1,0.0,0.0,neutral or dissatisfied
6,Male,Loyal Customer,22,Business travel,Eco Plus,2513,4,3,3,3,...,4,4,5,1,3,4,4,0.0,0.0,satisfied
7,Male,disloyal Customer,43,Business travel,Business,2095,3,3,3,3,...,3,5,4,5,5,4,3,0.0,3.0,satisfied
8,Male,Loyal Customer,66,Personal Travel,Eco,1442,3,5,3,4,...,2,5,1,4,4,1,2,34.0,26.0,neutral or dissatisfied
9,Female,disloyal Customer,36,Business travel,Eco,1528,2,4,2,3,...,5,2,2,3,3,4,5,0.0,6.0,neutral or dissatisfied


# **One-Hot-Encoding & Integer-Encoding**
- As the target feature for this dataset is either one of satisfied or neutral/dissastisfied we must integer-encode it. Normally, nominal descriptive features would never be integer-encoded.
- Normally, Sklearn would be used to do this but since we have a binary variably of either satisfied or neutral/dissastisfied we can continue with pandas.
- Through visual inspection, it was confirmed that the satisfied variable was correctly encoded as 1 and not a 0

In [7]:
# Creating a categorical columns list to be used with get_dummies()
categorical_cols = airplane_df.columns[airplane_df.dtypes==object].tolist()
categorical_cols
# CHecking dataframe pre-encoding


['Gender', 'Customer Type', 'Type of Travel', 'Class', 'Satisfaction']

In [8]:
airplane_df.head()

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Satisfaction
0,Male,Loyal Customer,8,Personal Travel,Eco,2359,0,5,0,3,...,2,3,2,4,3,4,2,4.0,2.0,satisfied
1,Male,Loyal Customer,32,Business travel,Business,1843,1,1,1,1,...,1,1,1,2,2,2,1,1.0,4.0,neutral or dissatisfied
2,Male,disloyal Customer,25,Business travel,Business,1578,4,0,4,2,...,4,5,4,4,3,4,4,2.0,0.0,satisfied
3,Male,Loyal Customer,52,Business travel,Business,384,3,5,4,4,...,3,3,3,3,3,3,4,15.0,12.0,neutral or dissatisfied
4,Male,disloyal Customer,47,Business travel,Business,1927,4,4,4,3,...,5,5,5,5,4,4,5,0.0,0.0,satisfied


In [9]:
for i in categorical_cols:
    if (airplane_df[i].nunique() == 2): # if it has only two values, e.g, if its binary
        airplane_df[i] = pd.get_dummies(airplane_df[i], drop_first=True, dtype=np.int64)
   
# if it has more than two levels this is where the one hot encoding occurs for those cols
airplane_df = pd.get_dummies(airplane_df, dtype=np.int64)
airplane_df.head()  # Checking Dataframe post-encoding

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,...,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Satisfaction,Class_Business,Class_Eco,Class_Eco Plus
0,1,0,8,1,2359,0,5,0,3,2,...,4,3,4,2,4.0,2.0,1,0,1,0
1,1,0,32,0,1843,1,1,1,1,1,...,2,2,2,1,1.0,4.0,0,1,0,0
2,1,1,25,0,1578,4,0,4,2,4,...,4,3,4,4,2.0,0.0,1,1,0,0
3,1,0,52,0,384,3,5,4,4,4,...,3,3,3,4,15.0,12.0,0,1,0,0
4,1,1,47,0,1927,4,4,4,3,5,...,5,4,4,5,0.0,0.0,1,1,0,0


- Checking to see if the data types are all numeric after encoding

In [10]:
airplane_df.dtypes

Gender                                 int64
Customer Type                          int64
Age                                    int64
Type of Travel                         int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes           float64
Arrival Delay in Minutes             float64
Satisfaction                           int64
Class_Busi

## Scaling of Features
Once One-Hot-Encoding has taken place the features are now scaled using min-max scaling

In [11]:
from sklearn import preprocessing

airplane_df_scaled = airplane_df.copy() # Copying dataframe
scaler = preprocessing.MinMaxScaler()   # setting caling function
airplane_arr = scaler.fit_transform(airplane_df_scaled)  # fitting and tranforming the dataframe

airplane_df_scaled = pd.DataFrame(airplane_arr, columns=airplane_df.columns)    # changing back to dataframe as sk learn only outputs a np array
airplane_df_scaled.head()

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,...,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Satisfaction,Class_Business,Class_Eco,Class_Eco Plus
0,1.0,0.0,0.012821,1.0,0.334638,0.0,1.0,0.0,0.5,0.4,...,0.75,0.5,0.75,0.4,0.03125,0.015504,1.0,0.0,1.0,0.0
1,1.0,0.0,0.320513,0.0,0.259823,0.2,0.2,0.2,0.0,0.2,...,0.25,0.25,0.25,0.2,0.007812,0.031008,0.0,1.0,0.0,0.0
2,1.0,1.0,0.230769,0.0,0.221401,0.8,0.0,0.8,0.25,0.8,...,0.75,0.5,0.75,0.8,0.015625,0.0,1.0,1.0,0.0,0.0
3,1.0,0.0,0.576923,0.0,0.048282,0.6,1.0,0.8,0.75,0.8,...,0.5,0.5,0.5,0.8,0.117188,0.093023,0.0,1.0,0.0,0.0
4,1.0,1.0,0.512821,0.0,0.272002,0.8,0.8,0.8,0.5,1.0,...,1.0,0.75,0.75,1.0,0.0,0.0,1.0,1.0,0.0,0.0


<span style='font-family:"Times New Roman"'> 

### Summary & conclusion: <a name="sum"></a>
<span styel=''>

<span style='font-family:"Times New Roman"'> 

### References: <a name="ref"></a>
<span styel=''>

In [69]:
"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2656082/" # Discretization    

'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2656082/'