In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
drive.mount('/content/drive')
import numpy as np
from pylab import rcParams
import statsmodels.api as sm
import itertools
import seaborn as sns
import plotly.express as px

Mounted at /content/drive


In [2]:
train = pd.read_csv('/content/drive/My Drive/Colab Notebooks/archive/train.csv')
test = pd.read_csv('/content/drive/My Drive/Colab Notebooks/archive/test.csv')

In [3]:
train.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [4]:
train.shape

(103904, 25)

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      1039

There are 310 missing values in Arrival Delay in Minutes Column. Handling the missing values by replacing it with mean values.

In [6]:
train['Arrival Delay in Minutes'].fillna(train['Arrival Delay in Minutes'].mean(),inplace=True)

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      1039

In [8]:
train.duplicated().sum()

0

Removing unnecessary columns

In [9]:
train.drop(['id','Unnamed: 0'],axis=1,inplace=True)

# Distribution of numeric columns

In [10]:
fig = px.histogram(train, x="Age",width=700, height=400)
fig.show()

The above histogram shows the distribution of the Age column. The data seem to approximately follow the normal distribution with very comparatively less people with the age 70 and above. 

In [11]:
fig = px.histogram(train, x="Flight Distance",width=700, height=400)
fig.show()

The distribution of Flight Distance shows that most of the flight are of shorter distances rather than long distances as it is heavily right skewed. It is evident that there are some observations with flight distance more than 4000, however, the number of observation is small compared to the number of observation of the flight distance less than 4000. It is a good indication that these values are outliers.

In [12]:
fig = px.histogram(train, x="Arrival Delay in Minutes",width=600, height=300)
fig.show()

In [13]:
fig = px.histogram(train, x="Arrival Delay in Minutes",width=600, height=300)
fig.show()

In [14]:
fig = px.scatter(train, x="Arrival Delay in Minutes",y = 'Departure Delay in Minutes',width=600, height=300)
fig.show()

The 2 histogram above shows the distribution of both arrival delay and departure delay. Both suggest that most of the flights have less than 100 minutes of delay, with only minuscule number of flights that are delayed for more that that. In the scatter plot, we can see most of the values are concentrated from the range 0 to 550, and the values above 550 in both axis starts to be very scattered away. This suggests that these observations are outliers of the data and will therefore be removed. However, after removing the values more than 550, the distribution is still extremely right skewed. This may be due to the nature of the problem where most flights are minimally delayed, with a few exceptions where the flight will be delayed for extreme periods. The scatter plot also suggests there is a linear relationship between departure and arrival delay, and it makes sense because when the departure time is delayed, the arrival time will naturally be delayed as well considering the flight time is the same.

The other columns include Inflight wifi service, departure/arrival time convenience, ease for booking online, gate location, food and drink, online boarding, seat comfort, inflight entertainment, on-board service, leg room service, baggage handling, checkin service, inflight service and cleanliness is on a scale from 0 to 5. It will be difficult to determine the outliers in this case as the values are discrete.





# Visualizing Categorical Data 

In [15]:
group_gender = train.groupby(by='Gender', as_index= False).agg(count_gender = ('Customer Type', 'count'))
fig = px.bar(group_gender,x= 'Gender', y = 'count_gender', height=400, width = 600)
fig.show()

The bar chart shows that this dataset is evenly distributed between male and female customers. 

In [16]:
group_cs_type = train.groupby(by='Customer Type', as_index= False).agg(customer_type = ('Customer Type', 'count'))
fig = px.bar(group_cs_type,x= 'Customer Type', y = 'customer_type', height=400, width = 600)
fig.show()

From the bar chart above, we can see there are proportionally more loyal customers compared to disloyal customers. This may be an indication that the airline company may need to focus on retaining the loyal customers. 

In [17]:
group_class = train.groupby(by='Class', as_index= False).agg(class_count = ('Class', 'count'))
fig = px.bar(group_class,x= 'Class', y = 'class_count', height=400, width = 600)
fig.show()

From the bar chart above, we can see similar distribution betwen economy class and business class, but the eco plus class is very less compared to the other 2. The airline company should look into reason as to why there is such a difference between the classes in various aspects like price and benefits provided. 

In [18]:
group_satisfaction = train.groupby('satisfaction', as_index=False).agg(satisfaction_count = ('satisfaction', 'count'))
fig = px.pie(group_satisfaction, values='satisfaction_count', names='satisfaction', width = 500, height = 300)
fig.show()

Lastly, the pie chart shows the distribution between satisfied and dissatisfied customers are roughly equal, with slightly more dissatisfied customers. This will ensure the model will be able to provide accurate prediction and does not have bias towards one class. 

# Multivariate Analysis

**Who are most dissatisfied?**

In [21]:
grp_satisfaction_age = train.groupby(by=['satisfaction', 'Age'], as_index=False).agg(count = ('Age', 'count'))
fig = px.line(grp_satisfaction_age, x="Age", y="count", color="satisfaction",  width = 800, height = 500)
fig.show()

There are 2 spikes for the neutral and dissatisfied group, which are the age range 22 to 27 and 36 to 39. While most satisfied customers are of the age range of 39 to 60. 

In [22]:
grp_customer_type = train.groupby(by=['Customer Type', 'Age'], as_index=False).agg(count = ('Age', 'count'))
fig = px.line(grp_customer_type, x="Age", y="count", color="Customer Type", width = 800, height = 500)
fig.show()

From another graph, we can see that disloyal customers tend to be the customers from the age range 20 to 27, while loyal customers peaks at the age range between 39 to 60. This observation is similar with the satisfaction of the customers. This suggests that dissatisfied customers tend to not be loyal customers, and it is logically so as customers would want to enjoy a satisfying flight experience rather.

The airline company should target the customers the age range 20 to 27 who takes the economy class flight, and provide them with extra promotions and amenities that would improve their flight experience and increase their satisfaction. This could include things like offering complimentary meals or drinks, providing more comfortable seating, or offering in-flight entertainment options.

In addition, the airline company should take care of the customers of age range 39 to 60 who takes the business class flight as to retain their loyalty. This can be done by continuing to provide excellent service towards this group of people, and offering special privellages for their loyalty towards the company.

# What are the inflight factors affecting flight satisfaction?

There are several metric that is related to the inflight flying experience, which include flight distance, inflight wifi service, food and drink, seat comfort, inflight entertainment, on-board service, leg room service, inflight service and cleanliness.

In [26]:
fig = px.box(train.sort_values(by='Class'), x="Class", y="Flight Distance", width = 800, height = 500)
fig.show()

From this boxplot, we can see that most economy class and economy plus class flyers are travelling shorter distances, while business class flyers have a higher average flight distance. 