*Author's comment: If you liked my work, please dont forget to upvote.*

# Introduction
Previously, in part one of this project, through EDA and data visualisation, factors affecting passengers’ overall satisfaction was examined. It revealed that passengers who were travelling for non-business purposes found the service poor. However, if faced with a particularly difficult stakeholder, more evidence may be needed to emphasis on improvements. As such, in this project, the data will be modelled to assess how predictable passengers’ overall level of satisfaction was relative to them and their experience of various services. The purpose of this is to further emphasise that the outcomes of the satisfaction are predictable and not random. 

In [1]:
# libraries for data handling
import numpy as np 
import pandas as pd 

# kaggle specific library for reading data
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# libraries for data visualisation 
import matplotlib.pyplot as plt
import seaborn as sns

#set style of graphs
sns.set_style("whitegrid")


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

/kaggle/input/airline-passenger-satisfaction/train.csv
/kaggle/input/airline-passenger-satisfaction/test.csv


# Overview

Below is the data following the transformations undertaken in the EDA. Some aspects of the coding have been altered for efficiency and the data visualisations have been removed. 


In [2]:
# import data
train = pd.read_csv("/kaggle/input/airline-passenger-satisfaction/train.csv", )

# show to 5 rows
train.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [3]:
# drop unnamed column
train.drop("Unnamed: 0", axis = 1, inplace = True)

#create dummy variable 
satisfaction_dummmy = pd.get_dummies(train["satisfaction"], drop_first = True)
Gender_dummmy = pd.get_dummies(train["Gender"], drop_first = True)
Customer_dummmy = pd.get_dummies(train["Customer Type"], drop_first = True)
Type_dummmy = pd.get_dummies(train["Type of Travel"], drop_first = True)


train = pd.concat([train, satisfaction_dummmy, Gender_dummmy, Customer_dummmy, Type_dummmy], axis=1)

In [4]:
# library for encoding variables
from sklearn.preprocessing import LabelEncoder 

In [5]:
# encode class variable
le = LabelEncoder()
train['Class_code'] = le.fit_transform(train['Class'])

As the first part of the project highlighted, the relationship between valuation of the services and its impact on the overall satisfaction was clear when examined in its entirety. As such, in this section of the project, each model will be run twice. Once where the services are evaluated individually and one where the total percent of the services is evaluated. 

In [6]:
# single out variables which are related to service scores
service_scores = train[['Inflight wifi service','Departure/Arrival time convenient', 
                              'Ease of Online booking','Gate location', 'Food and drink', 
                              'Online boarding', 'Seat comfort','Inflight entertainment', 'On-board service', 'Leg room service',
                              'Baggage handling', 'Checkin service', 'Inflight service','Cleanliness']]

In [7]:
# calculate sum of scores
train["Total_score"] = service_scores.sum(axis = 1)

# find max possible score 
max_score = len(service_scores.columns)*5

# convert sum of scores to percentage
train["Total_score_percent"] = round((train["Total_score"]/max_score)*100,1)

In [8]:
#round departure delay to hours
train["Departure Delay in hour"]= round(train["Departure Delay in Minutes"]/60,1)

In [9]:
# update null values to 0 
train['Arrival Delay in Minutes'] = train['Arrival Delay in Minutes'].fillna(0)

In [10]:
#round arrival delay to hours
train["Arrival Delay in hour"]= round(train["Arrival Delay in Minutes"]/60,1)

In [11]:
# rearrange values
train = train[['id', 'Gender','Male', 'Customer Type','disloyal Customer', 'Age', 'Type of Travel','Personal Travel', 'Class',
       'Class_code','Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Total_score',
       'Total_score_percent', 'Departure Delay in Minutes','Departure Delay in hour', 'Arrival Delay in Minutes',
               'Arrival Delay in hour', 'satisfaction', 'satisfied', ]]

In [12]:
# reconfirm variables 
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 33 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   id                                 103904 non-null  int64  
 1   Gender                             103904 non-null  object 
 2   Male                               103904 non-null  uint8  
 3   Customer Type                      103904 non-null  object 
 4   disloyal Customer                  103904 non-null  uint8  
 5   Age                                103904 non-null  int64  
 6   Type of Travel                     103904 non-null  object 
 7   Personal Travel                    103904 non-null  uint8  
 8   Class                              103904 non-null  object 
 9   Class_code                         103904 non-null  int64  
 10  Flight Distance                    103904 non-null  int64  
 11  Inflight wifi service              1039

In [13]:
# full list of variables
X1 = train[['Male', 'disloyal Customer', 'Age', 'Personal Travel', 'Class_code','Flight Distance', 
       'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in hour','Arrival Delay in hour',]]
Y1 = train['satisfied']

In [14]:
# condensed list of variable
X2 = train[['Male', 'disloyal Customer', 'Age', 'Personal Travel', 'Class_code','Flight Distance', 
       'Total_score_percent', 'Departure Delay in hour','Arrival Delay in hour',]]
Y2 = train['satisfied']

# Modelling 
There are several classifications which can be used to model the data. Various models will yield different accuracy results. For this project, the four models will be mainly utilised. They are:
* random forest, 
* logistic regression, 
* KNN 
* decision trees. 

To ensure that the balance in the training dataset does not skew the results, each model will be passed through cross validation.  


## Logistic Regression

In [15]:
Log_total_score1 = cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X1, Y1, cv=10)
print('Cross-Validation Accuracy Scores -', Log_total_score1)
print('Cross-Validation Accuracy Scores AVG -', Log_total_score1.mean())

Cross-Validation Accuracy Scores - [0.87498797 0.87691271 0.87036859 0.87508421 0.8746872  0.87266603
 0.87507218 0.87449471 0.88036574 0.880077  ]
Cross-Validation Accuracy Scores AVG - 0.8754716327865353


In [16]:
Log_total_score2 = cross_val_score(LogisticRegression(solver='liblinear',multi_class='ovr'), X2, Y2, cv=10)
print('Cross-Validation Accuracy Scores -', Log_total_score2)
print('Cross-Validation Accuracy Scores AVG -', Log_total_score2.mean())

Cross-Validation Accuracy Scores - [0.84284477 0.84553941 0.8437109  0.84351843 0.83849856 0.83974976
 0.84205967 0.84282964 0.85033686 0.85553417]
Cross-Validation Accuracy Scores AVG - 0.844462217386798


## Random Forest

In [17]:
RFC1 = cross_val_score(RandomForestClassifier(n_estimators=40), X1, Y1,cv=10)
print('Cross-Validation Accuracy Scores -', RFC1)
print('Cross-Validation Accuracy Scores AVG -', RFC1.mean())

Cross-Validation Accuracy Scores - [0.96015783 0.96131267 0.96102396 0.96458474 0.96217517 0.9599615
 0.96304139 0.95957652 0.96400385 0.96448508]
Cross-Validation Accuracy Scores AVG - 0.9620322706525201


In [18]:
RFC2 = cross_val_score(RandomForestClassifier(n_estimators=40), X2, Y2,cv=10)
print('Cross-Validation Accuracy Scores -', RFC2)
print('Cross-Validation Accuracy Scores AVG -', RFC2.mean())

Cross-Validation Accuracy Scores - [0.84582812 0.84274853 0.84996632 0.85006255 0.84359962 0.84196343
 0.84677575 0.85033686 0.85370549 0.85803657]
Cross-Validation Accuracy Scores AVG - 0.8483023233347063


## KNN

In [19]:
# scale variables
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X1)

X1_scaled = scaler.transform(X1)

scaler = StandardScaler()
scaler.fit(X2)

X2_scaled = scaler.transform(X2)

In [20]:
KNN1 = cross_val_score(KNeighborsClassifier(n_neighbors = 2), X1_scaled, Y1,cv=10)
print('Cross-Validation Accuracy Scores -', KNN1)
print('Cross-Validation Accuracy Scores AVG -', KNN1.mean())

Cross-Validation Accuracy Scores - [0.91463767 0.90895968 0.91184679 0.91511885 0.91116458 0.91453321
 0.91231954 0.90856593 0.91530318 0.91953802]
Cross-Validation Accuracy Scores AVG - 0.9131987433783715


In [21]:
KNN2 = cross_val_score(KNeighborsClassifier(n_neighbors = 2), X2_scaled, Y2,cv=10)
print('Cross-Validation Accuracy Scores -', KNN2)
print('Cross-Validation Accuracy Scores AVG -', KNN2.mean())

Cross-Validation Accuracy Scores - [0.81031662 0.80954672 0.80791069 0.81599461 0.81000962 0.81000962
 0.81385948 0.81847931 0.81607315 0.81953802]
Cross-Validation Accuracy Scores AVG - 0.8131737847098561


## Decision Tree Classifier

In [22]:
decision_tree_result1 = cross_val_score(DecisionTreeClassifier(), X1, Y1,cv=10)
print('Cross-Validation Accuracy Scores -', decision_tree_result1)
print('Cross-Validation Accuracy Scores AVG -', decision_tree_result1.mean())

Cross-Validation Accuracy Scores - [0.94572226 0.9458185  0.94678087 0.94995669 0.94706449 0.94610202
 0.9454283  0.9479307  0.9479307  0.94744947]
Cross-Validation Accuracy Scores AVG - 0.9470183996312052


In [23]:
decision_tree_result2 = cross_val_score(DecisionTreeClassifier(), X2, Y2,cv=10)
print('Cross-Validation Accuracy Scores -', decision_tree_result2)
print('Cross-Validation Accuracy Scores AVG -', decision_tree_result2.mean())

Cross-Validation Accuracy Scores - [0.78596863 0.78683476 0.79309017 0.79309017 0.78565929 0.78152069
 0.78768046 0.79201155 0.79316651 0.79268527]
Cross-Validation Accuracy Scores AVG - 0.7891707508783837


In [24]:
models = {'Log_full':Log_total_score1,'Log_limited':Log_total_score1, "RFC_full": RFC1, 
       "RFC_limited": RFC2,"KNN_scaled_full": KNN1, "KNN_scaled_limited": KNN2, 
          "decision_tree_full": decision_tree_result1, 
          "decision_tree_limited":decision_tree_result2,}
df = pd.DataFrame(models)

Doing so for both the full range of variables reveals the following: 

In [25]:
df1 = df.transpose()
df1["avg"]= df1.mean(axis =1)
df1.style.background_gradient(cmap ="RdPu" ,subset='avg')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,avg
Log_full,0.874988,0.876913,0.870369,0.875084,0.874687,0.872666,0.875072,0.874495,0.880366,0.880077,0.875472
Log_limited,0.874988,0.876913,0.870369,0.875084,0.874687,0.872666,0.875072,0.874495,0.880366,0.880077,0.875472
RFC_full,0.960158,0.961313,0.961024,0.964585,0.962175,0.959962,0.963041,0.959577,0.964004,0.964485,0.962032
RFC_limited,0.845828,0.842749,0.849966,0.850063,0.8436,0.841963,0.846776,0.850337,0.853705,0.858037,0.848302
KNN_scaled_full,0.914638,0.90896,0.911847,0.915119,0.911165,0.914533,0.91232,0.908566,0.915303,0.919538,0.913199
KNN_scaled_limited,0.810317,0.809547,0.807911,0.815995,0.81001,0.81001,0.813859,0.818479,0.816073,0.819538,0.813174
decision_tree_full,0.945722,0.945818,0.946781,0.949957,0.947064,0.946102,0.945428,0.947931,0.947931,0.947449,0.947018
decision_tree_limited,0.785969,0.786835,0.79309,0.79309,0.785659,0.781521,0.78768,0.792012,0.793167,0.792685,0.789171


As the table shows, even in the worst performing model, the outcome of the passenger satisfaction could be predicted based on their data and rating of various services with over 75% accuracy and the highest correctly predicting the outcome with over 90% accuracy. 

Given that satisfaction is subjective, it may be susceptible to randomness and noise. That said, these scores are extremely high, especially under the random forest model. Therefore, stakeholders cannot ignore that passengers’ overall satisfaction can be predicted and are not independent of their experience flying with the airline. 

This is further emphasised by the results of models which focused on either passenger profiles or passenger experience, “RFC_passenger_data” and “RFC_passenger_experiecnce”. Whilst both had over 75% accuracy in predicting the satisfaction, passenger experience was better at predicting the overall satisfaction. This implies that the majority of the issues with dissatisfaction do not come from factors such as customer demography. In other words, if the airline wishes to improve customer satisfaction, marketing to secure a different demography may help but improving their services are likely to have the biggest impact. 

In [26]:
X3 = train[['Male', 'disloyal Customer', 'Age', 'Personal Travel', 'Class_code','Flight Distance', 
       'Departure Delay in hour','Arrival Delay in hour',]]
Y3 = train['satisfied']

In [27]:
X4 = train[['Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', ]]
Y4 = train['satisfied']

In [28]:
RFC3 = cross_val_score(RandomForestClassifier(n_estimators=40), X3, Y3,cv=10)
print('Cross-Validation Accuracy Scores -', RFC3)
print('Cross-Validation Accuracy Scores AVG -', RFC3.mean())

Cross-Validation Accuracy Scores - [0.75594264 0.76104321 0.76123568 0.75748244 0.75842156 0.75630414
 0.75274302 0.76852743 0.75582291 0.76958614]
Cross-Validation Accuracy Scores AVG - 0.75971091718985


In [29]:
RFC4 = cross_val_score(RandomForestClassifier(n_estimators=40), X4, Y4,cv=10)
print('Cross-Validation Accuracy Scores -', RFC4)
print('Cross-Validation Accuracy Scores AVG -', RFC4.mean())

Cross-Validation Accuracy Scores - [0.94418247 0.94302762 0.94341257 0.94793571 0.94552454 0.94205967
 0.94446583 0.94600577 0.94533205 0.94552454]
Cross-Validation Accuracy Scores AVG - 0.944747078360271


In [30]:
models = { "RFC_Passenger_data": RFC3, "RFC_passenger_experience": RFC4 }
df2 = pd.DataFrame(models)

df_profile_service = df2.transpose()
df_profile_service["avg"]= df_profile_service.mean(axis=1)
df_profile_service.style.background_gradient(cmap ="RdPu" ,subset='avg')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,avg
RFC_Passenger_data,0.755943,0.761043,0.761236,0.757482,0.758422,0.756304,0.752743,0.768527,0.755823,0.769586,0.759711
RFC_passenger_experience,0.944182,0.943028,0.943413,0.947936,0.945525,0.94206,0.944466,0.946006,0.945332,0.945525,0.944747


# Conclusion 
This was part two of a two part series examining customer satisfaction. As this part of the project highlights, the passengers overall satisfaction could be predicted with a high degree of accuracy. This implies that their satisfaction is not removed from their experience of the services and thus if the airline is keen to improve customer satisfaction, quality of the service must be improved. 

In a workplace, projects such as these may be the first step in improving the company’s effectiveness. In most cases, stakeholders who accept the issue of low satisfaction may not require such extensive evidence and argument to improve its services. However, there may be instances where stakeholders may not recognise an issue. In this case, a comprehensive argument such as this may be needed. 

# Appendix 
Below is the code for the SVC model. However this model is not recommended as due to the size of the data and the computational power required. 


In [31]:
from sklearn.model_selection import train_test_split

In [32]:
X_train,X_test, y_train, y_test = train_test_split(X1_scaled,Y1, test_size = .33, random_state = 360)

In [33]:
model = SVC()
model.fit(X_train, y_train)

SVC()

In [34]:
predictions = model.predict(X_test)

In [35]:
from sklearn.metrics import classification_report, confusion_matrix

In [36]:
print(confusion_matrix(y_test, predictions))
print(classification_report (y_test, predictions))


[[18765   653]
 [  956 13915]]
              precision    recall  f1-score   support

           0       0.95      0.97      0.96     19418
           1       0.96      0.94      0.95     14871

    accuracy                           0.95     34289
   macro avg       0.95      0.95      0.95     34289
weighted avg       0.95      0.95      0.95     34289

