# **Airline Customer Satisfaction Capstone**

## Pre-Processing & Training

Now that we have completed the first two sections of this project (the Data Wrangling and Exploratory Data Analysis), it is time to move on to pre-processing and training our data so that it can be used in upcoming models.

## 1. Table of Contents

## 2. Import Packages

In [24]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## 3. Load Data

In [25]:
df=pd.read_csv('/Users/lauren/Desktop/airline_data_cleaned2.csv')

# 4. Explore the Data

In [26]:
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
0,0,1,Male,48,First-time,Business,Business,821,2,5.0,...,3,5,2,5,5,5,3,5,5,Neutral or Dissatisfied
1,1,2,Female,35,Returning,Business,Business,821,26,39.0,...,5,4,5,5,3,5,2,5,5,Satisfied
2,2,3,Male,41,Returning,Business,Business,853,0,0.0,...,3,5,3,5,5,3,4,3,3,Satisfied
3,3,4,Male,50,Returning,Business,Business,1905,0,0.0,...,5,5,5,4,4,5,2,5,5,Satisfied
4,4,5,Female,49,Returning,Business,Business,3470,0,1.0,...,3,4,4,5,4,3,3,3,3,Satisfied


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 25 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   Unnamed: 0                              129880 non-null  int64  
 1   ID                                      129880 non-null  int64  
 2   Gender                                  129880 non-null  object 
 3   Age                                     129880 non-null  int64  
 4   Customer Type                           129880 non-null  object 
 5   Type of Travel                          129880 non-null  object 
 6   Class                                   129880 non-null  object 
 7   Flight Distance                         129880 non-null  int64  
 8   Departure Delay                         129880 non-null  int64  
 9   Arrival Delay                           129487 non-null  float64
 10  Departure and Arrival Time Convenience  1298

In [28]:
df.shape

(129880, 25)

In [29]:
df.dtypes

Unnamed: 0                                  int64
ID                                          int64
Gender                                     object
Age                                         int64
Customer Type                              object
Type of Travel                             object
Class                                      object
Flight Distance                             int64
Departure Delay                             int64
Arrival Delay                             float64
Departure and Arrival Time Convenience      int64
Ease of Online Booking                      int64
Check-in Service                            int64
Online Boarding                             int64
Gate Location                               int64
On-board Service                            int64
Seat Comfort                                int64
Leg Room Service                            int64
Cleanliness                                 int64
Food and Drink                              int64


In [30]:
df.isnull().sum()

Unnamed: 0                                  0
ID                                          0
Gender                                      0
Age                                         0
Customer Type                               0
Type of Travel                              0
Class                                       0
Flight Distance                             0
Departure Delay                             0
Arrival Delay                             393
Departure and Arrival Time Convenience      0
Ease of Online Booking                      0
Check-in Service                            0
Online Boarding                             0
Gate Location                               0
On-board Service                            0
Seat Comfort                                0
Leg Room Service                            0
Cleanliness                                 0
Food and Drink                              0
In-flight Service                           0
In-flight Wifi Service            

In [31]:
df=df.dropna(how='any')
df.isnull().sum()

Unnamed: 0                                0
ID                                        0
Gender                                    0
Age                                       0
Customer Type                             0
Type of Travel                            0
Class                                     0
Flight Distance                           0
Departure Delay                           0
Arrival Delay                             0
Departure and Arrival Time Convenience    0
Ease of Online Booking                    0
Check-in Service                          0
Online Boarding                           0
Gate Location                             0
On-board Service                          0
Seat Comfort                              0
Leg Room Service                          0
Cleanliness                               0
Food and Drink                            0
In-flight Service                         0
In-flight Wifi Service                    0
In-flight Entertainment         

In [32]:
df = df.drop(['Unnamed: 0', 'ID'], axis = 1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129487 entries, 0 to 129879
Data columns (total 23 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   Gender                                  129487 non-null  object 
 1   Age                                     129487 non-null  int64  
 2   Customer Type                           129487 non-null  object 
 3   Type of Travel                          129487 non-null  object 
 4   Class                                   129487 non-null  object 
 5   Flight Distance                         129487 non-null  int64  
 6   Departure Delay                         129487 non-null  int64  
 7   Arrival Delay                           129487 non-null  float64
 8   Departure and Arrival Time Convenience  129487 non-null  int64  
 9   Ease of Online Booking                  129487 non-null  int64  
 10  Check-in Service                        1294

# 5. Dummy Variables

To use categorical variables in a machine learning model, we first need to represent them in a quantitative way.  In order to do that, we use dummy variables.

There are four categorical columns in our data: Gender, Customer Type, Type of Travel, Class, and Satisfaction.  Each of these columns will need to be converted to a dummy variable.

In [33]:
airline_data_encoded = pd.get_dummies(df, columns=['Gender', 'Customer Type', 'Type of Travel', 'Class', 'Satisfaction'])

In [34]:
print(airline_data_encoded.columns)

Index(['Age', 'Flight Distance', 'Departure Delay', 'Arrival Delay',
       'Departure and Arrival Time Convenience', 'Ease of Online Booking',
       'Check-in Service', 'Online Boarding', 'Gate Location',
       'On-board Service', 'Seat Comfort', 'Leg Room Service', 'Cleanliness',
       'Food and Drink', 'In-flight Service', 'In-flight Wifi Service',
       'In-flight Entertainment', 'Baggage Handling', 'Gender_Female',
       'Gender_Male', 'Customer Type_First-time', 'Customer Type_Returning',
       'Type of Travel_Business', 'Type of Travel_Personal', 'Class_Business',
       'Class_Economy', 'Class_Economy Plus',
       'Satisfaction_Neutral or Dissatisfied', 'Satisfaction_Satisfied'],
      dtype='object')


In [35]:
airline_data_encoded.head()

Unnamed: 0,Age,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,Gate Location,On-board Service,...,Gender_Male,Customer Type_First-time,Customer Type_Returning,Type of Travel_Business,Type of Travel_Personal,Class_Business,Class_Economy,Class_Economy Plus,Satisfaction_Neutral or Dissatisfied,Satisfaction_Satisfied
0,48,821,2,5.0,3,3,4,3,3,3,...,1,1,0,1,0,1,0,0,1,0
1,35,821,26,39.0,2,2,3,5,2,5,...,0,0,1,1,0,1,0,0,0,1
2,41,853,0,0.0,4,4,4,5,4,3,...,1,0,1,1,0,1,0,0,0,1
3,50,1905,0,0.0,2,2,3,4,2,5,...,1,0,1,1,0,1,0,0,0,1
4,49,3470,0,1.0,3,3,3,5,3,3,...,0,0,1,1,0,1,0,0,0,1


In [36]:
df.columns

Index(['Gender', 'Age', 'Customer Type', 'Type of Travel', 'Class',
       'Flight Distance', 'Departure Delay', 'Arrival Delay',
       'Departure and Arrival Time Convenience', 'Ease of Online Booking',
       'Check-in Service', 'Online Boarding', 'Gate Location',
       'On-board Service', 'Seat Comfort', 'Leg Room Service', 'Cleanliness',
       'Food and Drink', 'In-flight Service', 'In-flight Wifi Service',
       'In-flight Entertainment', 'Baggage Handling', 'Satisfaction'],
      dtype='object')

In [37]:
airline_data_encoded

Unnamed: 0,Age,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,Gate Location,On-board Service,...,Gender_Male,Customer Type_First-time,Customer Type_Returning,Type of Travel_Business,Type of Travel_Personal,Class_Business,Class_Economy,Class_Economy Plus,Satisfaction_Neutral or Dissatisfied,Satisfaction_Satisfied
0,48,821,2,5.0,3,3,4,3,3,3,...,1,1,0,1,0,1,0,0,1,0
1,35,821,26,39.0,2,2,3,5,2,5,...,0,0,1,1,0,1,0,0,0,1
2,41,853,0,0.0,4,4,4,5,4,3,...,1,0,1,1,0,1,0,0,0,1
3,50,1905,0,0.0,2,2,3,4,2,5,...,1,0,1,1,0,1,0,0,0,1
4,49,3470,0,1.0,3,3,3,5,3,3,...,0,0,1,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129875,28,447,2,3.0,4,4,4,4,2,5,...,1,0,1,0,1,0,0,1,1,0
129876,41,308,0,0.0,5,3,5,3,4,5,...,1,0,1,0,1,0,0,1,1,0
129877,42,337,6,14.0,5,2,4,2,1,3,...,1,0,1,0,1,0,0,1,1,0
129878,50,337,31,22.0,4,4,3,4,1,4,...,1,0,1,0,1,0,0,1,0,1


# 6. Train/Test Split

We need to split the data into a training set and a testing set.

First, let's determine the partition sizes for a 70/30 train/test split:

In [38]:
len(airline_data_encoded) * .7, len(airline_data_encoded) * .3

(90640.9, 38846.1)

In [40]:
#new code
airline_data_encoded = airline_data_encoded.reset_index()
#end new code
X_train, X_test, y_train, y_test = train_test_split(airline_data_encoded.drop(columns = ['Satisfaction_Neutral or Dissatisfied', 'Satisfaction_Satisfied']),
                                                    airline_data_encoded.Satisfaction_Satisfied,
                                                    test_size = 0.3,
                                                    random_state = 47)

In [41]:
X_train.shape, X_test.shape

((90640, 29), (38847, 29))

In [42]:
y_train.shape, y_test.shape

((90640,), (38847,))

In [43]:
print("\n")
print("X_train:", X_train, type(X_train), X_train.shape, len(X_train)) #TrainX

# AJS:
print("\n")
print("X_test", X_test, type(X_test), X_test.shape, len(X_test)) #TestX

# AJS:
print("\n")
print("y_train", y_train, type(y_train), y_train.shape, len(y_train)) #Trainy

# AJS:
print("\n")
print("y_test", y_test, type(y_test), y_test.shape, len(y_test)) #Testy






X_train:         level_0   index  Age  Flight Distance  Departure Delay  Arrival Delay  \
71316     71316   71562   39             3867                5            0.0   
47965     47965   48151   43              259                0            0.0   
33348     33348   33471   18              896                0           21.0   
93037     93037   93328   29             1024               83           65.0   
104200   104200  104527   26             1047                0            0.0   
...         ...     ...  ...              ...              ...            ...   
23112     23112   23204   61              223               23           18.0   
11528     11528   11568   22              771              150          130.0   
112967   112967  113317   41             2876               22           20.0   
51078     51078   51275   24              109                0            5.0   
103559   103559  103883   14              882                0            0.0   

        Departur

In [45]:
# Construct the LogisticRegression model
clf = LogisticRegression()

# Fit the model on the training data.
clf.fit(X_train, y_train) 

# Print the accuracy from the testing data.
# Introduce variable to be reused later
y_predict_test = clf.predict(X_test)
print("\n")
print("[Test] Accuracy score (y_predict_test, y_test):",accuracy_score(y_predict_test, y_test))

# Note the order in which the parameters must be passed
# according to the documentation ... although there should be
# no difference since it is a one-to-one comparison ...
# ref: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
print("\n")
print("[Test] Accuracy score: (y_test, y_predict_test)",accuracy_score(y_test, y_predict_test))

y_predict_training = clf.predict(X_train)
print("\n")
print("[Training] Accuracy score: (y_train, y_predict_training)",accuracy_score(y_train, y_predict_training))



[Test] Accuracy score (y_predict_test, y_test): 0.6477205447010065


[Test] Accuracy score: (y_test, y_predict_test) 0.6477205447010065


[Training] Accuracy score: (y_train, y_predict_training) 0.6457634598411297


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Both the training accuracy and testing accuracy are very close, meaning that there is no "variance."  However, since the model's training accuracy is low (well below 100%), that shows that there is bias in this model.