# **Airline Customer Satisfaction Capstone**

## Pre-Processing & Training

Now that we have completed the first two sections of this project (the Data Wrangling and Exploratory Data Analysis), it is time to move on to pre-processing and training our data so that it can be used in upcoming models.

## 1. Table of Contents

## 2. Import Packages

In [41]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
from sklearn.linear_model import LogisticRegression

## 3. Load Data

In [42]:
airline_data_cleaned2=pd.read_csv('/Users/lauren/Desktop/airline_data_cleaned2.csv')

# 4. Explore the Data

In [3]:
airline_data_cleaned2.head()

Unnamed: 0.1,Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
0,0,1,Male,48,First-time,Business,Business,821,2,5.0,...,3,5,2,5,5,5,3,5,5,Neutral or Dissatisfied
1,1,2,Female,35,Returning,Business,Business,821,26,39.0,...,5,4,5,5,3,5,2,5,5,Satisfied
2,2,3,Male,41,Returning,Business,Business,853,0,0.0,...,3,5,3,5,5,3,4,3,3,Satisfied
3,3,4,Male,50,Returning,Business,Business,1905,0,0.0,...,5,5,5,4,4,5,2,5,5,Satisfied
4,4,5,Female,49,Returning,Business,Business,3470,0,1.0,...,3,4,4,5,4,3,3,3,3,Satisfied


In [4]:
airline_data_cleaned2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 25 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   Unnamed: 0                              129880 non-null  int64  
 1   ID                                      129880 non-null  int64  
 2   Gender                                  129880 non-null  object 
 3   Age                                     129880 non-null  int64  
 4   Customer Type                           129880 non-null  object 
 5   Type of Travel                          129880 non-null  object 
 6   Class                                   129880 non-null  object 
 7   Flight Distance                         129880 non-null  int64  
 8   Departure Delay                         129880 non-null  int64  
 9   Arrival Delay                           129487 non-null  float64
 10  Departure and Arrival Time Convenience  1298

# 5. Dummy Variables

To use categorical variables in a machine learning model, we first need to represent them in a quantitative way.  In order to do that, we use dummy variables.

There are four categorical columns in our data: Gender, Customer Type, Type of Travel, Class, and Satisfaction.  Each of these columns will need to be converted to a dummy variable.

In [28]:
airline_data_encoded = pd.get_dummies(airline_data_cleaned2, columns=['Gender', 'Customer Type', 'Type of Travel', 'Class', 'Satisfaction'])

In [30]:
print(airline_data_encoded.columns)

Index(['Unnamed: 0', 'ID', 'Age', 'Flight Distance', 'Departure Delay',
       'Arrival Delay', 'Departure and Arrival Time Convenience',
       'Ease of Online Booking', 'Check-in Service', 'Online Boarding',
       'Gate Location', 'On-board Service', 'Seat Comfort', 'Leg Room Service',
       'Cleanliness', 'Food and Drink', 'In-flight Service',
       'In-flight Wifi Service', 'In-flight Entertainment', 'Baggage Handling',
       'Gender_Female', 'Gender_Male', 'Customer Type_First-time',
       'Customer Type_Returning', 'Type of Travel_Business',
       'Type of Travel_Personal', 'Class_Business', 'Class_Economy',
       'Class_Economy Plus', 'Satisfaction_Neutral or Dissatisfied',
       'Satisfaction_Satisfied'],
      dtype='object')


In [31]:
airline_data_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 31 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   Unnamed: 0                              129880 non-null  int64  
 1   ID                                      129880 non-null  int64  
 2   Age                                     129880 non-null  int64  
 3   Flight Distance                         129880 non-null  int64  
 4   Departure Delay                         129880 non-null  int64  
 5   Arrival Delay                           129487 non-null  float64
 6   Departure and Arrival Time Convenience  129880 non-null  int64  
 7   Ease of Online Booking                  129880 non-null  int64  
 8   Check-in Service                        129880 non-null  int64  
 9   Online Boarding                         129880 non-null  int64  
 10  Gate Location                           1298

In [32]:
airline_data_encoded.head()

Unnamed: 0.1,Unnamed: 0,ID,Age,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,...,Gender_Male,Customer Type_First-time,Customer Type_Returning,Type of Travel_Business,Type of Travel_Personal,Class_Business,Class_Economy,Class_Economy Plus,Satisfaction_Neutral or Dissatisfied,Satisfaction_Satisfied
0,0,1,48,821,2,5.0,3,3,4,3,...,1,1,0,1,0,1,0,0,1,0
1,1,2,35,821,26,39.0,2,2,3,5,...,0,0,1,1,0,1,0,0,0,1
2,2,3,41,853,0,0.0,4,4,4,5,...,1,0,1,1,0,1,0,0,0,1
3,3,4,50,1905,0,0.0,2,2,3,4,...,1,0,1,1,0,1,0,0,0,1
4,4,5,49,3470,0,1.0,3,3,3,5,...,0,0,1,1,0,1,0,0,0,1


We can see with the above that by converting to dummy variables the four columns were converted! So now we can move on to the next step, as the machine will be able to understand the numbers much better than the strings.

# 6. Train/Test Split

We need to split the data into a training set and a testing set.

First, let's determine the partition sizes for a 70/30 train/test split:

In [33]:
len(airline_data_encoded) * .7, len(airline_data_encoded) * .3

(90916.0, 38964.0)

In [34]:
X_train, X_test, y_train, y_test = train_test_split(airline_data_encoded.drop(columns = ['Satisfaction_Neutral or Dissatisfied', 'Satisfaction_Satisfied', 'ID']),
                                                    airline_data_encoded.Satisfaction_Satisfied,
                                                    test_size = 0.3,
                                                    random_state = 47)

In [35]:
X_train.shape, X_test.shape

((90916, 28), (38964, 28))

In [36]:
y_train.shape, y_test.shape

((90916,), (38964,))

In [37]:
print("\n")
print("X_train:", X_train, type(X_train), X_train.shape, len(X_train)) #TrainX

# AJS:
print("\n")
print("X_test", X_test, type(X_test), X_test.shape, len(X_test)) #TestX

# AJS:
print("\n")
print("y_train", y_train, type(y_train), y_train.shape, len(y_train)) #Trainy

# AJS:
print("\n")
print("y_test", y_test, type(y_test), y_test.shape, len(y_test)) #Testy





X_train:         Unnamed: 0  Age  Flight Distance  Departure Delay  Arrival Delay  \
13253        13253   32             3567                0            0.0   
31443        31443   15             1541               33           32.0   
75968        75968   33              297                8           44.0   
7298          7298   54              139                0            0.0   
101540      101540   69              345                0            0.0   
...            ...  ...              ...              ...            ...   
23112        23112   49             2976                1           37.0   
11528        11528   68              191                4            0.0   
112967      112967   34             1129               45           62.0   
51078        51078   28              308                8            4.0   
103559      103559   70              738                0            0.0   

        Departure and Arrival Time Convenience  Ease of Online Booking  \
13

In [46]:
#Count (using `.sum()`) the number of missing values (`.isnull()`) in each column of 
#airline_data as well as the percentages (using `.mean()` instead of `.sum()`).
#Order them (increasing or decreasing) using sort_values
#Call `pd.concat` to present these in a single table (DataFrame) with the helpful column names 'count' and '%'
missing = pd.concat([airline_data_encoded.isnull().sum(), 100 * airline_data_encoded.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count', ascending = False)
airline_data_encoded['Arrival Delay'].fillna(0, inplace = True)
print(airline_data_encoded['Arrival Delay'].isnull().sum())

0


In [48]:
#Count (using `.sum()`) the number of missing values (`.isnull()`) in each column of 
#airline_data as well as the percentages (using `.mean()` instead of `.sum()`).
#Order them (increasing or decreasing) using sort_values
#Call `pd.concat` to present these in a single table (DataFrame) with the helpful column names 'count' and '%'
missing = pd.concat([airline_data_encoded.isnull().sum(), 100 * airline_data_encoded.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count', ascending = False)

Unnamed: 0,count,%
Unnamed: 0,0,0.0
In-flight Service,0,0.0
Satisfaction_Neutral or Dissatisfied,0,0.0
Class_Economy Plus,0,0.0
Class_Economy,0,0.0
Class_Business,0,0.0
Type of Travel_Personal,0,0.0
Type of Travel_Business,0,0.0
Customer Type_Returning,0,0.0
Customer Type_First-time,0,0.0


In [47]:
# Construct the LogisticRegression model
clf = LogisticRegression()

# Fit the model on the training data.
clf.fit(X_train, y_train) 

# Print the accuracy from the testing data.
# Introduce variable to be reused later
y_predict_test = clf.predict(X_test)
print("\n")
print("[Test] Accuracy score (y_predict_test, y_test):",accuracy_score(y_predict_test, y_test))

# Note the order in which the parameters must be passed
# according to the documentation ... although there should be
# no difference since it is a one-to-one comparison ...
# ref: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score
print("\n")
print("[Test] Accuracy score: (y_test, y_predict_test)",accuracy_score(y_test, y_predict_test))

y_predict_training = clf.predict(X_train)
print("\n")
print("[Training] Accuracy score: (y_train, y_predict_training)",accuracy_score(y_train, y_predict_training))

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').