## Team 7. Final Project

In [70]:
# Initial imports
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sqlalchemy import create_engine

## Dataset Information

This dataset contains the outcomes of **103,904** airline passenger satisfaction survey. The variables of this dataset are highly correlated to predict if the passenger could become a loyal customer. According to the importance of customer loyalty programs - statistics and trends, **80%** of a company’s future revenue will come from just 20% of existing clients. This way, the airline could launch a loyalty program strategy to overcome the post-pandemic hardships. 

The dataset includes the following columns:

* `gender`: Gender of the passengers (Female/Male).
* `customer type`: The customer type (Loyal customer/Disloyal customer).
* `age`: The actual age of the passengers.
* `type travel`: Purpose of the flight of the passengers (Personal Travel/Business Travel).
* `class`: Travel class in the plane of the passengers (Business/Eco/Eco Plus).
* `flight distance`: The flight distance of this journey.
* `departure delay`: Minutes delayed when departure.
* `arrival delay`: Minutes delayed when arrival.
* `satisfaction`: Airline satisfaction level (satisfied/neutral or dissatisfied).
    
The Machine Learning model design goal is to answer the following questions:
1. Can a machine learning model predict the type of customer (loyal vs disloyal) a ticket purchaser will become in the future? 
2. What factors are highly correlated to the decision to become a loyal customer?
3. Which are the areas of opportunity?
4. Where during the "buying of a ticket to completing the flight" process does the company lose or gain the client's loyalty?


In [71]:
# Retrieve dataset from PostgreSQL RDS AWS

# Store environmental variables to get PostgreSQL connection
from getpass import getpass

user = getpass('Enter user')
password = getpass('Enter password')
database = getpass('Enter database')
port = getpass('Enter port')

Enter user········
Enter password········
Enter database········
Enter port········


In [72]:
# Create the engine connection to PostgreSQL in AWS
engine = create_engine('postgresql://'+user+':'+password+'@database-1.cetgij0pjfvj.us-east-1.rds.amazonaws.com:'+port+'/'+
                      database)

In [73]:
# Declare the SQL query to get dataset rows
query_flights_surveys = """SELECT sid,id,gender,customer_type,age,type_travel,class_name,flight_distance,departure_delay,
                   arrival_delay,satisfaction 
                   FROM flights_train f
                   INNER JOIN classes c ON f.class_no = c.class_no 
                   ORDER BY sid"""

In [74]:
# Retrieve flights rows from PostgreSQL
df = pd.read_sql(query_flights_surveys, con=engine, index_col='sid')
df.head()

Unnamed: 0_level_0,id,gender,customer_type,age,type_travel,class_name,flight_distance,departure_delay,arrival_delay,satisfaction
sid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,25,18,neutral or dissatisfied
2,5047,Male,disloyal Customer,25,Business travel,Business,235,1,6,neutral or dissatisfied
3,110028,Female,Loyal Customer,26,Business travel,Business,1142,0,0,satisfied
4,24026,Female,Loyal Customer,25,Business travel,Business,562,11,9,neutral or dissatisfied
5,119299,Male,Loyal Customer,61,Business travel,Business,214,0,0,satisfied


## Data Preprocessing
The dataset contains categorical and text features (gender, customer type, type travel, class and satisfaction). Therefore, these features 
must be converted to numerical data for use in our machine learning model. 

The data preprocessing includes the following steps:
    
1. Detect missing values with the Pandas DataFrame function *isna()*.
2. Make sure we are using the correct variables data types -Pandas DataFrame *dtypes* property.
3. Use Scikit-learns *LabelEncoder* module to transform categorical and text variables into numerical data as follows:
    
* `gender`: 0 (Female), 1 (Male)
* `customer_type`: 0 (Loyal Customer), 1 (Disloyal Customer)
* `type_travel`: 0 (Business travel), 1 (Personal Travel) 
* `class`: 0 (Business), 1 (Eco), 2 (Eco Plus)
* `satisfaction`: 0 (neutral or dissatisfied), 1 (satisfied)

4. Drop the identification `id` column.
5. Finally, verify the information about the DataFrame, including the index type and columns, non-null values, and memory usage.

In [75]:
# Identifying null values with the Pandas dataframe isna() function
df.isna().sum()

id                 0
gender             0
customer_type      0
age                0
type_travel        0
class_name         0
flight_distance    0
departure_delay    0
arrival_delay      0
satisfaction       0
dtype: int64

In [76]:
# Making sure we are using the correct data types with the Pandas dataframe dtypes property
df.dtypes

id                  int64
gender             object
customer_type      object
age                 int64
type_travel        object
class_name         object
flight_distance     int64
departure_delay     int64
arrival_delay       int64
satisfaction       object
dtype: object

In [77]:
# Using Scikit-learns LabelEncoder module to transform object type variables into numerical data
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for i in range(0,df.shape[1]):
    if df.dtypes[i]=='object':
        df[df.columns[i]] = le.fit_transform(df[df.columns[i]])
        
df.head()

Unnamed: 0_level_0,id,gender,customer_type,age,type_travel,class_name,flight_distance,departure_delay,arrival_delay,satisfaction
sid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,70172,1,0,13,1,2,460,25,18,0
2,5047,1,1,25,0,0,235,1,6,0
3,110028,0,0,26,0,0,1142,0,0,1
4,24026,0,0,25,0,0,562,11,9,0
5,119299,1,0,61,0,0,214,0,0,1


In [78]:
# Dropping the identification id column
df = df.drop(['id'], axis=1)
df.head()

Unnamed: 0_level_0,gender,customer_type,age,type_travel,class_name,flight_distance,departure_delay,arrival_delay,satisfaction
sid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1,0,13,1,2,460,25,18,0
2,1,1,25,0,0,235,1,6,0
3,0,0,26,0,0,1142,0,0,1
4,0,0,25,0,0,562,11,9,0
5,1,0,61,0,0,214,0,0,1


In [79]:
# Verify the information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103904 entries, 1 to 103904
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype
---  ------           --------------   -----
 0   gender           103904 non-null  int32
 1   customer_type    103904 non-null  int32
 2   age              103904 non-null  int64
 3   type_travel      103904 non-null  int32
 4   class_name       103904 non-null  int32
 5   flight_distance  103904 non-null  int64
 6   departure_delay  103904 non-null  int64
 7   arrival_delay    103904 non-null  int64
 8   satisfaction     103904 non-null  int32
dtypes: int32(5), int64(4)
memory usage: 5.9 MB


## Separate the Features (X) from the Target (y)

We can see from the preview of the dataset that multiple variables, such as age customer, the travel class, the flight distance, departure delay, arrival delay, and the level of customer satisfaction, can be used to predict the outcome: whether a customer is loyal (0) or disloyal (1). Thereby, the **customer type** column should be the target or the dependent variable (y) in our model, and the rest features are the independent variables (X).

In [80]:
y = df["customer_type"]
X = df.drop(columns="customer_type")

## Split our data into training and testing
We split the dataset into **random** train and test subsets using the Scikit-learn **train_test_split** module. The training subset will be used in the model to learn from it and the testing subset to assess its performance. 
We configure the **train_test_split** module with four arguments:
* The `input X` variables
* The `output y` or what we wish to predict: **customer type** 
* `random_state` of 1 to ensure that the equals rows are assigned to train and test sets, respectively.
* `stratify` enabled to divide the number of loyal and disloyal customers proportionally

In [81]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=1, 
                                                    stratify=y)
print(X_train.shape)
print(X_test.shape)

(77928, 8)
(25976, 8)


In [82]:
# Creating StandardScaler instance
scaler = StandardScaler()

In [83]:
# Fitting Standard Scaller
X_scaler = scaler.fit(X_train)

In [84]:
# Scaling data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [85]:
X_train_scaled[:5]

array([[ 1.01312379e+00,  6.36114472e-01, -6.71835080e-01,
        -9.56542010e-01,  2.80029742e+00, -3.85785435e-01,
        -3.89998570e-01,  1.14392629e+00],
       [ 1.01312379e+00,  1.49592975e+00,  1.48846053e+00,
         6.56698819e-01, -4.04407455e-04,  3.56559819e+00,
         3.36948028e+00, -8.74182196e-01],
       [-9.87046217e-01, -1.54649354e+00,  1.48846053e+00,
         6.56698819e-01,  8.68354990e-02, -1.94319861e-02,
        -3.89998570e-01, -8.74182196e-01],
       [ 1.01312379e+00, -1.41421427e+00,  1.48846053e+00,
         6.56698819e-01,  1.23298737e+00, -3.85785435e-01,
        -3.89998570e-01, -8.74182196e-01],
       [ 1.01312379e+00, -1.48035390e+00,  1.48846053e+00,
         6.56698819e-01, -9.36979955e-01, -3.85785435e-01,
        -3.89998570e-01,  1.14392629e+00]])

In [86]:
import numpy as np
print(np.mean(X_train_scaled[:,0]))
print(np.std(X_train_scaled[:,0]))

-9.728840712658183e-17
0.9999999999999999


## The Logistic Regression Model
A logistic regression model is a classification algorithm that can analyze continuous and categorical variables. With the combination of input variables, logistic regression predicts the probability of the input data belonging to one of two groups. In our case, the passenger satisfaction information could be used by an airline to determine whether a passenger does or does not qualify as a loyal client. 

On the other hand, we consider a logistic regression model due to the number of data points in our dataset being fewer than two hundred thousand (103,904) with eight independent variables and one target. Also, our dataset contains numerical and categorical variables.

In [87]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='lbfgs', random_state=1)
classifier

LogisticRegression(random_state=1)

## Fit (train) the model using the training data
Due to the variables values of flight distance and flight delay (departure and arrival) in miles and minutes, respectively, the data train subset was scaled using the Scikit-learn **StandardScaler** module. After adjusting the training dataset, we trained the logistic regression model using the **fit()** method with 77,928 data points.

In [88]:
classifier.fit(X_train_scaled, y_train)

LogisticRegression(random_state=1)

## Make predictions

In [89]:
y_pred = classifier.predict(X_test_scaled)
results = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results.head(10)

Unnamed: 0,Prediction,Actual
0,0,0
1,1,1
2,0,0
3,0,0
4,1,1
5,0,0
6,0,0
7,0,0
8,0,1
9,0,0


## Final Accuracy score
The model achieved an accuracy score of **0.89** which means that nine of ten observations in the testing set were predicted correctly. Thus, this extremely high metric should raise our suspicion of **overfitting**. However, testing the model with another dataset that contains the outcomes of **25,976** airline passenger satisfaction surveys turns out a similar accuracy score of **0.9**

In [90]:
from sklearn.metrics import accuracy_score
acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {acc_score}")

Accuracy Score: 0.8880890052356021


## Calculating the confusion matrix

**Precision**

From our results, the precision for the loyal customers can be determined by the ratio TP/ (TP + FP), which is 20,082/ (20,082 + 1,758) = 0.9195. The precision for the disloyal customers can be determined as follows: 2,987/ (2,987 + 1,149) = 0.7221. Thus, this high precision is indicative of a low number of false positives predictions.

**Recall**

From our results, the recall for the loyal customers can be determined by the ratio TP/ (TP + FN), which is 20,082/ (20,082 + 1,149) = 0.9458. The recall for the disloyal customers can be determined as follows: 2,987/ (2,987 + 1,758) = 0.6295. Thus, this high recall is indicative of a low number of false negative predictions.

**F1 score**

From our results, the harmonic mean can be determined by the formula 2(Precision * Sensitivity)/ (Precision + Sensitivity), which is 0.93 and 0.67 for the loyal and disloyal customers, respectively. Therefore, the high sensitivity means that among loyal customers, most of them will be diagnosed correctly. High precision, on the other hand, means that if the test comes back positive, there's a high likelihood that the customer is loyal.


In [91]:
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(
    cm, index=["Actual loyal", "Actual disloyal"], columns=["Predicted loyal", "Predicted disloyal"]
)

display(cm_df)

Unnamed: 0,Predicted loyal,Predicted disloyal
Actual loyal,20082,1149
Actual disloyal,1758,2987


In [92]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.92      0.95      0.93     21231
           1       0.72      0.63      0.67      4745

    accuracy                           0.89     25976
   macro avg       0.82      0.79      0.80     25976
weighted avg       0.88      0.89      0.89     25976



In [None]:
# Write df_encoded DataFrame to a table in PostgreSQL
df.to_sql('flights_data_encoded', con=engine, if_exists='replace')