<p align="center"><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="260" height="110" /></p>

![image-3.png](attachment:image-3.png)

## FLIGHT PASSENGER SATISFACTION PREDICTION (MACHINE LEARNING - INTERMEDIATE)

**Submitted by NARAYAN V. SHANBHAG**

**GCD Student, INSAID**

**Batch: May 9,2021**

---
# **Table of Contents**
---

1. [**Introduction**](#Section1)<br>
2. [**Problem Statement**](#Section2)<br>
3. [**Installing & Importing Libraries**](#Section3)<br>
  3.1 [**Installing Libraries**](#Section31)<br>
  3.2 [**Upgrading Libraries**](#Section32)<br>
  3.3 [**Importing Libraries**](#Section33)<br>
4. [**Data Acquisition & Description**](#Section4)<br>
5. [**Data Pre-Profiling**](#Section5)<br>
6. [**Data Pre-Processing**](#Section6)<br>
7. [**Data Post-Profiling**](#Section7)<br>
8. [**Exploratory Data Analysis**](#Section8)<br>
9. [**Summarization**](#Section9)</br>
  9.1 [**Conclusion**](#Section91)</br>
  9.2 [**Actionable Insights**](#Section91)</br>

---

---
<a name = Section1></a>
# **1. Introduction**
---
### Company Introduction - FLY HIGH

##### Your client for this project is an Airline Company. - FLY HIGH

- 1) Due to **fierce** competition in the airline industry, the airline company needs to focus on the passenger’s experience and satisfaction.
- 2) **Customer feedback**, in particular, is critical since it is an outcome measurement for business performance.
- 3) So, they need to analyze the data of the passenger's travel history.
- 4) One of the key **measurements** in this process is whether the passenger feels satisfied or not.

### Current Scenario
 - Currently, they have a manual process to analyze the customer satisfaction based on the number of feedback and complaint mail that they receive.

---
<a name = Section2></a>
# **2. Problem Statement**
---
- This section is emphasised on providing some generic introduction to the problem that most companies confront.

### The company suffers from the following problems:

- 1) Analyzing the data of passengers manually and to understand whether a passenger is satisfied or not is a tedious task.
- 2) his process needs to be repeated every time they receive some feedback.

- The company has hired you as data science consultants.

- They want to **automate** the process of **predicting** the passenger satisfaction based on the travel history data collected by the airline company.


### Your Role
- 1) You are given a dataset containing the answer of different questions asked as the feedback.
- 2) Your task is to build a classification model using the dataset.
- 3) Because there was no machine learning model for this problem in the company, you don’t have a quantifiable win condition. You need to build the best possible model.

### Project Deliverable
- Deliverable: **Predict whether the customer is satisfied or not.**
- Machine Learning Task:**Classification**
- Target Variable: **satisfaction**
- Win Condition: **N/A (best possible model)**

### Evaluation Metric
- The model evaluation will be based on the **Accuracy score.**

# **3. Installing & Importing Libraries**
---
- This section is emphasised on installing and importing the necessary libraries that will be required.

### **Installing Libraries**

In [None]:
!pip install -q datascience                                         
# Package that is required by pandas profiling
!pip install -q pandas-profiling                                    
# Library to generate basic statistics about data

# To install more libraries insert your code here..


### **Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell under Installing Libraries and Upgrading Libraries again after restarting the runtime.

In [None]:
!pip install -q --upgrade pandas-profiling                          
# Upgrading pandas profiling to the latest version

### **Importing Libraries**

- You can headstart with the basic libraries as imported inside the cell below.

- If you want to import some additional libraries, feel free to do so.

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing package pandas (For Panel Data Analysis)
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis)
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface to use matplotlib
import seaborn as sns                                               # Importing seaborn library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import scipy as sp                                                  # Importing library for scientific calculations
#-------------------------------------------------------------------------------------------------------------------------------
import warnings
warnings.filterwarnings('ignore')

# 4. Data  Description

- The dataset consists of information of passenger boarding and deboarding information and the services provided during the travel in flight.
- The column satisfaction is also present in the dataset which is a measure of the overall satisfaction.
- This is the data that we have to predict for future samples.
 
#### The dataset is divided into two parts: Train and Test sets.

#### Training Set: 
 - The train set contains **83123 rows** and **24 columns**
 - The last column **satisfaction** is the **target variable**.

#### Testing Set:
 - The test set contains **20781 rows** and **23 columns**.
 - The test set **doesn’t contain** the **satisfaction** column.
 - It needs to be predicted for the test set.
 

**Train Set:**

| Records |Features|Target Variable|
|:--|:--|:--|
|**83123**|**24**|**label**|

**Test Set:**

|Records|Features|Predicted Variable|
|:--|:--|:--|
|**20781**|**23**|**label**|
 
 |***ID***|****Feature****|****Description****|
|:--|:--|:--|
|01| Id                    | Unique Id|
|02| Gender                | Gender of the Passenger|  
|03| Customer Type	       | If the customer is loyal or disloyal| 
|04| Age                   | Age of the customer|   
|05| Type of Travel	       | If the travel is for a business or a personal purpose |
|06| Class                 | Class of the aircraft in which the customer is travelling|
|07| Flight Distance       | Distance covered by the flight|
|08| Inflight wifi service       | If there is inflight wifi service or not|
|09| Departure/Arrival time convenient      | Rating by the customer for Departure/Arrival Time|
|10| Ease of Online booking	      | Rating by the customer for this facility|
|11| Gate location         | Rating by the customer for this facility|
|12| Food and drink        | Rating by the customer for this facility|
|13| Online boarding       | Rating by the customer for this facility |
|14| Seat comfort          | Rating by the customer for this facility|
|15| Inflight entertainment     | Rating by the customer for this facility|
|16| On-board service      | Rating by the customer for this facility|
|17| Leg room service	   | Rating by the customer for this facility|
|18| Baggage handling	   | Rating by the customer for this facility|
|19| Checkin service	   | Rating by the customer for this facility|
|20| Inflight service	   | Rating by the customer for this facility|
|21| Cleanliness	       | Rating by the customer for this facility|
|22| Departure Delay in Minutes   |Rating by the customer for this facility|
|23| Arrival Delay in Minutes     |Rating by the customer for this facility |
|24| satisfaction          |If the passenger is satisfied or not|

---
<a name = Section4></a>
# **5. Data Acquisition**
---

- This section is emphasised on the accquiring the data and obtain some descriptive information out of it.

- You could either scrap the data and then continue, or use a direct source of link (generally preferred in most cases).

- You will be working with a direct source of link to head start your work without worrying about anything.

- Before going further you must have a good idea about the features of the data set:



In [None]:
#from google.colab import files
#uploaded = files.upload()
#import io
#data = pd.read_csv(io.BytesIO(uploaded['Churn_train.csv']))


In [None]:
#data = pd.read_csv(filepath_or_buffer = 'https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/car_sales.csv', encoding='cp1252')
data = pd.read_csv("D://cust_train.csv",skipinitialspace=True)
data
print('Data Shape:', data.shape)
data.head()
#Initial Analysis shows there are 5634 RECORDS/ROWS and 21 FEATURES/COLUMNS

 ### **Data Information**
- **Totally there are 22 Columns/Features out of which 1(Id) is NUMERICAL(INTEGER), 1 is CATEGORICAL and the rest are NUMERICAL(FLOAT)**
- **Check whether the NUMERICAL Columns have 0s and if they are relevant**


In [None]:
# Insert your code here
data.info()


### **Data Description**

- **To get some quick description out of the data you can use describe method defined in pandas library.**
-**Gives the 5-Point or 5-Number summary and other details such as Count, Mean and Standard Deviation of the data-set**

In [None]:
data.describe()

In [None]:
data_final = pd.read_csv("D://cust_test.csv")
print('Data Shape:', data_final.shape)
data_final

In [None]:
data_final.info()

In [None]:
data_final.describe()

---
<a name = Section5></a>
# **6. Data Pre-Profiling**
---

- This section is emphasised on getting a report about the data.

- You need to perform pandas profiling and get some observations out of it...

In [None]:
# Insert your code here...
profile_satisfaction = ProfileReport(df=data)
profile_satisfaction

**5.1 Data Pre-Profiling for TEST SET**

In [None]:
profile1_satisfaction = ProfileReport(df=data_final)
profile1_satisfaction

---
<a name = Section6></a>
# **7. Data Pre-Processing**
---

- This section is emphasised on performing data manipulation over unstructured data for further processing and analysis.

- To modify unstructured data to strucuted data you need to verify and manipulate the integrity of the data by:
  - **Handling/Checking Duplicate Data for both the TRAIN and TEST Data Sets**

  - Handling missing data,

  - Handling redundant data,

  - Handling inconsistent data,

  - Handling outliers,

  - Handling typos

-**There are actually NO DUPLICATE RECORDS/ROWS in the DATA-SETS**



In [None]:
# Insert your code here...
data[data.duplicated()]

In [None]:
data_final[data_final.duplicated()]

-**Now Check ALL the NUMERICAL COLUMNS for ZERO values and Replace/Substitute them with appropriate values**

In [None]:
data.isnull().sum()

In [None]:
(data == 0 ).sum(axis = 0)

In [None]:
data_final.isnull().sum()

In [None]:
(data_final == 0 ).sum(axis = 0)

### **Check whether the DataSet is Balanced for the label column**

In [None]:
data.satisfaction.value_counts().plot(kind='pie',autopct='%1.1f%%', startangle=90);
data.satisfaction.value_counts()

In [None]:
figure = plt.figure(figsize = (20,10))
HeatMap = sns.heatmap(data.corr(),annot = True,cmap = 'coolwarm',vmin = -1, vmax = 1,linecolor = 'black',linewidths = 1)
HeatMap.set_title('Correlation HeatMap', fontdict = {'fontsize':16})

### OBSERVATIONS
- **meanfreq** has HIGH CORRELATION with **mode, median Q25**
- **skew** has HIGH CORRELATION with **sp.ent**
- **centroid** has HIGH CORRELATION with **median, Q25**
- **meandom** has HIGH CORRELATION with **maxdom,fdrange**

In [None]:
data.columns

In [None]:
data['Arrival Delay in Minutes'].mean()

In [None]:
data['Arrival Delay in Minutes'].median()

In [None]:
data_final['Arrival Delay in Minutes'].mean()

In [None]:
data_final['Arrival Delay in Minutes'].median()

In [None]:
data['Arrival Delay in Minutes'].fillna(value = 15.063121440293465, inplace = True)
data.info()

In [None]:
data_final['Arrival Delay in Minutes'].fillna(value = 15.640816523501593, inplace = True)
data_final.info()

In [None]:
data_final.head()

In [None]:
data_intermediate = data[['Gender', 'Customer Type','Type of Travel', 'Class', 'satisfaction']]
data_intermediate

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
data_intermediate = data_intermediate.apply(LabelEncoder().fit_transform)
data_intermediate

In [None]:
data_categorical = data_intermediate.drop(['satisfaction'],axis = 1)

In [None]:
data_numerical = data[['Age','Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',]]
data_numerical

In [None]:
data_model = pd.concat([data_categorical,data_numerical], axis = 1)
data_model

In [None]:
data_model.columns

In [None]:
x = data_model[['Gender', 'Customer Type', 'Type of Travel', 'Class', 'Age',
       'Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes',
       'Arrival Delay in Minutes']]
x

In [None]:
y = data_intermediate[['satisfaction']]
y

### **5.2 Data Cleaning**

- In this section, we will **remove** columns which are **redundant** for model.

---
# **6. Exploratory Data Analysis**

#### **Question:** What is the distribution of the **target** feature?

# **7. Data Post Processing**

### **7.2 Data Splitting**

- Now we will split our data into train set and test set.

- We will keep **80%** data in the **train** set, and **20%** data in the **test** set.

### Import the train Import the train test split and split the data into train and test data for x and y.

-from sklearn.model_selection import train_test_split split and split the data into train and test data for x and y.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Splitting data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)

# Display the shape of training and testing data
print('x_train shape: ', x_train.shape)
print('y_train shape: ', y_train.shape)
print('x_test shape: ', x_test.shape)
print('y_test shape: ', y_test.shape)

# **8. Model Development & Evaluation**

### **8.1 Model Development & Evaluation without PCA**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
import plotly.graph_objs as go                                      # Importing plotly for interactive visualizations
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler                    # Importing to scale the features in the dataset
from sklearn.model_selection import train_test_split                # To properly split the dataset into train and test sets
from sklearn.ensemble import RandomForestClassifier                  # To create a random forest regressor model
from sklearn.linear_model import LogisticRegression                   # To create a linear regression model
from sklearn import metrics                                         # Importing to evaluate the model used for regression
from sklearn.decomposition import PCA                               # Importing to create an instance of PCA model
#-------------------------------------------------------------------------------------------------------------------------------
from random import randint                                          # Importing to generate random integers
#-------------------------------------------------------------------------------------------------------------------------------
import time                                                         # For time functionality
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

### Importing the evaluation metrics for Classification model - Logistic Regression


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

In [None]:
#clfs = [LogisticRegression(solver='liblinear'), RandomForestClassifier(random_state=42)]

In [None]:
logreg_model =  LogisticRegression(solver='liblinear')

In [None]:
randforest_model = RandomForestClassifier(random_state=42)

In [None]:
logreg_model.fit(x_train, y_train)

In [None]:
randforest_model.fit(x_train, y_train)

In [None]:
y_pred_train_lr = logreg_model.predict(x_train)
y_pred_train_lr

In [None]:
y_pred_test_lr = logreg_model.predict(x_test)
y_pred_test_lr

In [None]:
y_pred_train_rf = randforest_model.predict(x_train)
y_pred_train_rf

In [None]:
y_pred_test_rf = randforest_model.predict(x_test)
y_pred_test_rf

### ACCURACY SCORE and F1-SCORE of Logistic regression Train Set

In [None]:
ACCURACY_SCORE_TRAIN_LR = accuracy_score(y_train,y_pred_train_lr)
F1_SCORE_TRAIN_LR = f1_score(y_train,y_pred_train_lr)
print(ACCURACY_SCORE_TRAIN_LR)
print(F1_SCORE_TRAIN_LR)

### ACCURACY SCORE and F1-SCORE of Logistic regression Test Set

In [None]:
ACCURACY_SCORE_TEST_LR = accuracy_score(y_test,y_pred_test_lr)
F1_SCORE_TEST_LR = f1_score(y_test,y_pred_test_lr)
print(ACCURACY_SCORE_TEST_LR)
print(F1_SCORE_TEST_LR)

### ACCURACY SCORE and F1-SCORE of Random Forest Classifier Train Set

In [None]:
ACCURACY_SCORE_TRAIN_RF = accuracy_score(y_train,y_pred_train_rf)
F1_SCORE_TRAIN_RF = f1_score(y_train,y_pred_train_rf)
print(ACCURACY_SCORE_TRAIN_RF)
print(F1_SCORE_TRAIN_RF)

### ACCURACY SCORE and F1-SCORE of Random Forest Classifier Test Set

In [None]:
ACCURACY_SCORE_TEST_RF = accuracy_score(y_test,y_pred_test_rf)
F1_SCORE_TEST_RF = f1_score(y_test,y_pred_test_rf)
print(ACCURACY_SCORE_TRAIN_RF)
print(F1_SCORE_TRAIN_RF)

## Analyzing the Test File

### Seperate out the Categorical and Numerical Columns in the Test DataSet

In [None]:
data_submission = data_final['id']
data_submission.shape

In [None]:
data_final_categorical = data_final[['Gender', 'Customer Type','Type of Travel', 'Class']]
data_final_categorical

In [None]:
data_final_categorical = data_final_categorical.apply(LabelEncoder().fit_transform)
data_final_categorical.head()

In [None]:
data_final_numerical = data_final[['Age','Flight Distance', 'Inflight wifi service',
       'Departure/Arrival time convenient', 'Ease of Online booking',
       'Gate location', 'Food and drink', 'Online boarding', 'Seat comfort',
       'Inflight entertainment', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Inflight service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',]]
data_final_numerical.head()

In [None]:
data_final_model = pd.concat([data_final_categorical,data_final_numerical], axis = 1)
data_final_model.head()

In [None]:
data_final_model.shape

### Predict the label values using the earlier trained Logistic Regression Model

In [None]:
y_pred_test_final_logistic = logreg_model.predict(data_final_model)
y_pred_test_final_logistic

### Convert the array into a DataFrame

In [None]:
y_pred_test_final_logistic = pd.DataFrame(y_pred_test_final_logistic)

In [None]:
y_pred_test_final_logistic

In [None]:
y_pred_test_final_random_forest = randforest_model.predict(data_final_model)
y_pred_test_final_random_forest

In [None]:
y_pred_test_final_random_forest = pd.DataFrame(y_pred_test_final_random_forest)
y_pred_test_final_random_forest.head()

In [None]:
y_pred_test_final_random_forest.shape

## Prepare the submission file which should have only two columns viz. the KEY/INDEX column(Id) and TARGET column(label)

In [None]:
submission_file = pd.concat([data_submission,y_pred_test_final_random_forest], axis = 1)

In [None]:
submission_file.head()

### To convert label values from 0/1 back to Male/Female

In [None]:
submission_file.replace({1:"satisfied", 0:"neutral or dissatisfied"}, inplace = True)
submission_file.head()

In [None]:
y_pred_test_final_logistic.replace({1:"satisfied", 0:"neutral or dissatisfied"}, inplace = True)
y_pred_test_final_logistic.head()

### To write the final data to the submission file which is .csv without HEADER and INDEX

In [None]:
submission_file.to_csv('D://Flight_Passenger_Satisfaction_Prediction_Submission.csv', header=False, index=False)

In [None]:
y_pred_test_final_logistic.to_csv('D://Flight_Passenger_Satisfaction_Prediction_Intermediate.csv',header=False, index=False)

### Thank You !!!