<a href="https://colab.research.google.com/github/Arka1212/Airline-Passenger-Referral-Prediction/blob/main/Airline_Passenger_Referral_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Data includes airline reviews from 2006 to 2019 for popular airlines around the world with multiple choice and free text questions. Data is scraped in Spring 2019. The main objective is to predict whether passengers will refer the airline to their friends.***

# **Feature descriptions briefly as follows:**

## **airline:** Name of the airline.
## **overall:** Overall point is given to the trip between 1 to 10.
## **author:** Author of the trip
## **reviewdate:** Date of the Review customer review: Review of the customers in free text format
## **aircraft:** Type of the aircraft
## **travellertype:** Type of traveler (e.g. business, leisure)
## **cabin:** Cabin at the flight date flown: Flight date
## **seatcomfort:** Rated between 1-5
## **cabin service:** Rated between 1-5
## **foodbev:** Rated between 1-5 entertainment: Rated between 1-5
## **groundservice:** Rated between 1-5
## **valueformoney:** Rated between 1-5
## **recommended:** Binary, target variable.




In [16]:
# Importing necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import math
import scipy.stats as stat
from datetime import *
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
# Mounting drive.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Reading dataset.
path = '/content/drive/MyDrive/Capstone Projects/Machine Learning (Classification)/Airline Passenger Referral Prediction/data_airline_reviews.xlsx'
df = pd.read_excel(path)

In [6]:
# Shape of the dataset.
df.shape

# ROWS = 131895 & COLUMNS = 17

(131895, 17)

In [4]:
# Glimpse of the dataset.
df.head()

Unnamed: 0,airline,overall,author,review_date,customer_review,aircraft,traveller_type,cabin,route,date_flown,seat_comfort,cabin_service,food_bev,entertainment,ground_service,value_for_money,recommended
0,,,,,,,,,,,,,,,,,
1,Turkish Airlines,7.0,Christopher Hackley,8th May 2019,âœ… Trip Verified | London to Izmir via Istanb...,,Business,Economy Class,London to Izmir via Istanbul,2019-05-01 00:00:00,4.0,5.0,4.0,4.0,2.0,4.0,yes
2,,,,,,,,,,,,,,,,,
3,Turkish Airlines,2.0,Adriana Pisoi,7th May 2019,âœ… Trip Verified | Istanbul to Bucharest. We ...,,Family Leisure,Economy Class,Istanbul to Bucharest,2019-05-01 00:00:00,4.0,1.0,1.0,1.0,1.0,1.0,no
4,,,,,,,,,,,,,,,,,


In [12]:
# Columns.
df.columns

Index(['airline', 'overall', 'author', 'review_date', 'customer_review',
       'aircraft', 'traveller_type', 'cabin', 'route', 'date_flown',
       'seat_comfort', 'cabin_service', 'food_bev', 'entertainment',
       'ground_service', 'value_for_money', 'recommended'],
      dtype='object')

In [21]:
# Data description.
df.describe()

Unnamed: 0,overall,seat_comfort,cabin_service,food_bev,entertainment,ground_service,value_for_money
count,64017.0,60681.0,60715.0,52608.0,44193.0,39358.0,63975.0
mean,5.14543,2.95216,3.191814,2.90817,2.863372,2.69282,2.943962
std,3.477532,1.441362,1.565789,1.481893,1.507262,1.612215,1.58737
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,1.0,1.0,2.0,1.0,1.0,1.0,1.0
50%,5.0,3.0,3.0,3.0,3.0,3.0,3.0
75%,9.0,4.0,5.0,4.0,4.0,4.0,4.0
max,10.0,5.0,5.0,5.0,5.0,5.0,5.0


In [14]:
# Dataset information.
df.info()

# Seems that the dataset has a large number of missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131895 entries, 0 to 131894
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   airline          65947 non-null  object 
 1   overall          64017 non-null  float64
 2   author           65947 non-null  object 
 3   review_date      65947 non-null  object 
 4   customer_review  65947 non-null  object 
 5   aircraft         19718 non-null  object 
 6   traveller_type   39755 non-null  object 
 7   cabin            63303 non-null  object 
 8   route            39726 non-null  object 
 9   date_flown       39633 non-null  object 
 10  seat_comfort     60681 non-null  float64
 11  cabin_service    60715 non-null  float64
 12  food_bev         52608 non-null  float64
 13  entertainment    44193 non-null  float64
 14  ground_service   39358 non-null  float64
 15  value_for_money  63975 non-null  float64
 16  recommended      64440 non-null  object 
dtypes: float64

In [18]:
# Count of null or missing values.
df().isnull().sum

airline             65948
overall             67878
author              65948
review_date         65948
customer_review     65948
aircraft           112177
traveller_type      92140
cabin               68592
route               92169
date_flown          92262
seat_comfort        71214
cabin_service       71180
food_bev            79287
entertainment       87702
ground_service      92537
value_for_money     67920
recommended         67455
dtype: int64

In [27]:
df.isnull().sum().sum()

# So, dataset has 1326305 null values in total.

1326305

In [67]:
a_df = df.copy()

# Making a copy of the dataset so as to keep the original data intact.

## **DATA CLEANING & PRE-PROCESSING**

In [68]:
# Removing unnecessary columns.
airline_df = a_df.drop(['author','aircraft','date_flown','route','review_date','customer_review'],axis = 1)

# These columns does not have any significance in predictive analysis and also contains huge null values, so its been removed.

In [69]:
# Dropping null rows.
airline_df = airline_df.loc[~airline_df.isnull().all(axis=1),:]

# Removed rows with no values.

In [71]:
# New shape of the dataset.
airline_df.shape

# ROWS = 65947 & COLUMNS = 11

(65947, 11)

In [72]:
airline_df.isnull().sum()

# Still we have some missing values to deal with.

airline                0
overall             1930
traveller_type     26192
cabin               2644
seat_comfort        5266
cabin_service       5232
food_bev           13339
entertainment      21754
ground_service     26589
value_for_money     1972
recommended         1507
dtype: int64

In [74]:
airline_df.head()

Unnamed: 0,airline,overall,traveller_type,cabin,seat_comfort,cabin_service,food_bev,entertainment,ground_service,value_for_money,recommended
1,Turkish Airlines,7.0,Business,Economy Class,4.0,5.0,4.0,4.0,2.0,4.0,yes
3,Turkish Airlines,2.0,Family Leisure,Economy Class,4.0,1.0,1.0,1.0,1.0,1.0,no
5,Turkish Airlines,3.0,Business,Economy Class,1.0,4.0,1.0,3.0,1.0,2.0,no
7,Turkish Airlines,10.0,Solo Leisure,Economy Class,4.0,5.0,5.0,5.0,5.0,5.0,yes
9,Turkish Airlines,1.0,Solo Leisure,Economy Class,1.0,1.0,1.0,1.0,1.0,1.0,no
