#### INTRODUCTION
The objective of the study is to analyse the flight booking dataset obtained from “Ease My Trip” website and to conduct various statistical hypothesis tests in order to get meaningful information from it. The 'Linear Regression' statistical algorithm would be used to train the dataset and predict a continuous target variable. 'Easemytrip' is an internet platform for booking flight tickets, and hence a platform that potential passengers use to buy tickets. A thorough study of the data will aid in the discovery of valuable insights that will be of enormous value to passengers.

#### Research Questions
The aim of our study is to answer the below research questions:
a) Does price vary with Airlines?
b) How is the price affected when tickets are bought in just 1 or 2 days before departure?
c) Does ticket price change based on the departure time and arrival time?
d) How the price changes with change in Source and Destination?
e) How does the ticket price vary between Economy and Business class?

#### DATA COLLECTION AND METHODOLOGY
Octoparse scraping tool was used to extract data from the website. Data was collected in two parts: one for economy class tickets and another for business class tickets. A total of 300261 distinct flight booking options was extracted from the site. Data was collected for 50 days, from February 11th to March 31st, 2022.
Data source was secondary data and was collected from Ease my trip website.

DATASET
Dataset contains information about flight booking options from the website Easemytrip for flight travel between India's top 6 metro cities. There are 300261 datapoints and 11 features in the cleaned dataset.

FEATURES
The various features of the cleaned dataset are explained below:
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10)Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.

===================To boost learning, try to create an end-to-end project using the dataset.==================================

In [49]:
import pandas as pd

df = pd.read_csv('flight_data/business.csv')

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93487 entries, 0 to 93486
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        93487 non-null  object
 1   airline     93487 non-null  object
 2   ch_code     93487 non-null  object
 3   num_code    93487 non-null  int64 
 4   dep_time    93487 non-null  object
 5   from        93487 non-null  object
 6   time_taken  93487 non-null  object
 7   stop        93487 non-null  object
 8   arr_time    93487 non-null  object
 9   to          93487 non-null  object
 10  price       93487 non-null  object
dtypes: int64(1), object(10)
memory usage: 7.8+ MB


In [51]:
df

Unnamed: 0,date,airline,ch_code,num_code,dep_time,from,time_taken,stop,arr_time,to,price
0,11-02-2022,Air India,AI,868,18:00,Delhi,02h 00m,non-stop,20:00,Mumbai,25612
1,11-02-2022,Air India,AI,624,19:00,Delhi,02h 15m,non-stop,21:15,Mumbai,25612
2,11-02-2022,Air India,AI,531,20:00,Delhi,24h 45m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,20:45,Mumbai,42220
3,11-02-2022,Air India,AI,839,21:25,Delhi,26h 30m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,44450
4,11-02-2022,Air India,AI,544,17:15,Delhi,06h 40m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,46690
...,...,...,...,...,...,...,...,...,...,...,...
93482,31-03-2022,Vistara,UK,822,09:45,Chennai,10h 05m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,19:50,Hyderabad,69265
93483,31-03-2022,Vistara,UK,826,12:30,Chennai,10h 25m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,22:55,Hyderabad,77105
93484,31-03-2022,Vistara,UK,832,07:05,Chennai,13h 50m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,20:55,Hyderabad,79099
93485,31-03-2022,Vistara,UK,828,07:00,Chennai,10h 00m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,17:00,Hyderabad,81585


In [52]:
#checking if this has any missing values

df.isnull().sum()

date          0
airline       0
ch_code       0
num_code      0
dep_time      0
from          0
time_taken    0
stop          0
arr_time      0
to            0
price         0
dtype: int64

In [53]:
#checking the duplicate records and remove them if find any

df[df.duplicated()]

Unnamed: 0,date,airline,ch_code,num_code,dep_time,from,time_taken,stop,arr_time,to,price


In [54]:
#taking date column and dividing the values in date,month,year columns
df['Date'] = df['date'].str.split('-').str[0]
df['Month'] = df['date'].str.split('-').str[1]
df['Year'] = df['date'].str.split('-').str[2]

df.drop('date', axis=1, inplace=True)
df

Unnamed: 0,airline,ch_code,num_code,dep_time,from,time_taken,stop,arr_time,to,price,Date,Month,Year
0,Air India,AI,868,18:00,Delhi,02h 00m,non-stop,20:00,Mumbai,25612,11,02,2022
1,Air India,AI,624,19:00,Delhi,02h 15m,non-stop,21:15,Mumbai,25612,11,02,2022
2,Air India,AI,531,20:00,Delhi,24h 45m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,20:45,Mumbai,42220,11,02,2022
3,Air India,AI,839,21:25,Delhi,26h 30m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,44450,11,02,2022
4,Air India,AI,544,17:15,Delhi,06h 40m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,46690,11,02,2022
...,...,...,...,...,...,...,...,...,...,...,...,...,...
93482,Vistara,UK,822,09:45,Chennai,10h 05m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,19:50,Hyderabad,69265,31,03,2022
93483,Vistara,UK,826,12:30,Chennai,10h 25m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,22:55,Hyderabad,77105,31,03,2022
93484,Vistara,UK,832,07:05,Chennai,13h 50m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,20:55,Hyderabad,79099,31,03,2022
93485,Vistara,UK,828,07:00,Chennai,10h 00m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,17:00,Hyderabad,81585,31,03,2022


In [55]:
#doing same to dep_time

df['dep_hr'] = df['dep_time'].str.split(':').str[0]
df['dep_min'] = df['dep_time'].str.split(':').str[1]
df.drop('dep_time', axis=1, inplace=True)
df

Unnamed: 0,airline,ch_code,num_code,from,time_taken,stop,arr_time,to,price,Date,Month,Year,dep_hr,dep_min
0,Air India,AI,868,Delhi,02h 00m,non-stop,20:00,Mumbai,25612,11,02,2022,18,00
1,Air India,AI,624,Delhi,02h 15m,non-stop,21:15,Mumbai,25612,11,02,2022,19,00
2,Air India,AI,531,Delhi,24h 45m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,20:45,Mumbai,42220,11,02,2022,20,00
3,Air India,AI,839,Delhi,26h 30m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,44450,11,02,2022,21,25
4,Air India,AI,544,Delhi,06h 40m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,46690,11,02,2022,17,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93482,Vistara,UK,822,Chennai,10h 05m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,19:50,Hyderabad,69265,31,03,2022,09,45
93483,Vistara,UK,826,Chennai,10h 25m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,22:55,Hyderabad,77105,31,03,2022,12,30
93484,Vistara,UK,832,Chennai,13h 50m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,20:55,Hyderabad,79099,31,03,2022,07,05
93485,Vistara,UK,828,Chennai,10h 00m,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,17:00,Hyderabad,81585,31,03,2022,07,00


In [56]:
#same for time_taken
df['time_taken_hr'] = df['time_taken'].str.replace('h','').str.split(' ').str[0]
df['time_taken_min'] = df['time_taken'].str.replace('m','').str.split(' ').str[1]
df.drop('time_taken', axis=1, inplace=True)
df

Unnamed: 0,airline,ch_code,num_code,from,stop,arr_time,to,price,Date,Month,Year,dep_hr,dep_min,time_taken_hr,time_taken_min
0,Air India,AI,868,Delhi,non-stop,20:00,Mumbai,25612,11,02,2022,18,00,02,00
1,Air India,AI,624,Delhi,non-stop,21:15,Mumbai,25612,11,02,2022,19,00,02,15
2,Air India,AI,531,Delhi,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,20:45,Mumbai,42220,11,02,2022,20,00,24,45
3,Air India,AI,839,Delhi,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,44450,11,02,2022,21,25,26,30
4,Air India,AI,544,Delhi,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,23:55,Mumbai,46690,11,02,2022,17,15,06,40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93482,Vistara,UK,822,Chennai,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,19:50,Hyderabad,69265,31,03,2022,09,45,10,05
93483,Vistara,UK,826,Chennai,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,22:55,Hyderabad,77105,31,03,2022,12,30,10,25
93484,Vistara,UK,832,Chennai,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,20:55,Hyderabad,79099,31,03,2022,07,05,13,50
93485,Vistara,UK,828,Chennai,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,17:00,Hyderabad,81585,31,03,2022,07,00,10,00


In [57]:
#same for arrival_time

df['arrival_time_hr'] = df['arr_time'].str.split(':').str[0]
df['arrival_time_min'] = df['arr_time'].str.split(':').str[1]
df.drop('arr_time', axis=1, inplace=True)
df

Unnamed: 0,airline,ch_code,num_code,from,stop,to,price,Date,Month,Year,dep_hr,dep_min,time_taken_hr,time_taken_min,arrival_time_hr,arrival_time_min
0,Air India,AI,868,Delhi,non-stop,Mumbai,25612,11,02,2022,18,00,02,00,20,00
1,Air India,AI,624,Delhi,non-stop,Mumbai,25612,11,02,2022,19,00,02,15,21,15
2,Air India,AI,531,Delhi,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,Mumbai,42220,11,02,2022,20,00,24,45,20,45
3,Air India,AI,839,Delhi,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,Mumbai,44450,11,02,2022,21,25,26,30,23,55
4,Air India,AI,544,Delhi,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,Mumbai,46690,11,02,2022,17,15,06,40,23,55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93482,Vistara,UK,822,Chennai,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,Hyderabad,69265,31,03,2022,09,45,10,05,19,50
93483,Vistara,UK,826,Chennai,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,Hyderabad,77105,31,03,2022,12,30,10,25,22,55
93484,Vistara,UK,832,Chennai,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,Hyderabad,79099,31,03,2022,07,05,13,50,20,55
93485,Vistara,UK,828,Chennai,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t...,Hyderabad,81585,31,03,2022,07,00,10,00,17,00


In [69]:
#same for arrival_time

df['number_of_stops'] = df['stop'].str.replace('+','').str.replace('non','0').str.split('-').str[0]
df.drop('stop', axis=1, inplace=True)
df

Unnamed: 0,airline,ch_code,num_code,from,to,price,Date,Month,Year,dep_hr,dep_min,time_taken_hr,time_taken_min,arrival_time_hr,arrival_time_min,number_of_stops
0,Air India,AI,868,Delhi,Mumbai,25612,11,02,2022,18,00,02,00,20,00,0
1,Air India,AI,624,Delhi,Mumbai,25612,11,02,2022,19,00,02,15,21,15,0
2,Air India,AI,531,Delhi,Mumbai,42220,11,02,2022,20,00,24,45,20,45,1
3,Air India,AI,839,Delhi,Mumbai,44450,11,02,2022,21,25,26,30,23,55,1
4,Air India,AI,544,Delhi,Mumbai,46690,11,02,2022,17,15,06,40,23,55,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93482,Vistara,UK,822,Chennai,Hyderabad,69265,31,03,2022,09,45,10,05,19,50,1
93483,Vistara,UK,826,Chennai,Hyderabad,77105,31,03,2022,12,30,10,25,22,55,1
93484,Vistara,UK,832,Chennai,Hyderabad,79099,31,03,2022,07,05,13,50,20,55,1
93485,Vistara,UK,828,Chennai,Hyderabad,81585,31,03,2022,07,00,10,00,17,00,1


In [70]:
#same for arrival_time

df['price'] = df['price'].str.replace(',','')
df

Unnamed: 0,airline,ch_code,num_code,from,to,price,Date,Month,Year,dep_hr,dep_min,time_taken_hr,time_taken_min,arrival_time_hr,arrival_time_min,number_of_stops
0,Air India,AI,868,Delhi,Mumbai,25612,11,02,2022,18,00,02,00,20,00,0
1,Air India,AI,624,Delhi,Mumbai,25612,11,02,2022,19,00,02,15,21,15,0
2,Air India,AI,531,Delhi,Mumbai,42220,11,02,2022,20,00,24,45,20,45,1
3,Air India,AI,839,Delhi,Mumbai,44450,11,02,2022,21,25,26,30,23,55,1
4,Air India,AI,544,Delhi,Mumbai,46690,11,02,2022,17,15,06,40,23,55,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93482,Vistara,UK,822,Chennai,Hyderabad,69265,31,03,2022,09,45,10,05,19,50,1
93483,Vistara,UK,826,Chennai,Hyderabad,77105,31,03,2022,12,30,10,25,22,55,1
93484,Vistara,UK,832,Chennai,Hyderabad,79099,31,03,2022,07,05,13,50,20,55,1
93485,Vistara,UK,828,Chennai,Hyderabad,81585,31,03,2022,07,00,10,00,17,00,1


In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93487 entries, 0 to 93486
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   airline           93487 non-null  object
 1   ch_code           93487 non-null  object
 2   num_code          93487 non-null  int64 
 3   from              93487 non-null  object
 4   to                93487 non-null  object
 5   price             93487 non-null  object
 6   Date              93487 non-null  object
 7   Month             93487 non-null  object
 8   Year              93487 non-null  object
 9   dep_hr            93487 non-null  object
 10  dep_min           93487 non-null  object
 11  time_taken_hr     93487 non-null  object
 12  time_taken_min    93487 non-null  object
 13  arrival_time_hr   93487 non-null  object
 14  arrival_time_min  93487 non-null  object
 15  number_of_stops   93487 non-null  object
dtypes: int64(1), object(15)
memory usage: 11.4+ MB


In [73]:
df['price'] = df['price'].astype(int)
df['Date'] = df['Date'].astype(int)
df['Month'] = df['Month'].astype(int)
df['Year'] = df['Year'].astype(int)
df['dep_hr'] = df['dep_hr'].astype(int)
df['dep_min'] = df['dep_min'].astype(int)
df['time_taken_hr'] = df['time_taken_hr'].astype(int)
df['time_taken_min'] = df['time_taken_min'].astype(int)
df['arrival_time_hr'] = df['arrival_time_hr'].astype(int)
df['arrival_time_min'] = df['arrival_time_min'].astype(int)
df['number_of_stops'] = df['number_of_stops'].astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93487 entries, 0 to 93486
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   airline           93487 non-null  object
 1   ch_code           93487 non-null  object
 2   num_code          93487 non-null  int64 
 3   from              93487 non-null  object
 4   to                93487 non-null  object
 5   price             93487 non-null  int64 
 6   Date              93487 non-null  int64 
 7   Month             93487 non-null  int64 
 8   Year              93487 non-null  int64 
 9   dep_hr            93487 non-null  int64 
 10  dep_min           93487 non-null  int64 
 11  time_taken_hr     93487 non-null  int64 
 12  time_taken_min    93487 non-null  int64 
 13  arrival_time_hr   93487 non-null  int64 
 14  arrival_time_min  93487 non-null  int64 
 15  number_of_stops   93487 non-null  int64 
dtypes: int64(12), object(4)
memory usage: 11.4+ MB


In [75]:
#still no missing values

df.isnull().sum()

airline             0
ch_code             0
num_code            0
from                0
to                  0
price               0
Date                0
Month               0
Year                0
dep_hr              0
dep_min             0
time_taken_hr       0
time_taken_min      0
arrival_time_hr     0
arrival_time_min    0
number_of_stops     0
dtype: int64