### FEATURES
The various features of the cleaned dataset are explained below:
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10)Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.

In [149]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [150]:
df=pd.read_excel("flight_data.xlsx")
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [151]:
df.duplicated().sum()

220

In [152]:
df.drop_duplicates(inplace=True)

In [153]:
df.duplicated().sum()

0

In [154]:
df.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

In [155]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10463 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10463 non-null  object
 1   Date_of_Journey  10463 non-null  object
 2   Source           10463 non-null  object
 3   Destination      10463 non-null  object
 4   Route            10462 non-null  object
 5   Dep_Time         10463 non-null  object
 6   Arrival_Time     10463 non-null  object
 7   Duration         10463 non-null  object
 8   Total_Stops      10462 non-null  object
 9   Additional_Info  10463 non-null  object
 10  Price            10463 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 980.9+ KB


In [156]:
arrival_dates=df["Date_of_Journey"].tolist()
all_date=[date[:-8] for date in arrival_dates]
all_month=[date[-7:-5] for date in arrival_dates]
all_year=[date[-4:] for date in arrival_dates]

In [157]:
df["day_of_journy"]=df["Date_of_Journey"].str[:-8].astype(int)
df["month_of_journy"]=df["Date_of_Journey"].str[-7:-5].astype(int)
df["year_of_journy"]=df["Date_of_Journey"].str[-4:].astype(int)
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,day_of_journy,month_of_journy,year_of_journy
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019


In [158]:
df.drop(columns=["Date_of_Journey"],inplace=True,axis=1)

In [159]:
df.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,day_of_journy,month_of_journy,year_of_journy
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019


In [160]:
df["Arrival_Time"]=df["Arrival_Time"].str.split(" ").str[0]
df["arrival_Hour"]=(df["Arrival_Time"].str.split(":").str[0])
df["arrival_Minute"]=df["Arrival_Time"].str.split(":").str[1].astype(int)
df.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,day_of_journy,month_of_journy,year_of_journy,arrival_Hour,arrival_Minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10,2h 50m,non-stop,No info,3897,24,3,2019,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19h,2 stops,No info,13882,9,6,2019,4,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019,23,30
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019,21,35


In [161]:
df.columns

Index(['Airline', 'Source', 'Destination', 'Route', 'Dep_Time', 'Arrival_Time',
       'Duration', 'Total_Stops', 'Additional_Info', 'Price', 'day_of_journy',
       'month_of_journy', 'year_of_journy', 'arrival_Hour', 'arrival_Minute'],
      dtype='object')

In [162]:
if "Arrival_Time" in df.columns:
    df.drop(columns=["Arrival_Time"], axis=1, inplace=True)
df.head(3)

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,day_of_journy,month_of_journy,year_of_journy,arrival_Hour,arrival_Minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,24,3,2019,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,1,5,2019,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,9,6,2019,4,25


In [163]:
def to_minute(s:str):
	li=s.split()
	minute=0;
	for ele in li:
		if(ele[-1]=='h'): minute+=int(ele[:-1])*60
		else: minute+=int(ele[:-1])
	
	return minute

df["Total_minute"]=df["Duration"].apply(to_minute)
df.head(3)

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,day_of_journy,month_of_journy,year_of_journy,arrival_Hour,arrival_Minute,Total_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,24,3,2019,1,10,170
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,1,5,2019,13,15,445
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,9,6,2019,4,25,1140


In [164]:
df.drop(axis=1,columns=["Duration"],inplace=True)

In [165]:
df.head(2)

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Total_Stops,Additional_Info,Price,day_of_journy,month_of_journy,year_of_journy,arrival_Hour,arrival_Minute,Total_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,non-stop,No info,3897,24,3,2019,1,10,170
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,2 stops,No info,7662,1,5,2019,13,15,445


In [166]:
no_of_route={
	"non-stop":0,
	"1 stops":1,
	"2 stops":2,
	"3 stops":3,
	"4 stops":4
}
df["Total_Stops"]=df["Total_Stops"].map(no_of_route).fillna(0)

In [167]:
df.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Total_Stops,Additional_Info,Price,day_of_journy,month_of_journy,year_of_journy,arrival_Hour,arrival_Minute,Total_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,0.0,No info,3897,24,3,2019,1,10,170
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,2.0,No info,7662,1,5,2019,13,15,445
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,2.0,No info,13882,9,6,2019,4,25,1140
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,0.0,No info,6218,12,5,2019,23,30,325
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,0.0,No info,13302,1,3,2019,21,35,285


In [168]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10463 entries, 0 to 10682
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Airline          10463 non-null  object 
 1   Source           10463 non-null  object 
 2   Destination      10463 non-null  object 
 3   Route            10462 non-null  object 
 4   Dep_Time         10463 non-null  object 
 5   Total_Stops      10463 non-null  float64
 6   Additional_Info  10463 non-null  object 
 7   Price            10463 non-null  int64  
 8   day_of_journy    10463 non-null  int32  
 9   month_of_journy  10463 non-null  int32  
 10  year_of_journy   10463 non-null  int32  
 11  arrival_Hour     10463 non-null  object 
 12  arrival_Minute   10463 non-null  int32  
 13  Total_minute     10463 non-null  int64  
dtypes: float64(1), int32(4), int64(2), object(7)
memory usage: 1.0+ MB


In [169]:
df["Total_Stops"].unique()

array([0., 2., 3., 4.])

In [172]:
df["Total_Stops"]=df["Total_Stops"].astype(int)
df.head(2)

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Total_Stops,Additional_Info,Price,day_of_journy,month_of_journy,year_of_journy,arrival_Hour,arrival_Minute,Total_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,0,No info,3897,24,3,2019,1,10,170
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,2,No info,7662,1,5,2019,13,15,445


In [173]:
df.isnull().sum()

Airline            0
Source             0
Destination        0
Route              1
Dep_Time           0
Total_Stops        0
Additional_Info    0
Price              0
day_of_journy      0
month_of_journy    0
year_of_journy     0
arrival_Hour       0
arrival_Minute     0
Total_minute       0
dtype: int64

In [175]:
df.drop(axis=1,columns=["Route"],inplace=True)

In [176]:
df.isnull().sum()

Airline            0
Source             0
Destination        0
Dep_Time           0
Total_Stops        0
Additional_Info    0
Price              0
day_of_journy      0
month_of_journy    0
year_of_journy     0
arrival_Hour       0
arrival_Minute     0
Total_minute       0
dtype: int64