## EDA And Feature Engineering Flight Price Prediction
### FEATURES
The various features of the cleaned dataset are explained below:
1. ) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2. ) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3. ) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4. ) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5. ) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6. ) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7. ) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8. ) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9. ) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10. ) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11. ) Price: Target variable stores information of the ticket price.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [92]:
df=pd.read_excel('flight_price.xlsx')
df.head(2)

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662


In [93]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


In [94]:
df.isna().sum()

Unnamed: 0,0
Airline,0
Date_of_Journey,0
Source,0
Destination,0
Route,1
Dep_Time,0
Arrival_Time,0
Duration,0
Total_Stops,1
Additional_Info,0


In [95]:
df.describe()

Unnamed: 0,Price
count,10683.0
mean,9087.064121
std,4611.359167
min,1759.0
25%,5277.0
50%,8372.0
75%,12373.0
max,79512.0


In [96]:
df["Date"]=df['Date_of_Journey'].str.split('/').str[0]
df["Month"]=df['Date_of_Journey'].str.split('/').str[1]
df["Year"]=df['Date_of_Journey'].str.split('/').str[2]

In [97]:
df["Date"]=df["Date"].astype(int)
df["Month"]=df["Month"].astype(int)
df["Year"]=df["Year"].astype(int)

In [98]:
df.drop('Date_of_Journey',axis=1,inplace=True)

In [99]:
df["Dep_Hour"]=df['Dep_Time'].str.split(':').str[0]
df["Dep_Min"]=df['Dep_Time'].str.split(':').str[1]
df.drop('Dep_Time',axis=1,inplace=True)

In [100]:
df["Dep_Hour"]=df["Dep_Hour"].astype(int)
df["Dep_Min"]=df["Dep_Min"].astype(int)

In [101]:
df["Duration_hour"]=df["Duration"].str.split(" ").str[0].str.split("h").str[0]

In [102]:
df["Duration_min"]=df["Duration"].str.split(" ").str[1].str.split("m").str[0]

In [103]:
df["Duration_hour"]=df["Duration_hour"].str.split("m").str[0]

In [104]:
df["Duration_min"].isna().sum()

1032

In [105]:
df.Duration_min.fillna(0,inplace=True)

In [106]:
df["Duration_hour"]=df["Duration_hour"].astype(int)
df["Duration_min"]=df["Duration_min"].astype(int)

In [107]:
df.drop('Duration',axis=1,inplace=True)

In [108]:
df.head(2)

Unnamed: 0,Airline,Source,Destination,Route,Arrival_Time,Total_Stops,Additional_Info,Price,Date,Month,Year,Dep_Hour,Dep_Min,Duration_hour,Duration_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,01:10 22 Mar,non-stop,No info,3897,24,3,2019,22,20,2,50
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,13:15,2 stops,No info,7662,1,5,2019,5,50,7,25


In [109]:
df["Arrival_Time"]

Unnamed: 0,Arrival_Time
0,01:10 22 Mar
1,13:15
2,04:25 10 Jun
3,23:30
4,21:35
...,...
10678,22:25
10679,23:20
10680,11:20
10681,14:10


In [113]:
df["Arrival_hour"]= df["Arrival_Time"].str.split(" ").str[0].str.split(":").str[0]
df["Arrival_min"]= df["Arrival_Time"].str.split(" ").str[0].str.split(":").str[1]

In [117]:
df["Arrival_hour"]=df["Arrival_hour"].astype(int)
df["Arrival_min"]=df["Arrival_min"].astype(int)

In [118]:
df.drop('Arrival_Time',axis=1,inplace=True)

In [121]:
df.drop('Route',axis=1,inplace=True)

In [122]:
df.head(2)

Unnamed: 0,Airline,Source,Destination,Total_Stops,Additional_Info,Price,Date,Month,Year,Dep_Hour,Dep_Min,Duration_hour,Duration_min,Arrival_hour,Arrival_min
0,IndiGo,Banglore,New Delhi,non-stop,No info,3897,24,3,2019,22,20,2,50,1,10
1,Air India,Kolkata,Banglore,2 stops,No info,7662,1,5,2019,5,50,7,25,13,15


In [123]:
df["Total_Stops"].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [124]:
df["Total_Stops"].mode()

Unnamed: 0,Total_Stops
0,1 stop


In [125]:
## applting replacement technique
df["Total_Stops"]=df["Total_Stops"].map({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4,np.nan:1})

In [128]:
df["Total_Stops"].unique()

array([0, 2, 1, 3, 4])

In [130]:
df["Total_Stops"].isna().sum()

0

In [133]:
df["Airline"].unique()

array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

In [134]:
df["Source"].unique()

array(['Banglore', 'Kolkata', 'Delhi', 'Chennai', 'Mumbai'], dtype=object)

In [135]:
df["Destination"].unique()

array(['New Delhi', 'Banglore', 'Cochin', 'Kolkata', 'Delhi', 'Hyderabad'],
      dtype=object)

**perform Encodig technique on categorical data to convert into numerical value**

In [136]:
from sklearn.preprocessing import OneHotEncoder


In [137]:
encoder=OneHotEncoder()

In [139]:
encoder.fit_transform(df[["Airline","Source","Destination"]]).toarray()

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [141]:
df1 = pd.DataFrame(encoder.fit_transform(df[["Airline","Source","Destination"]]).toarray(),columns=encoder.get_feature_names_out())

In [143]:
df = pd.concat([df,df1],axis=1)

In [145]:
df.drop(["Airline","Source","Destination"],axis=1,inplace=True)

In [147]:
df.head(2)

Unnamed: 0,Total_Stops,Additional_Info,Price,Date,Month,Year,Dep_Hour,Dep_Min,Duration_hour,Duration_min,...,Source_Chennai,Source_Delhi,Source_Kolkata,Source_Mumbai,Destination_Banglore,Destination_Cochin,Destination_Delhi,Destination_Hyderabad,Destination_Kolkata,Destination_New Delhi
0,0,No info,3897,24,3,2019,22,20,2,50,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,2,No info,7662,1,5,2019,5,50,7,25,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [149]:
df["Additional_Info"].unique()

array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 Short layover', 'No Info',
       '1 Long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 Long layover'], dtype=object)

In [150]:
#appling one hot encoding
encoder.fit_transform(df[["Additional_Info"]]).toarray()

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])

In [152]:
df=pd.concat([df,pd.DataFrame(encoder.fit_transform(df[["Additional_Info"]]).toarray(),columns=encoder.get_feature_names_out())],axis=1)

In [155]:
df.drop('Additional_Info',axis=1,inplace=True)

In [156]:
df.head()

Unnamed: 0,Total_Stops,Price,Date,Month,Year,Dep_Hour,Dep_Min,Duration_hour,Duration_min,Arrival_hour,...,Additional_Info_1 Long layover,Additional_Info_1 Short layover,Additional_Info_2 Long layover,Additional_Info_Business class,Additional_Info_Change airports,Additional_Info_In-flight meal not included,Additional_Info_No Info,Additional_Info_No check-in baggage included,Additional_Info_No info,Additional_Info_Red-eye flight
0,0,3897,24,3,2019,22,20,2,50,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2,7662,1,5,2019,5,50,7,25,13,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2,13882,9,6,2019,9,25,19,0,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1,6218,12,5,2019,18,5,5,25,23,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1,13302,1,3,2019,16,50,4,45,21,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [160]:
df.to_csv('fight_price.csv',index=False)

In [161]:
pd.read_csv('fight_price.csv')

Unnamed: 0,Total_Stops,Price,Date,Month,Year,Dep_Hour,Dep_Min,Duration_hour,Duration_min,Arrival_hour,...,Additional_Info_1 Long layover,Additional_Info_1 Short layover,Additional_Info_2 Long layover,Additional_Info_Business class,Additional_Info_Change airports,Additional_Info_In-flight meal not included,Additional_Info_No Info,Additional_Info_No check-in baggage included,Additional_Info_No info,Additional_Info_Red-eye flight
0,0,3897,24,3,2019,22,20,2,50,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2,7662,1,5,2019,5,50,7,25,13,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2,13882,9,6,2019,9,25,19,0,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1,6218,12,5,2019,18,5,5,25,23,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1,13302,1,3,2019,16,50,4,45,21,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,0,4107,9,4,2019,19,55,2,30,22,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10679,0,4145,27,4,2019,20,45,2,35,23,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10680,0,7229,27,4,2019,8,20,3,0,11,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10681,0,12648,1,3,2019,11,30,2,40,14,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


#now data is read to perform regression  technique
