## Flight_Prediction

##### EDA on the Data

EDA And Feature Engineering Flight Price Prediction

FEATURES -
 
The various features of the cleaned dataset are explained below: 1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines. 2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature. 3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities. 4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels. 5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities. 6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time. 7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities. 8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy. 9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours. 10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date. 11) Price: Target variable stores information of the ticket price.

In [80]:
# importing libraries
import numpy as np
import pandas as pd

In [81]:
## loading the data from excel file
df = pd.read_excel("flight_price.xlsx")
## top 5 records
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [82]:
## dimension of the dataset
df.shape

(10683, 11)

In [83]:
## Checking for the datatype of the features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


In [84]:
### Five number summary
df.describe()

Unnamed: 0,Price
count,10683.0
mean,9087.064121
std,4611.359167
min,1759.0
25%,5277.0
50%,8372.0
75%,12373.0
max,79512.0


- the lowest price is 1759 and the highest price is 79512 
- 75% of the data is below 12373 and 25% is greater than 12373
- 8372 is the median and on average the price is 9087 

### Data Cleaning and feature extracting -

##### Extracting the date month and year from the Date_of_journey column : 

In [85]:
df['Date'] = df['Date_of_Journey']. str.split('/').str[0]
df['Month'] = df['Date_of_Journey']. str.split('/').str[1]
df['Year'] = df['Date_of_Journey']. str.split('/').str[2]

##### Changing the data type of the extracted features into integer :

In [86]:
df['Date'] = df['Date'].astype(int)
df['Month'] = df['Month'].astype(int)
df['Year'] = df['Year'].astype(int)

In [87]:
# Checking the results
df[['Date' , 'Month', 'Year']].dtypes

Date     int64
Month    int64
Year     int64
dtype: object

In [88]:
## removing the Date_of_Journey column
df.drop('Date_of_Journey' ,axis = 1 , inplace = True)

##### Data Extraction from the Arrival_Time column
Here splitting the data based on the Arrival_Hour and Arrival_min

In [89]:
## extracting using the str.split function
df['Arrival_hours'] = df['Arrival_Time'].str.split(' ').str[0].str.split(':').str[0] 
df['Arrival_min'] = df['Arrival_Time'].str.split(' ').str[0].str.split(':').str[1] 

##### Changing the data type of the extracted features into integer

In [90]:
df['Arrival_hours'] = df['Arrival_hours'].astype(int)
df['Arrival_min'] = df['Arrival_min'].astype(int)

In [91]:
df[['Arrival_hours' , 'Arrival_min']].dtypes

Arrival_hours    int64
Arrival_min      int64
dtype: object

In [92]:
## removing the Arrival_Time column from the dataset
df.drop('Arrival_Time',axis = 1 , inplace = True )

##### Data Extraction from the Dep_Time column
Here splitting the Dep_Time like we did in Arrival_Time column

##### Extracting the features form the column and changing datatype

In [93]:
## using the split function
df['Dep_Hour'] = (df['Dep_Time'].str.split(':').str[0]).astype(int)
df['Dep_min'] = (df['Dep_Time'].str.split(':').str[1]).astype(int)

In [94]:
## checking for the changes
df[['Dep_Hour', 'Dep_min']].dtypes

Dep_Hour    int64
Dep_min     int64
dtype: object

In [95]:
## removing the Dep_Time column
df.drop('Dep_Time' , axis = 1 , inplace = True)

In [96]:
## Route is not necessary for the model so removing it 
df.drop('Route', axis = 1 , inplace = True)

##### Data Extraction from the Duration column
Here splitting the Duration column between hours and minutes

In [97]:
df['Duration_hours'] = (df['Duration'].str.split(' ').str[0].str.split('h').str[0])
df['Duration_min'] = (df['Duration'].str.split(' ').str[1].str.split('m').str[0])

##### Removing null values and changing the data type from string into integer

In [98]:
## using fillna to replace the null values with 0
df['Duration_hours'] = df['Duration_hours'].fillna(0)
# changing datatype
df['Duration_min'] = (df['Duration_min'].fillna(0)).astype(int)

In [99]:
## replacing  5m with 5
df['Duration_hours'] = df['Duration_hours'].replace(['5m', '5'])

In [100]:
## checking for applied changes
df[['Duration_min' , 'Duration_hours']].dtypes

Duration_min       int64
Duration_hours    object
dtype: object

In [101]:
## changing data type str into int
df['Duration_hours'] = df['Duration_hours'].astype(int)

In [102]:
## removing the columns
df.drop('Duration', axis = 1 , inplace = True)

In [103]:
# checking for all the unique values in the Total_Stops column
df['Total_Stops'].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [104]:
## finding the most frequent variable in Total_Stops
df['Total_Stops'].mode()

0    1 stop
Name: Total_Stops, dtype: object

In [105]:
## Applying Imputation using mode
df['Total_Stops'] = df['Total_Stops'].fillna('1 stop')

##### Label Encoding

In [106]:
## assigning 
df['Total_Stops'] = df['Total_Stops'].map({'non-stop' : 0, '2 stops': 2 , '1 stop' : 1, '3 stops' : 3 , '4 stops' : 4 })

###### Feature Engineering on the remaning Categorical columns

In [107]:
## cahcceking for the number of unique values 
## here we will be using OneHOtEncoding because there are less variables in these columns
print(df['Destination'].unique())
print(df['Source'].unique())
print(df['Airline'].unique())

['New Delhi' 'Banglore' 'Cochin' 'Kolkata' 'Delhi' 'Hyderabad']
['Banglore' 'Kolkata' 'Delhi' 'Chennai' 'Mumbai']
['IndiGo' 'Air India' 'Jet Airways' 'SpiceJet' 'Multiple carriers' 'GoAir'
 'Vistara' 'Air Asia' 'Vistara Premium economy' 'Jet Airways Business'
 'Multiple carriers Premium economy' 'Trujet']


##### OneHotEncoding

In [108]:
## importing onehotencoder from sklearn
from sklearn.preprocessing import OneHotEncoder

#intializing 
encoder = OneHotEncoder()

In [109]:
df['Additional_Info'].unique()

array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 Short layover', 'No Info',
       '1 Long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 Long layover'], dtype=object)

In [120]:
df['Additional_Info'] = df['Additional_Info'].map({'No info' : 0 , 'No Info' : 0 })

In [121]:
## encoding using fit_transform on the Airline, Source, Destination columns
encoder.fit_transform(df [['Airline' , 'Source', 'Destination','Additional_Info']]).toarray()

array([[0., 0., 0., ..., 1., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.]])

In [124]:
## converting the encoded features into a dataframe
df1 = pd.DataFrame (encoder.fit_transform(df [['Airline' , 'Source', 'Destination' ,'Additional_Info']]).toarray() , columns = encoder.get_feature_names_out())

In [125]:
## concatinating the encoded df1 with df
df_final = pd.concat([df,df1] , axis = 1)

In [127]:
df_final.head()

Unnamed: 0,Airline,Source,Destination,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hours,...,Source_Kolkata,Source_Mumbai,Destination_Banglore,Destination_Cochin,Destination_Delhi,Destination_Hyderabad,Destination_Kolkata,Destination_New Delhi,Additional_Info_0.0,Additional_Info_nan
0,IndiGo,Banglore,New Delhi,0,0.0,3897,24,3,2019,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,Air India,Kolkata,Banglore,2,0.0,7662,1,5,2019,13,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,Jet Airways,Delhi,Cochin,2,0.0,13882,9,6,2019,4,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,IndiGo,Kolkata,Banglore,1,0.0,6218,12,5,2019,23,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,IndiGo,Banglore,New Delhi,1,0.0,13302,1,3,2019,21,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


##### we dont need the Additional_info column so removing it

In [79]:
df_final1 = encoder.fit_transform(df_final[['Additional_Info']].toarray(),columns = encoder.get_feature_names_out())

AttributeError: 'DataFrame' object has no attribute 'toarray'

In [34]:
## removing Additional_info with the help of drop keyword
df_final.drop('Additional_Info', inplace = True , axis = 1 )

In [35]:
## preprocessed data
df_final

Unnamed: 0,Airline,Source,Destination,Total_Stops,Price,Date,Month,Year,Arrival_hours,Arrival_min,...,Source_Chennai,Source_Delhi,Source_Kolkata,Source_Mumbai,Destination_Banglore,Destination_Cochin,Destination_Delhi,Destination_Hyderabad,Destination_Kolkata,Destination_New Delhi
0,IndiGo,Banglore,New Delhi,0,3897,24,3,2019,1,10,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,Air India,Kolkata,Banglore,2,7662,1,5,2019,13,15,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,Jet Airways,Delhi,Cochin,2,13882,9,6,2019,4,25,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,IndiGo,Kolkata,Banglore,1,6218,12,5,2019,23,30,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,IndiGo,Banglore,New Delhi,1,13302,1,3,2019,21,35,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,0,4107,9,4,2019,22,25,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10679,Air India,Kolkata,Banglore,0,4145,27,4,2019,23,20,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10680,Jet Airways,Banglore,Delhi,0,7229,27,4,2019,11,20,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
10681,Vistara,Banglore,New Delhi,0,12648,1,3,2019,14,10,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### Model Training

In [36]:
df_final.columns

Index(['Airline', 'Source', 'Destination', 'Total_Stops', 'Price', 'Date',
       'Month', 'Year', 'Arrival_hours', 'Arrival_min', 'Dep_Hour', 'Dep_min',
       'Duration_hours', 'Duration_min', 'Airline_Air Asia',
       'Airline_Air India', 'Airline_GoAir', 'Airline_IndiGo',
       'Airline_Jet Airways', 'Airline_Jet Airways Business',
       'Airline_Multiple carriers',
       'Airline_Multiple carriers Premium economy', 'Airline_SpiceJet',
       'Airline_Trujet', 'Airline_Vistara', 'Airline_Vistara Premium economy',
       'Source_Banglore', 'Source_Chennai', 'Source_Delhi', 'Source_Kolkata',
       'Source_Mumbai', 'Destination_Banglore', 'Destination_Cochin',
       'Destination_Delhi', 'Destination_Hyderabad', 'Destination_Kolkata',
       'Destination_New Delhi'],
      dtype='object')