### EDA and Feature Engineering: Flight Price Prediction

Covers the work we need to do before training a Machine Learning model as the data which we will have will be raw and not clean. It will have categorical and numerical features. Unless and until, we clean the data and convert it into a better numerical format, we will not be able to provide it to the model for proper training.

Features:   
1. Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 airlines.
2. Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3. Source city: City from which the flight takes off. It is a categorical feature having 6 cities.
4. Departure time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and 6 unique time labels.
5. Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and the destination cities.
6. Arrival time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7. Destination city: City where the flight will land. It is a categorical feature having 6 cities.
8. Class: A categorical feature that contains information on seat class. It has two distinct values: Business and Economy.
9. Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours. 
10. Price: Target variable stores information of the ticket price. 

In [408]:
#- Use Command Palette (Ctrl+Shift+P) → Type: Reload Window if VS code is not suggesting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [409]:
!pip install openpyxl



In [410]:
df = pd.read_excel('2.1 - flight_price.xlsx')

In [411]:
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [412]:
df.tail()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648
10682,Air India,9/05/2019,Delhi,Cochin,DEL → GOI → BOM → COK,10:55,19:15,8h 20m,2 stops,No info,11753


In [413]:
## Get basic information about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


In [414]:
df.describe()

Unnamed: 0,Price
count,10683.0
mean,9087.064121
std,4611.359167
min,1759.0
25%,5277.0
50%,8372.0
75%,12373.0
max,79512.0


In [415]:
## Feature engineering 
df['Date'] = df['Date_of_Journey'].str.split('/').str[0]
df['Month'] = df['Date_of_Journey'].str.split('/').str[1]
df['Year'] = df['Date_of_Journey'].str.split('/').str[2]

In [416]:
df['Date_of_Journey'].str.split('/').str[0]

0        24
1         1
2         9
3        12
4        01
         ..
10678     9
10679    27
10680    27
10681    01
10682     9
Name: Date_of_Journey, Length: 10683, dtype: object

In [417]:
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019


In [418]:
df.info() # Three additional features that we have created are still object type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
 11  Date             10683 non-null  object
 12  Month            10683 non-null  object
 13  Year             10683 non-null  object
dtypes: int64(1), object(13)
memory usage: 1.1+ MB


In [419]:
## Converting the data of the columns from object/string type to int type
df['Date']= df['Date'].astype(int)
df['Month'] = df['Month'].astype(int)
df['Year'] = df['Year'].astype(int)

In [420]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
 11  Date             10683 non-null  int64 
 12  Month            10683 non-null  int64 
 13  Year             10683 non-null  int64 
dtypes: int64(4), object(10)
memory usage: 1.1+ MB


In [421]:
# Dropping Date_of_Journey column
df.drop('Date_of_Journey',axis=1, inplace=True)

In [422]:
df.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019


In [423]:
## Alternative
## df['Arrival_Time'].apply(lambda x : x.split(' ')[0])

In [424]:
df['Arrival_hour'] = df['Arrival_Time'].str.split(':').str[0]
df['Arrival_minute'] = df['Arrival_Time'].str.split(':').str[1].str.split(' ').str[0]

In [425]:
df.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019,4,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019,23,30
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019,21,35


In [426]:
df['Arrival_hour'] = df['Arrival_hour'].astype(int)
df['Arrival_minute'] = df['Arrival_minute'].astype(int)

In [427]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Source           10683 non-null  object
 2   Destination      10683 non-null  object
 3   Route            10682 non-null  object
 4   Dep_Time         10683 non-null  object
 5   Arrival_Time     10683 non-null  object
 6   Duration         10683 non-null  object
 7   Total_Stops      10682 non-null  object
 8   Additional_Info  10683 non-null  object
 9   Price            10683 non-null  int64 
 10  Date             10683 non-null  int64 
 11  Month            10683 non-null  int64 
 12  Year             10683 non-null  int64 
 13  Arrival_hour     10683 non-null  int64 
 14  Arrival_minute   10683 non-null  int64 
dtypes: int64(6), object(9)
memory usage: 1.2+ MB


In [428]:
df.drop('Arrival_Time',axis=1,inplace=True)

In [429]:
df.head(2)

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,24,3,2019,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,1,5,2019,13,15


In [430]:
## We have handled 2 features till now

In [431]:
df.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,24,3,2019,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,1,5,2019,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,9,6,2019,4,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,5h 25m,1 stop,No info,6218,12,5,2019,23,30
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,4h 45m,1 stop,No info,13302,1,3,2019,21,35


In [432]:
df['Departure_hour'] = df['Dep_Time'].str.split(':').str[0]
df['Departure_minute'] = df['Dep_Time'].str.split(':').str[1]

In [433]:
df.head(2)

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_minute,Departure_hour,Departure_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,24,3,2019,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,1,5,2019,13,15,5,50


In [434]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Airline           10683 non-null  object
 1   Source            10683 non-null  object
 2   Destination       10683 non-null  object
 3   Route             10682 non-null  object
 4   Dep_Time          10683 non-null  object
 5   Duration          10683 non-null  object
 6   Total_Stops       10682 non-null  object
 7   Additional_Info   10683 non-null  object
 8   Price             10683 non-null  int64 
 9   Date              10683 non-null  int64 
 10  Month             10683 non-null  int64 
 11  Year              10683 non-null  int64 
 12  Arrival_hour      10683 non-null  int64 
 13  Arrival_minute    10683 non-null  int64 
 14  Departure_hour    10683 non-null  object
 15  Departure_minute  10683 non-null  object
dtypes: int64(6), object(10)
memory usage: 1.3+ MB


In [435]:
df['Departure_hour'] = df['Departure_hour'].astype(int)
df['Departure_minute'] = df['Departure_minute'].astype(int)

In [436]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Airline           10683 non-null  object
 1   Source            10683 non-null  object
 2   Destination       10683 non-null  object
 3   Route             10682 non-null  object
 4   Dep_Time          10683 non-null  object
 5   Duration          10683 non-null  object
 6   Total_Stops       10682 non-null  object
 7   Additional_Info   10683 non-null  object
 8   Price             10683 non-null  int64 
 9   Date              10683 non-null  int64 
 10  Month             10683 non-null  int64 
 11  Year              10683 non-null  int64 
 12  Arrival_hour      10683 non-null  int64 
 13  Arrival_minute    10683 non-null  int64 
 14  Departure_hour    10683 non-null  int64 
 15  Departure_minute  10683 non-null  int64 
dtypes: int64(8), object(8)
memory usage: 1.3+ MB


In [437]:
df.drop('Dep_Time',axis=1,inplace=True)

In [438]:
df.head(2)

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_minute,Departure_hour,Departure_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,24,3,2019,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,1,5,2019,13,15,5,50


In [439]:
## The above operations can be done for Duration as well
df['Duration_hour'] = df['Duration'].str.split(' ').str[0].str.split('h').str[0]
df['Duration_minute'] = df['Duration'].str.split(' ').str[1].str.split('m').str[0]

In [440]:
var = '5m'
var.split(' ')[0].split('h')[0]

'5m'

In [441]:
df.head()

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_minute,Departure_hour,Departure_minute,Duration_hour,Duration_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,24,3,2019,1,10,22,20,2,50.0
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,1,5,2019,13,15,5,50,7,25.0
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2 stops,No info,13882,9,6,2019,4,25,9,25,19,
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1 stop,No info,6218,12,5,2019,23,30,18,5,5,25.0
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1 stop,No info,13302,1,3,2019,21,35,16,50,4,45.0


In [442]:
df[df['Duration_hour'] == '5m']

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_minute,Departure_hour,Departure_minute,Duration_hour,Duration_minute
6474,Air India,Mumbai,Hyderabad,BOM → GOI → PNQ → HYD,5m,2 stops,No info,17327,6,3,2019,16,55,16,50,5m,


In [443]:
drop_index = df[df['Duration_hour'] == '5m'].index
df.drop(drop_index,inplace=True)

In [444]:
df['Duration_minute'].isnull().sum()

np.int64(1031)

In [445]:
df['Duration_minute'] = df['Duration_minute'].fillna(0)

In [446]:
df['Duration_minute'].isnull().sum()

np.int64(0)

In [447]:
df['Departure_hour'].isnull().sum()

np.int64(0)

In [448]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10682 entries, 0 to 10682
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Airline           10682 non-null  object
 1   Source            10682 non-null  object
 2   Destination       10682 non-null  object
 3   Route             10681 non-null  object
 4   Duration          10682 non-null  object
 5   Total_Stops       10681 non-null  object
 6   Additional_Info   10682 non-null  object
 7   Price             10682 non-null  int64 
 8   Date              10682 non-null  int64 
 9   Month             10682 non-null  int64 
 10  Year              10682 non-null  int64 
 11  Arrival_hour      10682 non-null  int64 
 12  Arrival_minute    10682 non-null  int64 
 13  Departure_hour    10682 non-null  int64 
 14  Departure_minute  10682 non-null  int64 
 15  Duration_hour     10682 non-null  object
 16  Duration_minute   10682 non-null  object
dtypes: int64(8), obje

In [449]:
df['Duration_hour'] = df['Duration_hour'].astype(int)

In [450]:
df['Duration_minute'] = df['Departure_minute'].astype(int)

In [451]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10682 entries, 0 to 10682
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Airline           10682 non-null  object
 1   Source            10682 non-null  object
 2   Destination       10682 non-null  object
 3   Route             10681 non-null  object
 4   Duration          10682 non-null  object
 5   Total_Stops       10681 non-null  object
 6   Additional_Info   10682 non-null  object
 7   Price             10682 non-null  int64 
 8   Date              10682 non-null  int64 
 9   Month             10682 non-null  int64 
 10  Year              10682 non-null  int64 
 11  Arrival_hour      10682 non-null  int64 
 12  Arrival_minute    10682 non-null  int64 
 13  Departure_hour    10682 non-null  int64 
 14  Departure_minute  10682 non-null  int64 
 15  Duration_hour     10682 non-null  int64 
 16  Duration_minute   10682 non-null  int64 
dtypes: int64(10), obj

In [452]:
## Categorical features
df['Total_Stops'].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [453]:
# NaN values mean missing values
## Finding how many values are missing
df['Total_Stops'].isnull().sum()

np.int64(1)

In [454]:
df['Total_Stops'].mode()
## Since the most occuring element is 1 stop, we will replace that in the missing value

0    1 stop
Name: Total_Stops, dtype: object

In [455]:
## We are using ordinal labels so that the model will be able to understand and we will be able to find out its relationship with price
## Working under the assumptions that more the number of stops, the cheaper the flight
df['Total_Stops'] = df['Total_Stops'].map({'non-stop':4,'1 stop':3, '2 stops':2, '3 stops':1, '4 stops':0, np.nan:3})

In [456]:
df['Total_Stops'].isnull().sum()

np.int64(0)

In [457]:
df.head(2)

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_minute,Departure_hour,Departure_minute,Duration_hour,Duration_minute
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,4,No info,3897,24,3,2019,1,10,22,20,2,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2,No info,7662,1,5,2019,13,15,5,50,7,50


In [458]:
## We are dropping Route because we already have source and destination
df.drop('Route',axis=1,inplace=True)

In [459]:
df.head(2)

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,Arrival_hour,Arrival_minute,Departure_hour,Departure_minute,Duration_hour,Duration_minute
0,IndiGo,Banglore,New Delhi,2h 50m,4,No info,3897,24,3,2019,1,10,22,20,2,20
1,Air India,Kolkata,Banglore,7h 25m,2,No info,7662,1,5,2019,13,15,5,50,7,50


In [460]:
## Dealing with other categorical features
df['Airline'].unique()

array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

In [461]:
df['Source'].unique()

array(['Banglore', 'Kolkata', 'Delhi', 'Chennai', 'Mumbai'], dtype=object)

In [462]:
df['Destination'].unique()

array(['New Delhi', 'Banglore', 'Cochin', 'Kolkata', 'Delhi', 'Hyderabad'],
      dtype=object)

In [463]:
df['Additional_Info'].unique()

array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 Short layover', 'No Info',
       '1 Long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 Long layover'], dtype=object)

In [464]:
## Using OneHotEncoding

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

airline_encoded = encoder.fit_transform(df[['Airline']]).toarray()
airline_encoded_df = pd.DataFrame(airline_encoded,columns=encoder.get_feature_names_out())
df = pd.concat([df,airline_encoded_df],axis=1)

source_encoded = encoder.fit_transform(df[['Source']]).toarray()
source_encoded_df = pd.DataFrame(source_encoded, columns=encoder.get_feature_names_out())
df = pd.concat([df,source_encoded_df],axis=1)

destination_encoded = encoder.fit_transform(df[['Destination']]).toarray()
destination_encoded_df = pd.DataFrame(destination_encoded,columns=encoder.get_feature_names_out())
df = pd.concat([df,destination_encoded_df],axis=1)

additional_encoded = encoder.fit_transform(df[['Additional_Info']]).toarray()
additiona_encoded_df = pd.DataFrame(additional_encoded, columns=encoder.get_feature_names_out())
df = pd.concat([df,additiona_encoded_df],axis=1)

df.head()

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Additional_Info,Price,Date,Month,Year,...,Additional_Info_1 Short layover,Additional_Info_2 Long layover,Additional_Info_Business class,Additional_Info_Change airports,Additional_Info_In-flight meal not included,Additional_Info_No Info,Additional_Info_No check-in baggage included,Additional_Info_No info,Additional_Info_Red-eye flight,Additional_Info_nan
0,IndiGo,Banglore,New Delhi,2h 50m,4.0,No info,3897.0,24.0,3.0,2019.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,Air India,Kolkata,Banglore,7h 25m,2.0,No info,7662.0,1.0,5.0,2019.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,Jet Airways,Delhi,Cochin,19h,2.0,No info,13882.0,9.0,6.0,2019.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,IndiGo,Kolkata,Banglore,5h 25m,3.0,No info,6218.0,12.0,5.0,2019.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,IndiGo,Banglore,New Delhi,4h 45m,3.0,No info,13302.0,1.0,3.0,2019.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [468]:
df.drop(['Airline','Source','Destination','Additional_Info'],axis=1,inplace=True)

KeyError: "['Airline', 'Source', 'Destination', 'Additional_Info'] not found in axis"

In [469]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10683 entries, 0 to 6474
Data columns (total 48 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Duration                                      10682 non-null  object 
 1   Total_Stops                                   10682 non-null  float64
 2   Price                                         10682 non-null  float64
 3   Date                                          10682 non-null  float64
 4   Month                                         10682 non-null  float64
 5   Year                                          10682 non-null  float64
 6   Arrival_hour                                  10682 non-null  float64
 7   Arrival_minute                                10682 non-null  float64
 8   Departure_hour                                10682 non-null  float64
 9   Departure_minute                              10682 non-null  float

In [470]:
## Alternatively
'''
## Using OneHotEncoding

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoded = encoder.fit_transform(df['Airline','Source','Destination','Additional_Info']).toarray()

encoded_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

df = pd.concat([df,encoded_df],axis=1)

df.head()
'''

"\n## Using OneHotEncoding\n\nfrom sklearn.preprocessing import OneHotEncoder\n\nencoder = OneHotEncoder()\n\nencoded = encoder.fit_transform(df['Airline','Source','Destination','Additional_Info']).toarray()\n\nencoded_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())\n\ndf = pd.concat([df,encoded_df],axis=1)\n\ndf.head()\n"

Now the data is prepared to give to the ML model for training.