## EDA And Feature Engineering Flight Price Prediction
### FEATURES
The various features of the cleaned dataset are explained below:
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder 

In [3]:
df = pd.read_excel('flight_price.xlsx')
df

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302
...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


### Insights
- Date_of_Journey feature ranges between '01/03/2019' to '9/06/2019'
-

In [5]:
df['Date_of_Journey'].value_counts().index.max()

'9/06/2019'

In [6]:
df['Date_of_Journey'].value_counts().index.min()


'01/03/2019'

### spliting Date_of_Journey feature
Bcz my model will not be able to understand this feature | But if I will convert it to date, month and year my model will be able to understand it

In [7]:
# df['Date_of_Journey'].str.split('/')[0] # this gives me first index
df['date']= df['Date_of_Journey'].str.split('/').str[0] # date
df['month']=df['Date_of_Journey'].str.split('/').str[1] # month
df['year']=df['Date_of_Journey'].str.split('/').str[2] # year


### now coverting dtypes of features

In [8]:
df['date'] = df['date'].astype('int64')
df['month'] = df['month'].astype('int64')
df['year'] = df['year'].astype('int64')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
 11  date             10683 non-null  int64 
 12  month            10683 non-null  int64 
 13  year             10683 non-null  int64 
dtypes: int64(4), object(10)
memory usage: 1.1+ MB


### droping Date_of_Journey

In [10]:
df.drop(labels='Date_of_Journey',axis=1,inplace=True)

In [11]:
df

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,date,month,year
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,2019
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,2019
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,2019
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,2019
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,9,4,2019
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,27,4,2019
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,27,4,2019
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,1,3,2019


### Insights
- `droping year` would be the best idea bcz year is same in complete dataset , So this dose not any sense

#### droping year

In [12]:
df.drop('year',axis=1,inplace=True)

## Working with Dep_Time means departure time

### Insights
- Dep_Time has two values hour and minute | dividing this would be best idea


In [13]:
df['Dep_Time'].str.split(':')
df['hour'] = df['Dep_Time'].str.split(':').str[0]
df['minute'] = df['Dep_Time'].str.split(':').str[1]
df['minute']


0        20
1        50
2        25
3        05
4        50
         ..
10678    55
10679    45
10680    20
10681    30
10682    55
Name: minute, Length: 10683, dtype: object

In [14]:
max(df['Dep_Time'].value_counts().index)
min(df['Dep_Time'].value_counts().index)

'00:20'

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Source           10683 non-null  object
 2   Destination      10683 non-null  object
 3   Route            10682 non-null  object
 4   Dep_Time         10683 non-null  object
 5   Arrival_Time     10683 non-null  object
 6   Duration         10683 non-null  object
 7   Total_Stops      10682 non-null  object
 8   Additional_Info  10683 non-null  object
 9   Price            10683 non-null  int64 
 10  date             10683 non-null  int64 
 11  month            10683 non-null  int64 
 12  hour             10683 non-null  object
 13  minute           10683 non-null  object
dtypes: int64(3), object(11)
memory usage: 1.1+ MB


In [16]:
df['dep_minute'] = df['minute'].astype('int64')
df['dep_hour'] = df['minute'].astype('int64')

In [17]:
df.drop(['minute','hour'],axis=1,inplace=True)

In [18]:
df

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,date,month,dep_minute,dep_hour
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,24,3,20,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,1,5,50,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,9,6,25,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,12,5,5,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,1,3,50,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,9,4,55,55
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,27,4,45,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,27,4,20,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,1,3,30,30


### droping Departure_time and Dep_Time

In [19]:
# df.drop(labels=['Dep_Time','Departure_time'],axis=1,inplace=True)

## working with arrival time

In [20]:
df['Arrival_hours']=df['Arrival_Time'].str.split(' ').str[0].str.split(':').str[0]
df['Arrival_minutes']=df['Arrival_Time'].str.split(' ').str[0].str.split(':').str[1]
df.drop('Arrival_Time',axis=1,inplace=True)

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Source           10683 non-null  object
 2   Destination      10683 non-null  object
 3   Route            10682 non-null  object
 4   Dep_Time         10683 non-null  object
 5   Duration         10683 non-null  object
 6   Total_Stops      10682 non-null  object
 7   Additional_Info  10683 non-null  object
 8   Price            10683 non-null  int64 
 9   date             10683 non-null  int64 
 10  month            10683 non-null  int64 
 11  dep_minute       10683 non-null  int64 
 12  dep_hour         10683 non-null  int64 
 13  Arrival_hours    10683 non-null  object
 14  Arrival_minutes  10683 non-null  object
dtypes: int64(5), object(10)
memory usage: 1.2+ MB


In [22]:
df['Arrival_hours'] = df['Arrival_hours'].astype('int64')
df['Arrival_minutes'] = df['Arrival_minutes'].astype('int64')


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Source           10683 non-null  object
 2   Destination      10683 non-null  object
 3   Route            10682 non-null  object
 4   Dep_Time         10683 non-null  object
 5   Duration         10683 non-null  object
 6   Total_Stops      10682 non-null  object
 7   Additional_Info  10683 non-null  object
 8   Price            10683 non-null  int64 
 9   date             10683 non-null  int64 
 10  month            10683 non-null  int64 
 11  dep_minute       10683 non-null  int64 
 12  dep_hour         10683 non-null  int64 
 13  Arrival_hours    10683 non-null  int64 
 14  Arrival_minutes  10683 non-null  int64 
dtypes: int64(7), object(8)
memory usage: 1.2+ MB


In [24]:
df

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,date,month,dep_minute,dep_hour,Arrival_hours,Arrival_minutes
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,24,3,20,20,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,1,5,50,50,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,9,6,25,25,4,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,5h 25m,1 stop,No info,6218,12,5,5,5,23,30
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,4h 45m,1 stop,No info,13302,1,3,50,50,21,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,2h 30m,non-stop,No info,4107,9,4,55,55,22,25
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,2h 35m,non-stop,No info,4145,27,4,45,45,23,20
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,3h,non-stop,No info,7229,27,4,20,20,11,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,2h 40m,non-stop,No info,12648,1,3,30,30,14,10


### Insights
Route has not any importance

#### droping route

In [25]:
df.drop('Route',axis=1,inplace=True)

In [26]:
df

Unnamed: 0,Airline,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info,Price,date,month,dep_minute,dep_hour,Arrival_hours,Arrival_minutes
0,IndiGo,Banglore,New Delhi,22:20,2h 50m,non-stop,No info,3897,24,3,20,20,1,10
1,Air India,Kolkata,Banglore,05:50,7h 25m,2 stops,No info,7662,1,5,50,50,13,15
2,Jet Airways,Delhi,Cochin,09:25,19h,2 stops,No info,13882,9,6,25,25,4,25
3,IndiGo,Kolkata,Banglore,18:05,5h 25m,1 stop,No info,6218,12,5,5,5,23,30
4,IndiGo,Banglore,New Delhi,16:50,4h 45m,1 stop,No info,13302,1,3,50,50,21,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,19:55,2h 30m,non-stop,No info,4107,9,4,55,55,22,25
10679,Air India,Kolkata,Banglore,20:45,2h 35m,non-stop,No info,4145,27,4,45,45,23,20
10680,Jet Airways,Banglore,Delhi,08:20,3h,non-stop,No info,7229,27,4,20,20,11,20
10681,Vistara,Banglore,New Delhi,11:30,2h 40m,non-stop,No info,12648,1,3,30,30,14,10


## working with Duration feature

In [27]:
df['Duration'].str.split(' ').str[0]

0         2h
1         7h
2        19h
3         5h
4         4h
        ... 
10678     2h
10679     2h
10680     3h
10681     2h
10682     8h
Name: Duration, Length: 10683, dtype: object

In [28]:
# removing h from hours
df['Duration_hour'] = df['Duration'].str.split(' ').str[0].str.split('h').str[0]
df['Duration_minute'] = df['Duration'].str.split(' ').str[1].str.split('m').str[0]


In [29]:
df

Unnamed: 0,Airline,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info,Price,date,month,dep_minute,dep_hour,Arrival_hours,Arrival_minutes,Duration_hour,Duration_minute
0,IndiGo,Banglore,New Delhi,22:20,2h 50m,non-stop,No info,3897,24,3,20,20,1,10,2,50
1,Air India,Kolkata,Banglore,05:50,7h 25m,2 stops,No info,7662,1,5,50,50,13,15,7,25
2,Jet Airways,Delhi,Cochin,09:25,19h,2 stops,No info,13882,9,6,25,25,4,25,19,
3,IndiGo,Kolkata,Banglore,18:05,5h 25m,1 stop,No info,6218,12,5,5,5,23,30,5,25
4,IndiGo,Banglore,New Delhi,16:50,4h 45m,1 stop,No info,13302,1,3,50,50,21,35,4,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,19:55,2h 30m,non-stop,No info,4107,9,4,55,55,22,25,2,30
10679,Air India,Kolkata,Banglore,20:45,2h 35m,non-stop,No info,4145,27,4,45,45,23,20,2,35
10680,Jet Airways,Banglore,Delhi,08:20,3h,non-stop,No info,7229,27,4,20,20,11,20,3,
10681,Vistara,Banglore,New Delhi,11:30,2h 40m,non-stop,No info,12648,1,3,30,30,14,10,2,40


In [30]:
df['Duration'].str.split(' ').str[0].str.split('h').str[0]

0         2
1         7
2        19
3         5
4         4
         ..
10678     2
10679     2
10680     3
10681     2
10682     8
Name: Duration, Length: 10683, dtype: object

In [31]:
df['Duration_minute'].isnull().sum()

1032

In [32]:
df['Duration_minute'].fillna(0,inplace=True)

In [33]:

df[df['Duration_hour'] == '5m'].index
df.drop(df[df['Duration_hour'] == '5m'].index,inplace=True)
df['Duration_hour'] = df['Duration_hour'].astype('int64')

In [34]:
df['Duration_minute'] = df['Duration_minute'].astype('int64')
df['Duration_hour'] = df['Duration_hour'].astype('int64')

In [35]:
df

Unnamed: 0,Airline,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info,Price,date,month,dep_minute,dep_hour,Arrival_hours,Arrival_minutes,Duration_hour,Duration_minute
0,IndiGo,Banglore,New Delhi,22:20,2h 50m,non-stop,No info,3897,24,3,20,20,1,10,2,50
1,Air India,Kolkata,Banglore,05:50,7h 25m,2 stops,No info,7662,1,5,50,50,13,15,7,25
2,Jet Airways,Delhi,Cochin,09:25,19h,2 stops,No info,13882,9,6,25,25,4,25,19,0
3,IndiGo,Kolkata,Banglore,18:05,5h 25m,1 stop,No info,6218,12,5,5,5,23,30,5,25
4,IndiGo,Banglore,New Delhi,16:50,4h 45m,1 stop,No info,13302,1,3,50,50,21,35,4,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,19:55,2h 30m,non-stop,No info,4107,9,4,55,55,22,25,2,30
10679,Air India,Kolkata,Banglore,20:45,2h 35m,non-stop,No info,4145,27,4,45,45,23,20,2,35
10680,Jet Airways,Banglore,Delhi,08:20,3h,non-stop,No info,7229,27,4,20,20,11,20,3,0
10681,Vistara,Banglore,New Delhi,11:30,2h 40m,non-stop,No info,12648,1,3,30,30,14,10,2,40


In [36]:
df

Unnamed: 0,Airline,Source,Destination,Dep_Time,Duration,Total_Stops,Additional_Info,Price,date,month,dep_minute,dep_hour,Arrival_hours,Arrival_minutes,Duration_hour,Duration_minute
0,IndiGo,Banglore,New Delhi,22:20,2h 50m,non-stop,No info,3897,24,3,20,20,1,10,2,50
1,Air India,Kolkata,Banglore,05:50,7h 25m,2 stops,No info,7662,1,5,50,50,13,15,7,25
2,Jet Airways,Delhi,Cochin,09:25,19h,2 stops,No info,13882,9,6,25,25,4,25,19,0
3,IndiGo,Kolkata,Banglore,18:05,5h 25m,1 stop,No info,6218,12,5,5,5,23,30,5,25
4,IndiGo,Banglore,New Delhi,16:50,4h 45m,1 stop,No info,13302,1,3,50,50,21,35,4,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,19:55,2h 30m,non-stop,No info,4107,9,4,55,55,22,25,2,30
10679,Air India,Kolkata,Banglore,20:45,2h 35m,non-stop,No info,4145,27,4,45,45,23,20,2,35
10680,Jet Airways,Banglore,Delhi,08:20,3h,non-stop,No info,7229,27,4,20,20,11,20,3,0
10681,Vistara,Banglore,New Delhi,11:30,2h 40m,non-stop,No info,12648,1,3,30,30,14,10,2,40


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10682 entries, 0 to 10682
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10682 non-null  object
 1   Source           10682 non-null  object
 2   Destination      10682 non-null  object
 3   Dep_Time         10682 non-null  object
 4   Duration         10682 non-null  object
 5   Total_Stops      10681 non-null  object
 6   Additional_Info  10682 non-null  object
 7   Price            10682 non-null  int64 
 8   date             10682 non-null  int64 
 9   month            10682 non-null  int64 
 10  dep_minute       10682 non-null  int64 
 11  dep_hour         10682 non-null  int64 
 12  Arrival_hours    10682 non-null  int64 
 13  Arrival_minutes  10682 non-null  int64 
 14  Duration_hour    10682 non-null  int64 
 15  Duration_minute  10682 non-null  int64 
dtypes: int64(9), object(7)
memory usage: 1.4+ MB


In [38]:
df.drop('Duration',axis=1,inplace=True)

In [39]:
df

Unnamed: 0,Airline,Source,Destination,Dep_Time,Total_Stops,Additional_Info,Price,date,month,dep_minute,dep_hour,Arrival_hours,Arrival_minutes,Duration_hour,Duration_minute
0,IndiGo,Banglore,New Delhi,22:20,non-stop,No info,3897,24,3,20,20,1,10,2,50
1,Air India,Kolkata,Banglore,05:50,2 stops,No info,7662,1,5,50,50,13,15,7,25
2,Jet Airways,Delhi,Cochin,09:25,2 stops,No info,13882,9,6,25,25,4,25,19,0
3,IndiGo,Kolkata,Banglore,18:05,1 stop,No info,6218,12,5,5,5,23,30,5,25
4,IndiGo,Banglore,New Delhi,16:50,1 stop,No info,13302,1,3,50,50,21,35,4,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,19:55,non-stop,No info,4107,9,4,55,55,22,25,2,30
10679,Air India,Kolkata,Banglore,20:45,non-stop,No info,4145,27,4,45,45,23,20,2,35
10680,Jet Airways,Banglore,Delhi,08:20,non-stop,No info,7229,27,4,20,20,11,20,3,0
10681,Vistara,Banglore,New Delhi,11:30,non-stop,No info,12648,1,3,30,30,14,10,2,40


## Working with Total_Stops feature

In [40]:
df['Total_Stops'].value_counts()

1 stop      5625
non-stop    3491
2 stops     1519
3 stops       45
4 stops        1
Name: Total_Stops, dtype: int64

In [41]:
df['Total_Stops'].isnull().sum()

1

In [42]:
df['Total_Stops'].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [43]:
df['Total_Stops'].mode()

0    1 stop
Name: Total_Stops, dtype: object

### Filling nan value of Total_Stops

In [44]:
df['Total_Stops'].fillna('1 stop',inplace=True)

In [45]:
df['Total_Stops'].isnull().sum()

0

In [46]:
df['Total_Stops'].value_counts()

1 stop      5626
non-stop    3491
2 stops     1519
3 stops       45
4 stops        1
Name: Total_Stops, dtype: int64

### renaming Total_Stops features categories name

In [47]:
c = pd.Categorical(df['Total_Stops'])
# c.rename_categories({'1 stop':'1-stop', '2 stops':'2-stop', '3 stops':'3-stop', '4 stops':'4-stop'},inplace=True)
c
# this didnot worked

['non-stop', '2 stops', '2 stops', '1 stop', '1 stop', ..., 'non-stop', 'non-stop', 'non-stop', 'non-stop', '2 stops']
Length: 10682
Categories (5, object): ['1 stop', '2 stops', '3 stops', '4 stops', 'non-stop']

In [48]:
df['Total_Stops'].isnull().sum()

0

In [49]:
df['Total_Stops'].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', '4 stops'],
      dtype=object)

In [50]:
df['Total_Stops']= df['Total_Stops'].map({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4})

In [51]:
df['Total_Stops'].isnull().sum()

0

In [52]:
df['Total_Stops'].value_counts()

1    5626
0    3491
2    1519
3      45
4       1
Name: Total_Stops, dtype: int64

In [53]:
df['Total_Stops'].dtype

dtype('int64')

In [54]:
df.drop('Dep_Time',axis=1,inplace=True)

In [55]:
df

Unnamed: 0,Airline,Source,Destination,Total_Stops,Additional_Info,Price,date,month,dep_minute,dep_hour,Arrival_hours,Arrival_minutes,Duration_hour,Duration_minute
0,IndiGo,Banglore,New Delhi,0,No info,3897,24,3,20,20,1,10,2,50
1,Air India,Kolkata,Banglore,2,No info,7662,1,5,50,50,13,15,7,25
2,Jet Airways,Delhi,Cochin,2,No info,13882,9,6,25,25,4,25,19,0
3,IndiGo,Kolkata,Banglore,1,No info,6218,12,5,5,5,23,30,5,25
4,IndiGo,Banglore,New Delhi,1,No info,13302,1,3,50,50,21,35,4,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,0,No info,4107,9,4,55,55,22,25,2,30
10679,Air India,Kolkata,Banglore,0,No info,4145,27,4,45,45,23,20,2,35
10680,Jet Airways,Banglore,Delhi,0,No info,7229,27,4,20,20,11,20,3,0
10681,Vistara,Banglore,New Delhi,0,No info,12648,1,3,30,30,14,10,2,40


In [56]:
df['Total_Stops'].value_counts()

1    5626
0    3491
2    1519
3      45
4       1
Name: Total_Stops, dtype: int64

In [57]:
df['Additional_Info'].value_counts()

No info                         8344
In-flight meal not included     1982
No check-in baggage included     320
1 Long layover                    19
Change airports                    7
Business class                     4
No Info                            3
1 Short layover                    1
Red-eye flight                     1
2 Long layover                     1
Name: Additional_Info, dtype: int64

### Doing OneHotEncoding in Airline,Source,Destination ,Additional_Info features

In [58]:
encoder = OneHotEncoder()
encoder

In [59]:
Airline_encoded  = encoder.fit_transform(df[['Airline','Source','Destination','Additional_Info']])

In [60]:
encoded_df = pd.DataFrame(data=Airline_encoded.toarray(),columns=encoder.get_feature_names_out())
encoded_df

Unnamed: 0,Airline_Air Asia,Airline_Air India,Airline_GoAir,Airline_IndiGo,Airline_Jet Airways,Airline_Jet Airways Business,Airline_Multiple carriers,Airline_Multiple carriers Premium economy,Airline_SpiceJet,Airline_Trujet,...,Additional_Info_1 Long layover,Additional_Info_1 Short layover,Additional_Info_2 Long layover,Additional_Info_Business class,Additional_Info_Change airports,Additional_Info_In-flight meal not included,Additional_Info_No Info,Additional_Info_No check-in baggage included,Additional_Info_No info,Additional_Info_Red-eye flight
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10678,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10679,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10680,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [61]:
final_df = pd.concat(objs=[df,encoded_df],axis=1)

In [62]:
final_df.drop(labels=['Airline','Source','Destination','Additional_Info'],axis=1,inplace=True)

In [73]:
final_df.drop(index=6474,inplace=True)
df.isnull().sum()

Airline            0
Source             0
Destination        0
Total_Stops        0
Additional_Info    0
Price              0
date               0
month              0
dep_minute         0
dep_hour           0
Arrival_hours      0
Arrival_minutes    0
Duration_hour      0
Duration_minute    0
dtype: int64

In [75]:
final_df

Unnamed: 0,Total_Stops,Price,date,month,dep_minute,dep_hour,Arrival_hours,Arrival_minutes,Duration_hour,Duration_minute,...,Additional_Info_1 Long layover,Additional_Info_1 Short layover,Additional_Info_2 Long layover,Additional_Info_Business class,Additional_Info_Change airports,Additional_Info_In-flight meal not included,Additional_Info_No Info,Additional_Info_No check-in baggage included,Additional_Info_No info,Additional_Info_Red-eye flight
0,0.0,3897.0,24.0,3.0,20.0,20.0,1.0,10.0,2.0,50.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,2.0,7662.0,1.0,5.0,50.0,50.0,13.0,15.0,7.0,25.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2.0,13882.0,9.0,6.0,25.0,25.0,4.0,25.0,19.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,6218.0,12.0,5.0,5.0,5.0,23.0,30.0,5.0,25.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,13302.0,1.0,3.0,50.0,50.0,21.0,35.0,4.0,45.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,0.0,4107.0,9.0,4.0,55.0,55.0,22.0,25.0,2.0,30.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10679,0.0,4145.0,27.0,4.0,45.0,45.0,23.0,20.0,2.0,35.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10680,0.0,7229.0,27.0,4.0,20.0,20.0,11.0,20.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
10681,0.0,12648.0,1.0,3.0,30.0,30.0,14.0,10.0,2.0,40.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
