In [1]:
## EDA And Feature Engineering Flight Price Prediction
### FEATURES
"""The various features of the cleaned dataset are explained below:
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price."""

"The various features of the cleaned dataset are explained below:\n1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.\n2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.\n3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.\n4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.\n5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.\n6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.\n7) Destination City: City where the flight will land. It is a categorical feature having 

In [21]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder

In [3]:
df = pd.read_excel("flight_price.xlsx")

In [4]:
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [5]:
# Coverting all Categorical values to Numerical Values

In [6]:
# Converting Date_of_journey to Numerical Values
df['Date']= df['Date_of_Journey'].str.split('/').str[0]
df['Month']= df['Date_of_Journey'].str.split('/').str[1]
df['Year']= df['Date_of_Journey'].str.split('/').str[2]
df['Date']= df['Date'].astype(int)
df['Month']= df['Month'].astype(int)
df['Year']= df['Year'].astype(int)
df.drop('Date_of_Journey',axis=1,inplace=True)

In [7]:
# Converting Arrival_Time to Numerical Values
df['Arrival_hours']=df['Arrival_Time'].str.split(' ').str[0].str.split(':').str[0]
df['Arrival_minutes']=df['Arrival_Time'].str.split(' ').str[0].str.split(':').str[1]
df['Arrival_hours']= df['Arrival_hours'].astype(int)
df['Arrival_minutes']= df['Arrival_minutes'].astype(int)
df.drop('Arrival_Time',axis=1,inplace=True)

In [8]:
# Converting Departure_Time to Numerical Values
df['Departure_hours'] = df['Dep_Time'].str.split(':').str[0]
df['Departure_minutes'] = df['Dep_Time'].str.split(':').str[1]
df['Departure_hours'] = df['Departure_hours'].astype(int)
df['Departure_minutes'] = df['Departure_minutes'].astype(int)
df.drop('Dep_Time',axis=1,inplace=True)

In [9]:
# Converting Duration to Numerical Values
df['Duration_hours']=df['Duration'].str.split(' ').str[0].str.split('h').str[0]
df['Duration_minutes']=df['Duration'].str.split(' ').str[1].str.split('m').str[0]
df['Duration_hours'][6474] = '0'
df['Duration_minutes'][6474] = '5'
df['Duration_minutes'].fillna(0,inplace=True)
df['Duration_minutes'] = df['Duration_minutes'].astype(int)
df['Duration_hours'] = df['Duration_hours'].astype(int)
df.drop('Duration',axis=1,inplace=True)

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['Duration_hours'][6474] = '0'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Duration_hours'][6474] = '

In [11]:
df['Total_Stops'].unique()

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [13]:
#the mode of total stops is one so we will fill nan value with 1
df['Total_Stops'].mode()

0    1 stop
Name: Total_Stops, dtype: object

In [14]:
df['Total_Stops'] = df['Total_Stops'].map({'non-stop':0, '2 stops':2, '1 stop':1, '3 stops':3,'4 stops':4,np.nan:1})

In [18]:
df['Total_Stops'].isnull().sum()

np.int64(0)

In [22]:
df['Airline'].unique()

array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

In [23]:
df['Source'].unique()

array(['Banglore', 'Kolkata', 'Delhi', 'Chennai', 'Mumbai'], dtype=object)

In [24]:
df['Destination'].unique()

array(['New Delhi', 'Banglore', 'Cochin', 'Kolkata', 'Delhi', 'Hyderabad'],
      dtype=object)

In [28]:
#we can see there are repetative values we have to cleann n it
df['Additional_Info'].unique()

array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 Short layover', 'No Info',
       '1 Long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 Long layover'], dtype=object)

In [33]:
df['Additional_Info'] = df['Additional_Info'].str.capitalize()

In [34]:
df['Additional_Info'].unique()

array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 short layover',
       '1 long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 long layover'], dtype=object)

In [25]:
encoder = OneHotEncoder()

In [35]:
encoder.fit_transform(df[['Airline','Source','Destination','Additional_Info']]).toarray()

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.]], shape=(10683, 32))

In [37]:
df1 = pd.DataFrame(encoder.fit_transform(df[['Airline','Source','Destination','Additional_Info']]).toarray(),columns=encoder.get_feature_names_out())

In [42]:
df = pd.concat([df,df1],axis=1).reset_index(drop=True)

In [46]:
df.drop(columns= ['Airline','Destination','Source','Route','Additional_Info'],inplace=True)

In [50]:
#Converting All float datatypes to int datatype
df[df.select_dtypes(include='float').columns] = df.select_dtypes(include='float').astype(int)

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 43 columns):
 #   Column                                        Non-Null Count  Dtype
---  ------                                        --------------  -----
 0   Total_Stops                                   10683 non-null  int64
 1   Price                                         10683 non-null  int64
 2   Date                                          10683 non-null  int64
 3   Month                                         10683 non-null  int64
 4   Year                                          10683 non-null  int64
 5   Arrival_hours                                 10683 non-null  int64
 6   Arrival_minutes                               10683 non-null  int64
 7   Departure_hours                               10683 non-null  int64
 8   Departure_minutes                             10683 non-null  int64
 9   Duration_hours                                10683 non-null  int64
 10  Duration_m

In [53]:
df

Unnamed: 0,Total_Stops,Price,Date,Month,Year,Arrival_hours,Arrival_minutes,Departure_hours,Departure_minutes,Duration_hours,...,Destination_New Delhi,Additional_Info_1 long layover,Additional_Info_1 short layover,Additional_Info_2 long layover,Additional_Info_Business class,Additional_Info_Change airports,Additional_Info_In-flight meal not included,Additional_Info_No check-in baggage included,Additional_Info_No info,Additional_Info_Red-eye flight
0,0,3897,24,3,2019,1,10,22,20,2,...,1,0,0,0,0,0,0,0,1,0
1,2,7662,1,5,2019,13,15,5,50,7,...,0,0,0,0,0,0,0,0,1,0
2,2,13882,9,6,2019,4,25,9,25,19,...,0,0,0,0,0,0,0,0,1,0
3,1,6218,12,5,2019,23,30,18,5,5,...,0,0,0,0,0,0,0,0,1,0
4,1,13302,1,3,2019,21,35,16,50,4,...,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,0,4107,9,4,2019,22,25,19,55,2,...,0,0,0,0,0,0,0,0,1,0
10679,0,4145,27,4,2019,23,20,20,45,2,...,0,0,0,0,0,0,0,0,1,0
10680,0,7229,27,4,2019,11,20,8,20,3,...,0,0,0,0,0,0,0,0,1,0
10681,0,12648,1,3,2019,14,10,11,30,2,...,1,0,0,0,0,0,0,0,1,0
