## EDA And Feature Engineering Flight Price Prediction

check the dataset info below
https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction


### FEATURES

The various features of the cleaned dataset are explained below:

1. Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2. Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3. Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4. Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5. Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6. Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7. Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8. Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9. Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
   10)Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
10. Price: Target variable stores information of the ticket price.


In [1]:
#importing basics libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df=pd.read_excel('flight_price.xlsx')
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [None]:
df.tail()

In [None]:
## get the basics info about data
df.info()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
## Feature Engineering
df['Date']=df['Date_of_Journey'].str.split('/').str[0]
df['Month']=df['Date_of_Journey'].str.split('/').str[1]
df['Year']=df['Date_of_Journey'].str.split('/').str[2]

In [None]:
df.info()

In [None]:
df['Date']=df['Date'].astype(int)
df['Month']=df['Month'].astype(int)
df['Year']=df['Year'].astype(int)

In [None]:
df.info()

In [None]:
## Drop Date Of Journey

df.drop('Date_of_Journey',axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df['Arrival_Time']=df['Arrival_Time'].apply(lambda x:x.split(' ')[0])

In [None]:
df['Arrival_hour']=df['Arrival_Time'].str.split(':').str[0]
df['Arrival_min']=df['Arrival_Time'].str.split(':').str[1]

In [None]:
df.head(2)

In [None]:
df['Arrival_hour']=df['Arrival_hour'].astype(int)
df['Arrival_min']=df['Arrival_min'].astype(int)

In [None]:
df.drop('Arrival_Time',axis=1,inplace=True)

In [None]:
df.head(2)

In [None]:
df['Departure_hour']=df['Dep_Time'].str.split(':').str[0]
df['Departure_min']=df['Dep_Time'].str.split(':').str[1]

In [None]:
df['Departure_hour']=df['Departure_hour'].astype(int)
df['Departure_min']=df['Departure_min'].astype(int)

In [None]:
df.info()

In [None]:
df.drop('Dep_Time',axis=1,inplace=True)

In [None]:
df.head(2)

In [None]:
df['Total_Stops'].unique()

In [None]:
df[df['Total_Stops'].isnull()]

In [None]:
df[df['Route'].isnull()]

In [None]:
df['Total_Stops'].mode()

In [None]:
df['Total_Stops'].unique()

In [None]:
# We are mapping i.e. applying features to Total_stops column and also filling the missing data with mode.
df['Total_Stops']=df['Total_Stops'].map({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4,np.nan:1})

In [None]:
df[df['Total_Stops'].isnull()]

In [None]:
df.head(2)

In [None]:
df.drop('Route',axis=1,inplace=True)

In [None]:
df.head(5)

In [None]:
df['Duration'].str.split(' ').str[0].str.split('h').str[0]

In [None]:
df['Airline'].unique()

In [None]:
df['Source'].unique()

In [None]:
df['Additional_Info'].unique()

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder=OneHotEncoder()

In [None]:
encoder.fit_transform(df[['Airline','Source','Destination']]).toarray()

In [None]:
pd.DataFrame(encoder.fit_transform(df[['Airline','Source','Destination']]).toarray(),columns=encoder.get_feature_names_out())

### a. Does price vary with Airlines?

### b. How is the price affected when tickets are bought in just 1 or 2 days before departure?

### c. Does ticket price change based on the departure time and arrival time?

### d. How the price changes with change in Source and Destination?

### e. How does the ticket price vary between Economy and Business class?
