## EDA And Feature Engineering Flight Price Prediction
check the dataset info below
https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction

### FEATURES
The various features of the cleaned dataset are explained below:
---
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.

In [1]:
#importing basics libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df=pd.read_excel('flight_price.xlsx')
df.head()

In [None]:
df.tail()

In [None]:
## get the basics info about data
df.info()

In [None]:
df.describe()

In [None]:
df.head()

In [7]:
## Feature Engineering
df['Date']=df['Date_of_Journey'].str.split('/').str[0]
df['Month']=df['Date_of_Journey'].str.split('/').str[1]
df['Year']=df['Date_of_Journey'].str.split('/').str[2]

In [None]:
df.info()

In [9]:
df['Date']=df['Date'].astype(int)
df['Month']=df['Month'].astype(int)
df['Year']=df['Year'].astype(int)

In [None]:
df.info()

In [11]:
## Drop Date Of Journey

df.drop('Date_of_Journey',axis=1,inplace=True)

In [None]:
df.head()

In [13]:
df['Arrival_Time']=df['Arrival_Time'].apply(lambda x:x.split(' ')[0])

In [None]:
df.head()

In [15]:
df['Arrival_hour']=df['Arrival_Time'].str.split(':').str[0]
df['Arrival_min']=df['Arrival_Time'].str.split(':').str[1]

In [None]:
df.head(2)

In [None]:
df.info()

In [21]:
df['Arrival_hour']=df['Arrival_hour'].astype(int)
df['Arrival_min']=df['Arrival_min'].astype(int)

In [22]:
df.drop('Arrival_Time',axis=1,inplace=True)

In [None]:
df.head(2)

In [24]:
df['Departure_hour']=df['Dep_Time'].str.split(':').str[0]
df['Departure_min']=df['Dep_Time'].str.split(':').str[1]

In [25]:
df['Departure_hour']=df['Departure_hour'].astype(int)
df['Departure_min']=df['Departure_min'].astype(int)

In [None]:
df.info()

In [27]:
df.drop('Dep_Time',axis=1,inplace=True)

In [None]:
df.head(2)

In [None]:
df['Total_Stops'].unique()

In [None]:
df[df['Total_Stops'].isnull()]
# is used to filter rows in your DataFrame df where the Total_Stops column has missing values (NaN).

In [None]:
df['Total_Stops'].mode()
# is used to find the mode (the most frequent value) of the Total_Stops column in your DataFrame df.

In [None]:
df['Total_Stops'].unique()

In [35]:
df['Total_Stops']=df['Total_Stops'].map({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4,np.nan:1})
# is used for mapping categorical values in the Total_Stops column to numerical values.
# For any missing values (np.nan), the code replaces them with 1.

In [None]:
df[df['Total_Stops'].isnull()]

In [None]:
df.head(2)

In [40]:
df.drop('Route',axis=1,inplace=True)

In [None]:
df.head(2)

In [None]:
df['Duration'].str.split(' ').str[0].str.split('h').str[0]

In [None]:
df['Airline'].unique()

In [None]:
df['Source'].unique()

In [None]:
df['Additional_Info'].unique()

In [46]:
from sklearn.preprocessing import OneHotEncoder

In [47]:
encoder=OneHotEncoder()

In [None]:
encoder.fit_transform(df[['Airline','Source','Destination']]).toarray()
# is using label encoding (or one-hot encoding) to transform the categorical columns (Airline, Source, Destination) into numerical values, where the .fit_transform() function applies the transformation and .toarray() converts the output into an array format.

In [None]:
pd.DataFrame(encoder.fit_transform(df[['Airline','Source','Destination']]).toarray(),columns=encoder.get_feature_names_out())
# is used to perform One-Hot Encoding on the categorical columns Airline, Source, and Destination in your DataFrame df, and convert the resulting encoded data into a new DataFrame with proper column names.