# **Flight fare prediction**
---


## **Problem Statement**


---

Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story.

We might have often heard travelers saying that flight ticket prices are so unpredictable.

That’s why we will try to use machine learning to solve this problem.

This can help airlines by predicting what prices they can maintain.

**Task 1:-**Prepare a complete data analysis report on the given data.

**Task 2:-**Create a predictive model which will help the customers to predict future flight prices and plan their journey accordingly.

### **Domain Analysis**
---

We have to analyze the **flight fare prediction** using Machine Learning dataset using essential exploratory data analysis techniques then will draw some predictions about the price of the flight based on some features such as what **type of airline** it is, what is the **arrival time**, what is the **departure time**, what is the **duration of the flight**, **source**, **destination** and more



### Attribute Information :

---

- **Airline:** So this column will have all the types of airlines like Indigo, Jet Airways, Air India, and many more.

- **Date_of_Journey:** This column will let us know about the date on which the passenger’s journey will start.

- **Source:** This column holds the name of the place from where the passenger’s journey will start.

- **Destination:** This column holds the name of the place to where passengers wanted to travel.

- **Route:** Here we can know about what the route is through which passengers have opted to travel from his/her source to their destination.

- **Arrival_Time:** Arrival time is when the passenger will reach his/her destination.

- **Duration:** Duration is the whole period that a flight will take to complete its journey from source to destination.

- **Total_Stops:** This will let us know in how many places flights will stop there for the flight in the whole journey.

- **Additional_Info:** In this column, we will get information about food, kind of food, and other amenities.

- **Price:** Price of the flight for a complete journey including all the expenses before onboarding.



In [2]:
#Mounting drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
# importing libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [3]:
# Load dataset
data=pd.read_excel('/content/drive/MyDrive/Internship phase 3 CDS certification/Projects/Flight fare prediction/Flight_Fare.xlsx')

### **Exploratory Data Analysis**

---


### Basic Checks



In [None]:
data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [None]:
data.tail()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648
10682,Air India,9/05/2019,Delhi,Cochin,DEL → GOI → BOM → COK,10:55,19:15,8h 20m,2 stops,No info,11753


In [None]:
data.shape

(10683, 11)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


In [None]:
data.describe()

Unnamed: 0,Price
count,10683.0
mean,9087.064121
std,4611.359167
min,1759.0
25%,5277.0
50%,8372.0
75%,12373.0
max,79512.0


In [None]:
data.duplicated().sum()

220

In [None]:
data.loc[data.duplicated()]

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
683,Jet Airways,1/06/2019,Delhi,Cochin,DEL → NAG → BOM → COK,14:35,04:25 02 Jun,13h 50m,2 stops,No info,13376
1061,Air India,21/05/2019,Delhi,Cochin,DEL → GOI → BOM → COK,22:00,19:15 22 May,21h 15m,2 stops,No info,10231
1348,Air India,18/05/2019,Delhi,Cochin,DEL → HYD → BOM → COK,17:15,19:15 19 May,26h,2 stops,No info,12392
1418,Jet Airways,6/06/2019,Delhi,Cochin,DEL → JAI → BOM → COK,05:30,04:25 07 Jun,22h 55m,2 stops,In-flight meal not included,10368
1674,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,18:25,21:20,2h 55m,non-stop,No info,7303
...,...,...,...,...,...,...,...,...,...,...,...
10594,Jet Airways,27/06/2019,Delhi,Cochin,DEL → AMD → BOM → COK,23:05,12:35 28 Jun,13h 30m,2 stops,No info,12819
10616,Jet Airways,1/06/2019,Delhi,Cochin,DEL → JAI → BOM → COK,09:40,12:35 02 Jun,26h 55m,2 stops,No info,13014
10634,Jet Airways,6/06/2019,Delhi,Cochin,DEL → JAI → BOM → COK,09:40,12:35 07 Jun,26h 55m,2 stops,In-flight meal not included,11733
10672,Jet Airways,27/06/2019,Delhi,Cochin,DEL → AMD → BOM → COK,23:05,19:00 28 Jun,19h 55m,2 stops,In-flight meal not included,11150


In [None]:
data.keys()

Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
       'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
       'Additional_Info', 'Price'],
      dtype='object')

**Insights from basic checks**

- No null value here
- 8 independent columns are categorical & object type
- Should we remove Date_of_Journey,Dep_Time,Arrival_Time or change to datetime
- How to change duration?
- We should drop Additional_Info.
- Should we remove 220 duplicated value


### Univariate Analysis

In [None]:
plt.figure(figsize=(20,25),facecolor='white')
plotnumber = 1

for column in data:
  ax = plt.subplot(6,2,plotnumber)
  sns.countplot(x = data[column])
  plt.xlabel(column,fontsize=10)
  plotnumber+=1
plt.tight_layout()

In [None]:
# Using sweetviz

!pip install sweetviz
import sweetviz as sv

uv_report = sv.analyze(data)
uv_report.show_html()

**Insights from univariate Analysis**

- People used **Jet Airways** maximum
- From **Delhi** maximum people used Air
- To reach **Cochin** maximum people used Air
- Maximum plane used one stop
- No valuable insights from **Date_of_Journey**,**Route**,**Dep_Time**,**Arrival_Time** ,**Duration**

### Bivariate Analysis

In [None]:
# Using Autoviz

!pip install autoviz


In [None]:
from autoviz import AutoViz_Class

av = AutoViz_Class()

bv_report = av.AutoViz('/content/drive/MyDrive/Internship phase 3 CDS certification/Projects/Flight fare prediction/Flight_Fare.xlsx')


Output hidden; open in https://colab.research.google.com to view.

### **Data Preprocessing**
---


In [None]:
data.keys()

Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
       'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
       'Additional_Info', 'Price'],
      dtype='object')

### Dropping unwanted column

In [4]:
# dropping Additional_Info as no specific relationship with price

data.drop('Additional_Info',axis=1,inplace=True)

### Imputing missing values

- There is no missing value , so we are skipping this step

In [5]:
# Convert to time format

data.Dep_Time = pd.to_datetime(data.Dep_Time).dt.hour
data.Arrival_Time = pd.to_datetime(data.Arrival_Time).dt.hour


In [6]:
data['Day_of_Journey']= pd.to_datetime(data.Date_of_Journey).dt.day


In [7]:
data['Month_of_Journey']=pd.to_datetime(data.Date_of_Journey).dt.month

In [8]:
data.drop('Date_of_Journey',axis=1,inplace=True)

### Encoding Categorical Variables

- Will perform one hot encoding for 'Airline','Source', 'Destination', 'Route'
- Will perform label encoding for Total_Stops

In [None]:
# Encoding Airline

#data_Airline = pd.get_dummies(data['Airline'],prefix='Airline',drop_first=True)
#data = pd.concat([data,data_Airline],axis=1).drop(['Airline'],axis=1)


In [None]:
# Encoding Source

#data_Source = pd.get_dummies(data['Source'],prefix='Source',drop_first=True)
#data = pd.concat([data,data_Source],axis=1).drop(['Source'],axis=1)

In [None]:
# Encoding Destination

#data_Destination = pd.get_dummies(data['Destination'],prefix='Destination',drop_first=True)
#data = pd.concat([data,data_Destination],axis=1).drop(['Destination'],axis=1)

In [None]:
# Encoding Route

#data_Route = pd.get_dummies(data['Route'],prefix='Route',drop_first=True)
#data = pd.concat([data,data_Route],axis=1).drop(['Route'],axis=1)

In [9]:
# Encoding Total_Stops

data.Total_Stops.replace({'non-stop':0,'2 stops':2, '1 stop':1, '3 stops':3,'4 stops':4},inplace=True)
data.loc[data.Total_Stops.isnull(),'Total_Stops']=0

In [10]:
data.loc[data.Total_Stops.isnull(),'Total_Stops']=0

In [12]:
from sklearn.preprocessing import LabelEncoder

lc = LabelEncoder()

data.Airline = lc.fit_transform(data.Airline)
data.Source = lc.fit_transform(data.Source)
data.Destination = lc.fit_transform(data.Destination)
data.Route = lc.fit_transform(data.Route)
data.Dep_Time = lc.fit_transform(data.Dep_Time)
data.Arrival_Time = lc.fit_transform(data.Arrival_Time)

### Change string time to number format

In [13]:
# String time to mins

def convert_time_to_minutes(time_str):
    try:
        hours, minutes = time_str.split('h')
        hours = int(hours.strip())

        if minutes.strip('m '):
            minutes = int(minutes.strip('m '))
        else:
            minutes = 0  # Set minutes to 0 if it's an empty string

        total_minutes = hours * 60 + minutes
        return total_minutes
    except ValueError:
        print(f"Invalid time format: {time_str}")
        return None

data['Duration'] = data['Duration'].apply(convert_time_to_minutes)

Invalid time format: 5m


In [14]:
# imputing Nan value with Median
data.loc[data['Duration'].isnull(),'Duration']=data['Duration'].median()

In [15]:
data.Duration.isnull().sum()

0

### Creating independent & dependent feature

In [16]:
X = data.drop(columns= ['Price'],axis=1)
y = data.Price

### Creating training & testing data


In [17]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=73)

### Scaling down

In [19]:
# Scaling down Duration

from sklearn.preprocessing import MinMaxScaler

sc = MinMaxScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

### Model creation

In [20]:
# Using Decision Tree
from sklearn.tree import DecisionTreeRegressor

model1 = DecisionTreeRegressor()

model1.fit(X_train,y_train)

y_predict1 = model1.predict(X_test)

In [21]:
# Using Linear Regression

from sklearn.linear_model import LinearRegression

model2 = LinearRegression()

model2.fit(X_train,y_train)

y_predict2 = model2.predict(X_test)

In [22]:
# Using Lasso Regression

from sklearn.linear_model import LassoCV
from sklearn.model_selection import RepeatedKFold

cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)

model3 = LassoCV(alphas=np.arange(0.1,10,0.1),cv=cv,tol=1)
model3.fit(X_train,y_train)

y_predict3 = model3.predict(X_test)

In [23]:
# Using Ridge Regression

from sklearn.linear_model import RidgeCV
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
model4 = RidgeCV(alphas = np.arange(0.1,10,0.1),cv= cv ,scoring = 'neg_mean_absolute_error')
model4.fit(X_train,y_train)
y_predict4 = model4.predict(X_test)

In [24]:
# Using KNN

from sklearn.neighbors import KNeighborsRegressor
model5 = KNeighborsRegressor(n_neighbors=5)
model5.fit(X_train,y_train)
y_predict5 = model5.predict(X_test)

In [25]:
# Using Random Forest

from sklearn.ensemble import RandomForestRegressor
model6 = RandomForestRegressor(n_estimators = 100)
model6.fit(X_train,y_train)
y_predict6=model6.predict(X_test)

In [26]:
#Using Xgboost

import xgboost as xgb
model7 = xgb.XGBRegressor()
model7.fit(X_train,y_train)
y_predict7=model7.predict(X_test)

### Model Evaluation

In [27]:
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error

In [29]:
print('Decision tree score :',r2_score(y_test,y_predict1)*100,'%')
print('Linear Regression score :',r2_score(y_test,y_predict2)*100,'%')
print('Lasso Regression score :',r2_score(y_test,y_predict3)*100,'%')
print('Ridge Regression score :',r2_score(y_test,y_predict4)*100,'%')
print('KNN Regression score :',r2_score(y_test,y_predict5)*100,'%')
print('Random Forest score :',r2_score(y_test,y_predict6)*100,'%')
print('XGBoost Regressor score :',r2_score(y_test,y_predict7)*100,'%')

Decision tree score : 70.73807943131291 %
Linear Regression score : 42.92014736837992 %
Lasso Regression score : 38.867070552158644 %
Ridge Regression score : 42.91570431236026 %
KNN Regression score : 57.421399744322635 %
Random Forest score : 79.3879933009721 %
XGBoost Regressor score : 84.25253542100324 %
