# Predict The Flight Ticket Price 
Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travellers saying that flight ticket prices are so unpredictable. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities.

Size of training set: 10683 records

Size of test set: 2671 records

FEATURES:
Airline: The name of the airline.

Date_of_Journey: The date of the journey

Source: The source from which the service begins.

Destination: The destination where the service ends.

Route: The route taken by the flight to reach the destination.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about the flight

Price: The price of the ticket





https://github.com/dsrscientist/Data-Science-ML-Capstone-Projects



In [39]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [40]:
train = pd.read_excel('Data_Train.xlsx')
test = pd.read_excel('Test_set.xlsx')

In [41]:
train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [42]:
test.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info
0,Jet Airways,6/06/2019,Delhi,Cochin,DEL → BOM → COK,17:30,04:25 07 Jun,10h 55m,1 stop,No info
1,IndiGo,12/05/2019,Kolkata,Banglore,CCU → MAA → BLR,06:20,10:20,4h,1 stop,No info
2,Jet Airways,21/05/2019,Delhi,Cochin,DEL → BOM → COK,19:15,19:00 22 May,23h 45m,1 stop,In-flight meal not included
3,Multiple carriers,21/05/2019,Delhi,Cochin,DEL → BOM → COK,08:00,21:00,13h,1 stop,No info
4,Air Asia,24/06/2019,Banglore,Delhi,BLR → DEL,23:55,02:45 25 Jun,2h 50m,non-stop,No info


In [43]:
train.shape,test.shape

((10683, 11), (2671, 10))

In [44]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


In [None]:
test.info()

In [45]:
train.isnull().sum()


Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

In [46]:
test.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              0
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        0
Additional_Info    0
dtype: int64

In [47]:
train = train.dropna()

In [48]:
AirlineDetails=train.append(test)

In [49]:
AirlineDetails.shape,train.shape

((13353, 11), (10682, 11))

In [50]:
info=['The name of the airline.','\bThe date of the journey','\tThe source from which the service begins.',
      'The destination where the service ends.','\tThe route taken by the flight to reach the destination.',
      'The time when the journey starts from the source.','Time of arrival at the destination.',
      'Total duration of the flight.','Total stops between the source and destination.',
      '\bAdditional information about the flight','\tTarget:The price of the ticket']

for i in range(len(info)):
    print(AirlineDetails.columns[i]+":\t\t\t"+info[i])

Airline:			The name of the airline.
Date_of_Journey:			The date of the journey
Source:				The source from which the service begins.
Destination:			The destination where the service ends.
Route:				The route taken by the flight to reach the destination.
Dep_Time:			The time when the journey starts from the source.
Arrival_Time:			Time of arrival at the destination.
Duration:			Total duration of the flight.
Total_Stops:			Total stops between the source and destination.
Additional_Info:			Additional information about the flight
Price:				Target:The price of the ticket


In [51]:
AirlineDetails['Date']=AirlineDetails['Date_of_Journey'].str.split('/').str[0]
AirlineDetails['Month']=AirlineDetails['Date_of_Journey'].str.split('/').str[1]
AirlineDetails['Year']=AirlineDetails['Date_of_Journey'].str.split('/').str[2]

In [52]:
AirlineDetails['Arrival_Time']=AirlineDetails['Arrival_Time'].str.split(' ').str[0]

In [53]:
AirlineDetails['Arrival_Hour']=AirlineDetails['Arrival_Time'].str.split(':').str[0]
AirlineDetails['Arrival_Min']=AirlineDetails['Arrival_Time'].str.split(':').str[1]

AirlineDetails['Arrival_Hour']=AirlineDetails['Arrival_Hour'].astype(int)
AirlineDetails['Arrival_Min']=AirlineDetails['Arrival_Min'].astype(int)


In [54]:
AirlineDetails['Total_Stops']=AirlineDetails['Total_Stops'].replace('non-stop','0 stop')
AirlineDetails['Stops']=AirlineDetails['Total_Stops'].str.split(' ').str[0]
AirlineDetails['Stops']=AirlineDetails['Stops'].astype(int)

In [55]:
AirlineDetails['Route1']=AirlineDetails['Route'].str.split('→ ').str[0]
AirlineDetails['Route2']=AirlineDetails['Route'].str.split('→ ').str[1]
AirlineDetails['Route3']=AirlineDetails['Route'].str.split('→ ').str[2]
AirlineDetails['Route4']=AirlineDetails['Route'].str.split('→ ').str[3]
AirlineDetails['Route5']=AirlineDetails['Route'].str.split('→ ').str[4]


AirlineDetails['Route1'].fillna('None',inplace=True)
AirlineDetails['Route2'].fillna('None',inplace=True)
AirlineDetails['Route3'].fillna('None',inplace=True)
AirlineDetails['Route4'].fillna('None',inplace=True)
AirlineDetails['Route5'].fillna('None',inplace=True)

In [56]:
AirlineDetails['Dep_Hour']=AirlineDetails['Dep_Time'].str.split(':').str[0]
AirlineDetails['Dep_Min']=AirlineDetails['Dep_Time'].str.split(':').str[1]

AirlineDetails['Dep_Hour']=AirlineDetails['Dep_Hour'].astype(int)
AirlineDetails['Dep_Min']=AirlineDetails['Dep_Min'].astype(int)



In [57]:
AirlineDetails['Duration_Hour']=AirlineDetails['Duration'].str.split('h').str[0]
AirlineDetails['Duration_Hour']=AirlineDetails['Duration_Hour'].str.split('m').str[-1]
AirlineDetails['Duration_Hour']=AirlineDetails['Duration_Hour'].replace('','0')
AirlineDetails['Duration_Hour'].fillna('0',inplace=True)
AirlineDetails['Duration_Hour']=AirlineDetails['Duration_Hour'].astype(int)

AirlineDetails['Duration_Min']=AirlineDetails['Duration'].str.split('m').str[0]
AirlineDetails['Duration_Min']=AirlineDetails['Duration_Min'].str.split(' ').str[1]
AirlineDetails['Duration_Min'].fillna('0',inplace=True)
AirlineDetails['Duration_Min']=AirlineDetails['Duration_Min'].astype(int)


In [58]:
AirlineDetails=AirlineDetails.drop(['Date_of_Journey','Arrival_Time','Route','Dep_Time','Total_Stops','Duration'],axis=1)

In [59]:
AirlineDetails.head(10)

Unnamed: 0,Airline,Source,Destination,Additional_Info,Price,Date,Month,Year,Arrival_Hour,Arrival_Min,Stops,Route1,Route2,Route3,Route4,Route5,Dep_Hour,Dep_Min,Duration_Hour,Duration_Min
0,IndiGo,Banglore,New Delhi,No info,3897.0,24,3,2019,1,10,0,BLR,DEL,,,,22,20,2,50
1,Air India,Kolkata,Banglore,No info,7662.0,1,5,2019,13,15,2,CCU,IXR,BBI,BLR,,5,50,7,25
2,Jet Airways,Delhi,Cochin,No info,13882.0,9,6,2019,4,25,2,DEL,LKO,BOM,COK,,9,25,19,0
3,IndiGo,Kolkata,Banglore,No info,6218.0,12,5,2019,23,30,1,CCU,NAG,BLR,,,18,5,5,25
4,IndiGo,Banglore,New Delhi,No info,13302.0,1,3,2019,21,35,1,BLR,NAG,DEL,,,16,50,4,45
5,SpiceJet,Kolkata,Banglore,No info,3873.0,24,6,2019,11,25,0,CCU,BLR,,,,9,0,2,25
6,Jet Airways,Banglore,New Delhi,In-flight meal not included,11087.0,12,3,2019,10,25,1,BLR,BOM,DEL,,,18,55,15,30
7,Jet Airways,Banglore,New Delhi,No info,22270.0,1,3,2019,5,5,1,BLR,BOM,DEL,,,8,0,21,5
8,Jet Airways,Banglore,New Delhi,In-flight meal not included,11087.0,12,3,2019,10,25,1,BLR,BOM,DEL,,,8,55,25,30
9,Multiple carriers,Delhi,Cochin,No info,8625.0,27,5,2019,19,15,1,DEL,BOM,COK,,,11,25,7,50


In [60]:
from sklearn.preprocessing import LabelEncoder
Lb=LabelEncoder()
AirlineDetails['Airline']=Lb.fit_transform(AirlineDetails['Airline'])
AirlineDetails['Source']=Lb.fit_transform(AirlineDetails['Source'])
AirlineDetails['Destination']=Lb.fit_transform(AirlineDetails['Destination'])
AirlineDetails['Additional_Info']=Lb.fit_transform(AirlineDetails['Additional_Info'])
AirlineDetails['Route1']=Lb.fit_transform(AirlineDetails['Route1'])
AirlineDetails['Route2']=Lb.fit_transform(AirlineDetails['Route2'])
AirlineDetails['Route3']=Lb.fit_transform(AirlineDetails['Route3'])
AirlineDetails['Route4']=Lb.fit_transform(AirlineDetails['Route4'])
AirlineDetails['Route5']=Lb.fit_transform(AirlineDetails['Route5'])

In [61]:
AirlineDetails_train=AirlineDetails[0:10682]
AirlineDetails_test=AirlineDetails[10682:]
AirlineDetails_test=AirlineDetails_test.drop(['Price'],axis=1)

In [62]:
AirlineDetails_train.shape,AirlineDetails_test.shape

((10682, 20), (2671, 19))

In [63]:
x=AirlineDetails_train.drop(['Price'],axis=1)
y=AirlineDetails_train.Price
x.shape,y.shape

((10682, 19), (10682,))

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
def maxr2_score(regr,df_x,y):
    max_r_score=0
    for r_state in range(42,101):
        x_train,x_val,y_train,y_val=train_test_split(x,y,random_state=r_state,test_size=0.20)
        regr.fit(x_train,y_train)
        y_pred=regr.predict(x_val)
        r2_scr=r2_score(y_val,y_pred)
        #print('r2_score corresponding to random state: ',r_state,"is:",r2_scr)
        if r2_scr>max_r_score:
            max_r_score=r2_scr
            final_r_state=r_state
    print('\n\nmax r2 score corresponding to random state:',final_r_state,"is",max_r_score)
    return final_r_state

In [65]:
#lets use linear regression and check max  r2 score corresponding to different random states
from sklearn.linear_model import LinearRegression
lreg=LinearRegression()
r_state=maxr2_score(lreg,x,y)



max r2 score corresponding to random state: 64 is 0.5487767863569135


In [None]:
#lets use grid_search to find optimal value of n_neigbors for KNN model
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
neighbors={'n_neighbors':range(1,30)}
knr=KNeighborsRegressor()
gknr=GridSearchCV(knr,neighbors,cv=10)
gknr.fit(x,y)
gknr.best_params_

In [None]:
#lets use KNN Regression and check max r2 score corresponding to different random states
knr=KNeighborsRegressor(n_neighbors=2)
r_state=maxr2_score(knr,x,y)

In [None]:
#lets check the mean r2 score of both linear regression model and knn regression
from sklearn.model_selection import cross_val_score
print('Mean r2 score for linear regression : ',cross_val_score(lreg,x,y,cv=5,scoring='r2').mean())
print('Standard deviation in r2 score for Linear Regression : ',cross_val_score(lreg,x,y,cv=5,scoring='r2').std())
print('\n\n Mean r2 score for KNN Regression : ',cross_val_score(knr,x,y,cv=5,scoring='r2').mean())
print('Standard deviation in r2 score for KNN Regression : ',cross_val_score(knr,x,y,cv=5,scoring='r2').std())

In [None]:
#lets check Lasso regression and find best value of alpha
from sklearn.linear_model import Lasso
lsreg=Lasso()
parameters={'alpha':[0.001,0.01,0.1,1]}
clf=GridSearchCV(lsreg,parameters,cv=10)
clf.fit(x,y)
clf.best_params_

In [None]:
#lets check max r2 score when we use lasso
lsreg=Lasso(alpha=1)
r_state=maxr2_score(lsreg,x,y)

In [None]:
#lets use cross val score with lasso
print("Mean r2 score for lasso regression : ",cross_val_score(lsreg,x,y,cv=5,scoring='r2').mean())
print('standard deviation for lasso ragression : ',cross_val_score(lsreg,x,y,scoring='r2').std())

In [None]:
# we tried all the model and till now KNN Regression is the best
#random state corresponding to highest r2_score is 49
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=49,test_size=0.20)
knr=KNeighborsRegressor(n_neighbors=2)      
knr.fit(x_train,y_train)
y_pred=lreg.predict(x_test)

In [None]:
#lets find the rmse and r2_score using sklearn.metrics
import numpy as np
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
print("RMSE is : ",np.sqrt(mean_squared_error(y_test,y_pred)))
print('r2_score is : ',r2_score(y_test,y_pred))

In [None]:
from sklearn.externals import joblib
joblib.dump(knr,'Model_FlightPrice.pkl')

In [None]:
model=joblib.load('Model_FlightPrice.pkl')

In [None]:
result=pd.DataFrame(model.predict(AirlineDetails_test))

In [None]:
result.to_csv('FlightTicketprdictresults.csv')