# Flight Price Prediction Model

## Instruction:
1. You will have a dataset.
2. Find the cheapest and expenses flight at specific time.
3. You have to go through EDA.
4. ML Model
5. Find a sweet spot for a cheap ticket.

- Ahmed is a customer of sastaticket.pk. He is planning to fly from Karachi to Islamabad for his brother's wedding and is currently in the process of choosing tickets. Ahmed has to go to Islamabad but Ahmed also wants to save some money in the process, so he chooses to wait instead of buying now, simply because ticket prices are just too high.
- Is this the right decission? Won't ticket prices increase in the future? Perhaps there is a sweet-spot Ahmed is hoping to find and maybe he just might find it. This is the problem that you will be tackling in this competition.
- Can you predict future prices accurately to such a degree that you can now tell Ahmed - with confidence - that he has made the wrong decision. Your task boils down to generating optimal predictions for flight prices of multiple airlines. Your model will contribute greatly to Sastaticket's rich and diverse set of operating algorithms.    

## 1. Exploratory Data Analysis:

We will extract information from our data

In [1]:
# Import Libraries 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load your data
df = pd.read_csv('E:\Python\ML\Sasta Ticket Dataset\sastaticket_train.csv')
df_test = pd.read_csv('E:\Python\ML\Sasta Ticket Dataset\sastaticket_test.csv')

In [3]:
print('Column, Rows', df.shape)

Column, Rows (5000, 14)


In [4]:
# Structure
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    5000 non-null   int64  
 1   Unnamed: 0.1  5000 non-null   int64  
 2   f1            5000 non-null   object 
 3   f2            5000 non-null   object 
 4   f3            5000 non-null   object 
 5   f4            5000 non-null   object 
 6   f5            5000 non-null   object 
 7   f6            5000 non-null   object 
 8   f7            5000 non-null   bool   
 9   f8            5000 non-null   float64
 10  f9            5000 non-null   int64  
 11  f10           5000 non-null   object 
 12  Unnamed: 0.2  5000 non-null   int64  
 13  target        5000 non-null   float64
dtypes: bool(1), float64(2), int64(4), object(7)
memory usage: 376.0+ KB


In [5]:
# Check null value
df.isnull().sum()

Unnamed: 0      0
Unnamed: 0.1    0
f1              0
f2              0
f3              0
f4              0
f5              0
f6              0
f7              0
f8              0
f9              0
f10             0
Unnamed: 0.2    0
target          0
dtype: int64

In [6]:
# Summary stats
df.describe()

Unnamed: 0.3,Unnamed: 0,Unnamed: 0.1,f8,f9,Unnamed: 0.2,target
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,10862930.0,10862930.0,22.4944,0.9446,10862930.0,10104.3518
std,6275456.0,6275456.0,8.887101,0.607951,6275456.0,3359.936118
min,2499.0,2499.0,0.0,0.0,2499.0,4990.0
25%,5417290.0,5417290.0,20.0,1.0,5417290.0,7796.0
50%,10938030.0,10938030.0,20.0,1.0,10938030.0,9403.0
75%,16215820.0,16215820.0,32.0,1.0,16215820.0,11245.0
max,21774430.0,21774430.0,45.0,2.0,21774430.0,33720.0


In [7]:
df.head()

Unnamed: 0.3,Unnamed: 0,Unnamed: 0.1,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,Unnamed: 0.2,target
0,276919,276919,2021-01-08 12:43:27.828728+00:00,x,y,2021-01-23 05:00:00+00:00,2021-01-23 07:00:00+00:00,gamma,True,0.0,0,c-2,276919,7400.0
1,12092463,12092463,2021-07-01 04:45:11.397541+00:00,x,y,2021-07-01 13:00:00+00:00,2021-07-01 15:00:00+00:00,alpha,True,35.0,1,a-9,12092463,15377.0
2,11061788,11061788,2021-06-24 11:28:47.565115+00:00,x,y,2021-07-29 14:00:00+00:00,2021-07-29 16:00:00+00:00,gamma,True,20.0,1,c-4,11061788,6900.0
3,8799808,8799808,2021-06-05 11:09:48.655927+00:00,x,y,2021-06-09 16:00:00+00:00,2021-06-09 18:00:00+00:00,alpha,True,15.0,1,a-23,8799808,9707.0
4,16391150,16391150,2021-07-29 09:53:51.065306+00:00,x,y,2021-08-23 05:00:00+00:00,2021-08-23 06:55:00+00:00,beta,True,20.0,0,b-1,16391150,6500.0


In [8]:
print(df.columns)

Index(['Unnamed: 0', 'Unnamed: 0.1', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7',
       'f8', 'f9', 'f10', 'Unnamed: 0.2', 'target'],
      dtype='object')


In [9]:
# Feature engineering
# Finding unique values in categorical lists
cat_list = ['f2', 'f3', 'f6', 'f8', 'f9', 'f10']

# unique value in each column
for i in cat_list:
    print(i, df[i].unique())
    print('.............................') # Seperator line

f2 ['x']
.............................
f3 ['y']
.............................
f6 ['gamma' 'alpha' 'beta' 'omega']
.............................
f8 [ 0. 35. 20. 15. 32. 40. 45.]
.............................
f9 [0 1 2]
.............................
f10 ['c-2' 'a-9' 'c-4' 'a-23' 'b-1' 'a-5' 'b-9' 'a-7' 'd-1' 'c-6' 'a-1' 'd-5'
 'b-69' 'b-19' 'd-3' 'b-319' 'b-369' 'b-67' 'b-73']
.............................


In [10]:
df.drop(['Unnamed: 0.1', 'f2', 'f3','Unnamed: 0.2'], axis=1, inplace=True)

In [11]:
df.head()

Unnamed: 0.1,Unnamed: 0,f1,f4,f5,f6,f7,f8,f9,f10,target
0,276919,2021-01-08 12:43:27.828728+00:00,2021-01-23 05:00:00+00:00,2021-01-23 07:00:00+00:00,gamma,True,0.0,0,c-2,7400.0
1,12092463,2021-07-01 04:45:11.397541+00:00,2021-07-01 13:00:00+00:00,2021-07-01 15:00:00+00:00,alpha,True,35.0,1,a-9,15377.0
2,11061788,2021-06-24 11:28:47.565115+00:00,2021-07-29 14:00:00+00:00,2021-07-29 16:00:00+00:00,gamma,True,20.0,1,c-4,6900.0
3,8799808,2021-06-05 11:09:48.655927+00:00,2021-06-09 16:00:00+00:00,2021-06-09 18:00:00+00:00,alpha,True,15.0,1,a-23,9707.0
4,16391150,2021-07-29 09:53:51.065306+00:00,2021-08-23 05:00:00+00:00,2021-08-23 06:55:00+00:00,beta,True,20.0,0,b-1,6500.0


In [12]:
# Type casting
df =  df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  5000 non-null   int64  
 1   f1          5000 non-null   object 
 2   f4          5000 non-null   object 
 3   f5          5000 non-null   object 
 4   f6          5000 non-null   object 
 5   f7          5000 non-null   bool   
 6   f8          5000 non-null   float64
 7   f9          5000 non-null   int64  
 8   f10         5000 non-null   object 
 9   target      5000 non-null   float64
dtypes: bool(1), float64(2), int64(2), object(5)
memory usage: 258.9+ KB


In [None]:
# convert them into date and time objects
from datetime import date, datetime
df['f1'] = pd.to_datetime(df['f1'])
df['f4'] = pd.to_datetime(df['f4'])
df['f5'] = pd.to_datetime(df['f5'])

In [None]:
df.info()

In [None]:
# Adding column after substraction
df.insert(0, "time_to_dep(s)", ((df['f4']-df['f1']).astype('timedelta64[s]')), True)
df.insert(1, "travel_time(s)", ((df['f5']-df['f4']).astype('timedelta64[s]')), True)

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df = df.rename(columns={'f1':'f12'})

In [None]:
df.head()

In [None]:
cat_cols = ['f6', 'f7', 'f8', 'f9']
num_cols = ['time_to_dep(s)', 'travel_time(s)']

In [None]:
# Plotting categorical count plot
c=1
plt.figure(figsize= (10,40))
for i in cat_cols:
    plt.subplot(2,4,c)
    sns.countplot(df[i])
    plt.xticks(rotation=90)
    plt.tight_layout(pad=3.0)
    c= c+1
plt.show()

In [None]:
# Plotting numerical count plot
c=1
plt.figure(figsize= (10,30))
for i in num_cols:
    plt.subplot(6,3,c)
    sns.distplot(df[i])
    c= c+1
plt.show()

In [None]:
c=1
plt.figure(figsize= (10,30))
for i in num_cols:
    plt.subplot(6,3,c)
    sns.boxplot(df[i])
    c= c+1
plt.show()

In [None]:
# Target 
sns.displot(df['target'])

In [None]:
sns.boxplot(df.target)

In [None]:
# Skew and kurtosis
df.skew()
df.kurtosis()
# Outliers removel task

In [None]:
# encoding of variables

df.head()

In [None]:
df.drop9['f1', 'f2', 'f5'], axis=1, inplace=True

In [None]:
df.head()

In [None]:
# encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
# apply label encoder
df['f6'] = le.fit_transform(df['f6'])
df['f7'] = le.fit_transform(df['f7'])
df['f8'] = le.fit_transform(df['f8'])

In [None]:
df.sample(10)

In [None]:
df.describe()

In [None]:
# sklearn function to scale our data / normalize
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
df['time_to_dep(s)']=sc.fit_transform(df['time_to_dep(s)'])
df['travel_time(s)']=sc.fit_transform(df['travel_time(s)'])
df['target']=sc.fit_transform(df['target'])

In [None]:
# Splitting our data into X and y

X = df.drop(['target'], axis=1)
y = df['target']

In [None]:
X.head()

In [None]:
y.head()

# ML Modeling

In [None]:
# Regression pipeline or algos

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# root mean squared error
# rms = mean_absolute_error(y_test, y_pred, squared=False)

In [None]:
# shorten the names
lr = LinearRegression()
dt = DecisionTreeClassifier()
svr = SVR()
knn = KNeighborsRegressor()

In [None]:
lr.fit()
lr(X_train, y_train)
pred = lr.predict()

In [None]:
# model loop

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

for i in[lr, dt, svr, knn]: # read all models
    i.fit(X_train, y_train) # fitting our model
    pred = i.predict(X_test) # predict
    test_score = r2_score(y_test, pred) # test score
    train_score= r2_score(y_train, i.predict(X_train)) # train score
    if abs(train_score-test_score) <= 0.1:
        print(i)
        print('R2 score is: ', r2_score(y_test, pred))
        print('Mean Absolute error is', mean_absolute_error(y_test, pred))
        print('Mean squared error is', mean_squared_error(y_test, pred))
        print('RMSE is', mean_squared_error(y_test, pred, squared=False))
        print('----------------------------------------------------')

In [None]:
# model loop

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

for i in[lr, dt, svr, knn]: # read all models
    i.fit(X_train, y_train) # fitting our model
    pred = i.predict(X_test) # predict
    test_score = r2_score(y_test, pred) # test score
    train_score= r2_score(y_train, i.predict(X_train)) # train score
    if abs(train_score-test_score) <= 0.1:
        print(i)
        print('R2 score is: ', r2_score(y_test, pred))
        print('Mean Absolute error is', mean_absolute_error(y_test, pred))
        print('Mean squared error is', mean_squared_error(y_test, pred))
        print('RMSE is', mean_squared_error(y_test, pred, squared=False))
        print('----------------------------------------------------')

# To save prediction
res = pd.DataFrame(pred)
res.index = X_test.index # its important for comparison 
res.columns = ['prediction']
res.to_csv("E:\Python\ML\prediction_results_with_traintestsplit.csv")

In [None]:
df_test.head()

In [None]:
# final data prediction

lr = LinearRegression().fit(X,y)
pred = lr.predict(df_test)
# To save prediction
res = pd.DataFrame(pred)
res.index = X_test.index # its important for comparison 
res.columns = ['prediction']
res.to_csv("prediction_results.csv")

# Second Method

In [20]:
df1 = pd.read_csv('E:\Python\ML\Sasta Ticket Dataset\sastaticket_train.csv')
df1.head()

Unnamed: 0.3,Unnamed: 0,Unnamed: 0.1,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,Unnamed: 0.2,target
0,276919,276919,2021-01-08 12:43:27.828728+00:00,x,y,2021-01-23 05:00:00+00:00,2021-01-23 07:00:00+00:00,gamma,True,0.0,0,c-2,276919,7400.0
1,12092463,12092463,2021-07-01 04:45:11.397541+00:00,x,y,2021-07-01 13:00:00+00:00,2021-07-01 15:00:00+00:00,alpha,True,35.0,1,a-9,12092463,15377.0
2,11061788,11061788,2021-06-24 11:28:47.565115+00:00,x,y,2021-07-29 14:00:00+00:00,2021-07-29 16:00:00+00:00,gamma,True,20.0,1,c-4,11061788,6900.0
3,8799808,8799808,2021-06-05 11:09:48.655927+00:00,x,y,2021-06-09 16:00:00+00:00,2021-06-09 18:00:00+00:00,alpha,True,15.0,1,a-23,8799808,9707.0
4,16391150,16391150,2021-07-29 09:53:51.065306+00:00,x,y,2021-08-23 05:00:00+00:00,2021-08-23 06:55:00+00:00,beta,True,20.0,0,b-1,16391150,6500.0


In [21]:
date_format_str = '%Y/%m/%d %H:%M:%S.%f'

df1['f1'] = pd.to_datetime(df1['f1'])
df1['f4'] = pd.to_datetime(df1['f4'])

f1 = pd.to_datetime(df1['f1'], date_format_str)
f4 = pd.to_datetime(df1['f4'], date_format_str)
diff = f4 - f1
df1['delta'] = diff

In [22]:
df1.head()

Unnamed: 0.3,Unnamed: 0,Unnamed: 0.1,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,Unnamed: 0.2,target,delta
0,276919,276919,2021-01-08 12:43:27.828728+00:00,x,y,2021-01-23 05:00:00+00:00,2021-01-23 07:00:00+00:00,gamma,True,0.0,0,c-2,276919,7400.0,14 days 16:16:32.171272
1,12092463,12092463,2021-07-01 04:45:11.397541+00:00,x,y,2021-07-01 13:00:00+00:00,2021-07-01 15:00:00+00:00,alpha,True,35.0,1,a-9,12092463,15377.0,0 days 08:14:48.602459
2,11061788,11061788,2021-06-24 11:28:47.565115+00:00,x,y,2021-07-29 14:00:00+00:00,2021-07-29 16:00:00+00:00,gamma,True,20.0,1,c-4,11061788,6900.0,35 days 02:31:12.434885
3,8799808,8799808,2021-06-05 11:09:48.655927+00:00,x,y,2021-06-09 16:00:00+00:00,2021-06-09 18:00:00+00:00,alpha,True,15.0,1,a-23,8799808,9707.0,4 days 04:50:11.344073
4,16391150,16391150,2021-07-29 09:53:51.065306+00:00,x,y,2021-08-23 05:00:00+00:00,2021-08-23 06:55:00+00:00,beta,True,20.0,0,b-1,16391150,6500.0,24 days 19:06:08.934694


In [None]:
df.describe()

In [None]:
# functions to calculate time
def convert(seconds):
    seconds = seconds % (24 * 3600)
    hour = seconds // 3600
    minutes = seconds // 60
    seconds %= 60
    return "%d:%02d:%02d" % (hour, minutes, seconds)

# column addition
df['time_1'] = df['time_to_dep(s)'].apply(convert)
df['time_2'] = df['travel_time(s)'].apply(convert)
df.sample(100)

In [None]:
df['time_1'].min()

In [None]:
df['time_1'].max()