# AIRFLIGHT PRICE PREDICTION

## Instructions
1. Look into data
2. Find the cheapest and expensive flight at any specific time
3. EDA and Feature engeenering
4. ML Modeling
5. Find a sweetspot for cheap ticket.

## What are you to do?
Ahmed is a customer of Sastaticket.pk. He is planning to fly from Karachi to Islamabad for his
brother’s wedding and is currently in the process of choosing tickets. Ahmed has to go to
Islamabad but Ahmed also wants to save some money in the process, so he chooses to wait
instead of buy now, simply because ticket prices are just too high.

Is this the right decision? Won’t ticket prices increase in the future? Perhaps there is a
sweet-spot Ahmed is hoping to find and maybe he just might find it.
This is the problem that you will be tackling in this competition. Can you predict future prices
accurately to such a degree that you can now tell Ahmed - with confidence - that he has made
the wrong decision.

Your task boils down to generating optimal predictions for flight prices of multiple airlines. If
successful, your model will contribute greatly to Sastaticket’s rich and diverse set of operating
algorithms.

In [5]:
#import libararies
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

In [6]:
#loading data from csvs
X_train= pd.read_csv('X_train.csv')
y_train= pd.read_csv('y_train.csv')
X_test= pd.read_csv('X_test.csv')

In [7]:
#looking into shapes of the datasets
X_train.shape

(21776590, 11)

In [8]:
y_train.shape

(21776590, 2)

In [9]:
X_test.shape

(4532489, 10)

In [10]:
#looking into the datasets
X_train.head()

Unnamed: 0.1,Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10
0,0,2020-12-31 09:46:17.463002+00:00,x,y,2021-01-10 05:00:00+00:00,2021-01-10 07:00:00+00:00,gamma,True,0.0,0,c-2
1,1,2020-12-31 09:46:17.463002+00:00,x,y,2021-01-10 05:00:00+00:00,2021-01-10 07:00:00+00:00,gamma,True,32.0,1,c-2
2,2,2020-12-31 09:46:17.463002+00:00,x,y,2021-01-10 11:00:00+00:00,2021-01-10 13:00:00+00:00,gamma,True,32.0,1,c-4
3,3,2020-12-31 09:46:17.463002+00:00,x,y,2021-01-10 11:00:00+00:00,2021-01-10 13:00:00+00:00,gamma,True,32.0,2,c-4
4,4,2020-12-31 09:46:18.191119+00:00,x,y,2021-01-25 11:00:00+00:00,2021-01-25 12:55:00+00:00,beta,False,20.0,0,b-69


X_train includes the following features:
- f1: Ticket Purchase Date Time
- f2: Origin
- f3: Destination
- f4: Departure Date Time
- f5: Arrival Date Time
- f6: Airline
- f7: Refundable Ticket
- f8: Baggage Weight
- f9: Baggage Pieces
- f10: Flight Number

In [11]:
y_train.head()

Unnamed: 0.1,Unnamed: 0,target
0,0,7400.0
1,1,8650.0
2,2,9150.0
3,3,10400.0
4,4,8697.0


y_train have the following variables:
- Target

In [12]:
# merger two dataset to make one
df = pd.concat([X_train,y_train], axis=1) # horizontal

#if we had made axis=0 it would stack both datasets vertically down the instances.

In [13]:
df.head()

Unnamed: 0.2,Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,Unnamed: 0.1,target
0,0,2020-12-31 09:46:17.463002+00:00,x,y,2021-01-10 05:00:00+00:00,2021-01-10 07:00:00+00:00,gamma,True,0.0,0,c-2,0,7400.0
1,1,2020-12-31 09:46:17.463002+00:00,x,y,2021-01-10 05:00:00+00:00,2021-01-10 07:00:00+00:00,gamma,True,32.0,1,c-2,1,8650.0
2,2,2020-12-31 09:46:17.463002+00:00,x,y,2021-01-10 11:00:00+00:00,2021-01-10 13:00:00+00:00,gamma,True,32.0,1,c-4,2,9150.0
3,3,2020-12-31 09:46:17.463002+00:00,x,y,2021-01-10 11:00:00+00:00,2021-01-10 13:00:00+00:00,gamma,True,32.0,2,c-4,3,10400.0
4,4,2020-12-31 09:46:18.191119+00:00,x,y,2021-01-25 11:00:00+00:00,2021-01-25 12:55:00+00:00,beta,False,20.0,0,b-69,4,8697.0


In [14]:
df.shape

(21776590, 13)

As it is clear that the data is too big in terms of size of rows , so we will sample the data and work on that, after successfully completing we can run all comands on real full data.

In [15]:
# take sample of 5000 randomly out of it
df= df.sample(5000)

In [16]:
#looking into shape of the dataset now
df.shape

(5000, 13)

In [17]:
#saving the training dataset into csv file for use 
df.to_csv('Xy_train_sample.csv')

In [18]:
##looking into shapes of the X test data
X_test.shape

(4532489, 10)

In [19]:
# Sampling random 500 and saving it to another csv
X_test.sample(500).to_csv('X_test_sample.csv')

In [20]:
#loading the sample data 
df_train = pd.read_csv('Xy_train_sample.csv')

In [21]:
df_train.shape

(5000, 14)

In [22]:
#loading the sample data
df_test = pd.read_csv('X_test_sample.csv')

In [23]:
df_test.shape 

(500, 11)

Now we have a train data of (5000, 14) size and test data of (500, 11) size. let's go to another file to progress our work.