# Flight Price Prediction

### Problem Statement

Anyone who has booked a flight ticket knows how unexpectedly the prices vary. The cheapest
available ticket on a given flight gets more and less expensive over time. This usually happens as
an attempt to maximize revenue based on -
1. Time of purchase patterns (making sure last-minute purchases are expensive)
2. Keeping the flight as full as they want it (raising prices on a flight which is filling up in order to reduce sales and hold back inventory for those expensive last-minute expensive purchases)

So, you have to work on a project where you collect data of flight fares with other features and
work to make a model to predict fares of flights. The project consists of three phases. This is the first phase which is data collection, we have to scrape at least 1500 rows of data. In this section you have to scrape the data of flights from different websites. The number of columns for data doesn’t have
limit. Generally, these columns areairline name, date of journey,
source, destination, route, departure time, arrival time, duration, total stops and the target variable
price.

In [1]:
# importing libraries

import selenium
import pandas as pd
from selenium import webdriver
import time
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import ElementNotInteractableException

In [2]:
# initializing the driver to a variable

driver= webdriver.Chrome('chromedriver.exe')

In [20]:
# Giving address to the driver

driver.get('https://www.yatra.com/')

In [21]:
# Creating lists to store data.

airline=[]
date=[]
source=[]
destination=[]
dep_time=[]
arvl_time=[]
duration=[]
stops=[]
price=[]

# Navigating to the required data.

for f in range(1,10):
    
    # Clicking on the home button.
    driver.find_element_by_xpath("//div[@class='header-left-menu ftL']/a").click()
    time.sleep(5)
    
    # Clicking on the departure option.
    driver.find_element_by_id("BE_flight_origin_city").click()
    time.sleep(1)

    # Selecting the departure from.
    driver.find_element_by_xpath("//div[@class='viewport']/div/div/li[{}]/div".format(f)).click()
    time.sleep(1)

    #Selecting the destination.
    driver.find_element_by_xpath("//div[@class='viewport']/div/div/li[{}]/div".format(f+1)).click()
    time.sleep(1)

    #Selecting the date.
    driver.find_element_by_xpath("//li[@class='datepicker flex1']").click()
    time.sleep(1)
    driver.find_element_by_id("09/10/2021").click()
    time.sleep(1)

    # Clicking the search button.
    driver.find_element_by_id("BE_flight_flsearch_btn").click()
    time.sleep(3)

    # Selecting sort by departure
    driver.find_element_by_xpath("//p[@class='pr uprcse option-label inline-block cursor-pointer ']").click()
    time.sleep(1)
    
    # Scraping data
    
    for j in range(3,20,2):
        
        try:
            # Scraping the date.
            dt= driver.find_element_by_xpath("//div[@class='day-li text-center cursor-pointer pr active font-primary-color']/p[1]")
            
            # Scraping the airline name.
            arl= driver.find_elements_by_xpath("//span[@class='i-b text ellipsis']")
            for i in arl:
                airline.append(i.text)
                date.append(dt.text.split(', ')[-1])
            
            # Scraping the source.
            src= driver.find_elements_by_xpath("//div[@class='i-b col-4 no-wrap text-right dtime col-3']/p[1]")
            for i in src:
                source.append(i.text)
            
            # Scraping the destination.
            des= driver.find_elements_by_xpath("//div[@class='i-b pdd-0 text-left atime col-5']/p[2]")
            for i in des:
                destination.append(i.text)
            
            # Scraping the departure time.
            dtm= driver.find_elements_by_xpath("//div[@class='i-b pr']")
            for i in dtm:
                dep_time.append(i.text)
            
            # Scraping the arrival time.
            atm= driver.find_elements_by_xpath("//p[@class='bold fs-15 mb-2 pr time']")
            for i in atm:
                arvl_time.append(i.text.replace('\n+ 1 day',''))
            
            # Scraping the duration.
            dr= driver.find_elements_by_xpath("//p[@class='fs-12 bold du mb-2']")
            for i in dr:
                duration.append(i.text)
            
            # Scraping the number of stops
            st= driver.find_elements_by_xpath("//div[@class=' font-lightgrey fs-10 tipsy i-b fs-10']/span[1]")
            for i in st:
                stops.append(i.text)
            
            # Scraping the price.
            pr= driver.find_elements_by_xpath("//div[@class='i-b tipsy fare-summary-tooltip fs-18']")
            for i in pr:
                price.append(i.text)

            try:
                # Selecting next date.
                driver.find_element_by_xpath("//ul[@class='full-width no-wrap ovf-hidden']/li[{}]".format(j)).click()
                time.sleep(5)
                
                # Bringing upcomming dates in the view.
                driver.find_element_by_xpath("//i[@class='ytfi-angle-right']").click()
                time.sleep(1)
                driver.find_element_by_xpath("//i[@class='ytfi-angle-right']").click()
                time.sleep(1)

                # Selecting sort by departure
                driver.find_element_by_xpath("//p[@class='pr uprcse option-label inline-block cursor-pointer ']").click()
                time.sleep(1)

            except ElementNotInteractableException:
                pass
            
        except NoSuchElementException:
                pass

In [29]:
# Creating dataframe for the data.

flights_yt= pd.DataFrame({"Airline":airline,"Date":date,"Source":source,"Destination":destination,"Departure_time":dep_time,
                          "Arrival_time":arvl_time,"Duration":duration,"Stops":stops,"Price":price})
flights_yt

Unnamed: 0,Airline,Date,Source,Destination,Departure_time,Arrival_time,Duration,Stops,Price
0,Air India,9 Oct,New Delhi,Mumbai,04:55,14:25,9h 30m,1 Stop,17990
1,Air India,9 Oct,New Delhi,Mumbai,05:20,23:05,17h 45m,2 Stop(s),9000
2,Air India,9 Oct,New Delhi,Mumbai,05:20,09:20,28h 00m,2 Stop(s),9000
3,Air Asia,9 Oct,New Delhi,Mumbai,05:35,22:10,16h 35m,2 Stop(s),6166
4,Air India,9 Oct,New Delhi,Mumbai,05:45,09:40,3h 55m,1 Stop,7468
...,...,...,...,...,...,...,...,...,...
2386,Air India,24 Oct,Jaipur,Lucknow,14:00,19:40,29h 40m,2 Stop(s),14093
2387,Air India,24 Oct,Jaipur,Lucknow,14:00,19:40,29h 40m,2 Stop(s),14093
2388,Air India,24 Oct,Jaipur,Lucknow,14:00,19:40,29h 40m,2 Stop(s),14093
2389,Air India,24 Oct,Jaipur,Lucknow,14:00,19:40,29h 40m,2 Stop(s),14093


In [32]:
#flights_yt.to_csv('flight_yatra.csv', index=False)

In [62]:
# Giving another address to the driver

driver.get('https://www.makemytrip.com/')

In [77]:
# Creating lists to store data.

airline=[]
date=[]
source=[]
destination=[]
dep_time=[]
arvl_time=[]
duration=[]
stops=[]
price=[]

# Navigating to the required data.

for f in range(1,9):
    
    # Clicking on the home button.
    try:
        driver.find_element_by_xpath("//a[@class='chMmtLogo']").click()
    except ElementNotInteractableException:
        pass
        
    # Clicking on the departure option.
    driver.find_element_by_id("fromCity").click()
    time.sleep(1)

    # Selecting the departure from. 
    driver.find_element_by_xpath("//div[@id='react-autowhatever-1']/div[2]/ul/li[{}]".format(f)).click()
    time.sleep(1)
    
    # Clicking the destination box.
    try:
        driver.find_element_by_xpath("//div[@class='fsw_inputBox searchToCity inactiveWidget ']").click()
        time.sleep(1)
    except NoSuchElementException:
        pass
        
    #Selecting the destination.
    driver.find_element_by_xpath("//div[@id='react-autowhatever-1']/div[2]/ul/li[{}]".format(f+1)).click()
    time.sleep(1)
    
    #Selecting the date.
    driver.find_element_by_xpath("//div[@class='DayPicker-Day DayPicker-Day--selected']").click()
    time.sleep(1)

    # Clicking the search button.
    driver.find_element_by_xpath("//a[@class='primaryBtn font24 latoBold widgetSearchBtn ']").click()
    time.sleep(2)

    # Selecting sort by departure
    driver.find_element_by_xpath("//div[@class='sortby-dom-sctn departure_sorter ']/span").click()
    
    
    # Scraping data.
    
    for j in range(4,20,2):
        
        try:
            # Scraping the date.
            dt= driver.find_element_by_xpath("//div[@class='weeklyFareItems active']/a/p[1]")
            
            # Scraping the airline name.
            arl= driver.find_elements_by_xpath("//span[@class='boldFont blackText airlineName']")
            for i in arl:
                airline.append(i.text)
                date.append(dt.text.split(', ')[-1])
            
            # Scraping the source.
            src= driver.find_elements_by_xpath("//div[@class='flightTimeSection flexOne timeInfoLeft']/div/p[2]")
            for i in src:
                source.append(i.text)
            
            # Scraping the detination.
            des= driver.find_elements_by_xpath("//div[@class='flightTimeSection flexOne timeInfoRight']/div/p[2]")
            for i in des:
                destination.append(i.text)
            
            # Scraping the departure time.
            dtm= driver.find_elements_by_xpath("//div[@class='flightTimeSection flexOne timeInfoLeft']/div/p[1]")
            for i in dtm:
                dep_time.append(i.text)
            
            # Scraping the arrival time.
            atm= driver.find_elements_by_xpath("//div[@class='flightTimeSection flexOne timeInfoRight']/div/p[1]")
            for i in atm:
                arvl_time.append(i.text.replace('\n+ 1 day',''))
            
            # Scraping the duration.
            dr= driver.find_elements_by_xpath("//div[@class='stop-info flexOne']/p[1]")
            for i in dr:
                duration.append(i.text)
            
            # Scraping the number of stops.
            st= driver.find_elements_by_xpath("//p[@class='flightsLayoverInfo']")
            for i in st:
                stops.append(i.text.split(' via')[0])
            
            # Scraping the price.
            pr= driver.find_elements_by_xpath("//p[@class='blackText fontSize18 blackFont white-space-no-wrap']")
            for i in pr:
                price.append(i.text.replace('₹ ',''))

            try:
                # Selecting the next date.
                driver.find_element_by_xpath("//div[@class='slider-list']/div[{}]".format(j)).click()
                time.sleep(6)

                # Selecting sort by departure
                driver.find_element_by_xpath("//div[@class='sortby-dom-sctn departure_sorter ']/span").click()

            except NoSuchElementException:
                pass
            
        except NoSuchElementException:
                pass

In [83]:
# Creating dataframe for the data.

flights_mt= pd.DataFrame({"Airline":airline,"Date":date,"Source":source,"Destination":destination,"Departure_time":dep_time,
                          "Arrival_time":arvl_time,"Duration":duration,"Stops":stops,"Price":price})
flights_mt

Unnamed: 0,Airline,Date,Source,Destination,Departure_time,Arrival_time,Duration,Stops,Price
0,IndiGo,Oct 9,Mumbai,New Delhi,05:35,10:50,05 h 15 m,1 stop,9461
1,AirAsia,Oct 9,Mumbai,New Delhi,05:55,17:45,11 h 50 m,1 stop,5941
2,Go First,Oct 9,Mumbai,New Delhi,06:00,08:05,02 h 05 m,Non stop,5942
3,IndiGo,Oct 9,Mumbai,New Delhi,06:05,14:55,08 h 50 m,1 stop,7896
4,IndiGo,Oct 9,Mumbai,New Delhi,06:05,08:15,02 h 10 m,Non stop,8043
...,...,...,...,...,...,...,...,...,...
720,IndiGo,Oct 23,Chennai,Goa,14:30,20:50,06 h 20 m,1 stop,5042
721,Air India,Oct 23,Chennai,Goa,14:55,20:30,05 h 35 m,1 stop,8928
722,IndiGo,Oct 23,Chennai,Goa,15:50,19:35,03 h 45 m,1 stop,8560
723,IndiGo,Oct 23,Chennai,Goa,15:55,17:25,01 h 30 m,Non stop,5672


In [84]:
#flights_mt.to_csv('flight_mkmytrip.csv', index=False)

In [3]:
# Giving another address to the driver

driver.get('https://in.via.com/')

In [4]:
# Getting the urls of flight routs.

routs=[]
rt= driver.find_elements_by_xpath("//div[@class='deal']/a")
for i in rt:
    routs.append(i.get_attribute('href'))

In [44]:
# Creating lists to store data.

airline=[]
date=[]
source=[]
destination=[]
dep_time=[]
arvl_time=[]
duration=[]
stops=[]
price=[]

# Scraping the required data.

for r in routs:
    driver.get(r)
    
    # Clicking on the date box close button.
    driver.find_element_by_id('vc-close').click()
    time.sleep(1)
    
    # selecting the date.
    driver.find_element_by_xpath("//div[@class='lowFares-slider lowWeeekFares-slider']/div/div[2]").click()
    time.sleep(5)
    
    # selecting sort by depart.
    driver.find_element_by_xpath("//div[@class='depart sortClass  js-toolTipLeft']").click()
    time.sleep(1)
    
    for n in range(15):
        
        # Scraping the date.
        dt= driver.find_element_by_xpath("//span[@class='dt']")
        
        # Scraping the airline name.
        arl= driver.find_elements_by_xpath("//div[@class='name js-toolTip']")
        for a in arl:
            airline.append(a.text)
            date.append(dt.text.split(', ')[-1])
        
        # Scraping the source.
        src= driver.find_elements_by_xpath("//div[@class='depTime']/div[2]")
        for s in src:
            source.append(s.text)
        
        # Scraping the destination.
        des= driver.find_elements_by_xpath("//div[@class='arrTime']/div[2]")
        for d in des:
            destination.append(d.text)
        
        # Scraping the departure time.
        dtm= driver.find_elements_by_xpath("//div[@class='depTime']/div[1]")
        for t in dtm:
            dep_time.append(t.text)
        
        # Scraping the arrival time.
        atm= driver.find_elements_by_xpath("//div[@class='arrTime']/div[1]")
        for m in atm:
            arvl_time.append(m.text)
        
        # Scraping the number of stops.
        st= driver.find_elements_by_xpath("//span[@class='stops']")
        for p in st:
            stops.append(p.text.replace('(s)',''))
        
        # Scraping the duration.
        dr= driver.find_elements_by_xpath("//div[@class='dur']")
        for i,j in zip(dr,st):
            duration.append(i.text.replace(j.text,''))
        
        # Scraping the price.
        pr= driver.find_elements_by_xpath("//span[@class='price']")
        for c in pr:
            price.append(c.text)
        
        try:
            # Clicking on the next day
            driver.find_element_by_xpath("//a[@class='via_next_day']").click()
            time.sleep(10)

            # Selecting sort by departure
            driver.find_element_by_xpath("//div[@class='depart sortClass  js-toolTipLeft']").click()
            time.sleep(3)

        except NoSuchElementException:
            pass

In [64]:
# Creating dataframe for the data.

flights_via= pd.DataFrame({"Airline":airline,"Date":date,"Source":source,"Destination":destination,"Departure_time":dep_time,
                          "Arrival_time":arvl_time,"Duration":duration,"Stops":stops,"Price":price})
flights_via

Unnamed: 0,Airline,Date,Source,Destination,Departure_time,Arrival_time,Duration,Stops,Price
0,AirAsia India,Oct 10 2021,Delhi,Goa,04:55,15:00,10h 5m,2 Stop,11821
1,AirAsia India,Oct 10 2021,Delhi,Goa,04:55,15:00,10h 5m,2 Stop,12347
2,AirIndia,Oct 10 2021,Delhi,Goa,04:55,06:50,25h 55m,3 Stop,13688
3,AirIndia,Oct 10 2021,Delhi,Goa,05:15,10:05,4h 50m,1 Stop,9148
4,AirAsia India,Oct 10 2021,Delhi,Goa,05:20,07:50,2h 30m,Non-Stop,7426
...,...,...,...,...,...,...,...,...,...
3325,Vistara,Oct 24 2021,Mumbai,Ahmedabad,11:55,10:20,22h 25m,1 Stop,10668
3326,GO FIRST,Oct 24 2021,Mumbai,Ahmedabad,12:40,17:10,4h 30m,1 Stop,9461
3327,GO FIRST,Oct 24 2021,Mumbai,Ahmedabad,12:40,22:05,9h 25m,1 Stop,9461
3328,GO FIRST,Oct 24 2021,Mumbai,Ahmedabad,13:25,21:45,8h 20m,1 Stop,10537


In [65]:
#flights_via.to_csv('flight_via.csv', index=False)

In [66]:
# Loading the scraped data.

flights1= pd.read_csv('flight_yatra.csv')
flight2= pd.read_csv('flight_mkmytrip.csv')
flights3= pd.read_csv('flight_via.csv')

In [78]:
# Storing all the data in one dataframe.

df= pd.concat([flights1,flights2,flights3], axis=0, ignore_index=True)
df

Unnamed: 0,Airline,Date,Source,Destination,Departure_time,Arrival_time,Duration,Stops,Price
0,Air India,9 Oct,New Delhi,Mumbai,04:55,14:25,9h 30m,1 Stop,17990
1,Air India,9 Oct,New Delhi,Mumbai,05:20,23:05,17h 45m,2 Stop(s),9000
2,Air India,9 Oct,New Delhi,Mumbai,05:20,09:20,28h 00m,2 Stop(s),9000
3,Air Asia,9 Oct,New Delhi,Mumbai,05:35,22:10,16h 35m,2 Stop(s),6166
4,Air India,9 Oct,New Delhi,Mumbai,05:45,09:40,3h 55m,1 Stop,7468
...,...,...,...,...,...,...,...,...,...
6136,Vistara,Oct 24 2021,Mumbai,Ahmedabad,11:55,10:20,22h 25m,1 Stop,10668
6137,GO FIRST,Oct 24 2021,Mumbai,Ahmedabad,12:40,17:10,4h 30m,1 Stop,9461
6138,GO FIRST,Oct 24 2021,Mumbai,Ahmedabad,12:40,22:05,9h 25m,1 Stop,9461
6139,GO FIRST,Oct 24 2021,Mumbai,Ahmedabad,13:25,21:45,8h 20m,1 Stop,10537


In [80]:
# Saving the dataset.

df.to_excel("final_flights.xlsx")  