
## Data Mining

**Objective**: Extract flight prices and relevant information from Kayak for dates between **30th October 2024 to 30th January 2025** from **Dublin Airport to Sydney Airport** for **1 adult passenger**, **economy**, **one-way flights**.


## Scenario

The aim is to collect data from `kayak.ie` to predict flight prices for different dates and times. 
- Flights from Dublin to Sydney Airport
- Flights between 30th October 2024 and 30th January 2025
- For 1 adult passenger
- Economy class
- One-way flights


### Extraction Variables

The following variables will be extracted:

1. **Date**
2. **Flight Name**
3. **Stops**
4. **Price**
5. **Duration**
6. **Departure Time**
7. **Arrival Time**


### Setup and Imports

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import time
import csv

### Scraping Logic

In [None]:
#used sleep() for few seconds in order to not be detected as a bot
# method to scrape flight data for a given month
def scrape_flights_for_month(start_date, end_date):
    driver = webdriver.Chrome()
    
    flight_details = []
    current_date = start_date

    while current_date <= end_date:

        formatted_date = current_date.strftime('%Y-%m-%d')
        url = f'https://www.kayak.ie/flights/DUB-SYD/{formatted_date}?sort=bestflight_a'
        
        driver.get(url)
        
        time.sleep(10) 
        
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        flight_containers = soup.findAll('div', class_="yuAt yuAt-pres-rounded yuAt-mod-box-shadow")
        for item in flight_containers:
            try:
                flight_name = item.find('div', class_="J0g6-operator-text").get_text(strip=True)
                stops = item.find('span', class_="JWEO-stops-text").get_text(strip=True)
                price = item.find('div', class_="f8F1-price-text").get_text(strip=True)
                
                #extract duration by skipping time and stops
                duration = None
                divs = item.findAll('div', class_="vmXl vmXl-mod-variant-default")

                for div in divs:
                    # if there are no spans inside the div, it's the duration
                    if not div.find('span'):  
                        duration = div.get_text(strip=True)
                        break

                times_div = item.find('div', class_="vmXl vmXl-mod-variant-large")
                if times_div:
                    times = times_div.get_text(strip=True).split("â€“")
                    departure_time = times[0].strip() if len(times) > 0 else "Departure not found"
                    arrival_time = times[1].strip() if len(times) > 1 else "Arrival not found"
                else:
                    departure_time, arrival_time = "Departure not found", "Arrival not found"

                flight_details.append([formatted_date, flight_name, stops, price, duration, departure_time, arrival_time])
            except Exception as e:
                print(f"Error extracting data: {e}")
        
        current_date += timedelta(days=1)

    driver.quit()
    return flight_details


In [None]:

start_date = datetime(2024, 10, 30)
end_date = datetime(2024, 11, 29)

all_flight_details = []

while start_date <= datetime(2025, 1, 30):
    #print(f"scraping from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
    monthly_flights = scrape_flights_for_month(start_date, end_date)
    
    all_flight_details.extend(monthly_flights)
    
    start_date = end_date + timedelta(days=1)
    end_date = start_date + timedelta(days=30)
    
    # Sleep for 5 seconds {avoid bot detection}
    time.sleep(5)



### Storing Data into CSV

After data is collected for the flight details, storing it in a CSV format for further analysis and cleaning. 


In [None]:
csv_file_name = 'flight_details.csv'

header = ['Date', 'Flight Name', 'Stops', 'Price', 'Duration', 'Departure-Time', 'Arrival-Time']

with open(csv_file_name, mode='w', newline='') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(header)
    writer.writerows(all_flight_details)

print(f"Flight details have been written to {csv_file_name}")

Flight details have been written to flight_details.csv


# Issues Faced

1. **Multiple Elements Sharing the Same Class**: 
   - Problem: Elements with class `vmXl vmXl-mod-variant-default` are shared between different sections (departure time, duration, stops).
   - **Solution**: Inspected further child elements like `<span>` or `<div>` within those classes to correctly extract data.

2. **Website Blocking (Captcha)**: 
   - Problem: After multiple consecutive requests, the website prompts a captcha.
   - **Solution**: Introduced a sleep mechanism to limit requests and prevent detection. Also restarted the driver after scraping for one month.


# Comments

- `+1` in flight arrival time indicates that the flight lands the next day.
