# Web Scraping for Collecting Flights Data in Central America

## Introduction:

This Jupyter Notebook aims to collect data on airports in Central America using web scraping techniques. In the context of our SQL database project for airport management.

Web scraping will allow us to extract relevant data from Airport Info, https://airportinfo.live/. This data will include details about the flights in Central America, such as schedules, the airline of that flight, airport of departura and arrival and many more...

## Objectives:

1. Obtain data of flights in Central America from https://airportinfo.live/.
2. Extract key information, such as schedule of flight, airline of the flight, and airport details.
3. Store the collected data in a suitable format for further analysis and use in our SQL database.

## Libraries

* **Requests**: The requests library allows users to make HTTP requests to the web pages they want to analyze, facilitating the download of HTML content from these pages for further processing.

* **Beautiful Soup (bs4)**: Beautiful Soup is a useful tool for parsing and searching HTML elements in the downloaded content. It enables users to search and extract specific information from web pages, such as titles, paragraphs, links, and more.

* **Selenium**: When websites use JavaScript to load dynamic content, Selenium becomes a valuable choice. With this library, users can automate a web browser to interact with the website and extract data from pages that require interaction.

* **Pandas**: Pandas is an essential library for structuring and manipulating extracted data. It allows users to create DataFrames to organize data into rows and columns, facilitating operations such as cleaning, filtering, and processing.

In [1]:
import requests
import bs4
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import requests
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import ui
import chromedriver_autoinstaller
from selenium.common.exceptions import NoSuchElementException
import pandas as pd

In [5]:
chromedriver_autoinstaller.install()



## Web Scraping for Data Collection

In this section, we will explore web scraping, an automated technique for extracting data from websites.

In [None]:
def wb_driver(extension):
    
    driver = webdriver.Chrome()
    
    driver.get(f"https://airportinfo.live/{extension}")
    
    return driver

In [None]:
def recolect_flight(extension):
    
    info = airport_basic(extension)[0].split()[0]
    
    driver = wb_driver(extension)
    
    data = []
    
    hora = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, '//*[@id="timeStart"]')))

    driver.execute_script("return arguments[0].scrollIntoView(true);", hora)
    
    date_input = driver.find_element(By.ID, "datepicker_input") 

    date_input.click()
    
    time.sleep(2)

    day_9 = driver.find_element(By.XPATH, "//a[text()='9']")
    
    day_9.click()
    
    for i in range(1,25):

        hora_fija = driver.find_element(By.XPATH,f'//*[@id="timeStart"]/option[{i}]')

        hora_fija.click()
        
        time.sleep(10)

        boton_hora = driver.find_element(By.XPATH, '//*[@id="fs_div"]/div[3]/button')

        boton_hora.click()
        
        time.sleep(5)
    
        try:

            vuelos = driver.find_element(By.TAG_NAME, 'tbody')

            vuelos = vuelos.text.split('\n')
            
            a = 0
            
            j = 0
            
            while a != len(vuelos):
            
                arrivo = "NO"

                while arrivo not in ["SCHEDULED","ARRIVED","IN AIR",
                                     'You’re eligible for compensation from this flight. Check what you’re owed for free. Check now',
                                     "UNKNOWN","DETAILS"]:

                    arrivo = vuelos[a]

                    if arrivo in ["SCHEDULED","ARRIVED","IN AIR",
                                  "You’re eligible for compensation from this flight. Check what you’re owed for free. Check now",
                                  "UNKNOWN","DETAILS"]:

                        data.append(vuelos[j:(a+1)])

                        j = a+1

                    a +=1

                    if a == len(vuelos):

                        break


        except NoSuchElementException:

            pass
    
    return data

In [None]:
def extract_all(list_airports):
    a = 0
    arrivos_rows = []
    
    for k in range(len(list_airports)):
        data = recoleccion(list_airports[k])
        time.sleep(5)
        
        for i in range(len(data)):
            if data[i][0] in ('Delay', ''):
                if data[i][0] == 'Delay':
                    codeshare_index = 4 if data[i][2].split(' ')[0] == "Codeshare" else 3
                else:
                    codeshare_index = 2 if data[i][0].split(' ')[0] == "Codeshare" else 1

                if len(data[i]) <= codeshare_index + 1:
                    pass
                else:
                    if data[i][codeshare_index + 1] != '' and data[i][codeshare_index + 1][0].isdigit():
                        arrivos_rows.append([data[i][codeshare_index - 1], df_info_aero["Nombre"][k],
                                             data[i][codeshare_index], data[i][codeshare_index + 1], data[i][-3]])
                        a += 1
                    else:
                        arrivos_rows.append([data[i][codeshare_index - 1], df_info_aero["Nombre"][k],
                                             data[i][codeshare_index], data[i][codeshare_index - 1].split(' ')[-1], data[i][-3]])
                        a += 1

    arrivos.loc[a: a + len(arrivos_rows) - 1] = arrivos_rows