# Web Scraping for Collecting Airlines Data

## Introduction:

This Jupyter Notebook focuses on collecting data about airlines all around the world using web scraping from Wikipedia. 

## Objectives:

1. Obtain data on airlines all around the world from Wikipedia.
2. Store the collected data in a suitable format for further analysis and use in our database.

## Libraries

* **Requests**: The requests library allows users to make HTTP requests to web pages, facilitating the download of HTML content from Wikipedia for further processing.

* **Beautiful Soup (bs4)**: Beautiful Soup is a useful tool for parsing and searching HTML elements in the downloaded content. It enables users to extract specific information from Wikipedia pages, such as titles, paragraphs, links, and more.

* **Pandas**: Pandas is an essential library for structuring and manipulating extracted data. It allows users to organize data into rows and columns, facilitating operations such as cleaning, filtering, and processing.

In [3]:
import requests
import bs4
import pandas as pd

In [21]:
def get_airlines():
    
    # URL de la página de Wikipedia
    url = "https://en.wikipedia.org/wiki/List_of_airline_codes"

    response = requests.get(url)

    soup = bs4.BeautifulSoup(response.text, 'html.parser')

    # Encontrar la tabla que contiene los códigos IATA
    iata_table = soup.find('table', {'class': 'wikitable'})

    # Inicializar una lista para almacenar los datos de la tabla
    iata_data = []

    # Recorrer las filas de la tabla
    for row in iata_table.find_all('tr')[1:]:  # Ignorar la primera fila que contiene encabezados
        
        # Obtener todas las celdas de la fila
        cells = row.find_all('td')
        
        if len(cells) >= 4:
            
            # Obtener el código IATA y comprobar si no está vacío
            iata_code = cells[0].get_text(strip=True)
            iaco = cells[1].get_text(strip=True)
            
            if (iata_code and iata_code !="n/a") and (iaco and iaco != "n/a"):
                
                # Si el código IATA ni IACO no está vacío
                airline_name = cells[2].get_text(strip=True)
                call_sign = cells[3].get_text(strip=True)
                country = cells[4].get_text(strip=True)

                # Agregar los datos a la lista
                iata_data.append([iata_code, iaco, airline_name, call_sign,country])
                
    return pd.DataFrame(iata_data, columns=["iata", "iaco", "airline_name", "call_sign","country"])

In [22]:
data = get_airlines()

In [24]:
data.head()

Unnamed: 0,iata,iaco,airline_name,call_sign,country
0,PR,BOI,2GO,ABAIR,Philippines
1,2T,TBS,Timbis Air,TIMBIS,Kenya
2,3F,FIE,FlyOne Armenia,ARMRIDER,Armenia
3,Q5,MLA,40-Mile Air,MILE-AIR,United States
4,4D,ASD,Air Sinai,AIR SINAI,Egypt


In [25]:
data.to_csv("Airlines.csv")