# Airport data parsing and databse construction


The Wikipedia pages of airports are used as a source to establish a complete route network, not knowing the traffic on each route so far.  
The first step is to reference all airports served by commercial airlines available on Wikipedia.
A community-based list of airports served is available for each continent:  
-https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_North_America  
-https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_South_America  
-https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_Europe  
-https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_Africa  
-https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_Asia  
-https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_Oceania  

Note that the content of this list is continuously modified by the community. It was used in April 2023 for this notebook.

The related URLs of airport Wikipedia pages are retrieved by parsing the list using an HTML analysis library (BeautifulSoup).

In [2]:
# Import necessary libraries

import numpy as np
import requests
from requests.adapters import HTTPAdapter, Retry
from bs4 import BeautifulSoup
from string import Template
import pandas as pd
from tqdm.notebook import tqdm

## Getting a list of airports

In [2]:
# Define the URL for the list of airports in all continents
url_nam = "https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_North_America"
url_sam = "https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_South_America"
url_eu = "https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_Europe"
url_af = "https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_Africa"
url_as = "https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_Asia"
url_oc = "https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Aviation/Airline_destination_lists:_Oceania"

# url_list = [url_nam, url_sam, url_eu, url_af, url_as, url_oc]

### Process done iteratively, for each continent
url = url_oc

In [3]:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Find all the unordered lists ('ul') on the page
ul_lists = soup.find_all("ul")

# Extract the links to airport pages from the unordered lists
airport_links = []
for ul in ul_lists:
    for link in ul.find_all("a"):
        href = link.get("href")
        if href and "/wiki/" in href and "Airport" in link.text:
            link = ("https://en.wikipedia.org" + href, link.text)
            airport_links.append(link)


# airport_links_nam = list(dict.fromkeys(airport_links))
# airport_links_sam = list(dict.fromkeys(airport_links))
# airport_links_eu = list(dict.fromkeys(airport_links))
# airport_links_af = list(dict.fromkeys(airport_links))
# airport_links_as = list(dict.fromkeys(airport_links))
airport_links_oc = list(dict.fromkeys(airport_links))

print(
    "Maximal parsing time estimated (if 0.5s/query) :{} s for {} airports in the file".format(
        0.5 * len(airport_links), len(airport_links)
    )
)

Maximal parsing time estimated (if 0.5s/query) :292.0 s for 584 airports in the file


A list of airport whose url was incorectly reported is the continental airport list is created by hand.

In [4]:
pb_airports = [
    (
        "https://en.wikipedia.org/wiki/Newcastle_International_Airport",
        "Newcastle International Airport",
    ),
    ("https://en.wikipedia.org/wiki/Qaarsut_Airport", "Qaarsut Airport"),
    (
        "https://en.wikipedia.org/wiki/San_Jos%C3%A9_Airport_(Guatemala)",
        "San José Airport (Guatemala)",
    ),
    (
        "https://en.wikipedia.org/wiki/Santa_Rosa_Airport_(Argentina)",
        "Santa Rosa Airport (Argentina)",
    ),
    (
        "https://en.wikipedia.org/wiki/Santa_Cruz_Airport_(Argentina)",
        "Santa Cruz Airport (Argentina)",
    ),
    (
        "https://en.wikipedia.org/wiki/Las_Brujas_Airport_(Colombia)",
        "Las Brujas Airport (Colombia)",
    ),
    (
        "https://en.wikipedia.org/wiki/San_Fernando_Airport_(Philippines)",
        "San Fernando Airport (Philippines)",
    ),
    (
        "https://en.wikipedia.org/wiki/Richmond_Airport_(Queensland)",
        "Richmond Airport (Queensland)",
    ),
    (
        "https://en.wikipedia.org/wiki/St_George_Airport_(Queensland)",
        "St George Airport (Queensland)",
    ),
    (
        "https://en.wikipedia.org/wiki/Obo_Airport_(Papua_New_Guinea)",
        "Obo Airport (Papua New Guinea)",
    ),
    (
        "https://en.wikipedia.org/wiki/Santa_Ana_Airport_(Solomon_Islands)",
        "Santa Ana Airport (Solomon Islands)",
    ),
    (
        "https://en.wikipedia.org/wiki/Redcliffe_Airport_(Vanuatu)",
        "Redcliffe Airport (Vanuatu)",
    ),
    ("https://en.wikipedia.org/wiki/Vanua_Lava_Airport", "Vanua Lava Airport"),
]

A list of "missing" airports, not present in the continetal lists, 
but who were in the destinations served from other airports is also created in 03-routes_processing.ipynb and re-injected here.

In [14]:
missing_airports = pd.read_csv("data/missing_airports.csv")
# missing_airports=pd.read_csv('data/extra_airport_refs.csv')
missing_airports = [
    [missing, "nnn"] for missing in missing_airports.loc[:, "0"].to_list()
]
missing_airports

[['https://en.wikipedia.org/wiki/Dabo_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Caldas_Novas_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Prince_Abdul_Majeed_bin_Abdulaziz_International_Airport',
  'nnn'],
 ['https://en.wikipedia.org/wiki/Ramechhap_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Orcas_Island_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Codrington_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Kuk%C3%ABs_International_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Deoghar_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Santa_Maria_Airport_(Rio_Grande_do_Sul)',
  'nnn'],
 ['https://en.wikipedia.org/wiki/Robert_Atty_Bessing_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Port-Menier_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Iranshahr_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Postville_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Caye_Chapel_Airport', 'nnn'],
 ['https://en.wikipedia.org/wiki/Rocky_Mountain_Metro

## Parsing for airport related information
Two (very-related) data sources are used for each airport: its wikidata profile, and its wikipedia page.  
Wikidata is a structured database and therefore much more reliable to find informations on. SPARQL requests are sent for each airport.  
Wikipedia page parsing is made with Beautifulsoup by looking for html tags of relevant information, leading to potential mismatches.

In [16]:
# Here one of the airport links list parsed before should be used. Remember to iterate!

# airport_links=airport_links_nam
# airport_links=airport_links_sam
# airport_links=airport_links_eu
# airport_links=airport_links_af
# airport_links=airport_links_as
# airport_links=airport_links_oc
# airport_links=pb_airports
airport_links = missing_airports

airport_list = []
for airport_link, airport_name in tqdm(airport_links):

    if airport_link in [
        "https://en.wikipedia.org/wiki/Lumbia_Airport",
        "https://en.wikipedia.org/wiki/Woomera_Airport",
        "https://en.wikipedia.org/wiki/Saadani_National_Park",
    ]:
        continue

    # Opening the wikipedia page of the airport and store the datasoup.
    response = requests.get(airport_link)
    soup = BeautifulSoup(response.content, "html.parser")

    # Find the wikidata item (QID) of the airport on its Wikipedia page
    QNumber = np.nan
    QNumber = soup.find("li", {"id": "t-wikibase"}).a["href"].rsplit("/")[-1]

    ################# AIRPORT WIKIDATA SPARQL QUERY SECTION #################

    url = "https://query.wikidata.org/sparql"

    # Define the sparql request. Please use https://query.wikidata.org/ to test or modify the query.
    # Only the icoa ident is mandatory (i could not figure out how to have only optional fields, not sparql proefficient...), all other fields searcched are optional.
    query_string = """# Scroll down and hit blue arrow down to run and see the results + the sources
    SELECT ?item ?itemLabel ?icao ?iata (MAX(?population) AS ?max_population) (MAX(?lon) as ?max_lon) (MAX(?lat) as ?max_lat) (MAX(?alt_in_m) as ?max_alt_in_m) (MAX(?length_in_m) AS ?max_rwy_lenght_m) (MAX(?passengers19) AS ?max_passengers19) (MAX(?passengers2) AS ?maxpax)  (SAMPLE(COALESCE(?reference_URL, ?reference_URL2)) AS ?sample_reference_URL)
    WHERE
    { 
      FILTER ( ?item = <http://www.wikidata.org/entity/$QID> )
      ?item wdt:P239 ?icao.
      OPTIONAL{?item wdt:P238 ?iata.}
      OPTIONAL{ ?item p:P625 ?coordinate.
                 ?coordinate ps:P625 ?coord.
                 ?coordinate psv:P625 ?coordinate_node.
                 ?coordinate_node wikibase:geoLongitude ?lon.
                 ?coordinate_node wikibase:geoLatitude ?lat.
              }
      OPTIONAL { ?item p:P2044 ?elevation .
                ?elevation psn:P2044 ?elevationnode.
                ?elevationnode  wikibase:quantityAmount     ?altitude.
                ?elevationnode wikibase:quantityUnit ?alt_unit.
                # conversion to SI unit
                ?alt_unit p:P2370/psv:P2370 [                # conversion to SI unit
                wikibase:quantityAmount ?alt_conversion;
                wikibase:quantityUnit wd:Q11573;      # meter
      ]
      BIND(?altitude * ?alt_conversion AS ?alt_in_m).
               }
      OPTIONAL {?item p:P529 ?runway.
      ?runway  ps:P529 ?number.
      ?runway pqv:P2043 ?valuenode.
      ?valuenode     wikibase:quantityAmount     ?length.
      ?valuenode     wikibase:quantityUnit       ?unit.
      # conversion to SI unit
      ?unit p:P2370/psv:P2370 [                # conversion to SI unit
         wikibase:quantityAmount ?conversion;
         wikibase:quantityUnit wd:Q11573;      # meter
      ]
      BIND(?length * ?conversion AS ?length_in_m).}




    #   ?stmnode       psv:P2043                   ?valuenode.
    #   ?valuenode     wikibase:quantityAmount     ?length2.

      OPTIONAL {?item wdt:P931 ?city.
      ?city wdt:P1082 ?population.}

      OPTIONAL {    
        ?item p:P3872 ?statement.
        ?statement pqv:P585 ?timevalue;
                   ps:P3872 ?passengers19.
        ?timevalue wikibase:timeValue ?date.
        ?timevalue wikibase:timePrecision 9 
        OPTIONAL { ?statement pq:P518 ?applies. }
        OPTIONAL { ?statement prov:wasDerivedFrom / (pr:P854|pr:P4656) ?reference_URL. }
        FILTER (BOUND(?applies)=false || ?applies = wd:Q2165236 )
        MINUS { ?statement wikibase:rank wikibase:DeprecatedRank }
        BIND (YEAR(?date) AS ?year)
        FILTER (?year = 2019)
      }
      OPTIONAL {    
        ?item p:P3872 ?statement2.
        ?statement2 pqv:P585 ?timevalue2;
                   ps:P3872 ?passengers2.
        ?timevalue2 wikibase:timeValue ?date2.
        ?timevalue2 wikibase:timePrecision 9 
        OPTIONAL { ?statement2 pq:P518 ?applies2. }
        OPTIONAL { ?statement2 prov:wasDerivedFrom / (pr:P854|pr:P4656) ?reference_URL2. }
        FILTER (BOUND(?applies2)=false || ?applies2 = wd:Q2165236 )
        MINUS { ?statement2 wikibase:rank wikibase:DeprecatedRank }
        BIND (YEAR(?date2) AS ?year2)
      }
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
    } GROUP BY ?item ?itemLabel ?icao ?iata """

    # Using Template to allow $ to be identified as formatting place in the string, since {} is alredy used by sparql for something else.
    query_string = Template(query_string)
    query_string = query_string.substitute({"QID": QNumber})

    # Preparing the request. Using an User-Agent seems to be enough to avoid being rejected by wikidata for too frequent requests (Error 429)

    headers = {
        "Accept": "application/sparql-results+json",
        "User-Agent": "AirProjectBot/0.0 (antoine732@hotmail.fr)",
    }
    params = {"query": query_string, "format": "json", "no-headers": "1"}

    # If the previous precaution is not sufficient, Retry is performed, max 5 tiem, after waiting for adequate time period.
    # If it still doesn't work or any other problem occurs (timeout, ...) a nan is returned. The case of airport with nan will be handeld afterwards.

    try:
        session = requests.Session()
        retries = Retry(
            total=5,
            backoff_factor=1,
            status_forcelist=[429],
            respect_retry_after_header=True,
        )
        session.mount("https://", HTTPAdapter(max_retries=retries))
        response = session.get(url, headers=headers, params=params)
        query_result = response.json()
    except:
        query_result = np.nan

    # Initialise a empty dictionary to store the request results.
    # This step seems mandatory to handle the case of missing fields in request response.
    airport_results = {
        "item": np.nan,
        "itemLabel": np.nan,
        "icao": np.nan,
        "iata": np.nan,
        "max_population": np.nan,
        "max_lon": np.nan,
        "max_lat": np.nan,
        "max_alt_in_m": np.nan,
        "max_rwy_lenght_m": np.nan,
        "max_passengers19": np.nan,
        "maxpax": np.nan,
        "sample_reference_URL": np.nan,
    }

    # Inserting the request results in the dictionnary.

    if query_result is not np.nan:
        if len(query_result["results"]["bindings"]) == 0:
            print("Airport {} query ok but return no entry".format(airport_link))
        else:
            for key, value in query_result["results"]["bindings"][0].items():
                airport_results[key] = value["value"]
    else:
        print("Airport {} query not ok".format(airport_link))

    # check if the page is ok, otherwise every field remains nan for latter handling
    wdta_item = airport_results["item"]
    if wdta_item is not np.nan:
        wdta_item = wdta_item.rsplit("/")[-1]
    wdta_itemLabel = airport_results["itemLabel"]
    wdta_icao = airport_results["icao"]
    wdta_iata = airport_results["iata"]
    wdta_max_population = airport_results["max_population"]
    wdta_lon = airport_results["max_lon"]
    wdta_lat = airport_results["max_lat"]
    wdta_alt_in_m = airport_results["max_alt_in_m"]
    wdta_max_rwy_lenght_m = airport_results["max_rwy_lenght_m"]
    wdta_max_passengers19 = airport_results["max_passengers19"]
    wdta_maxpax = airport_results["maxpax"]
    wdta_sample_reference_URL = airport_results["sample_reference_URL"]

    ################# AIRPORT WIKIPEDIA PAGE PARSING SECTION #################

    wdpa_link = soup.find("link", {"rel": "canonical"}).get("href")
    wpda_iata = np.nan
    wpda_icao = np.nan
    wpda_passengers = np.nan
    wpda_movements = np.nan
    wpda_year_data = np.nan

    # Find the airport information card
    vcard = soup.find("table", class_="infobox vcard")

    if vcard is not None:
        # Extract interesting information from the vcard when available
        # Extract the "nickname"; it is html name of the iata and icao field of Wiki pages

        if vcard.find("td", class_="infobox-full-data").find("div", class_="hlist"):
            for nickname in (
                vcard.find("td", class_="infobox-full-data")
                .find("div", class_="hlist")
                .find_all("span", class_="nowrap")
            ):
                if nickname is not None:
                    if nickname.a.text == "IATA" and nickname.span is not None:
                        wpda_iata = nickname.find("span", class_="nickname").text
                    if nickname.a.text == "ICAO" and nickname.span is not None:
                        wpda_icao = nickname.find("span", class_="nickname").text

        # Extract the traffic statistics, it could be usefull. Note that the method is of a very limited robustness.
        # Indeed, many other keyword could be used from the collaborative nature of wikipedia
        for th in vcard.find_all("th"):
            if (
                ("passenger" in th.text.lower() or "passengers" in th.text.lower())
                and not (
                    "movements" in th.text.lower() or "operations" in th.text.lower()
                )
                and "%" not in th.parent.select_one("td").text.split()[0]
            ):
                wpda_passengers = th.parent.select_one("td").text.split()[0]
            if (
                ("movements" in th.text.lower() or "operations" in th.text.lower())
                and not (
                    "passenger" in th.text.lower() or "passengers" in th.text.lower()
                )
                and "%" not in th.parent.select_one("td").text.split()[0]
            ):
                wpda_movements = th.parent.select_one("td").text.split()[0]
            if "statistics" in th.text.lower():
                if len(th.text.lower().split()) > 1:
                    wpda_year_data = th.text.lower().split()[1]

    airport_list.append(
        [
            wdta_item,
            wdta_itemLabel,
            wdta_icao,
            wdta_iata,
            wdta_max_population,
            wdta_lon,
            wdta_lat,
            wdta_alt_in_m,
            wdta_max_rwy_lenght_m,
            wdta_max_passengers19,
            wdta_maxpax,
            wdta_sample_reference_URL,
            wdpa_link,
            wpda_iata,
            wpda_icao,
            wpda_passengers,
            wpda_movements,
            wpda_year_data,
        ]
    )
    print(wdpa_link)

airport_df = pd.DataFrame(
    airport_list,
    columns=[
        "QID",
        "itemLabel",
        "icao",
        "iata",
        "max_population",
        "lon",
        "lat",
        "alt_in_m",
        "max_rwy_lenght_m",
        "max_passengers19",
        "maxpax",
        "traffic_source_url",
        "wdpa_link",
        "wpda_iata",
        "wpda_icao",
        "wpda_passengers",
        "wpda_movements",
        "wpda_year_data",
    ],
)

  0%|          | 0/129 [00:00<?, ?it/s]

https://en.wikipedia.org/wiki/Dabo_Airport
https://en.wikipedia.org/wiki/Caldas_Novas_Airport
https://en.wikipedia.org/wiki/Prince_Abdul_Majeed_bin_Abdulaziz_International_Airport
https://en.wikipedia.org/wiki/Ramechhap_Airport
https://en.wikipedia.org/wiki/Orcas_Island_Airport
https://en.wikipedia.org/wiki/Barbuda_Codrington_Airport
https://en.wikipedia.org/wiki/Kuk%C3%ABs_International_Airport_Zayed
Airport https://en.wikipedia.org/wiki/Deoghar_Airport query ok but return no entry
https://en.wikipedia.org/wiki/Deoghar_Airport
https://en.wikipedia.org/wiki/Santa_Maria_Airport_(Rio_Grande_do_Sul)
https://en.wikipedia.org/wiki/Robert_Atty_Bessing_Airport
https://en.wikipedia.org/wiki/Port-Menier_Airport
https://en.wikipedia.org/wiki/Iranshahr_Airport
Airport https://en.wikipedia.org/wiki/Postville_Airport query ok but return no entry
https://en.wikipedia.org/wiki/Postville_Airport
https://en.wikipedia.org/wiki/Caye_Chapel_Airport
https://en.wikipedia.org/wiki/Rocky_Mountain_Metropolitan

In [17]:
airport_df

Unnamed: 0,QID,itemLabel,icao,iata,max_population,lon,lat,alt_in_m,max_rwy_lenght_m,max_passengers19,maxpax,traffic_source_url,wdpa_link,wpda_iata,wpda_icao,wpda_passengers,wpda_movements,wpda_year_data
0,Q12474045,Dabo Singkep Airport,WIDS,SIQ,,104.579,-0.4788,17,,,,,https://en.wikipedia.org/wiki/Dabo_Airport,SIQ,WIDS,10486,1205,(2016)
1,Q3813543,Caldas Novas Airport,SBCN,CLV,95183,-48.61,-17.724722222222,685,2100,136214,149480,https://www.anac.gov.br/assuntos/dados-e-estat...,https://en.wikipedia.org/wiki/Caldas_Novas_Air...,CLV,SBCN,,,
2,Q4120431,Prince Abdul Majeed bin Abdul Aziz Domestic Ai...,OEAO,ULH,32413,38.11694444,26.48333333,624,,61138,78337,https://gaca.gov.sa/web/ar-sa/media/airnav2019,https://en.wikipedia.org/wiki/Prince_Abdul_Maj...,ULH,OEAO,,,
3,Q8214364,Ramechhap Airport,VNRC,RHP,5222,86.061388888889,27.393888888889,474,,,,,https://en.wikipedia.org/wiki/Ramechhap_Airport,RHP,VNRC,,,
4,Q934833,Orcas Island Airport,KORS,ESD,,-122.911,48.7081,9.4488186,884.2248,18410,20202,https://www.faa.gov/airports/planning_capacity...,https://en.wikipedia.org/wiki/Orcas_Island_Air...,ESD,KORS,,41800,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123,Q3912807,Puvirnituq Airport,CYPX,YPX,,-77.2875,60.052222222222,25.2984498,,,,,https://en.wikipedia.org/wiki/Puvirnituq_Airport,YPX,CYPX,,5802,(2010)
124,Q7489162,Shaoguan Guitou Airport,ZGSG,HSC,2855131,113.42111111,24.97861111,,,,102348,http://www.caac.gov.cn/XXGK/XXGK/TJSJ/202303/P...,https://en.wikipedia.org/wiki/Shaoguan_Danxia_...,HSC,ZGSG,9423,148,(2021)
125,Q4505797,Qinhuangdao Beidaihe Airport,ZBDH,BPE,3146300,119.057557,39.662501,14,,506522,506522,https://zh.wikipedia.org/wiki/%E4%B8%AD%E5%8D%...,https://en.wikipedia.org/wiki/Qinhuangdao_Beid...,BPE,ZBDH,217642,8352,(2021)
126,Q2876019,New Castle Airport,KILG,ILG,70898,-75.6067,39.6786,24.384048,2217.42,1114,2086,https://www.faa.gov/airports/planning_capacity...,https://en.wikipedia.org/wiki/Wilmington_Airpo...,ILG,KILG,52456,48024,(2019)


In [18]:
# Run that if you want to save it
# airport_df.to_csv('data/missing_arpt_v2.csv')
airport_df.to_csv("data/missing_arpt_extra.csv")