# Appending Details About Airports

In this project, I am going to merge multiple dataframes to achieve one dataframe that has all of the necessary information for 'airports'. This dataset may or may not appear in a project that I post in the near future that handles graph data for flights (and the airports for which they fly between).

I want to thank the folks that provided the sources for the information. Links to the data sources are as follows: 
1) https://www.kaggle.com/giovamata/airlinedelaycauses

2) https://en.wikipedia.org/wiki/List_of_airports_in_the_United_States

3) https://data.humdata.org/dataset/ourairports-usa

The first block includes all of the necessary library imports. The second block imports data from the Wikipedia source and prints out the first 12 samples.

In [1]:
import pandas as pd
import time

In [2]:
wiki_data = pd.read_html('https://en.wikipedia.org/wiki/List_of_airports_in_the_United_States', attrs= {'class' : 'wikitable sortable'})
airport_details = wiki_data[0]

print(airport_details.head(12), '\n\n')

          City  FAA IATA  ICAO  \
0      ALABAMA  NaN  NaN   NaN   
1   Birmingham  BHM  BHM  KBHM   
2       Dothan  DHN  DHN  KDHN   
3   Huntsville  HSV  HSV  KHSV   
4       Mobile  MOB  MOB  KMOB   
5   Montgomery  MGM  MGM  KMGM   
6       ALASKA  NaN  NaN   NaN   
7    Anchorage  LHD  NaN  PALH   
8    Anchorage  MRI  MRI  PAMR   
9    Anchorage  ANC  ANC  PANC   
10       Aniak  ANI  ANI  PANI   
11      Bethel  BET  BET  PABE   

                                              Airport Role  Enplanements  
0                                                 NaN  NaN           NaN  
1      Birmingham–Shuttlesworth International Airport  P-S     1457562.0  
2                             Dothan Regional Airport  P-N       52855.0  
3   Huntsville International Airport (Carl T. Jone...  P-S      580932.0  
4                             Mobile Regional Airport  P-N      297544.0  
5        Montgomery Regional Airport (Dannelly Field)  P-N      170544.0  
6                               

Here, I read in the list (Series) of airports for which I am seeking to retrieve additional information. Additionally, I check some basic numerics about the data. In the block after that, I rename the Series and check that the update completed properly.

In [3]:
data = pd.read_csv('DelayedFlights (For Practicing with Graph Databases).csv', usecols=['Origin', 'Dest'])
orig_and_dest = data.loc[:, ['Origin', 'Dest']]
airports = orig_and_dest['Origin'].append(orig_and_dest['Dest'])
airports = sorted(set(airports))
airports = pd.DataFrame(airports)
airports.rename(columns={0 : "airport_code"}, inplace=True) #columns={}
airports.info()

  airports = orig_and_dest['Origin'].append(orig_and_dest['Dest'])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298 entries, 0 to 297
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   airport_code  298 non-null    object
dtypes: object(1)
memory usage: 2.5+ KB


In [4]:
airports.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298 entries, 0 to 297
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   airport_code  298 non-null    object
dtypes: object(1)
memory usage: 2.5+ KB


Here, I merge the two dataframes. While the join is more efficient, the merge is necessary for specifying what feature to join each dataframe using.

In [5]:
# lets try a merge as well
all_data = pd.merge(left=airports, right=airport_details, how='left', left_on='airport_code', right_on='FAA')
all_data.drop(columns=['FAA', 'Role', 'Enplanements'], inplace=True)
all_data

Unnamed: 0,airport_code,City,IATA,ICAO,Airport
0,ABE,Allentown,ABE,KABE,Lehigh Valley International Airport (was Allen...
1,ABI,Abilene,ABI,KABI,Abilene Regional Airport
2,ABQ,Albuquerque,ABQ,KABQ,Albuquerque International Sunport
3,ABY,Albany,ABY,KABY,Southwest Georgia Regional Airport
4,ACK,Nantucket,ACK,KACK,Nantucket Memorial Airport
...,...,...,...,...,...
293,WYS,West Yellowstone,WYS,KWYS,Yellowstone Airport
294,XNA,Fayetteville,XNA,KXNA,Northwest Arkansas National Airport
295,YAK,Yakutat,YAK,PAYA,Yakutat Airport (also see Yakutat Seaplane Base)
296,YKM,Yakima,YKM,KYKM,Yakima Air Terminal (McAllister Field)


In [6]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 298 entries, 0 to 297
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   airport_code  298 non-null    object
 1   City          280 non-null    object
 2   IATA          280 non-null    object
 3   ICAO          280 non-null    object
 4   Airport       280 non-null    object
dtypes: object(5)
memory usage: 14.0+ KB


I noticed that there are 18 samples with null values. The Wikipedia page must not have had that information. Thus, I want to find which samples I need to manually find the missing data. The block of code that is the data that I found manually, that I will have uploaded into the dataframe. At the conclusion of all of this work, I check to make sure that the updates actually saved to the dataframe.

In [7]:
all_null_data = all_data[all_data.isnull().any(axis=1)]
all_null_data['airport_code']

8      ADK
54     CEC
58     CIC
60     CLD
75     CYS
103    FCA
127    HHH
141    IPL
144    IYK
190    MOD
192    MQT
211    OXR
214    PFN
222    PMD
251    SCE
263    SLE
277    TEX
297    YUM
Name: airport_code, dtype: object

In [8]:
all_data.loc[251,['City', 'IATA', 'ICAO', 'Airport']] = ['Benner Township', 'SCE', 'KUNV', 'University Park Airport']
all_data.loc[103,['City', 'IATA', 'ICAO', 'Airport']] = ['Kalispell', 'FCA', 'KGPF', 'Glacier Park International Airport']
all_data.loc[127,['City', 'IATA', 'ICAO', 'Airport']] = ['Hilton Head Island', 'HHH', 'KHXD', 'Hilton Head Airport']
all_data.loc[141,['City', 'IATA', 'ICAO', 'Airport']] = ['Imperial County', 'IPL', 'KIPL', 'Imperial County Airport']
all_data.loc[144,['City', 'IATA', 'ICAO', 'Airport']] = ['Inyokern', 'IYK', 'IYK', 'Inyokern Airport']
all_data.loc[190,['City', 'IATA', 'ICAO', 'Airport']] = ['Modesto', 'MOD', 'KMOD', 'Modesto City–County Airport']
all_data.loc[192,['City', 'IATA', 'ICAO', 'Airport']] = ['Gwinn', 'MQT', 'KSAW', 'Sawyer International Airport']
all_data.loc[211,['City', 'IATA', 'ICAO', 'Airport']] = ['Oxnard', 'OXR', 'KOXR', 'Oxnard Airport']
all_data.loc[214,['City', 'IATA', 'ICAO', 'Airport']] = ['Panama City', 'ECP', 'KECP', 'Northwest Florida Beaches International Airport']
all_data.loc[222,['City', 'IATA', 'ICAO', 'Airport']] = ['Palmdale', 'PMD', 'KPMD', 'Palmdale Regional Airport']
all_data.loc[263,['City', 'IATA', 'ICAO', 'Airport']] = ['Salem', 'SLE', 'KSLE', 'McNary Field (Salem Municipal Airport)']
all_data.loc[277,['City', 'IATA', 'ICAO', 'Airport']] = ['Telluride', 'TEX', 'KTEX', 'Telluride Regional Airport']
all_data.loc[297,['City', 'IATA', 'ICAO', 'Airport']] = ['Yuma', 'YUM', 'KNYL', 'Yuma International Airport']
all_data.loc[54,['City', 'IATA', 'ICAO', 'Airport']] = ['Crescent City', 'CEC', 'KCEC', 'Del Norte County Regional Airport']
all_data.loc[58,['City', 'IATA', 'ICAO', 'Airport']] = ['Chico', 'CIC', 'KCIC', 'Chico Municipal Airport']
all_data.loc[60,['City', 'IATA', 'ICAO', 'Airport']] = ['Carlsbad', 'CLD', 'KCRQ', 'McClellan-Palomar Airport']
all_data.loc[75,['City', 'IATA', 'ICAO', 'Airport']] = ['Cheyenne', 'CYS', 'KCYS', 'Cheyenne Regional Airport']
all_data.loc[8,['City', 'IATA', 'ICAO', 'Airport']] = ['Adak', 'ADK', 'PADK', 'Adak Airport']
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 298 entries, 0 to 297
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   airport_code  298 non-null    object
 1   City          298 non-null    object
 2   IATA          298 non-null    object
 3   ICAO          298 non-null    object
 4   Airport       298 non-null    object
dtypes: object(5)
memory usage: 22.1+ KB


Near the end of completing this project, I came across some additional information that could become useful later, so I ingested that data. In this block, I also filtered out some of the unnecessary features and samples. This should help to make the merge process efficient. At the end of this code block as well as the following blocks of code, I am just using different functions and methods to make sure that the data is complete and accurate.

In [9]:
coordinates = pd.read_csv('us-airports.csv')

for x in coordinates['type'].unique():
    print(x)

type_filter = ['large_airport', 'medium_airport', 'small_airport', 'closed']
coordinates = coordinates[coordinates['type'].isin(type_filter)]

coordinates.drop(columns=['id', 'type', 'iso_country', 'continent', 'country_name', 'scheduled_service',
                          'score', 'last_updated', 'iso_region', 'region_name', 'home_link', 'name',
                          'wikipedia_link', 'keywords'], inplace=True)

coordinates.describe(include='all')

large_airport
medium_airport
small_airport
closed
seaplane_base
heliport
balloonport


Unnamed: 0,ident,latitude_deg,longitude_deg,elevation_ft,local_region,municipality,gps_code,iata_code,local_code
count,20891,20891.0,20891.0,19537.0,20891,20837,13895,1938,14558
unique,20891,,,,51,8275,13845,1935,14479
top,KLAX,,,,TX,Houston,29KY,CLG,28PA
freq,1,,,,2788,57,2,2,2
mean,,38.619007,-97.764234,1338.584276,,,,,
std,,7.087849,18.664144,1571.560061,,,,,
min,,18.9163,-178.65638,-223.0,,,,,
25%,,33.52445,-108.374096,345.0,,,,,
50%,,38.440899,-95.300301,820.0,,,,,
75%,,42.30984,-85.049103,1475.0,,,,,


In [10]:
coordinates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20891 entries, 0 to 29161
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ident          20891 non-null  object 
 1   latitude_deg   20891 non-null  float64
 2   longitude_deg  20891 non-null  float64
 3   elevation_ft   19537 non-null  float64
 4   local_region   20891 non-null  object 
 5   municipality   20837 non-null  object 
 6   gps_code       13895 non-null  object 
 7   iata_code      1938 non-null   object 
 8   local_code     14558 non-null  object 
dtypes: float64(3), object(6)
memory usage: 1.6+ MB


In [11]:
coordinates.head(12)

Unnamed: 0,ident,latitude_deg,longitude_deg,elevation_ft,local_region,municipality,gps_code,iata_code,local_code
0,KLAX,33.942501,-118.407997,125.0,CA,Los Angeles,KLAX,LAX,LAX
1,KORD,41.9786,-87.9048,672.0,IL,Chicago,KORD,ORD,ORD
2,KJFK,40.639801,-73.7789,13.0,NY,New York,KJFK,JFK,JFK
3,KATL,33.6367,-84.428101,1026.0,GA,Atlanta,KATL,ATL,ATL
4,KSFO,37.618999,-122.375,13.0,CA,San Francisco,KSFO,SFO,SFO
5,KDFW,32.896801,-97.038002,607.0,TX,Dallas-Fort Worth,KDFW,DFW,DFW
6,KEWR,40.692501,-74.168701,18.0,NJ,New York,KEWR,EWR,EWR
7,KLAS,36.080101,-115.152,2181.0,NV,Las Vegas,KLAS,LAS,LAS
8,KMCO,28.429399,-81.308998,96.0,FL,Orlando,KMCO,MCO,MCO
9,KDEN,39.861698,-104.672997,5431.0,CO,Denver,KDEN,DEN,DEN


Next, I merge the two dataframes and drop irrelevant features. After that, I run some functions and methods to make sure that the merge completely correctly.

In [12]:
airports_with_coords = pd.merge(left=all_data, right=coordinates, how='left', left_on='IATA', right_on='iata_code')
features_to_drop = ['IATA', 'ident', 'IATA', 'iata_code', 'gps_code', 'local_code']
airports_with_coords.drop(columns=features_to_drop, inplace=True)

airports_with_coords.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 298 entries, 0 to 297
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   airport_code   298 non-null    object 
 1   City           298 non-null    object 
 2   ICAO           298 non-null    object 
 3   Airport        298 non-null    object 
 4   latitude_deg   293 non-null    float64
 5   longitude_deg  293 non-null    float64
 6   elevation_ft   293 non-null    float64
 7   local_region   293 non-null    object 
 8   municipality   293 non-null    object 
dtypes: float64(3), object(6)
memory usage: 23.3+ KB


In [13]:
airports_with_coords.describe(include='all')

Unnamed: 0,airport_code,City,ICAO,Airport,latitude_deg,longitude_deg,elevation_ft,local_region,municipality
count,298,298,298,298,293.0,293.0,293.0,293,293
unique,298,284,298,298,,,,50,282
top,ABE,Columbus,KABE,Lehigh Valley International Airport (was Allen...,,,,CA,New York
freq,1,3,1,1,,,,27,3
mean,,,,,38.983548,-99.295017,1204.453925,,
std,,,,,8.130953,21.243182,1769.52659,,
min,,,,,19.721399,-176.642783,-54.0,,
25%,,,,,33.562901,-112.497002,89.0,,
50%,,,,,38.805801,-93.663101,544.0,,
75%,,,,,42.932598,-83.353401,1203.0,,


In [14]:
airports_with_coords.head(12)

Unnamed: 0,airport_code,City,ICAO,Airport,latitude_deg,longitude_deg,elevation_ft,local_region,municipality
0,ABE,Allentown,KABE,Lehigh Valley International Airport (was Allen...,40.651773,-75.442797,393.0,PA,Allentown
1,ABI,Abilene,KABI,Abilene Regional Airport,32.411301,-99.6819,1791.0,TX,Abilene
2,ABQ,Albuquerque,KABQ,Albuquerque International Sunport,35.040199,-106.609001,5355.0,NM,Albuquerque
3,ABY,Albany,KABY,Southwest Georgia Regional Airport,31.532946,-84.196215,197.0,GA,Albany
4,ACK,Nantucket,KACK,Nantucket Memorial Airport,41.253101,-70.060204,47.0,MA,Nantucket
5,ACT,Waco,KACT,Waco Regional Airport,31.6113,-97.230499,516.0,TX,Waco
6,ACV,Arcata/Eureka,KACV,Arcata Airport,40.978101,-124.109,221.0,CA,Arcata/Eureka
7,ACY,Atlantic City,KACY,Atlantic City International Airport,39.4576,-74.577202,75.0,NJ,Atlantic City
8,ADK,Adak,PADK,Adak Airport,51.883564,-176.642783,18.0,AK,Adak
9,ADQ,Kodiak,PADQ,Kodiak Airport (Benny Benson State Airport),57.75,-152.494003,78.0,AK,Kodiak


Here, I am finding the five airports with null values for any of the following: latitude, longitude, and/or elevation.

In [15]:
nulls_in_category_data = airports_with_coords[airports_with_coords.isnull().any(axis=1)]
nulls_in_category_data['airport_code']

40     BQN
225    PSE
261    SJU
271    STT
272    STX
Name: airport_code, dtype: object

Next, I upload the missing data for the 5 samples manually and check that it saved.

In [16]:
airports_with_coords.loc[225,['airport_code', 'City', 
                           'ICAO', 'Airport', 'latitude_deg', 
                           'longitude_deg', 'elevation_ft', 
                           'local_region', 'municipality']] = ['PSE', 'Bo. Vayas / Bo. Sabanetas', 
                                                               'TJPS', 'Mercedita International Airport', 
                                                               -66.563056, 18.008333, 28, 'PR', 'Ponce']

airports_with_coords.loc[261,['airport_code', 'City', 
                           'ICAO', 'Airport', 'latitude_deg', 
                           'longitude_deg', 'elevation_ft', 
                           'local_region', 'municipality']] = ['SJU', 'Carolina', 'TJSJ', 
                                                               'Luis Muñoz Marín International Airport', 
                                                               -66.001944, 18.439167, 9, 'PR', 'Carolina']

airports_with_coords.loc[271,['airport_code', 'City', 
                           'ICAO', 'Airport', 'latitude_deg', 
                           'longitude_deg', 'elevation_ft', 
                           'local_region', 'municipality']] = ['STT', 'Saint Thomas', 'TIST', 
                                                               'Cyril E. King Airport', -64.973333, 
                                                               18.337222, 24, 'VI', 'Saint Thomas']

airports_with_coords.loc[272,['airport_code', 'City', 
                           'ICAO', 'Airport', 'latitude_deg', 
                           'longitude_deg', 'elevation_ft', 
                           'local_region', 'municipality']] = ['STX', 'St. Croix', 
                                                               'TISX', 'Henry E. Rohlsen Airport', 
                                                               -64.801667, 17.704444, 74, 
                                                               'VI', 'St. Croix']

airports_with_coords.loc[40,['airport_code', 'City', 
                           'ICAO', 'Airport', 'latitude_deg', 
                           'longitude_deg', 'elevation_ft', 
                           'local_region', 'municipality']] = ['BQN', 'Aguadilla', 
                                                               'TJBQ', 
                                                               'Rafael Hernández Marín International Airport', 
                                                               67.1356, 18.4954, 237, 'PR', 'Aguadilla']

airports_with_coords.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 298 entries, 0 to 297
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   airport_code   298 non-null    object 
 1   City           298 non-null    object 
 2   ICAO           298 non-null    object 
 3   Airport        298 non-null    object 
 4   latitude_deg   298 non-null    float64
 5   longitude_deg  298 non-null    float64
 6   elevation_ft   298 non-null    float64
 7   local_region   298 non-null    object 
 8   municipality   298 non-null    object 
dtypes: float64(3), object(6)
memory usage: 31.4+ KB


Finally, it is time to convert the dataframe into the properly structured json filetype. The code below successfully completes that (I checked it using my database of choice).

In [17]:
# Export the updated information to a JSON file
results = airports_with_coords.to_json('airports_with_details.json', orient="table")