# Preparation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [159]:
flights_sample = pd.read_csv("../../data/processed/flights_sample.csv", index_col=None)

### Some Feature Engineering:

# Change the column dtypes to the correct type for the date columns
flights_sample['Scheduled Departure Time (local time)'] = pd.to_datetime(flights_sample['Scheduled Departure Time (local time)'])
flights_sample['Actual Departure Time (local time)'] = pd.to_datetime(flights_sample['Actual Departure Time (local time)'])
flights_sample['Wheels Off (local time)'] = pd.to_datetime(flights_sample['Wheels Off (local time)'])
flights_sample['Wheels On (local time)'] = pd.to_datetime(flights_sample['Wheels On (local time)'])
flights_sample['Scheduled Arrival Time (local time)'] = pd.to_datetime(flights_sample['Scheduled Arrival Time (local time)'])
flights_sample['Actual Arrival Time (local time)'] = pd.to_datetime(flights_sample['Actual Arrival Time (local time)'])

#Create a new column for the hour of the day for actual departure time and for wheels on time
flights_sample['Actual Departure Hour'] = flights_sample['Actual Departure Time (local time)'].dt.hour  #I don't like that they are FLOATS.. would prefer int but having an error code because of NANs
flights_sample['Wheels On Hour'] = flights_sample['Wheels On (local time)'].dt.hour #I don't like that they are FLOATS.. would prefer int but having an error code because of NANs

# Create a new columns that calculates the difference between the departure delay and arrival delay
flights_sample['Difference in Delay (Dep - Arr [minutes])'] = flights_sample['Departure Delay (minutes)'] - flights_sample['Arrival Delay (minutes)']

#Create departure and arrival state column
flights_sample['Departure State'] = flights_sample['Origin Airport (City, State)'].str[-2:]
flights_sample['Arrival State'] = flights_sample['Destination Airport (City, State)'].str[-2:]

# Remove the rows that have missing_airports
#missing_airports = pd.read_csv("../../data/raw/missing_airports.csv", index_col=None)
#flights_sample = flights_sample[~flights_sample['Origin Airport (IATA Code)'].isin(missing_airports)]
#flights_sample = flights_sample[~flights_sample['Destination Airport (IATA Code)'].isin(missing_airports)]

# Orientation

In [160]:
origin_airport = pd.read_csv("../../data/raw/unique_origin_airports.csv", index_col=None)
origin_airport.rename(columns={'origin': 'IATA code'}, inplace=True)

dest_airport = pd.read_csv("../../data/raw/unique_dest_airports.csv", index_col=None)
dest_airport.rename(columns={'dest': 'IATA code'}, inplace=True)

all_airports = pd.concat([origin_airport, dest_airport])
all_airports = all_airports.drop_duplicates()

In [176]:
all_airports.shape

(376, 1)

Okay, so we need to pull the weather for 376x different airports we would need to get the weather from, on a daily basis for 2 years and 7x days

In [180]:
(376 * (2 * 365))

274480

If we're going by day. that's a total of ~275,000 API calls.. and this is just for the sample.. we don't even account for the test.

World Weather API is only allowing 500 request a day, and each pull can only be of up to a month. 

In [181]:
(376 * (2 * 12)) / 500

18.048

This is not a viable solution...

Let's keep searching

https://home.openweathermap.org/history_bulks/new

Allows to do complete history pulls for 10USD a pull.. wow.. that would be expensive

https://rapidapi.com/iddogino/api/global-weather-history/pricing

This guy allows 10,000 pull a month and it's only for a day at a time:

In [183]:
(376 * (2 * 365)) / 10000

27.448

Still not a viable solution...

This website looks promising:
https://www.ncdc.noaa.gov/cdo-web/datasets

There's an FTP server which allows to pull daily historical summaries per weather stations. We can even download the worlds' weather stations per year. And it's free.

After further investigation, this looks like the most viable solution:
- year.csv provides daily wheather summaries per weatherstations which a weather station id.
- ghcnd-stations.txt provides the location (lat, long) of all weather stations.

If we can get the lat,long of every airport, we could get the weather data from the closest weather station.
- Initially found Global Airport Database, but it doesn't contain all the airports we're using.
- Finally found World Airports which contains all our airports and more.

Entire process detailed below.

## Global Airport Database (Incomplete - for archive purposes)

In [67]:
airport_location = pd.read_csv("GlobalAirportDatabase.txt", sep=":")

# Add column headers to the airport_location df
airport_location.columns = ['Airport ID', 'Airport Code', 'Airport Name', 'City', 'Country', 'Latitude Degrees', 'Latitude Minutes', 'Latitude Seconds', 'Latitude Direction', 'Longitude Degrees', 'Longitude Minutes', 'Longitude Seconds', 'Longitude Direction', 'Altitude', 'Latitude', 'Longitude']

# Drop all the columns except the Airport Code, Name, City, Country, Latitude and Longitude
airport_location = airport_location.drop(['Airport ID', 'Latitude Degrees', 'Latitude Minutes', 'Latitude Seconds', 'Latitude Direction', 'Longitude Degrees', 'Longitude Minutes', 'Longitude Seconds', 'Longitude Direction', 'Altitude'], axis=1)

In [69]:
filter = airport_location['Airport Code'].isin(all_airports)

In [71]:
#Copy all records from airport_location that are in all_airports to a new df
airport_latlong = airport_location[airport_location['Airport Code'].isin(all_airports)]

In [73]:
airport_latlong.shape

(217, 6)

In [79]:
missing_airports = np.setdiff1d(all_airports, airport_latlong['Airport Code'].values)
missing_airports

array(['ABE', 'ABR', 'ACV', 'ALO', 'ALW', 'APN', 'ASE', 'ATW', 'ATY',
       'AVL', 'AVP', 'AZA', 'AZO', 'BFF', 'BGM', 'BIL', 'BIS', 'BJI',
       'BKG', 'BMI', 'BRD', 'BTM', 'BZN', 'CAK', 'CGI', 'CHO', 'CID',
       'CIU', 'CKB', 'CMI', 'CMX', 'CNY', 'COD', 'CRW', 'CSG', 'CWA',
       'DAB', 'DBQ', 'DIK', 'DVL', 'EAR', 'EAT', 'EAU', 'ECP', 'EGE',
       'EKO', 'ELM', 'ERI', 'ESC', 'EUG', 'EVV', 'FAR', 'FAY', 'FCA',
       'FLG', 'FNT', 'FSD', 'FWA', 'GCC', 'GJT', 'GPT', 'GRI', 'GSO',
       'GSP', 'GST', 'GTR', 'GUC', 'HDN', 'HGR', 'HHH', 'HSV', 'HTS',
       'HVN', 'HYA', 'HYS', 'IDA', 'IFP', 'IMT', 'ITH', 'JAC', 'JLN',
       'JMS', 'LAR', 'LAW', 'LBE', 'LBF', 'LBL', 'LEX', 'LSE', 'LWB',
       'LWS', 'LYH', 'MBS', 'MEI', 'MFR', 'MGM', 'MHK', 'MHT', 'MKG',
       'MLI', 'MMH', 'MRY', 'MSO', 'MTJ', 'MVY', 'OAJ', 'OGD', 'ORH',
       'OTH', 'OWB', 'PAH', 'PGD', 'PGV', 'PIA', 'PIB', 'PIH', 'PIR',
       'PLN', 'PSC', 'PSG', 'PSM', 'PUW', 'PVU', 'RAP', 'RDD', 'RDM',
       'RFD', 'RHI',

In [81]:
# count how many rows are missing_airports represent in flights_sample
print(flights_sample[flights_sample['Origin Airport (IATA Code)'].isin(missing_airports)].shape[0])
print(flights_sample[flights_sample['Destination Airport (IATA Code)'].isin(missing_airports)].shape[0])

13149
13079


We can't just ignore them.. I need another piece of data to account for those missing ones

## Getting the Lat/Long of all airports using World Airports

In [161]:
#let's try this dataset
airport_location = pd.read_csv("world-airports.csv", usecols=['country_name', 'local_region', 'iata_code', 'local_code', 'name', 'type', 'latitude_deg', 'longitude_deg', 'elevation_ft'])
## Credits to: https://ourairports.com/world.html

#Reorder columns
airport_location = airport_location[['iata_code', 'local_code', 'name', 'latitude_deg', 'longitude_deg', 'elevation_ft', 'type', 'local_region', 'country_name']]

#Keeping only United States Airport
airport_location = airport_location[airport_location['country_name'] == 'United States']

airport_location.head()

Unnamed: 0,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
1,LAX,LAX,Los Angeles International Airport,33.942501,-118.407997,125.0,large_airport,CA,United States
2,ORD,ORD,Chicago O'Hare International Airport,41.9786,-87.9048,672.0,large_airport,IL,United States
3,JFK,JFK,John F Kennedy International Airport,40.639447,-73.779317,13.0,large_airport,NY,United States
4,ATL,ATL,Hartsfield Jackson Atlanta International Airport,33.6367,-84.428101,1026.0,large_airport,GA,United States
6,SFO,SFO,San Francisco International Airport,37.618999,-122.375,13.0,large_airport,CA,United States


In [162]:
airport_codes = airport_location['iata_code'].unique()

# Check if we have any values in all_airports (The list of all airport_codes in the LHL flight dataset) that are not in airport_codes
missing_airports = all_airports[~all_airports['IATA code'].isin(airport_codes)]

missing_airports.shape

(11, 1)

We have 11x airports missing, meaning 11x airports from the lighthouse labs flights dataset is not in the airport dataset we found. Let's further investigate.

In [163]:
missing_airports

Unnamed: 0,IATA code
11,PSE
16,BQN
34,SPN
39,ROP
42,GUM
93,ISN
165,STT
169,CYS
211,PPG
222,STX


-- Ran this SQL Query on the flights table:
SELECT count(*)
FROM flights
WHERE  origin = 'PSE' OR dest= 'PSE'
    OR origin = 'BQN' OR dest= 'BQN'
    OR origin = 'SPN' OR dest= 'SPN'
    OR origin = 'ROP' OR dest= 'ROP'
    OR origin = 'GUM' OR dest= 'GUM'
    OR origin = 'ISN' OR dest= 'ISN'
    OR origin = 'STT' OR dest= 'STT'
    OR origin = 'CYS' OR dest= 'CYS'
    OR origin = 'PPG' OR dest= 'PPG'
    OR origin = 'STX' OR dest= 'STX'
    OR origin = 'SJU' OR dest= 'SJU'

--> Returned: 140,485... It's roughly 1% of the dataset.. it's pretty significant
--> Looking at some codes above, some are international, like GUM. Will remove the united states filter see if we can get more.

### Re-running query above but without the US Filter

In [165]:
airport_location = pd.read_csv("world-airports.csv", usecols=['country_name', 'local_region', 'iata_code', 'local_code', 'name', 'type', 'latitude_deg', 'longitude_deg', 'elevation_ft'])
## Credits to: https://ourairports.com/world.html

#Reorder columns
airport_location = airport_location[['iata_code', 'local_code', 'name', 'latitude_deg', 'longitude_deg', 'elevation_ft', 'type', 'local_region', 'country_name']]

#Keeping only United States Airport
#airport_location = airport_location[airport_location['country_name'] == 'United States']
## Removed as some flights end up being international

airport_location.head()

Unnamed: 0,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
0,LHR,,London Heathrow Airport,51.4706,-0.461941,83.0,large_airport,ENG,United Kingdom
1,LAX,LAX,Los Angeles International Airport,33.942501,-118.407997,125.0,large_airport,CA,United States
2,ORD,ORD,Chicago O'Hare International Airport,41.9786,-87.9048,672.0,large_airport,IL,United States
3,JFK,JFK,John F Kennedy International Airport,40.639447,-73.779317,13.0,large_airport,NY,United States
4,ATL,ATL,Hartsfield Jackson Atlanta International Airport,33.6367,-84.428101,1026.0,large_airport,GA,United States


In [166]:
airport_codes = airport_location['iata_code'].unique()

# Check if we have any values in all_airports (The list of all airport_codes in the LHL flight dataset) that are not in airport_codes
missing_airports = all_airports[~all_airports['IATA code'].isin(airport_codes)]

missing_airports.shape

(2, 1)

In [167]:
missing_airports

Unnamed: 0,IATA code
93,ISN
169,CYS


Okay we're down to two. Open Source research reveals ISN is a closed airport so that's probably why we don't have it in our airport dataset and CYS is active... 

Since Sloulin Field Airport closed to the public on October 10, 2019, we can disregard as it shouldn't appear in our test dataset (Jan 2020)

In [169]:
airport_location[airport_location['name'].str.contains('Cheyenne')]

Unnamed: 0,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
568,,CYS,Cheyenne Regional Jerry Olson Field,41.155701,-104.811997,6159.0,medium_airport,WY,United States
5749,,84D,Cheyenne Eagle Butte Airport,44.984402,-101.251,2448.0,small_airport,SD,United States
7934,,SYF,Cheyenne County Municipal Airport,39.761101,-101.795998,3413.0,small_airport,KS,United States


There it is! It just doesn't have an IATA code... Let's manually add it

In [170]:
airport_location.loc[568, 'iata_code'] = 'CYS'

# Check if it worked
airport_location[airport_location['name'].str.contains('Cheyenne')]

Unnamed: 0,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
568,CYS,CYS,Cheyenne Regional Jerry Olson Field,41.155701,-104.811997,6159.0,medium_airport,WY,United States
5749,,84D,Cheyenne Eagle Butte Airport,44.984402,-101.251,2448.0,small_airport,SD,United States
7934,,SYF,Cheyenne County Municipal Airport,39.761101,-101.795998,3413.0,small_airport,KS,United States


In [172]:
airport_codes = airport_location['iata_code'].unique()

# Check if we have any values in all_airports (The list of all airport_codes in the LHL flight dataset) that are not in airport_codes
missing_airports = all_airports[~all_airports['IATA code'].isin(airport_codes)]

missing_airports

Unnamed: 0,IATA code
93,ISN


And now our only missing airport is ISN, which won't be a problem on our dataset. Fantastic!

In [175]:
# Store airport_location in a csv for future use
airport_location.to_csv("../../data/processed/flights_enrichment_airportLocation.csv", index=False)

### The next step is to get which Weather Station is close to our airports

In [232]:
#import ghcnd-stations
weather_stations = pd.read_csv("../../data/raw/ghcnd-stations.txt", sep='\t', header=None, index_col=None)

# Ok so we have to split the first column into 3 columns Station ID, Latitude, Longitude and keep only those columns
weather_stations = weather_stations[0].str.split(expand=True)
weather_stations = weather_stations.iloc[:, :3]
weather_stations.columns = ['StationID', 'Latitude', 'Longitude']
weather_stations

Unnamed: 0,StationID,Latitude,Longitude
0,ACW00011604,17.1167,-61.7833
1,ACW00011647,17.1333,-61.7833
2,AE000041196,25.3330,55.5170
3,AEM00041194,25.2550,55.3640
4,AEM00041217,24.4330,54.6510
...,...,...,...
123179,ZI000067969,-21.0500,29.3670
123180,ZI000067975,-20.0670,30.8670
123181,ZI000067977,-21.0170,31.5830
123182,ZI000067983,-20.2000,32.6160


Before we go any further let's educate ourselves on the accuracy of the lat/longs:

Accuracy versus decimal places decimal places	degrees	distance
0	1.0	111 km
1	0.1	11.1 km
2	0.01	1.11 km
3	0.001	111 m
4	0.0001	11.1 m
5	0.00001	1.11 m
6	0.000001	0.111 m
7	0.0000001	1.11 cm
8	0.00000001	1.11 mm
source: http://wiki.gis.com/wiki/index.php/Decimal_degrees

airport_location have 6x digits and weather stations have 4. This is too precise for our needs.. we could keep only one decimal..

We have another problem however, the way decimal work, if the station is to the west and/or north, we wouldn't find it as the decimal point would be one less for either the EW or the NS axis. 

To solve this, we'll look for weather stations withing the same lat/long as well as the one with .1 less (NW, SW, SE).

In [233]:
#Let's start by rounding the latitude and longitude of weather_stations and airport_location to 1 decimals

#convert the latitude and longitude to float
weather_stations['Latitude'] = weather_stations['Latitude'].astype(float)
weather_stations['Longitude'] = weather_stations['Longitude'].astype(float)

weather_stations['Latitude'] = weather_stations['Latitude'].round(1)
weather_stations['Longitude'] = weather_stations['Longitude'].round(1)
airport_location['latitude_deg'] = airport_location['latitude_deg'].round(1)
airport_location['longitude_deg'] = airport_location['longitude_deg'].round(1)

In [234]:
# Now let's loop through the weather_stations and find if an airport is within 0.1 degree of the weather station if it is, add the iata_code into a new column
weather_stations['Airport'] = np.nan
for index, row in weather_stations.iterrows():
    for index2, row2 in airport_location.iterrows():
        # Same latitude and longitude (NE)
        if row['Latitude'] == row2['latitude_deg'] and row['Longitude'] == row2['longitude_deg']:
            # If there's already an airport, add a comma and the new airport
            if pd.notnull(weather_stations.loc[index, 'Airport']):
                weather_stations.loc[index, 'Airport'] = weather_stations.loc[index, 'Airport'] + ',' + row2['iata_code']
            # If there's no airport, add the airport
            else:
                weather_stations.loc[index, 'Airport'] = row2['iata_code']
        
        # latitude-0.1 and longitude-0.1 (SW)
        elif row['Latitude'] == row2['latitude_deg'] - 0.1 and row['Longitude'] == row2['longitude_deg'] - 0.1:
            # If there's already an airport, add a comma and the new airport
            if pd.notnull(weather_stations.loc[index, 'Airport']):
                weather_stations.loc[index, 'Airport'] = weather_stations.loc[index, 'Airport'] + ',' + row2['iata_code']
            # If there's no airport, add the airport
            else:
                weather_stations.loc[index, 'Airport'] = row2['iata_code']
        
        # Same latitude and longitude-0.10.1 (NW)
        elif row['Latitude'] == row2['latitude_deg'] and row['Longitude'] == row2['longitude_deg'] - 0.1:
            # If there's already an airport, add a comma and the new airport
            if pd.notnull(weather_stations.loc[index, 'Airport']):
                weather_stations.loc[index, 'Airport'] = weather_stations.loc[index, 'Airport'] + ',' + row2['iata_code']
            # If there's no airport, add the airport
            else:
                weather_stations.loc[index, 'Airport'] = row2['iata_code']
        
        # latitude-0.1 and same longitude (SE)
        elif row['Latitude'] == row2['latitude_deg'] - 0.1 and row['Longitude'] == row2['longitude_deg']:
            # If there's already an airport, add a comma and the new airport
            if pd.notnull(weather_stations.loc[index, 'Airport']):
                weather_stations.loc[index, 'Airport'] = weather_stations.loc[index, 'Airport'] + ',' + row2['iata_code']
            # If there's no airport, add the airport
            else:
                weather_stations.loc[index, 'Airport'] = row2['iata_code']

KeyboardInterrupt: 

In [235]:
weather_stations

Unnamed: 0,StationID,Latitude,Longitude,Airport
0,ACW00011604,17.1,-61.8,ANU
1,ACW00011647,17.1,-61.8,ANU
2,AE000041196,25.3,55.5,SHJ
3,AEM00041194,25.3,55.4,"DXB,SHJ"
4,AEM00041217,24.4,54.7,AUH
...,...,...,...,...
123179,ZI000067969,-21.0,29.4,
123180,ZI000067975,-20.1,30.9,
123181,ZI000067977,-21.0,31.6,
123182,ZI000067983,-20.2,32.6,


In [231]:
# weather stations that have the same latitude and longitude as airports(NE)
NEweather_stations = weather_stations[weather_stations['Latitude'].isin(airport_location['latitude_deg'])]
NEweather_stations = weather_stations[weather_stations['Longitude'].isin(airport_location['longitude_deg'])]
NEweather_stations

Unnamed: 0,StationID,Latitude,Longitude,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
0,ACW00011604,17.1,-61.8,ANU,,V.C. Bird International Airport,17.1,-61.8,62.0,medium_airport,03,Antigua and Barbuda
1,ACW00011647,17.1,-61.8,ANU,,V.C. Bird International Airport,17.1,-61.8,62.0,medium_airport,03,Antigua and Barbuda
2,AE000041196,25.3,55.5,SHJ,,Sharjah International Airport,25.3,55.5,111.0,large_airport,SH,United Arab Emirates
3,AEM00041194,25.3,55.4,DXB,,Dubai International Airport,25.3,55.4,62.0,large_airport,DU,United Arab Emirates
4,AEM00041217,24.4,54.7,AUH,,Abu Dhabi International Airport,24.4,54.7,88.0,large_airport,AZ,United Arab Emirates
...,...,...,...,...,...,...,...,...,...,...,...,...
32895,ZI000067965,-20.0,28.6,BUQ,,Joshua Mqabuko Nkomo International Airport,-20.0,28.6,4359.0,medium_airport,BU,Zimbabwe
32896,ZI000067975,-20.1,30.9,MVZ,,Masvingo International Airport,-20.1,30.9,3595.0,medium_airport,MV,Zimbabwe
32897,ZI000067977,-21.0,31.6,BFO,,Buffalo Range Airport,-21.0,31.6,1421.0,medium_airport,MV,Zimbabwe
32898,ZI000067983,-20.2,32.6,CHJ,,Chipinge Airport,-20.2,32.6,3720.0,small_airport,MA,Zimbabwe


In [222]:
# weather stations that have the same latitude and longitude as airports(NE)
NEweather_stations = weather_stations[weather_stations['Latitude'].isin(airport_location['latitude_deg'])]
NEweather_stations = weather_stations[weather_stations['Longitude'].isin(airport_location['longitude_deg'])]

# weather stations that have the same latitude and -0.1 longitude as weather stations (SE)
SEweather_stations = weather_stations[weather_stations['Latitude'].isin(airport_location['latitude_deg'])]
SEweather_stations = weather_stations[weather_stations['Longitude'].isin(airport_location['longitude_deg'] - 0.1)]

# weather stations that have -0.1 latitude and the same longitude as weather stations (NW)
NWweather_stations = weather_stations[weather_stations['Latitude'].isin(airport_location['latitude_deg'] - 0.1)]
NWweather_stations = weather_stations[weather_stations['Longitude'].isin(airport_location['longitude_deg'])]

# weather stations that have -0.1 latitude and -0.1 longitude as weather stations (SW)
SWweather_stations = weather_stations[weather_stations['Latitude'].isin(airport_location['latitude_deg'] - 0.1)]
SWweather_stations = weather_stations[weather_stations['Longitude'].isin(airport_location['longitude_deg'] - 0.1)]

# Concatenate the 4 dataframes
weather_stations = pd.concat([NEweather_stations, SEweather_stations, NWweather_stations, SWweather_stations])

# Drop duplicates
weather_stations = weather_stations.drop_duplicates()

In [224]:
weather_stations.groupby('iata_code').count().sort_values(by='StationID', ascending=False)

Unnamed: 0_level_0,StationID,Latitude,Longitude,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
iata_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
WBU,45,45,45,45,45,45,45,45,45,45,45
CYS,37,37,37,37,37,37,37,37,37,37,37
BFK,32,32,32,32,32,32,32,32,32,32,32
BJC,29,29,29,29,29,29,29,29,29,29,29
MSP,29,29,29,29,29,29,29,29,29,29,29
...,...,...,...,...,...,...,...,...,...,...,...
KFA,1,1,1,0,1,1,1,1,1,1,1
KGA,1,1,1,0,1,1,1,1,1,1,1
KGG,1,1,1,0,1,1,1,1,1,1,1
KGJ,1,1,1,0,1,1,1,1,1,1,1


Ok some of them have a LOT of weather stations around.. but it should be fine, it will just provide more data!

AND, given that we have no iata_code values with 0, this means we have, at a minimum, 1x weather stations for every airport. Now let's make sure all the iata codes from airport_location are in weather_stations.

In [226]:
airport_location[airport_location['iata_code'].isin(weather_stations['StationID'])]

Unnamed: 0,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name


### Once we filtered using the lat longs, we'll have the Stationd ID to filter the main Weather CSV.

In [52]:
raw2018 = pd.read_csv('2018.csv')

In [54]:
raw2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36333281 entries, 0 to 36333280
Data columns (total 8 columns):
 #   Column       Dtype  
---  ------       -----  
 0   AE000041196  object 
 1   20180101     int64  
 2   TMAX         object 
 3   259          int64  
 4   Unnamed: 4   object 
 5   Unnamed: 5   object 
 6   S            object 
 7   Unnamed: 7   float64
dtypes: float64(1), int64(2), object(5)
memory usage: 2.2+ GB
