# Preparation

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_columns", 120)

In [2]:
flights_sample = pd.read_csv("../../data/raw/flights_sample+test.csv", index_col=None)

# Orientation

In [3]:
flights_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 860556 entries, 0 to 860555
Data columns (total 17 columns):
 #   Column                                      Non-Null Count   Dtype 
---  ------                                      --------------   ----- 
 0   Marketer - Unique Carrier Code              860556 non-null  object
 1   Operator - Unique Carrier Code              860556 non-null  object
 2   Tail Number                                 858407 non-null  object
 3   Flight Number                               860556 non-null  int64 
 4   Origin Airport (IATA Code)                  860556 non-null  object
 5   Destination Airport (IATA Code)             860556 non-null  object
 6   Scheduled Departure Time (local time)       860556 non-null  object
 7   Scheduled Arrival Time (local time)         860556 non-null  object
 8   Scheduled Elapsed Time                      860556 non-null  int64 
 9   Distance (miles)                            860556 non-null  int64 
 10  Differen

In [4]:
# Printed a list of all airports from the flights table
origin_airport = pd.read_csv("../../data/raw/unique_origin_airports.csv", index_col=None)
origin_airport.rename(columns={'origin': 'IATA code'}, inplace=True)

# Printed a list of all airports from the flights table
dest_airport = pd.read_csv("../../data/raw/unique_dest_airports.csv", index_col=None)
dest_airport.rename(columns={'dest': 'IATA code'}, inplace=True)

# The list above didn't include the Final test ones, but we have them in our flights_sample, so we'll list all unique arrivals and departures for this one as well
sample_origin_airport = pd.DataFrame()
sample_dest_airport = pd.DataFrame()

sample_origin_airport['IATA code'] = flights_sample['Origin Airport (IATA Code)'].unique()
sample_dest_airport['IATA code'] = flights_sample['Destination Airport (IATA Code)'].unique()

# Merging all 6x so we have a list with all airports
all_airports = pd.concat([origin_airport, dest_airport, sample_origin_airport, sample_dest_airport], ignore_index=True)
all_airports = all_airports.drop_duplicates()

In [5]:
all_airports.shape

(378, 1)

Okay, so we need to pull the weather for 378x different airports we would need to get the weather from, on a daily basis for 2 years and 7x days

In [6]:
(378 * (2 * 365))

275940

If we're going by day. that's a total of ~275,000 API calls.. and this is just for the sample.. we don't even account for the test.

World Weather API is only allowing 500 request a day, and each pull can only be of up to a month. 

In [7]:
(378 * (2 * 12)) / 500

18.144

This is not a viable solution...

Let's keep searching

https://home.openweathermap.org/history_bulks/new

Allows to do complete history pulls for 10USD a pull.. wow.. that would be expensive

https://rapidapi.com/iddogino/api/global-weather-history/pricing

This guy allows 10,000 pull a month and it's only for a day at a time:

In [8]:
(378 * (2 * 365)) / 10000

27.594

Still not a viable solution...

This website looks promising:
https://www.ncdc.noaa.gov/cdo-web/datasets

There's an FTP server which allows to pull daily historical summaries per weather stations. We can even download the worlds' weather stations per year. And it's free.

After further investigation, this looks like the most viable solution:
- year.csv provides daily wheather summaries per weatherstations which a weather station id.
- ghcnd-stations.txt provides the location (lat, long) of all weather stations.

If we can get the lat,long of every airport, we could get the weather data from the closest weather station.
- Initially found Global Airport Database, but it doesn't contain all the airports we're using.
- Finally found World Airports which contains all our airports and more.

Entire process detailed below.

# Getting the Lat/Long of all airports using World Airports

## Take 1 - w/ US Filter (dropped)

In [9]:
#let's try this dataset
airport_location = pd.read_csv("../../data/raw/world-airports.csv", usecols=['country_name', 'local_region', 'iata_code', 'local_code', 'name', 'type', 'latitude_deg', 'longitude_deg', 'elevation_ft'])
## Credits to: https://ourairports.com/world.html

#Reorder columns
airport_location = airport_location[['iata_code', 'local_code', 'name', 'latitude_deg', 'longitude_deg', 'elevation_ft', 'type', 'local_region', 'country_name']]

#Keeping only United States Airport
airport_location = airport_location[airport_location['country_name'] == 'United States']

airport_location.head()

Unnamed: 0,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
1,LAX,LAX,Los Angeles International Airport,33.942501,-118.407997,125.0,large_airport,CA,United States
2,ORD,ORD,Chicago O'Hare International Airport,41.9786,-87.9048,672.0,large_airport,IL,United States
3,JFK,JFK,John F Kennedy International Airport,40.639447,-73.779317,13.0,large_airport,NY,United States
4,ATL,ATL,Hartsfield Jackson Atlanta International Airport,33.6367,-84.428101,1026.0,large_airport,GA,United States
6,SFO,SFO,San Francisco International Airport,37.618999,-122.375,13.0,large_airport,CA,United States


In [10]:
airport_codes = airport_location['iata_code'].unique()

# Check if we have any values in all_airports (The list of all airport_codes in the LHL flight dataset) that are not in airport_codes
missing_airports = all_airports[~all_airports['IATA code'].isin(airport_codes)]

missing_airports.shape

(11, 1)

We have 11x airports missing, meaning 11x airports from the lighthouse labs flights dataset is not in the airport dataset we found. Let's further investigate.

In [11]:
missing_airports

Unnamed: 0,IATA code
11,PSE
16,BQN
34,SPN
39,ROP
42,GUM
93,ISN
165,STT
169,CYS
211,PPG
222,STX


Returned: 140,485... It's roughly 1% of the dataset.. it's pretty significant

Looking at some codes above, some are international, like GUM. Will remove the united states filter see if we can get more.

## Take 2 - without the US Filter

In [12]:
airport_location = pd.read_csv("../../data/raw/world-airports.csv", usecols=['country_name', 'local_region', 'iata_code', 'local_code', 'name', 'type', 'latitude_deg', 'longitude_deg', 'elevation_ft'])
## Credits to: https://ourairports.com/world.html

#Reorder columns
airport_location = airport_location[['iata_code', 'local_code', 'name', 'latitude_deg', 'longitude_deg', 'elevation_ft', 'type', 'local_region', 'country_name']]

#Keeping only United States Airport
#airport_location = airport_location[airport_location['country_name'] == 'United States']
## Removed as some flights end up being international

airport_location.head()

Unnamed: 0,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
0,LHR,,London Heathrow Airport,51.4706,-0.461941,83.0,large_airport,ENG,United Kingdom
1,LAX,LAX,Los Angeles International Airport,33.942501,-118.407997,125.0,large_airport,CA,United States
2,ORD,ORD,Chicago O'Hare International Airport,41.9786,-87.9048,672.0,large_airport,IL,United States
3,JFK,JFK,John F Kennedy International Airport,40.639447,-73.779317,13.0,large_airport,NY,United States
4,ATL,ATL,Hartsfield Jackson Atlanta International Airport,33.6367,-84.428101,1026.0,large_airport,GA,United States


In [13]:
airport_codes = airport_location['iata_code'].unique()

# Check if we have any values in all_airports (The list of all airport_codes in the LHL flight dataset) that are not in airport_codes
missing_airports = all_airports[~all_airports['IATA code'].isin(airport_codes)]

missing_airports.shape

(2, 1)

In [14]:
missing_airports

Unnamed: 0,IATA code
93,ISN
169,CYS


Okay we're down to two. Open Source research reveals ISN is a closed airport so that's probably why we don't have it in our airport dataset and CYS is active... 

Since Sloulin Field Airport closed to the public on October 10, 2019, we can disregard as it shouldn't appear in our test dataset (Jan 2020)

In [15]:
airport_location[airport_location['name'].str.contains('Cheyenne')]

Unnamed: 0,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
568,,CYS,Cheyenne Regional Jerry Olson Field,41.155701,-104.811997,6159.0,medium_airport,WY,United States
5749,,84D,Cheyenne Eagle Butte Airport,44.984402,-101.251,2448.0,small_airport,SD,United States
7934,,SYF,Cheyenne County Municipal Airport,39.761101,-101.795998,3413.0,small_airport,KS,United States


There it is! It just doesn't have an IATA code... Let's manually add it

In [16]:
airport_location.loc[568, 'iata_code'] = 'CYS'

# Check if it worked
airport_location[airport_location['name'].str.contains('Cheyenne')]

Unnamed: 0,iata_code,local_code,name,latitude_deg,longitude_deg,elevation_ft,type,local_region,country_name
568,CYS,CYS,Cheyenne Regional Jerry Olson Field,41.155701,-104.811997,6159.0,medium_airport,WY,United States
5749,,84D,Cheyenne Eagle Butte Airport,44.984402,-101.251,2448.0,small_airport,SD,United States
7934,,SYF,Cheyenne County Municipal Airport,39.761101,-101.795998,3413.0,small_airport,KS,United States


In [17]:
airport_codes = airport_location['iata_code'].unique()

# Check if we have any values in all_airports (The list of all airport_codes in the LHL flight dataset) that are not in airport_codes
missing_airports = all_airports[~all_airports['IATA code'].isin(airport_codes)]

missing_airports

Unnamed: 0,IATA code
93,ISN


And now our only missing airport is ISN, which won't be a problem on our dataset. Fantastic!

## Saving the results to a CSV

Although we're just using this to get our weather stations, there is some additional information on the airports we could use as well. We will save it as a CSV and conduct data_preparation on it at a later time in its own jupyter notebook.

In [18]:
# Store airport_location in a csv for future use
airport_location.to_csv("../../data/processed/flights_enrichment_airportLocation.csv", index=False)

# Finding the Weather Stations that are proximate to our airports, using lat/long

In [19]:
airport_location.shape

(35816, 9)

For our purpose, we'll remove the airports that are not in the LHL dataset. 

In [20]:
airport_location = airport_location[airport_location['iata_code'].isin(all_airports['IATA code'])]
airport_location.shape

(377, 9)

What a trim down!

Now let's import the list of weather stations and their location

In [21]:
#import ghcnd-stations
weather_stations = pd.read_csv("../../data/raw/ghcnd-stations.txt", sep='\t', header=None, index_col=None)

# Ok so we have to split the first column into 3 columns Station ID, Latitude, Longitude and keep only those columns
weather_stations = weather_stations[0].str.split(expand=True)
weather_stations = weather_stations.iloc[:, :3]
weather_stations.columns = ['StationID', 'Latitude', 'Longitude']

#Let's start by rounding the latitude and longitude of weather_stations and airport_location to 1 decimals

#convert the latitude and longitude to float
weather_stations['Latitude'] = weather_stations['Latitude'].astype(float)
weather_stations['Longitude'] = weather_stations['Longitude'].astype(float)

weather_stations['Latitude'] = weather_stations['Latitude'].round(1)
weather_stations['Longitude'] = weather_stations['Longitude'].round(1)
airport_location['latitude_deg'] = airport_location['latitude_deg'].round(1)
airport_location['longitude_deg'] = airport_location['longitude_deg'].round(1)

Before we go any further let's educate ourselves on the accuracy of the lat/longs:

Accuracy versus decimal places decimal places	degrees	distance
0	1.0	111 km
1	0.1	11.1 km
2	0.01	1.11 km
3	0.001	111 m
4	0.0001	11.1 m
5	0.00001	1.11 m
6	0.000001	0.111 m
7	0.0000001	1.11 cm
8	0.00000001	1.11 mm
source: http://wiki.gis.com/wiki/index.php/Decimal_degrees

airport_location have 6x digits and weather stations have 4. This is too precise for our needs.. we could keep only one decimal..

We have another problem however, the way decimal work, if the station is to the west and/or north, we wouldn't find it as the decimal point would be one less for either the EW or the NS axis. 

To solve this, we'll look for weather stations withing the same lat/long as well as the one with .1 less (NW, SW, SE).

-----UPDATE----

After a bit more thought on this we should do +0.1 as well. The reason being is if the airport is at 35.19 and the weather station at 35.21 we would miss it that way. So we'll work with a central box surrounded by a N, NE, E, SE, S, SW, W, and NW box as well.

Ok this is slow, let's create a North_Latitude and South_Latitude column as well as an East_Longitude and West_Longitude for airport_location

In [22]:
airport_location['North_Latitude'] = airport_location['latitude_deg'] + 0.1
airport_location['South_Latitude'] = airport_location['latitude_deg'] - 0.1
airport_location['East_Longitude'] = airport_location['longitude_deg'] + 0.1
airport_location['West_Longitude'] = airport_location['longitude_deg'] - 0.1

# So now, we can make Central df, a North df, a NorthEast df, a East df, a SouthEast df, a South df, a SouthWest df, a West df, and a NorthWest df
Central = airport_location[['iata_code', 'latitude_deg', 'longitude_deg']]
North = airport_location[['iata_code', 'North_Latitude', 'longitude_deg']]
NorthEast = airport_location[['iata_code', 'North_Latitude', 'East_Longitude']]
East = airport_location[['iata_code', 'latitude_deg', 'East_Longitude']]
SouthEast = airport_location[['iata_code', 'South_Latitude', 'East_Longitude']]
South = airport_location[['iata_code', 'South_Latitude', 'longitude_deg']]
SouthWest = airport_location[['iata_code', 'South_Latitude', 'West_Longitude']]
West = airport_location[['iata_code', 'latitude_deg', 'West_Longitude']]
NorthWest = airport_location[['iata_code', 'North_Latitude', 'West_Longitude']]

# Now let's merge the weather_stations with the Central, North, NorthEast, East, SouthEast, South, SouthWest, West, and NorthWest airport_location
Central = pd.merge(weather_stations, Central, how='left', left_on=['Latitude', 'Longitude'], right_on=['latitude_deg', 'longitude_deg'])
North = pd.merge(weather_stations, North, how='left', left_on=['Latitude', 'Longitude'], right_on=['North_Latitude', 'longitude_deg'])
NorthEast = pd.merge(weather_stations, NorthEast, how='left', left_on=['Latitude', 'Longitude'], right_on=['North_Latitude', 'East_Longitude'])
East = pd.merge(weather_stations, East, how='left', left_on=['Latitude', 'Longitude'], right_on=['latitude_deg', 'East_Longitude'])
SouthEast = pd.merge(weather_stations, SouthEast, how='left', left_on=['Latitude', 'Longitude'], right_on=['South_Latitude', 'East_Longitude'])
South = pd.merge(weather_stations, South, how='left', left_on=['Latitude', 'Longitude'], right_on=['South_Latitude', 'longitude_deg'])
SouthWest = pd.merge(weather_stations, SouthWest, how='left', left_on=['Latitude', 'Longitude'], right_on=['South_Latitude', 'West_Longitude'])
West = pd.merge(weather_stations, West, how='left', left_on=['Latitude', 'Longitude'], right_on=['latitude_deg', 'West_Longitude'])
NorthWest = pd.merge(weather_stations, NorthWest, how='left', left_on=['Latitude', 'Longitude'], right_on=['North_Latitude', 'West_Longitude'])


In [23]:
# And finally, let's combine all of the dataframes into one
weather_stations = pd.concat([Central, North, NorthEast, East, SouthEast, South, SouthWest, West, NorthWest])

# Now let's drop the columns we don't need
weather_stations = weather_stations.drop(['North_Latitude', 'South_Latitude', 'East_Longitude', 'West_Longitude', 'latitude_deg', 'longitude_deg'], axis=1)

# And let's drop the duplicates
weather_stations = weather_stations.drop_duplicates()

# And let's drop the rows where there's no airport
weather_stations = weather_stations.dropna(subset=['iata_code'])

In [24]:
weather_stations['iata_code'].value_counts()

MSP    124
DCA    117
ABQ    107
SAT    100
CHS     98
      ... 
XWA      1
OTZ      1
OME      1
GFK      1
BRW      1
Name: iata_code, Length: 376, dtype: int64

In [25]:
weather_stations['StationID'].value_counts()

US1TXDA0057    2
US1TXDA0098    2
USW00013907    2
US1TXDA0102    2
US1OHFR0104    2
              ..
US1TXBXR361    1
US1TXBXR331    1
US1TXBXR273    1
US1TXBXR242    1
USW00094918    1
Name: StationID, Length: 6838, dtype: int64

As we can see, some weather stations are close to multiple airports. Here's the two next step:

1. Consolidate every airports proximate to a weather station by putting them all into a single column following this format: 'RDU', 'YYZ'. That way when we call it we can call it using like: 'RDU' and be sure we grabbed the right one.
2. Generate a list of unique weather stations, so we can significantly diminish the large CSVs we are working with

In [26]:
#1 - Thank you Google!
weather_stations = weather_stations.groupby(['StationID', 'Latitude', 'Longitude'])['iata_code'].apply(','.join).reset_index()

In [27]:
weather_stations

Unnamed: 0,StationID,Latitude,Longitude,iata_code
0,AQC00914005,-14.3,-170.6,PPG
1,AQC00914021,-14.3,-170.6,PPG
2,AQC00914060,-14.3,-170.7,PPG
3,AQC00914135,-14.3,-170.7,PPG
4,AQC00914138,-14.3,-170.7,PPG
...,...,...,...,...
6833,VQC00674900,17.8,-64.8,STX
6834,VQC00677600,18.3,-64.9,STT
6835,VQC00679222,18.2,-65.0,STT
6836,VQW00011624,17.7,-64.8,STX


Now let's get a list of all those useful weather stations

In [28]:
#2 - Have a unique list of weather stations
station_list = weather_stations['StationID'].unique()
#station_list

We're ready to get our weather.

# Prepare the weather table, keeping only the weather stations of interest

In [3]:
# Consolidating our raw weather into one dataframe
weather_2018 = pd.read_csv('../../data/raw/2018.csv', header=None)
weather_2019 = pd.read_csv('../../data/raw/2019.csv', header=None)
weather_2020 = pd.read_csv('../../data/raw/2020.csv', header=None)
weather = pd.concat([weather_2018, weather_2019, weather_2020], axis=0)

#Let's keep only the rows that have a stationID that's in our station_list
weather = weather[weather[0].isin(station_list)]

NameError: name 'station_list' is not defined

In [4]:
weather.shape

(108570329, 8)

In [30]:
weather.head(15)

Unnamed: 0,0,1,2,3,4,5,6,7
188,AQW00061705,20180101,TMAX,314,,,W,2400.0
189,AQW00061705,20180101,TMIN,252,,,W,2400.0
190,AQW00061705,20180101,PRCP,190,,,W,2400.0
191,AQW00061705,20180101,ADPT,250,,,W,
192,AQW00061705,20180101,ASLP,10102,,,W,
193,AQW00061705,20180101,ASTP,10098,,,W,
194,AQW00061705,20180101,AWBT,261,,,W,
195,AQW00061705,20180101,AWND,23,,,W,
196,AQW00061705,20180101,RHAV,94,,,W,
197,AQW00061705,20180101,RHMN,83,,,W,


According to the documentation, the 3rd column is the element and the 4th one its value. 

Let's rename the first four columns to Station, Date, Element, and Value and drop the rest

# Keeping only the relevant weather information 

In [31]:
weather = weather[[weather.columns[0], weather.columns[1], weather.columns[2], weather.columns[3]]]
weather = weather.rename(columns={0: 'Station', 1: 'Date', 2: 'Element', 3: 'Value'})

#Drop rows with no date
weather = weather.dropna(subset=['Date'])

# The observation column would need to be pivoted and would need to be added under the appropriate column name
weather = weather.pivot_table(index=['Station', 'Date'], columns='Element', values='Value', aggfunc='first').reset_index()

Let's drop the columns that we don't need.

After some research, the factors affecting planes are:
- Fog
- Ice
- High winds
- Heavy Reain
- Low Air Density (combination of heat, altitude and air pressure)

Let's keep the relevant ones:

In [32]:
weather = weather[['Station', 'Date', 'PRCP', 'SNOW', 'TAVG', 'TMAX','TMIN', 'ADPT', 'ASLP', 'ASTP', 'AWND', 'PSUN', 'RHAV', 'RHMN', 'RHMX', 'WSFG', 'WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT07', 'WT08', 'WT09', 'WT10', 'WT11']]

In [33]:
#Let's rename the WT columns
weather = weather.rename(columns={'WT01': 'Fog', 'WT02': 'Heavy_Fog', 'WT03': 'Thunder', 'WT04': 'Ice_Pellets', 'WT05': 'Hail', 'WT06': 'Glaze_or_Rime', 'WT07': 'Dust_or_Sand', 'WT08': 'Smoke_or_Haze', 'WT09': 'Blowing or Drifting Snow', 'WT10': 'Tornado_or_Funnel_Cloud', 'WT11': 'High_or_Damaging_Winds'})

In [34]:
weather.head()

Element,Station,Date,PRCP,SNOW,TAVG,TMAX,TMIN,ADPT,ASLP,ASTP,AWND,PSUN,RHAV,RHMN,RHMX,WSFG,Fog,Heavy_Fog,Thunder,Ice_Pellets,Hail,Glaze_or_Rime,Dust_or_Sand,Smoke_or_Haze,Blowing or Drifting Snow,Tornado_or_Funnel_Cloud,High_or_Damaging_Winds
0,AQC00914141,20190801,5.0,,,,,,,,,,,,,,,,,,,,,,,,
1,AQC00914141,20190802,0.0,,,,,,,,,,,,,,,,,,,,,,,,
2,AQC00914141,20190803,0.0,,,,,,,,,,,,,,,,,,,,,,,,
3,AQC00914141,20190804,0.0,,,,,,,,,,,,,,,,,,,,,,,,
4,AQC00914141,20190805,135.0,,,,,,,,,,,,,,,,,,,,,,,,


In [35]:
weather.shape

(2379183, 27)

In [36]:
weather[weather['Smoke_or_Haze'].notnull()].shape

(47263, 27)

Looking at the amount of results for each WT, we'll only keep, the fogs, smoke/haze, and thunder ones. The others are too insignificant. 

We'll also merge fog and heavy fog on a scale from 1 to 2. We'll modify the values in fog to 2 where heavy fog is 1 and replace the NANs by 0.

In [37]:
weather.loc[weather['Heavy_Fog'] == 1, 'Fog'] = 2
weather['Fog'] = weather['Fog'].fillna(0)

In [38]:
############## weather[weather['PRCP'].notnull()].shape
############## weather[weather['SNOW'].notnull()].shape
# weather[weather['TAVG'].notnull()].shape
############## weather[weather['TMAX'].notnull()].shape
# weather[weather['TMIN'].notnull()].shape
# weather[weather['ADPT'].notnull()].shape
# weather[weather['ASLP'].notnull()].shape
############# weather[weather['ASTP'].notnull()].shape
############# weather[weather['AWND'].notnull()].shape
# weather[weather['PSUN'].notnull()].shape
############# weather[weather['RHAV'].notnull()].shape
# weather[weather['RHMN'].notnull()].shape
# weather[weather['RHMX'].notnull()].shape
###### # weather[weather['WSFG'].notnull()].shape

Surprisingly enough, we have more TMAX/MIN than TAVG, so we'll work with max as high temperature as more effect on planes than low temperature (lower air density at high temperature)

For pressure we'll use ASTP

For humidity we have slightly more with RHAV, so we'll use this one

For wind, we'll use AWND

Let's format the columns we keep

In [39]:
#Tune values
weather['PRCP'] = weather['PRCP']/10 #PRCP is in tenth of a mm as well
weather['TMAX'] = weather['TMAX']/10 #temp is in tenth of degree
weather['ASTP'] = weather['ASTP']/10 #pressure is originally in (hpa * 10)
weather['AWND'] = weather['AWND']/10 #wind is in tenth of m/s


# Rename columns
weather = weather.rename(columns={'SNOW': 'Snowfall (mm)'})
weather = weather.rename(columns={'PRCP': 'Precipitation (mm)'})
weather = weather.rename(columns={'TMAX': 'Maximum Temperature (*C)'})
weather = weather.rename(columns={'ASTP': 'Avg Pressure for the day (hPa)'})
weather = weather.rename(columns={'AWND': 'Avg Wind Speed (m/s)'})
weather = weather.rename(columns={'RHAV': 'Avg Humidity (%)'})

Let's remove the columns we don't need

In [40]:
weather = weather.drop(['TAVG', 'TMIN', 'ADPT', 'ASLP', 'PSUN', 'RHMN', 'RHMX', 'WSFG', 'Heavy_Fog','Ice_Pellets', 'Hail', 'Glaze_or_Rime', 'Dust_or_Sand', 'Blowing or Drifting Snow', 'Tornado_or_Funnel_Cloud', 'High_or_Damaging_Winds'], axis=1)

# Add weather stations' location information to the dataset

Remember those weather-stations? Let's add the lat, long, and iata code from weather_stations to our weather dataframe. 

In [41]:
weather = pd.merge(weather, weather_stations, left_on='Station', right_on='StationID', how='left')

#drop the stationID column (duplicate added during last step)
weather = weather.drop('StationID', axis=1)

# Split the date column into year, month, and day
weather['Year'] = weather['Date'].apply(lambda x: str(x)[:4])
weather['Month'] = weather['Date'].apply(lambda x: str(x)[4:6])
weather['Day'] = weather['Date'].apply(lambda x: str(x)[6:])

#Switch them to Integer
weather['Year'] = weather['Year'].astype(int)
weather['Month'] = weather['Month'].astype(int)
weather['Day'] = weather['Day'].astype(int)

# Drop the date column
weather = weather.drop('Date', axis=1)



In [42]:
weather.describe()

Unnamed: 0,Precipitation (mm),Snowfall (mm),Maximum Temperature (*C),Avg Pressure for the day (hPa),Avg Wind Speed (m/s),Avg Humidity (%),Fog,Thunder,Smoke_or_Haze,Latitude,Longitude,Year,Month,Day
count,2302364.0,1290966.0,685127.0,275883.0,372155.0,276900.0,2379183.0,43749.0,47263.0,2379183.0,2379183.0,2379183.0,2379183.0,2379183.0
mean,3.088246,2.092188,19.219024,975.808618,3.552319,86.729516,0.06838566,1.0,1.0,37.53879,-95.95755,2019.035,6.602999,15.71715
std,9.402053,15.23931,11.717693,54.124061,1.784641,13.252523,0.2839928,0.0,0.0,6.987813,21.46382,0.8185386,3.41209,8.799691
min,0.0,0.0,-42.8,746.0,0.0,0.0,0.0,1.0,1.0,-14.3,-170.7,2018.0,1.0,1.0
25%,0.0,0.0,10.6,971.9,2.3,82.0,0.0,1.0,1.0,33.4,-107.9,2018.0,4.0,8.0
50%,0.0,0.0,21.1,993.6,3.2,90.0,0.0,1.0,1.0,37.9,-93.3,2019.0,7.0,16.0
75%,1.0,0.0,28.9,1010.2,4.5,96.0,0.0,1.0,1.0,41.9,-81.8,2020.0,10.0,23.0
max,602.0,1080.0,56.1,1044.4,29.3,100.0,2.0,1.0,1.0,71.3,145.7,2020.0,12.0,31.0


We seem to have some PRCP and SNOW outliers.. let's check

In [43]:
weather.sort_values(by='Precipitation (mm)', ascending=False).head(15)

Unnamed: 0,Station,Precipitation (mm),Snowfall (mm),Maximum Temperature (*C),Avg Pressure for the day (hPa),Avg Wind Speed (m/s),Avg Humidity (%),Fog,Thunder,Smoke_or_Haze,Latitude,Longitude,iata_code,Year,Month,Day
1927308,USC00512802,602.0,,,,,,0.0,,,19.8,-155.1,ITO,2018,8,23
1849580,USC00410611,502.9,,27.2,,,,0.0,,,30.1,-94.1,BPT,2019,9,19
1939182,USC00517724,475.0,,,,,,0.0,,,19.8,-155.1,ITO,2018,8,23
326862,US1FLES0026,461.5,,,,,,0.0,,,30.5,-87.2,PNS,2020,9,16
419406,US1HIHI0039,431.5,,,,,,0.0,,,19.7,-155.1,ITO,2018,8,23
1939183,USC00517724,412.8,,,,,,0.0,,,19.8,-155.1,ITO,2018,8,24
418939,US1HIHI0011,411.7,,,,,,0.0,,,19.8,-155.1,ITO,2018,8,25
329790,US1FLES0054,396.0,,,,,,0.0,,,30.5,-87.2,PNS,2020,9,16
2183923,USW00021504,381.0,,26.1,1010.8,3.7,100.0,1.0,1.0,,19.7,-155.1,ITO,2018,8,24
1808166,USC00317170,360.4,0.0,25.0,,,,0.0,,,35.1,-76.9,EWN,2018,9,14


High but it is not a mistake

In [44]:
#view weathers ordered by SNOW in descending order to see outliers
weather.sort_values(by='Snowfall (mm)', ascending=False).head(15)

Unnamed: 0,Station,Precipitation (mm),Snowfall (mm),Maximum Temperature (*C),Avg Pressure for the day (hPa),Avg Wind Speed (m/s),Avg Humidity (%),Fog,Thunder,Smoke_or_Haze,Latitude,Longitude,iata_code,Year,Month,Day
1070273,US1NYTG0008,72.4,1080.0,,,,,0.0,,,42.1,-76.1,BGM,2020,12,17
1028437,US1NYBM0051,,1003.0,,,,,0.0,,,42.1,-76.0,BGM,2020,12,17
1989264,USW00003103,36.6,912.0,-4.9,770.4,3.8,92.0,2.0,,,35.1,-111.7,FLG,2019,2,21
1026830,US1NYBM0024,29.7,699.0,,,,,0.0,,,42.1,-75.9,BGM,2020,12,17
1786655,USC00300684,,671.0,,,,,0.0,,,42.2,-76.0,BGM,2020,12,17
2022002,USW00004725,42.9,671.0,-5.5,955.3,3.6,88.0,2.0,,1.0,42.2,-76.0,BGM,2020,12,17
131451,US1AZYV0028,33.0,648.0,,,,,0.0,,,34.6,-112.5,PRC,2019,2,22
63819,US1AZCN0033,25.1,630.0,,,,,0.0,,,35.2,-111.7,FLG,2019,2,22
67706,US1AZCN0113,,622.0,,,,,0.0,,,35.2,-111.7,FLG,2019,2,22
1776671,USC00274234,30.7,617.0,-1.1,,,,0.0,,,42.8,-71.4,MHT,2018,3,14


That's a lot of snow, but the two largest ones basically confirm each others as they are from different sensors. I guess this is GTG as well.

# Group all stations per airport code

In [45]:
#Just a test
weather[(weather['iata_code'].str.contains('DEN')) & (weather['Year'] == 2018) & (weather['Month'] == 1) & (weather['Day'] == 8)]

Unnamed: 0,Station,Precipitation (mm),Snowfall (mm),Maximum Temperature (*C),Avg Pressure for the day (hPa),Avg Wind Speed (m/s),Avg Humidity (%),Fog,Thunder,Smoke_or_Haze,Latitude,Longitude,iata_code,Year,Month,Day
212105,US1COAD0087,0.0,0.0,,,,,0.0,,,40.0,-104.8,DEN,2018,1,8
213192,US1COAD0120,0.0,,,,,,0.0,,,40.0,-104.8,DEN,2018,1,8
215341,US1COAD0204,0.0,0.0,,,,,0.0,,,40.0,-104.8,DEN,2018,1,8
286210,US1COWE0187,0.0,0.0,,,,,0.0,,,40.0,-104.7,DEN,2018,1,8
1641115,USC00050950,0.0,0.0,10.0,,,,0.0,,,39.9,-104.8,DEN,2018,1,8
1986663,USW00003017,0.0,0.0,9.4,831.4,2.4,78.0,0.0,,,39.8,-104.7,DEN,2018,1,8


This is great! Lots of different sensors

Now let's group those sensors. We'll use a max() to consolidate the values instead of the average so if we have a few precipitations and a bunch of zeros it's not pulling it down

In [46]:
weather = weather.groupby(['Year', 'Month', 'Day', 'iata_code']).max().reset_index()

In [47]:
# Drop Lat and Long as no longer accurate
weather = weather.drop(['Latitude', 'Longitude'], axis=1)

In [48]:
weather.sample(15)

Unnamed: 0,Year,Month,Day,iata_code,Station,Precipitation (mm),Snowfall (mm),Maximum Temperature (*C),Avg Pressure for the day (hPa),Avg Wind Speed (m/s),Avg Humidity (%),Fog,Thunder,Smoke_or_Haze
188536,2019,5,30,CLT,USW00013881,0.0,0.0,35.0,984.4,2.9,71.0,0.0,,1.0
92690,2018,9,10,ORD,USW00094846,0.0,0.0,23.3,992.5,3.0,90.0,0.0,,
353933,2020,8,21,MRY,USW00023259,0.0,0.0,31.1,,2.6,,0.0,,1.0
30294,2018,3,24,JFK,USW00094789,0.0,0.0,10.6,1017.9,5.8,57.0,0.0,,
25047,2018,3,10,CLT,USW00013881,0.5,0.0,13.9,987.5,1.3,100.0,0.0,,1.0
241177,2019,10,20,LNK,USW00014939,0.0,0.0,17.8,961.1,5.6,86.0,0.0,,
207083,2019,7,19,PIE,USW00012873,12.7,0.0,32.2,,2.8,,0.0,1.0,
55588,2018,6,1,GUM,GQW00041415,1.5,0.0,32.2,,5.5,88.0,1.0,,
237187,2019,10,9,PGD,USW00012812,83.8,0.0,33.3,,2.1,,2.0,1.0,
147658,2019,2,7,LAS,USW00053123,0.0,0.0,8.9,946.8,2.5,44.0,0.0,,


In [49]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402757 entries, 0 to 402756
Data columns (total 14 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Year                            402757 non-null  int64  
 1   Month                           402757 non-null  int64  
 2   Day                             402757 non-null  int64  
 3   iata_code                       402757 non-null  object 
 4   Station                         402757 non-null  object 
 5   Precipitation (mm)              399624 non-null  float64
 6   Snowfall (mm)                   342687 non-null  float64
 7   Maximum Temperature (*C)        382023 non-null  float64
 8   Avg Pressure for the day (hPa)  271552 non-null  float64
 9   Avg Wind Speed (m/s)            341327 non-null  float64
 10  Avg Humidity (%)                272540 non-null  float64
 11  Fog                             402757 non-null  float64
 12  Thunder         

# Some last minute tweaks

In [50]:
# Let's switch the NaNs to zeros for Thunder and smoke
weather['Thunder'] = weather['Thunder'].fillna(0)
weather['Smoke_or_Haze'] = weather['Smoke_or_Haze'].fillna(0)

# Save the CSV file

Look at that beautiful weather data! Now let's save it before our computer crash!

In [51]:
weather.to_csv("../../data/processed/flights_enrichment_weather.csv", index=False)