# Merging Airline Delay and Weather Datasets

In this notebook, we merge together two data sources in order to create richer features for our flight delay prediction classification problem.
* selecting the columns we wish to keep for later analysis
* converting and cleaning data where required
* handling missing values

#### Import required modules

Import and configure the required modules.

In [None]:
# !pip install pandas scikit-learn > /dev/null 2>&1

In [None]:
# Define required imports
import pandas as pd
# These set pandas max column and row display in the notebook
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

### Read datasets

We start by reading in the processed flight delay and weather datasets

In [None]:
flight_path = 'data/jfk_flight_features.csv'
flight_data = pd.read_csv(flight_path, parse_dates=['flight_date'])
flight_data.head()

In [None]:
weather_path = 'data/jfk_weather_features.csv'
weather_data = pd.read_csv(weather_path, parse_dates=['DATE'])
weather_data.head()

### Merge datasets

The next step is to merge or join the two datasets, such that for each flight record in the flight delay dataset, we have information about the weather conditions present for that flight. 

**Note** we have to be careful not to effectively "leak" information. Recall that our weather observations come from automated weather station reports that are generated on the 51st minute of each hour. We must ensure that the weather report used for flight delay prediction is one covering weather conditions present _before_ the flight departure, otherwise we would be giving our model a glimpse in the the future!

This makes joining the datasets a little tricky. One simple approach is to join the record for a given flight day and hour, with the weather reading for the same day but the _previous hour_. We can do this by extracting 2 "join keys" from each dataset: the first for the `date` and the second for the `hour` of the record. If we set the `hour` join key for the flight to the hour _before_ the actual hour of the flight scheduled departure, then we ensure the corresponding weather report comes from the hour before the flight would depart.

In [None]:
flight_data.loc[:, 'hour_key'] = pd.to_datetime(flight_data['sched_dep_time'], format='%H%M', errors='ignore').dt.hour - 1
flight_data.loc[:, 'date_key'] = flight_data['flight_date'].dt.date
flight_data.head()

In [None]:
weather_data.loc[:, 'date_key'] = weather_data['DATE'].dt.date
weather_data.loc[:, 'hour_key'] = weather_data['DATE'].dt.hour
weather_data.head()

Next, we join the datasets together based on the "join keys" we have created:

In [None]:
flight_weather_data = flight_data.merge(weather_data, how='inner', on=['date_key', 'hour_key'])
flight_weather_data.head()

For the first record in our flight dataset, we can see that the flight departs at 15:25. The corresponding weather report is timestamped at 14:51.

**Note** all we guarantee here is that the weather report is _within_ 1 hour before the flight departure, not _precisely 1 hour before_. 

### Save the Merged Data

Finally, we save the merged dataset for use by downstream tasks.

In [None]:
flight_weather_data.to_csv('data/jfk_flight_weather_features.csv', index=False, float_format='%g')

<a id="authors"></a> 
### Authors

This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).

Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.