# Download Primary and Secondary Datasets

The primary dataset comprises eight months (May–December, 2019) of flight and weather data from US airports from the [Historical Flight Delay and Weather Data USA](https://www.kaggle.com/datasets/ioanagheorghiu/historical-flight-and-weather-data) dataset on [Kaggle](https://www.kaggle.com/).

The data was originally sourced from the [United States Bureau of Transportation Statistics](https://www.bts.gov/browse-statistical-products-and-data/bts-publications/airline-service-quality-performance-234-time) and the [National Oceanic and Atmospheric Administration](https://www.ncdc.noaa.gov/cdo-web/datatools/lcd).

<hr>

The secondary dataset comes from [The Global Airport Database](https://www.partow.net/miscellaneous/airportdatabase/index.html). It contains a list of airport codes along with their three-dimensional geographic coordinates (latitude, longitide, and altitude).

The data types for its columns (useful for setting up a database schema) can be found at the link above.

In [1]:
datasources = {
    # Primary dataset; format: Kaggle USERNAME/DATASET
    "primary" : 'ioanagheorghiu/historical-flight-and-weather-data',
    

}

# Primary dataset; format: Kaggle USERNAME/DATASET
datasource_primary = 'ioanagheorghiu/historical-flight-and-weather-data'

# Seconday dataset; format: URL
datasource_secondary = 'https://www.partow.net/downloads/GlobalAirportDatabase.zip'

In [2]:
import os

In [3]:
# Ensure the necessary folder structure exists, and make `resources` the working directory
os.chdir('..')
os.makedirs('resources',exist_ok=True)
os.chdir('resources')
data_dir = os.path.join('.','data')
for d in ['config',data_dir,'images']:
    os.makedirs(d,exist_ok=True)

In [4]:
# Filepaths to downloaded datasets
dataset_primary = os.path.join(data_dir,os.path.basename(datasource_primary) + '.zip')
dataset_secondary = os.path.join(data_dir,os.path.basename(datasource_secondary))

### Install Kaggle API.

If you do not already have the Kaggle API installed, **enable the cell below** by converting it to Cell Type `Code`. (In the Jupyter Notebook menus, select `Cell` > `Cell Type` > `Code`.) Then, go to https://www.kaggle.com/docs/api and follow the `Authentication` instructions.

Additional details about how to use the Kaggle API with Jupyter Notebook can be found [here](https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook), [here](https://technowhisp.com/kaggle-api-python-documentation/), or [here](https://stackoverflow.com/a/60309843).

### Important
Your `kaggle.json` API key file must be in the proper location as specified in the `Authentication` instructions above.

## Download Primary Dataset

In [5]:
import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi

In [6]:
kag = KaggleApi()
kag.authenticate()

In [7]:
# Download primary dataset from Kaggle
kag.dataset_download_files(
    dataset=datasource_primary,
#     unzip=True,
    path=data_dir,
)

print('Download complete.')

Download complete.


## Download Secondary Dataset

In [8]:
import requests

In [9]:
# Download the secondary datasets
response = requests.get(datasource_secondary)

try:
    with open(dataset_secondary, 'xb') as dl_file:
        for chunk in response.iter_content(chunk_size=128):
            dl_file.write(chunk)
        print('Download complete.')
except FileExistsError:
    print('Download complete. (File already exists.)')

Download complete.


## UnZip Downloaded Datasets

In [10]:
import zipfile

In [11]:
for dataset in [dataset_primary,dataset_secondary]:
    with zipfile.ZipFile(dataset, 'r') as data_zip:
        # Extract only files that do not already exist
        member_list = (set(data_zip.namelist()) - set(os.listdir(data_dir)))
        data_zip.extractall(data_dir, members=member_list)
print('Extraction complete.')

Extraction complete.
