# Copy the flight data to the default datastore

In this notebook, you'll be using a series of shell scripts to download the high-dimensional airline data: the Airline Service Quality Performance dataset, distributed by the U.S. Bureau of Transportation Statistics. 1987-2021. https://www.bts.dot.gov/browse-statistical-products-and-data/bts-publications/airline-service-quality-performance-234-time)

This dataset is open source and provided on an ongoing basis by the U.S. Bureau of Transportation Statistics.

Each month, the Bureau publishes a new csv file containing all flight information for the prior month. A single month of data may not be enough to build out a robust machine learning model. Our goal is to download several years' worth of this data and then combine it into a single csv file.

In addition to the flight data, you'll also be downloading a file containing metadata and geo-coordinates of each airport and a file containing the code mappings for each airline. Airlines and airports rarely change, and as such, these files are static and do not change on a monthly basis. They do, however, contain information that we will later need to be mapped to the full airline dataset. (Megginson, David. "airports.csv", distributed by OurAirports. August 2, 2021. https://ourairports.com/data/airports.csv)

## Get the data

For easy reuse, you'll copy the data to the default datastore which is an Azure Storage Account. To get the data, you start by copying the data from the source to the compute instance. Then, you'll upload it to the datastore.

In [None]:
# Set the location for data and name for final raw csv
data_dir = "./data"
csvfile_full = f"{data_dir}/airlines_raw_data_full.csv"

In [None]:
%%bash

# Files are stored on bts.gov with 1 file per-month, download each month for a subset of years
for month in `seq 1 8`; do 
  for year in `seq 2019 2019`; do
    wget -q --no-check-certificate https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_${year}_${month}.zip
    unzip -qou On_Time_Reporting_Carrier_On_Time_Performance_1987_present_${year}_${month}.zip
  done
done

In [None]:
%%bash -s "$data_dir"
data_dir="${1}"

# Remove the temporary .zip files and combine all CSV files into a single large file
rm -f On_Time_Reporting_Carrier_On_Time_Performance_1987_present_*zip*
mkdir -p "${data_dir}"
cat *csv > "${data_dir}/airlines_raw_data_full.csv"
rm On_Time_Reporting_Carrier_On_Time_Performance*csv

In [None]:
%%bash -s "$data_dir"
data_dir="${1}"

# Download some static files that describe metadata about airports, airlines, and locations
cd "${data_dir}"
wget -q --no-check-certificate https://ourairports.com/data/airports.csv
wget -q --no-check-certificate  https://sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com/airline_csv/carriers.csv

## Explore the data

Let's check whether you have all the data, how much data you have and what it contains.

In [None]:
%%bash -s "$data_dir"
data_dir="${1}"

# Print out some debug information about the downloaded files
ls -lh "${data_dir}/airlines_raw_data_full.csv"
ls -lh "${data_dir}/carriers.csv"
ls -lh "${data_dir}/airports.csv"
wc -l  "${data_dir}/airlines_raw_data_full.csv"
wc -l  "${data_dir}/carriers.csv"
wc -l  "${data_dir}/airports.csv"

In [None]:
%%bash -s "$data_dir"
data_dir="${1}"

# Take a look at the headers for each of the CSV files
head -n 1 "${data_dir}/airlines_raw_data_full.csv"
echo
head -n 1 "${data_dir}/carriers.csv"
echo
head -n 1 "${data_dir}/airports.csv"

## Upload all the files to the default datastore

To process the data with cuDF you need to use the compute cluster (as it uses GPU). To give the cluster access to your data, you'll upload all the files to the default datastore (the Azure Storage Account created together with the Azure Machine Learning workspace).

In [None]:
from azureml.core import Workspace, Datastore, Dataset
ws = Workspace.from_config()
# Get the default datastore
default_ds = ws.get_default_datastore()

default_ds.upload_files(files=['./data/airlines_raw_data_full.csv', './data/airports.csv', './data/carriers.csv'], # Upload the diabetes csv files in /data
                       target_path='airport-data/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

You can check whether the data has successfully been uploaded by going to [https://portal.azure.com](https://portal.azure.com). Navigate to the Azure Storage Account created with the Azure Machine Learning workspace (both will be in the same resource group).

In the Storage Account, go to containers. In the container starting with the prefix `azureml-blobstore` and find the `airport-data` folder.