# Preprocess the data with RAPIDS using GPUs

In this notebook, you'll be using a subset of high-dimensional airline data: the Airline Service Quality Performance dataset, distributed by the U.S. Bureau of Transportation Statistics. 1987-2021. https://www.bts.dot.gov/browse-statistical-products-and-data/bts-publications/airline-service-quality-performance-234-time)

This dataset is open source and provided on an ongoing basis by the U.S. Bureau of Transportation Statistics.

Each month, the Bureau publishes a new csv file containing all flight information for the prior month. To train a robust machine learning model, you'd want to combine data over multiple years to use as a training dataset. In this exercise, you'll use data of only 10 days for illustration purposes. However, even when working with large amounts of data, the script should execute quickly as it uses cuDF to load and preprocess the data.

In addition to the flight data, you'll also be downloading a file containing metadata and geo-coordinates of each airport and a file containing the code mappings for each airline. Airlines and airports rarely change, and as such, these files are static and do not change on a monthly basis. They do, however, contain information that we will later need to be mapped to the full airline dataset. (Megginson, David. "airports.csv", distributed by OurAirports. August 2, 2021. https://ourairports.com/data/airports.csv)

## Get environment variables

Before you can submit the job, you have to get all necessary environment variables such as the workspace and environment.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()

In [None]:
from azureml.core import Environment
from azureml.core.runconfig import DockerConfiguration

rapidsai_env = Environment.get(workspace=ws, name="rapids-mlflow")
d_config = DockerConfiguration(arguments=['-it'])

## Define the configuration and submit the run

Now that you have defined all necessary variables, you can define the script run configuration and submit the run.

**Warning!** Change the value of the compute_target variable to your compute cluster name before running the code below!

In [None]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory='script',
                      script='preprocess-rapids.py',
                      compute_target="<your-compute-cluster>",
                      environment=rapidsai_env,
                      docker_runtime_config=d_config)

To learn what is done during preprocessing, explore the script `preprocess-rapids.py` in the `script` folder.

The following cell will initiate the run. Note that first, the compute cluster has to scale up from 0 nodes. Once a node is available, it will execute the script. The execution of the script should be fast and you can see the execution time in the **Details** tab of the **Experiment** run afterwards.

In [None]:
from azureml.core import Experiment

run = Experiment(ws,'preprocess-data').submit(src)
run.wait_for_completion(show_output=True)

You should get a notification in the Studio that a new run has started and is running. 

You can also navigate to the **Experiments** tab, and find the experiment `preprocess-data` there. 

Once it has finished running, have a look at the **Metrics** tab to learn how much data was processed. In the **Details** tab, you can see how long it took to run. You'll also find the processed data in the **Outputs+logs** in the **outputs** folder.