# Preprocess the data with RAPIDS using GPUs

In this notebook, we will take the data from the datastore as input and submit a job that runs a Python script. 

The script uses cuDF to load and preprocess the data.

## Get environment variables

Before we can submit the job, we have to get all necessary environment variables such as the workspace, datastore, and environment.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()

In [None]:
from azureml.core import Environment

rapidsai_env = Environment.get(workspace=ws, name="rapids-mlflow")

In [None]:
from azureml.core import Image
build = rapidsai_env.build(workspace=ws)
build.wait_for_completion(show_output=True)

In [None]:
from azureml.core.dataset import Dataset
default_ds = ws.get_default_datastore()

data_full_ds = Dataset.File.from_files(default_ds.path('airport-data/airlines_raw_data_full.csv')).as_mount()
airports_ds = Dataset.File.from_files(default_ds.path('airport-data/airports.csv')).as_mount()
carriers_ds = Dataset.File.from_files(default_ds.path('airport-data/carriers.csv')).as_mount()

## Define the configuration and submit the run

Now that we have defined all necessary variables, we can define the script run configuration and submit the run.

**Warning!** Change the value of the compute_target variable to your compute cluster name before running the code below!

In [None]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory='script',
                      script='preprocess-rapids.py',
                      arguments = ['--data-file-full', data_full_ds,
                                   '--airports-file', airports_ds,
                                   '--carriers-file', carriers_ds],
                      compute_target="<your-compute-cluster-name>",
                      environment=rapidsai_env)

To learn what is done during preprocessing, explore the script `preprocess-rapids.py` in the `script` folder.

The following cell will initiate the run. Note that first, the compute cluster has to scale up from 0 nodes. Once a node is available, it will execute the script. The execution of the script should be fast and you can see the execution time in the **Details** tab of the **Experiment** run afterwards.

In [None]:
from azureml.core import Experiment

run = Experiment(ws,'preprocess-data').submit(src)
run.wait_for_completion(show_output=True)

You should get a notification in the Studio that a new run has started and is running. 

You can also navigate to the **Experiments** tab, and find the experiment `preprocess-data` there. 

Once it has finished running, have a look at the **Metrics** tab to learn how much data was processed. In the **Details** tab, you can see how long it took to run.