# Tutorial 01 - Data Preprocessing

We provide a converter to convert raw datasets into [CommonRoad scenarios](https://commonroad.in.tum.de/scenarios) in `.xml` format. The public converter is available [here](https://commonroad.in.tum.de/dataset-converters). In addition, CommonRoad-RL provides tools (`./commonroad_rl/tools/pickle_scenario`) to convert `.xml` scenarios to `.pickle` format to save loading time for the training. 

This tutorial shows how to utilize the tools to prepare training and testing data for the highD dataset. A similar procedure follows for the inD dataset.

## 0. Preparation
Please follow the README.md to install the CommonRoad-RL package and make sure the followings:
* current path is at `commonroad-rl/commonroad_rl`, i.e. one upper layer to the `tutorials` folder
* interactive python kernel is triggered from the correct environment

In [None]:
# Check current path
%cd ..
%pwd

# Check interactive python kernel
import sys
sys.executable

## 1. Acquire the dataset
To download the whole raw highD dataset, please go to [the highD home page](https://www.highd-dataset.com).

To facilitate the following exercises, we have prepared sample data under `tutorials/data/highd/raw`, where you should see three csv files recording the track information and one jpg file showing the track background.

## 2. Convert raw .csv data to .xml files

Clone and install [dataset-converters](https://gitlab.lrz.de/tum-cps/dataset-converters/-/tree/master) in `commonroad-rl/install` folder.

In [None]:
%cd ../install/
!git clone https://gitlab.lrz.de/tum-cps/dataset-converters.git
%cd dataset-converters
!pip install -r requirements.txt 

In [None]:
!python -m src.main highD ../../commonroad_rl/tutorials/data/highD/raw/ ../../commonroad_rl/tutorials/data/highD/xmls/ --num_time_steps_scenario 1000 
%cd ../../commonroad_rl

Now there should be 50 `.xml` files in the output folder `tutorials/data/highd/xmls`.

## 3. Validate .xml files against CommonRoad .xsd specification

To check if the converted `.xml` files comply with the CommonRoad scenario format, use the validation tool in `commonroad_rl/tools`.

In [None]:
!python -m commonroad_rl.tools.validate_cr -s tools/XML_commonRoad_XSD_2020a.xsd tutorials/data/highD/xmls/*

## 4. Visualize CommonRoad scenarios
There is a visualization tool in `commonroad_rl/tools`, which can be executed by a simple command at the terminal; for example,  
`python -m commonroad_rl.tools.visualize_cr tutorials/data/highD/xmls/DEU_LocationB-3_1_T-1.xml`. 

However, this script does not work for Jupyter notebook because of a backend error. Therefore, we utilize here the `commonroad-io` package. Let's try it with a sample scenario.

In [None]:
%matplotlib inline
import os
import glob
import matplotlib.pyplot as plt
from IPython import display

from commonroad.common.file_reader import CommonRoadFileReader
from commonroad.visualization.draw_dispatch_cr import draw_object

files = "tutorials/data/highD/xmls/*.xml"
file_path = sorted(glob.glob(files))[0]

# Read in the scenario and planning problem set
scenario, planning_problem_set = CommonRoadFileReader(file_path).open()

# Plot the scenario for 40 time step, here each time step corresponds to 0.1 second
for i in range(0, 40):
    # Comment line below to keep sequence of graphs
    display.clear_output(wait=True)
    
    plt.figure(figsize=(20, 10))
    
    # Plot the scenario at different time step
    draw_object(scenario, draw_params={'time_begin': i})
    
    # Plot the planning problem set
    draw_object(planning_problem_set)
    plt.gca().set_aspect('equal')
    plt.show()

## 5. Convert .xml files to .pickle data
Since an RL training/testing session involves tens of thousands of iterations and accesses to the scenarios, it is a good idea to convert the `.xml` files to `.pickle` format so that they will be loaded more efficiently during training and testing. For example, loading 3000 `.xml` files takes about 2h while loading the same amount of `.pickle` files takes only 10min.

Furthermore, this script separates road networks and obstacles since lots of scenario could share the road network data. Road networks are stored in `meta_scenario` folder, whereas obstacles are stored in the `problem` folder. This is done with a conversion tool in `commonroad_rl/tools/pickle_scenario`.

In [None]:
!python -m commonroad_rl.tools.pickle_scenario.xml_to_pickle -i tutorials/data/highD/xmls -o tutorials/data/highD/pickles

Now in the output folder `tutorials/data/highD/pickles`, there should be a `meta_scenario` folder containing meta information and a `problem` folder containing 50 `.pickle` files.

## 6. Split .pickle data for training and testing
As a final step, let's split the 50 problems into training and testing sets with a ratio of 7:3 randomly, again using a provided script in `commonroad_rl/utils_run`.

In [None]:
!python -m commonroad_rl.utils_run.split_dataset -i tutorials/data/highD/pickles/problem -otrain tutorials/data/highD/pickles/problem_train -otest tutorials/data/highD/pickles/problem_test -tr_r 0.7

Now in `tutorials/data/highD/pickles`, there should be a `problem_train` folder containing 35 pickles and a `problem_test` folder containing 15 pickles.

**Note**: For each data conversion step, we provide bash script to enable converting the data on multiple threads. Please use those scripts instead if you want to convert the whole dataset to save runtime.

## 7. Separate training data for multi envs (skip this step if not using multi env)
To train the model on mulitple envs, the scenarios need to be separated into different files. we can use a provided script in `commonroad_rl/tools/pickles_scenario` to do it  
Here is an example to separate all .pickles files (both train and test) into 5 folders

In [None]:
!python -m commonroad_rl.tools.pickle_scenario.copy_files -i tutorials/data/highD/pickles/problem_train -o tutorials/data/highD/pickles/problem_train -f *.pickle -n 5 

In [None]:
!python -m commonroad_rl.tools.pickle_scenario.copy_files -i tutorials/data/highD/pickles/problem_test -o tutorials/data/highD/pickles/problem_test -f *.pickle -n 5

Now in the output folder `tutorials/data/highD/pickles/problem_train` and `tutorials/data/highD/pickles/problem_test`, you should have 5 folders name `0`,`1`,`2`,`3`,`4`, each contains different part of the scenarios.