
# Combining Reinforcement Learning with Monte Carlo Tree Search

## Motivation
Deep Reinforcement Learning is used to learn a mapping between the observed state of the world and the best action to take especially when dealing with constantly changing and eventually high dimensional worlds. This mapping is learned without any prior knowledge of that world. To do so, a Deep Neural Network is trained offline using rewards and penalizations. Once trained, the network can be used to act online and is guaranteed to be sample efficient and ensure real-time capability, since no time-limited optimization is performed during execution. Such a behavior is of course desired when dealing with autonomous cars that have to act and react fast to ensure safety and efficiency.

On the other hand, we have decision processes such as Monte Carlo Tree Search which are informed search algorithms that do not have a training phase. Instead, they use a heuristic to find the best action to take. One major challenge with vanilla MCTS is the scalability to larger environments with long episodes, which renders the algorithm uncapable of finding a solution in real time.

You might ask, well if DRL is real-time capable on its own like we want it to be, why are we even talking about MCTS. The problem with DRL is that it can be unrobust when a completely new scenario is encountered, and such an unexpected behavior can be dangerous. A new AlphaGO implementation that combines DRL with Monte Carlo Tree Search has shown  great improvements in the performance. For this reason, we want to implement that idea in our project and see if we can achieve better results when applying it in the case of autonomous driving. That is why we will use the Q-values from the DRL agent as heuristic to improve the performance of the tree search by better guiding the search.


## Pipeline
We first start working on Bark-ML and using Python we train a new DRL agent known as the categorial DQN. Because getting the information about the world in the vicinity of our ego car is a speed-critical part of the project, we use C++ instead of Python to get the observation of the world, since C++ has a much faster performance.  
Afterwards, we save the trained Q-Network in Python and load it using C++ for later evaluations that are time-constrained.

In parallel, we work in Bark-planner-mcts and create in C++ three different heuristic functions that influence the choice of moves in the tree search.

Finally, we use Bark to evaluate our results in Python but by calling the C++ observer and the Q-network evaluator from bark-ML and access the different heuristics from planner-mcts.
So now let’s discuss each step into more detail.

## Training and Model Saving
The first step is to train the CDQN Agent. This is achieved by doing the following steps:

1. Clone bark-ml from https://github.com/SebastianGra/bark-ml_MCTS_RL and follow the steps to create a virtual environment like stated in the link and then activate it.

2. Change the path where the checkpoints, summaries and model are saved to your local path in the lines 43, 44 and 45 in the examples/tfa.py respectivly examples/tfa_discrete.py file.

3. In bark_ml/behaviors/discrete_behavior.py, change the number of possible discrete actions. This will later influence how long the tree search takes to go through the possible actions. For a faster results, reduce the number of actions from 10 and 5 in lines 25 and 29 respectively to 4 and 2. This results in 8 possible combinations of actions that the ego vehicle can perform.

4. In bark_ml/library_wrappers/lib_tf_agents/tfa_wrapper.py, go to line 22 and change the dtype from float32 to `int32`.

5. And finally, train the agent using the command line `bazel run //examples:tfa -- --mode=train`. This will also save the Q-network, which we later need for the tree search to evaluate the Q-values given the state of the environment.

## Testing

An important part of our work is the testing. For our project you can test both the model loader and the observer we have implemented.

To test the **model loader** do the following:

1. In the file bark_ml/tests/model_loader_test.cc file, change the path in line 15 to the same local path used before to save the model.

2. Git clone the Tensorflow library from this link https://github.com/steven-guo94/libtensorflow_so

3. Run this command in the terminal `export LD_LIBRARY_PATH=/.../libtensorflow_so/libtensorflow/lib` by adding the local path to where the Tensorflow library is.

4. Run this command in the terminal to test the saved model `bazel run //bark_ml/tests:model_loader_test`.

For testing the **observer** an additional more comprehensive observer test in c++ has been implemented. It verifies several properties of the observer:

- The correct length of the concatenated state-array according a given number of observed agents.
- Agents too far away from the ego position are filtered out.
- That the observed agent states are stacked in the correct order into the concatenates state-array (closest (ego) first, most far one last).
- The normalization of the state-values works correctly.

To exectute the test together with the new obsever do following:

1. Ensure that line 11 in observer_test.cc says `#include "bark_ml/observers/nearest_observer.hpp"` so it uses the correct observer.

2. Execute the test via command `bazel run //bark_ml/tests:observer_test`.




## Monte Carlo Tree Search with Heuristics

As mentioned in Motivation, MCTS uses different heuristics to find the best action to take. You can access the different heuristics from planner-mcts.

To play with **planner-mcts** do the following:

1. Get the code locally: `git clone https://github.com/bark-simulator/planner-mcts.git`. 

2. Switch to the branch which we are working on: `git checkout migrate_to_new_bark`.

3. Set up the vitual environment: `bash util/setup_test_venv.sh`.

4. Get into the venv: `source util/into_test_venv.sh`.

In our project, three heuristics are used:

1. **Random heuristic**: which is defined here https://github.com/juloberno/mamcts/blob/master/mcts/heuristics/random_heuristic.h .

2. **Domain heuristic**: which is defined here https://github.com/bark-simulator/planner-mcts/blob/migrate_to_new_bark/bark_mcts/models/behavior/heuristics/domain_heuristic.hpp . To test the domain heuristic run this command in terminal: `bazel test //bark_mcts/models/behavior/tests:single_agent_domain_heuristic_test`.
    
3. **NN heuristic**: which is defined here https://github.com/bark-simulator/planner-mcts/blob/migrate_to_new_bark/bark_mcts/models/behavior/heuristics/nn_heuristic.hpp . To test the NN heuristic run this command in terminal: `bazel test //bark_mcts/models/behavior/tests:single_agent_nn_heuristic_test`.



## Benchmarking

Systematically benchmarking behavior consists of
1. A reproducable set of scenarios (we call it **BenchmarkDatabase**)
2. Metrics, which you use to study the performance (we call it **Evaluators**)
3. The behavior model(s) under test

To run this benchmarking notebook, you should run the following commands in terminal:
1. Download the bark repository: `git clone https://github.com/Lizhu-Chen/bark.git`.
    
2. Download the libtensorflow: `git clone https://github.com/steven-guo94/libtensorflow_so.git`.
    
3. Export the local path of libtensorflow: `export LD_LIBRARY_PATH=your local path of libtensorflow`.

4. Open bark in the terminal, and switch to the branch: `git branch practical_course_mcts_rl`.
    
5. Get into vitual environment: `bash install.sh` and then `source dev_into.sh`.
    
6. Run the notebook: `bazel run //docs/tutorials:run --define planner_uct=true`.
    

Our **BenchmarkRunner** can then run the benchmark and produce the results.

In [1]:
import os
import unittest
import ray
import matplotlib.pyplot as plt
from IPython.display import Video

from bark.runtime.scenario.scenario import *
from benchmark_database.load.benchmark_database import BenchmarkDatabase
from benchmark_database.serialization.database_serializer import DatabaseSerializer
from bark.benchmark.benchmark_runner import BenchmarkRunner, BenchmarkConfig, BenchmarkResult
from bark.benchmark.benchmark_runner_mp import BenchmarkRunnerMP
from bark.benchmark.benchmark_analyzer import BenchmarkAnalyzer

from bark.core.world.evaluation import *
from bark.runtime.commons.parameters import ParameterServer

from bark.runtime.viewer.matplotlib_viewer import MPViewer
from bark.runtime.viewer.video_renderer import VideoRenderer


from bark.core.models.behavior import BehaviorUCTSingleAgentMacroActions

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html


ImportError: cannot import name 'BehaviorUCTSingleAgentMacroActions' from 'bark.core.models.behavior' (unknown location)

# Database
The benchmark database provides a reproducable set of scenarios.
A scenario get's created by a ScenarioGenerator (we have a couple of them). The scenarios are serialized into binary files (ending `.bark_scenarios`) and packed together with the map file and the parameter files into a `.zip`-archive. We call this zipped archive a relase, which can be published at Github, or processed locally, which named `benchmark_database_test.zip`. 
It's usually saved in `bark/bazel-bin/examples/benchmark_database.runfiles/benchmark_database_release/database` folder, you can find the local path of the `.zip` and change `dbs.process("your local path of the .zip file")` 



## We will first start with the DatabaseSerializer

The **DatabaseSerializer** recursively serializes all scenario param files sets
 within a folder.
 
We will process the database directory from Github.

In [None]:
dbs = DatabaseSerializer(test_scenarios=2, test_world_steps=2, num_serialize_scenarios=10)
dbs.process("/home/vivienne/Praktikum/fork/bark/bazel-bin/modules/benchmark/tests/py_benchmark_process_tests.runfiles/benchmark_database")
local_release_filename = dbs.release(version="test")

print('Filename:', local_release_filename)

Then reload to test correct parsing

In [None]:
db = BenchmarkDatabase(database_root=local_release_filename)
scenario_generation, _ = db.get_scenario_generator(scenario_set_id=0)

for scenario_generation, _ in db:
  print('Scenario: ', scenario_generation)

## Evaluators

Evaluators allow to calculate a boolean, integer or real-valued metric based on the current simulation world state.

The current evaluators available in BARK are:
- StepCount: returns the step count the scenario is at.
- GoalReached: checks if a controlled agent’s Goal Definitionis satisfied.
- DrivableArea: checks whether the agent is inside its RoadCorridor.
- Collision(ControlledAgent): checks whether any agent or only the currently controlled agent collided

Let's now map those evaluators to some symbols, that are easier to interpret.

In [None]:
evaluators = {"success" : "EvaluatorGoalReached", \
              "collision" : "EvaluatorCollisionEgoAgent", \
              "max_steps": "EvaluatorStepCount"}

We will now define the terminal conditions of our benchmark. We state that a scenario ends, if
- a collision occured
- the number of time steps exceeds the limit
- the definition of success becomes true (which we defined to reaching the goal, using EvaluatorGoalReached)

In [None]:
terminal_when = {"collision" :lambda x: x, \
                 "max_steps": lambda x : x>40, \
                 "success" : lambda x: x}

# Behaviors Under Test
Let's now define the parameters for different heuristics.

In [None]:
scenario_param_file ="macro_action_params.json" # must be within examples params folder
params1 = ParameterServer(filename= os.path.join("examples/mcts_rl/params/",scenario_param_file))
params2 = ParameterServer(filename= os.path.join("examples/mcts_rl/params/",scenario_param_file))
params3 = ParameterServer(filename= os.path.join("examples/mcts_rl/params/",scenario_param_file))

Parameters of **Random Heuristic**

In [None]:
params1["BehaviorUctSingleAgent"]["UseRandomHeuristic"]=True
params1["BehaviorUctSingleAgent"]["UseNNHeuristic"]=False
params1["BehaviorUctSingleAgent"]["Mcts"]["UctStatistic"]["ReturnLowerBound"] = -10000.0
params1["BehaviorUctSingleAgent"]["Mcts"]["UctStatistic"]["ReturnUpperBound"] = 10000.0

Parameters of **Domain Heuristic**

In [None]:
params2["BehaviorUctSingleAgent"]["UseRandomHeuristic"]=False
params2["BehaviorUctSingleAgent"]["UseNNHeuristic"]=False
params2["BehaviorUctSingleAgent"]["Mcts"]["UctStatistic"]["ReturnLowerBound"] = -10000.0
params2["BehaviorUctSingleAgent"]["Mcts"]["UctStatistic"]["ReturnUpperBound"] = 10000.0

Parameters of **NN Heuristic**

In [None]:
params2["BehaviorUctSingleAgent"]["UseRandomHeuristic"]=False
params2["BehaviorUctSingleAgent"]["UseNNHeuristic"]=True
params2["BehaviorUctSingleAgent"]["Mcts"]["UctStatistic"]["ReturnLowerBound"] = -10000.0
params2["BehaviorUctSingleAgent"]["Mcts"]["UctStatistic"]["ReturnUpperBound"] = 10000.0

We use the same behavior for the three heuristics.

In [None]:
behaviors_tested = {"RandomHeuristic": BehaviorUCTSingleAgentMacroActions(params1),"DomainHeuristic": BehaviorUCTSingleAgentMacroActions(params2),"NNHeuristic": BehaviorUCTSingleAgentMacroActions(params3)}

# Benchmark Runner

The BenchmarkRunner allows to evaluate behavior models with different parameter configurations over the entire benchmarking database.

In [None]:
benchmark_runner = BenchmarkRunner(benchmark_database=db,\
                                   evaluators=evaluators,\
                                   terminal_when=terminal_when,\
                                   behaviors=behaviors_tested,\
                                   log_eval_avg_every=10)

result = benchmark_runner.run(maintain_history=True)

We will now dump the files, to allow them to be postprocessed later.

In [None]:
result.dump(os.path.join("./benchmark_results.pickle"))


# Benchmark Results

Benchmark results contain
- the evaluated metrics of each simulation run, as a Panda Dataframe
- the world state of every simulation (optional)

In [None]:
result_loaded = BenchmarkResult.load(os.path.join("./benchmark_results.pickle"))

We will now first analyze the dataframe.

In [None]:
df = result_loaded.get_data_frame()

df.head()

# Benchmark Analyzer

The benchmark analyzer allows to filter the results to show visualize what really happened. These filters can be set via a dictionary with lambda functions specifying the evaluation criteria which must be fullfilled.

A config is basically a simulation run, where step size, controlled agent, terminal conditions and metrics have been defined.

Let us first load the results into the BenchmarkAnalyzer and then filter the results.

In [None]:
analyzer = BenchmarkAnalyzer(benchmark_result=result_loaded)


configs_rd = analyzer.find_configs(criteria={"behavior": lambda x: x=="RandomHeuristic", "success": lambda x : not x})
configs_dm = analyzer.find_configs(criteria={"behavior": lambda x: x=="DomainHeuristic", "success": lambda x : not x})
configs_nn = analyzer.find_configs(criteria={"behavior": lambda x: x=="NNHeuristic", "success": lambda x : not x})

We will now create a video from them. We will use Matplotlib Viewer and render everything to a video.

In [None]:
fig = plt.figure(figsize=[15, 15])
viewer = MPViewer(x_range=[-75, 75],
                  y_range=[-75, 75],
                  follow_agent_id=True)
video_exporter = VideoRenderer(renderer=viewer, world_step_time=0.2)

analyzer.visualize(viewer = video_exporter, real_time_factor = 1, configs_idx_list=configs_dm[1:3], fontsize=6)
                   
video_exporter.export_video(filename="./heuristic_test_video.mp4")
