# Experiments Report

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob

from IPython.display import IFrame
from IPython.core.display import display
from IPython.core.display import HTML

## References
**Experiments code**
* https://github.com/PrinceJavier/rl_plas_experiment

**Reference paper and code**
* PLAS: Latent Action Space for Offline Reinforcement Learning (2020)
* https://sites.google.com/view/latent-policy
* https://github.com/Wenxuan-Zhou/PLAS

**RL Environment**
* https://www.gymlibrary.dev/environments/box2d/lunar_lander/

## September 6, 2022 Update
### Data collection process
* Needed to generate own data due to lack of ready-to-use offline data for `LunarLander-v2`.
* Defined a `LunarLander-v2` environment
```python
env = gym.make(
    "LunarLander-v2",
    continuous = True,
    gravity = -10.0,
    enable_wind = True,
    wind_power = 5, # between 0 and 20
    turbulence_power = 1, # between 0 and 2
)
```
* Trained a PPO model
```python
from stable_baselines3 import PPO
model = PPO("MlpPolicy", env, verbose=1, device='cpu')
model.learn(total_timesteps=100000)
```
* Generated 1,000,000 datapoints using the trained PPO model
    * `observations` - array of observations per time step
    * `actions` - array of selected actions per time step
    * `next_observations` - array of new observations after selection action per time step
    * `rewards` - array of reward per time step
    * `terminals` - array of `True` or `False` values indicating if each round is over

### Experiments
* Trained PLAS model on the collected offline data
* For evaluation, experimented with different values of `wind_power` and `turbulence_power` - out of distribution
```python
wind_powers = np.arange(0, 21, 1)
turb_powers = np.arange(0, 2, 0.2)
```

### Takeaways

### Notebooks
* `data_gen_lunar_lander.ipynb` - code for the data collection process
* `experiments_lunar_lander.ipynb` - code for experiments

### Training Performance
* Average reward (30 iterations each evaluation) fluctuates widely over training iterations
* Stagnant over training iterations
* **Is there a way to make this more stable?**

<img src = "report_files/lunar_lander_plas_training_perf.png" width="500" />

### Results

In [45]:
df_results = pd.read_csv("report_files/results.csv")
df_results

Unnamed: 0,env_type,wind_power,turb_power,agent_type,avg_reward
0,baseline,5,1,random,-250.404784
1,baseline,5,1,plas,157.242059


In [None]:
# heat map of turb power vs wind power
# random and for plas

### Baseline environment (`w=5`, `t=1`)
```python
env = gym.make(
    "LunarLander-v2",
    continuous = True,
    gravity = -10.0,
    enable_wind = True,
    wind_power = 5, # between 0 and 20
    turbulence_power = 1, # between 0 and 2
)
```

**Random model**

<img src = "report_files/random_baseline.gif" width="300"/>

**PLAS model**

<img src = "report_files/plas_baseline.gif" width="300"/>

### Out-of-distribution environments
* We tested different combinations of `wind_power` and `turbulence_power`

In [52]:
wind_powers = np.arange(0, 21, 5)
turb_powers = np.arange(0, 2.5, 0.5)

for w in wind_powers:
    for t in turb_powers:
        if w != 5 and t != 1: # this is the baseline
            w = np.around(w, 3)
            t = np.around(t, 3)            
            
            rand_reward = np.around(df_results[(df_results.agent_type=='random') 
                                               & (df_results.wind_power == w) 
                                               & (df_results.turb_power== t)], 3)
            plas_reward = np.around(df_results[(df_results.agent_type=='plas') 
                                               & (df_results.wind_power == w) 
                                               & (df_results.turb_power== t)], 3)
            
            display(HTML(f"<h4>wind_power = {w} | turbulence_power = {t}</h4>"))
            display(HTML(f"<h5>(left) Random model reward = {rand_reward} vs (right) PLAS model reward = {plas_reward}</h5>"))
            display(HTML(f"""
                        <div class="row">
                                <img src='report_files/random_w={w}_t={t}.gif' width=300> </img>
                                <img src='report_files/plas_w={w}_t={t}.gif' width=300> </img>
                        </div>
                        """))
            print()

















































## August 30, 2022 Update

### Experiments
* Evaluated the performance of the PLAS model on three datasets for the bullet walker simulation
    * `bullet-walker2d-random-v0-0` - collected using a random policy
    * `bullet-walker2d-medium-v0-0` - collected using a medium policy (not random but not expert)
    * `bullet-walker2d-expert-v0-0` - collected using an expert policy (well-trained model
* Environment is the default `bullet-walker2d` environment

### Takeaways
* Model trained on data from a expert policy performs the best
    * Dropped off quickly over training iterations - this is not what we want.
    * **Can this be a point of improvement?**
* Model trained on data from a medium policy can perform better than random
    * Learning is unstable over training iterations
    * **Can we somehow stabilize this?**
* Model trained on random can learn a policy better than random but the performance is not so good
    * The VAE model underfits the data
    * The best the model is able to do is keep the robot stationary and not fall down
    * This may be very difficult - extrapolating good action-state combinations from a dataset of not-so-good action-state combinations

### Notebook
* See `experiments_bulletwalker.ipynb` for details of the experiment


In [None]:
IFrame('https://docs.google.com/presentation/d/e/2PACX-1vSI_Ea2Vl7JAB8cOz42qJC3uFaW7B_-c5mqEE72DwbmVutuzuEiHxWWh9fCYSlcFs9m2MG88pvjd3xZ/embed?start=false&loop=false&delayms=3000', 
       '100%', '600')