Last time, we looked at an example reinforcement problem that balanced an object in space following  [this tutorial](https://youtu.be/cO5g5qLrLSo).

In this lab, you will choose a reinforcement learning problem to explore. Here are some suggestions for problems that you can investigate.

1. Autonomous driving: Building a racing car with reinforcement learning. [Tutorial](https://youtu.be/Mut_u40Sqz4?t=6020), [code on github](https://github.com/nicknochnack/ReinforcementLearningCourse/blob/main/Project%202%20-%20Self%20Driving.ipynb). 
2. Custom environments: Shower environment to get the temperature right every time. [Tutorial](https://youtu.be/Mut_u40Sqz4?t=6020), [code on github](https://github.com/nicknochnack/ReinforcementLearningCourse/blob/main/Project%203%20-%20Custom%20Environment.ipynb).
3. Some other options to check out later if interested:
    1. Solving the Lunar Landing Problem using Stable Baselines algorithm: [tutorial](https://youtu.be/nRHjymV2PX8), [code on github](https://github.com/nicknochnack/StableBaselinesRL). ACER is only available in a previous version of stable_baselines that is compatible with tensorflow 1.5, which is not available on edStem.
    2. Datasets for Deep Data-Driven Reinforcement Learning (D4RL): [environments description](https://sites.google.com/view/d4rl/home), [code on github](https://github.com/rail-berkeley/d4rl).

**Note:** This is a rough guide with the general mains steps in a reinforcement learning program. Please add more sections as your implementation requires, with comments describing each section.

# Problem description
Enter in the text cell below the problem that you chose to solve with reinforcement learning.

Custom environment: Shower

# **Build an RL environment**

**1. Install packages**

Note: Please inform the TA of any additional packages that you need to install for the problem that you selected.

In [3]:
import os
if not os.getenv("ED_COURSE_ID"):
    !pip install tensorflow stable_baselines3 torch collections gym box2d-py --user

**1.b import packages**

In [7]:
# Add your code here to import all needed packages. 
# Contact the instructor if you get an error.

import gym 
from gym import Env
from gym.spaces import Discrete, Box, Dict, Tuple, MultiBinary, MultiDiscrete 
import numpy as np
import random
import os
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy

**2. Create the environment**

In [8]:
Discrete(3)

Discrete(3)

In [9]:
Box(0,1,shape=(3,3)).sample()

array([[0.6923383 , 0.7911673 , 0.20257261],
       [0.70782256, 0.6032806 , 0.02770894],
       [0.44738367, 0.6103288 , 0.9298064 ]], dtype=float32)

In [10]:
Box(0,255,shape=(3,3), dtype=int).sample()

array([[149,  44,  38],
       [ 83,  65,  96],
       [153,  51, 121]])

In [11]:
Tuple((Discrete(2), Box(0,100, shape=(1,)))).sample()

(1, array([7.638086], dtype=float32))

In [12]:
Dict({'height':Discrete(2), "speed":Box(0,100, shape=(1,))}).sample()

OrderedDict([('height', 0), ('speed', array([39.94341], dtype=float32))])

**3. Test the environment with random policy**

In [13]:
# Trigger Ed's X display
!xdpyinfo

# Add your code here to display the environment with random choice
MultiBinary(4).sample()


MultiDiscrete([5,2,2]).sample()

name of display:    :1.0
version number:    11.0
vendor string:    The X.Org Foundation
vendor release number:    12009000
X.Org version: 1.20.9
maximum request size:  16777212 bytes
motion buffer size:  256
bitmap unit, bit order, padding:    32, LSBFirst, 32
image byte order:    LSBFirst
number of supported pixmap formats:    6
supported pixmap formats:
    depth 1, bits_per_pixel 1, scanline_pad 32
    depth 4, bits_per_pixel 8, scanline_pad 32
    depth 8, bits_per_pixel 8, scanline_pad 32
    depth 16, bits_per_pixel 16, scanline_pad 32
    depth 24, bits_per_pixel 32, scanline_pad 32
    depth 32, bits_per_pixel 32, scanline_pad 32
keycode range:    minimum 8, maximum 255
focus:  PointerRoot
number of extensions:    23
    BIG-REQUESTS
    Composite
    DAMAGE
    DOUBLE-BUFFER
    GLX
    Generic Event Extension
    MIT-SCREEN-SAVER
    MIT-SHM
    Present
    RANDR
    RECORD
    RENDER
    SHAPE
    SYNC
    VNC-EXTENSION
    X-Resource
    

array([0, 0, 1])

# **Build and Train the Model**

**4. Build the training model**

In [14]:
class ShowerEnv(Env):
    def __init__(self):
        # Actions we can take, down, stay, up
        self.action_space = Discrete(3)
        # Temperature array
        self.observation_space = Box(low=np.array([0]), high=np.array([100]))
        # Set start temp
        self.state = 38 + random.randint(-3,3)
        # Set shower length
        self.shower_length = 60
        
    def step(self, action):
        # Apply action
        # 0 -1 = -1 temperature
        # 1 -1 = 0 
        # 2 -1 = 1 temperature 
        self.state += action -1 
        # Reduce shower length by 1 second
        self.shower_length -= 1 
        
        # Calculate reward
        if self.state >=37 and self.state <=39: 
            reward =1 
        else: 
            reward = -1 
        
        # Check if shower is done
        if self.shower_length <= 0: 
            done = True
        else:
            done = False
        
        # Apply temperature noise
        #self.state += random.randint(-1,1)
        # Set placeholder for info
        info = {}
        
        # Return step information
        return self.state, reward, done, info

    def render(self):
        # Implement viz
        pass
    
    def reset(self):
        # Reset shower temperature
        self.state = np.array([38 + random.randint(-3,3)]).astype(float)
        # Reset shower time
        self.shower_length = 60 
        return self.state
        

In [32]:
env=ShowerEnv()

In [34]:
env.observation_space.sample()

array([17.05793], dtype=float32)

In [33]:
env.reset()

array([36.])

In [35]:
from stable_baselines3.common.env_checker import check_env

In [39]:
check_env(env, warn=True)

Test environment

In [40]:

episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

Episode:1 Score:-36
Episode:2 Score:-28
Episode:3 Score:-52
Episode:4 Score:-34
Episode:5 Score:-16


In [42]:
env.close()

Train model

In [45]:
log_path = os.path.join('Training', 'Logs')

In [46]:
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log=log_path)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [47]:
model.learn(total_timesteps=400000)

Logging to Training/Logs/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 60       |
|    ep_rew_mean     | -34.4    |
| time/              |          |
|    fps             | 1446     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 60          |
|    ep_rew_mean          | -32.5       |
| time/                   |             |
|    fps                  | 1709        |
|    iterations           | 2           |
|    time_elapsed         | 2           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.007906474 |
|    clip_fraction        | 0.0154      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.09       |
|    explained_variance   | 4.63e-05    |

# **Save, Reload and Evaluate the model**

**5. Save the model**

In [49]:
model.save('PPO')

**6. Reload the model**

In [53]:
# Add your code here to test the system:
# Reload the model
# Evaluate and test

evaluate_policy(model, env, n_eval_episodes=10, render=True)

(12.0, 58.787753826796276)

**7. Display the environment**

In [0]:
# Trigger Ed's X display
!xdpyinfo
# Add loop here to display the smart agent!
