- Goal: Build a reinforcement learning model to adjust the temperature automatically to get it in the optimal range
- Optimal temperature: 37 and 39
- Shower length: 60 seconds
- Actions: Turn Down, Leave, Turn Up
- Task: Build a model that keeps us in the optimal range for as long as possible

### 1. Import Dependencies

In [26]:
import gym
from gym import Env # import the environment class
from gym.spaces import Discrete, Box, Dict, Tuple, MultiBinary, MultiDiscrete

import numpy as np
import random 
import os

from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

### 2. Types of Spaces

In [27]:
Discrete(3).sample()

0

In [28]:
Box(0,1,shape=(3,)).sample()

array([0.40021256, 0.9457223 , 0.839329  ], dtype=float32)

In [29]:
Tuple((Discrete(3),Box(0,1,shape=(3,)))).sample()
# stable_baselines doesnt support tuple
# allow you to combine different spaces

(1, array([0.65918005, 0.8343058 , 0.03300654], dtype=float32))

In [30]:
Dict({'height':Discrete(2),'speed':Box(0,100,shape=(1,))}).sample()

OrderedDict([('height', 1), ('speed', array([36.79398], dtype=float32))])

In [31]:
MultiBinary(4).sample()
# different combination of 0 and 1 in those positions

array([0, 1, 0, 1], dtype=int8)

In [32]:
MultiDiscrete([5,2,2]).sample()
# between 0 to 4
# between 0 to 1
# between 0 to 1

array([2, 0, 1], dtype=int64)

### 3. Building an Environment
- Build and agent to gibve us the best shower possible
- Randomly temperature
- 37 to 39 degrees
- however our agent doesnt know that we prefer 37 to 39, so we need to train our agent into learning what type of adjustment it made that can get to the temperature we want

In [33]:
class ShowerEnv(Env):
    # four most important function
    def __init__(self):
        self.action_space = Discrete(3) # tape up, down, unchange
        # we can even make it more complicated by using box, tape up to certain degree .etc
        self.observation_space = Box(low = 0,high = 100, shape=(1,))
        self.state = 38 + random.randint(-3,3)
        self.shower_length = 60

    def step(self,action):
        # applying the impact of our action on our state
            # 0 = decrease, 1 = no change, 2 = increase
        self.state += action-1
        
        # Decrease shower time
        self.shower_length -= 1
        
        # Caculate Reward
        if self.state >= 37 and self.state <= 39:
            reward = 1
        else:
            reward = -1
            
        # checking whether shower is done
        if self.shower_length <=0:
            done = True
        else:
            done = False    
            
        info = {} # addtional info
        
        return self.state, reward, done, info
        
    def render(self):
        # implement viz
        pass
    def reset(self):
        # reset our temp and time
        
        self.state = np.array([38+random.randint(-3,3)]).astype(float)
        self.shower_length = 60 # reset our length
        return self.state

In [34]:
env = ShowerEnv()

In [35]:
env.observation_space.sample()

array([57.26816], dtype=float32)

In [36]:
env.action_space.sample()

0

In [37]:
env.reset()

array([40.])

### 4. Test Environment

In [38]:
episodes = 5
for episode in range(1,episodes + 1):
    obs = env.reset()
    done = False
    score = 0
    
    while not done:
        env.render()
        action = env.action_space.sample()
        obs, reward, done, info = env.step(action)
        score += reward
    print('episode:{} Score:{}'.format(episode,score))
env.close()

episode:1 Score:-28
episode:2 Score:0
episode:3 Score:-10
episode:4 Score:-20
episode:5 Score:-58


### 5. Train Model

In [39]:
log_path = os.path.join('training','logs')
model = PPO('MlpPolicy', env, verbose=1,tensorboard_log=log_path)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [41]:
model.learn(total_timesteps=40000)

Logging to training\logs\PPO_2
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 60       |
|    ep_rew_mean     | -28.6    |
| time/              |          |
|    fps             | 498      |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 60          |
|    ep_rew_mean          | -25.8       |
| time/                   |             |
|    fps                  | 389         |
|    iterations           | 2           |
|    time_elapsed         | 10          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008549665 |
|    clip_fraction        | 0.0561      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.07       |
|    explained_variance   | -0.00113    |

<stable_baselines3.ppo.ppo.PPO at 0x16856c34a30>

### 6. Save Model

In [42]:
shower_path = os.path.join('Training','Saved Models','shower_Model_PPO')

In [43]:
model.save(shower_path)



In [44]:
del model

In [45]:
model = PPO.load(shower_path, env)

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [47]:
evaluate_policy(model,env,n_eval_episodes=10,render=True)

TypeError: ShowerEnv.render() got an unexpected keyword argument 'mode'