## Reinforcement Learning

#### run global setup

In [None]:
try:
    with open("../global_setup.py") as setupfile:
        exec(setupfile.read())
except FileNotFoundError:
    print('Setup already completed')

#### run local setup

In [None]:
%matplotlib inline
import random
import matplotlib.pyplot as plt
import seaborn as sns

from src.rl.CliffworldEnv import CliffworldEnv

from src.rl.RandomAgent import RandomAgent
from src.rl.util import run_episode_ss

from src.rl.TabularQAgent import TabularQAgent
from src.rl.util import run_episode

### Environment, Agents, Actions and Goals

A simplified way to observe the world would be an environment with some well-defined states wherein some agents act according to their goals. The states are defined by every unique situation which could possibly exist in the environment. The state of the environment changes as the agents perform actions, e.g. the agent is hungry, so it decides to eat and thereby the state of the refrigerator has changed to be less full.

The most accessible digital version of such simplified worlds exist in form of computer games where the agent is the player or even non-player, i.e. computer, character attempting to take the best actions in order to reach the goal of the game.

It is much harder to define the environment, agents, actions and goals of real-world scenarios in order to make it feasible to solve due to the complexities and many possible states of the world.

## Cliff World

In [None]:
sns.set(rc={'figure.figsize': (15, 15)})
states_colors = matplotlib.colors.ListedColormap(
    ['#9A9A9A', '#D886BA', '#4D314A', '#6E9183'])
cmap_default = 'Blues'
cpal_default = sns.color_palette(("Blues_d"))


random.seed(1)

env = CliffworldEnv()
env.render(mode='reward')

In [None]:
agent = RandomAgent()
sum_r, ss = run_episode_ss(env, agent)

env.render(mode='path', ss=ss)
print("Reward achieved: ", sum_r)

In [None]:
def run_experiment(env, agent, epsilon_decay, n_episodes) -> list:
    rewards = []
    for i in range(n_episodes):
        sum_r = run_episode(env, agent, learn=True)
        rewards.append(sum_r)
        agent.epsilon *= epsilon_decay
    agent.epsilon = 0
    sum_r, ss = run_episode_ss(env, agent)
    print('Trained for ', n_episodes, ' episodes. Last episode achieved a reward of ', sum_r, '. Last episode run: ')     
    env.render(mode='path', ss=ss)
    return rewards


alpha = 0.1  # learning rate
epsilon = 1.0  # initial randomness
gamma = 1.0  # discount factor
agent = TabularQAgent(alpha, epsilon, gamma)

rewards = run_experiment(env, agent, 0.99, 1000)

In [None]:
figure = sns.tsplot(rewards)
figure.set(xlabel='# episodes', ylabel='reward')
sns.despine()

In [None]:
env.render(mode='policy', Q=agent.Q, A=agent.A)