In [None]:
import random
import numpy as np
import gymnasium as gym
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

## 6.1. Introduction to Reinforcement Learning

For the problem formulation, we introduce the [gymnasium](https://gymnasium.farama.org/) library. It implements control problems from the past and present of reinforcement learning that have served as milestones in the development of that technique. Researchers that work on the same standard problems have the advantage that their work is easier to compare and to transfer. On the other hand, if benchmark problems are too prevalent in a community, it may drive research in a certain, uniform direction that is not as productive anymore. Note that gym is a product of OpenAI, a private company. 

gym uses a unifying framework that defines every control problem as an *environment*. The basic building blocks of an environment are `env = gym.make` to create the environment, `env.reset` to start an episode, `env.render` to give a human readable representation of the state of the environment, and `env.step` to perform an action.

We start the exercises with the 4x4 [FrozenLake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) environment. It is a kind of maze with "frozen" traversable squares marked by `F` and "holes", losing terminal squares marked by `H`. The agent starts at the `S` start square and only incurs reward, when they manage to get to the goal `G` square. We mostly look at the deterministic case, where traversing on the frozen lake is deterministic, which is controlled by the variable `is_slippery=False` when creating the environment. If the lake is slippery, a movement in a certain direction may by chance result in the agent arriving at a different square than expected.

In [None]:
env = gym.make("FrozenLake-v1", is_slippery=False, render_mode="human")
#print(env.action_space)
#print(env.observation_space)

In [None]:
starting_state, _ = env.reset()
#print(starting_state)
env.render()

The `env.action_space` always implements a `sample` method, which returns a valid, random aciton. We can utilize this, to have a look at the dynamics of the system. You can execute the following cell a few times to see what happens. When the agent enters a terminal state, you need to execute `env.reset` to start anew.

In [None]:
state, reward, terminated, truncated, info = env.step(env.action_space.sample())
print(state, reward, terminated, truncated, info)

#### Task 1. a) Random Agent:
We provide the framework for the random agent, a method to rollout a policy

In [None]:
def rollout(env, agent):
    state, _ = env.reset()
    done = False
    total_reward = 0
    while not done:
        action = agent.action(state)
        state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        total_reward += reward
    return total_reward

class RandomAgent:
    def __init__(self, action_space, observation_space):
        self.action_space = action_space
        self.observation_space = observation_space
        
    # We pass the state only for compatability
    def action(self, state):
    # your code goes here
        return None
    
def compute_avg_return(env, agent, num_episodes=5000):
    # your code goes here
    return avg_reward

Add your code to estimate the `avg_return_random_agent` for the deterministic case and `avg_return_random_agent_slippery` for the stochastic case!

In [None]:
env = gym.make("FrozenLake-v1", is_slippery=False, render_mode=None)
# your code goes here

In [None]:
print("Estimation for the deterministic case:", avg_return_random_agent)
print("Estimation for the stochastic case:", avg_return_random_agent_slippery)

### 1. b) Iterative Policy Evaluation
We provide a `set_state` method that changes the state of the environment. This is a pretty unusual way to interact with this framework. Note, that the random policy is stochastic, while the environment is not. In the value update we sum the value of each possible action that is weighted by its probability to be picked by the action. The architecture of the agent does provide access to these inner dynamics, so instead of passing the agent or its dynamics as a variable, we implement iterative policy evaluation just for the random agent, with the probability of `0.25` for each action hard coded.

We also provide `all_states` and `all_actions`, lists of all admissable states and actions for the environment. 

In [None]:
all_states = list(range(env.observation_space.n))
all_actions = list(range(env.action_space.n))

def set_state(env, state):
    env.reset()
    env.env.env.env.s = state
    return env

def visualize_value_fct(v):
    print(np.round(np.array(list(v.values())).reshape((4,4)),3))

In [None]:
def iterative_policy_iteration_random_agent(env, all_states, all_actions, discount_rate, 
                                            threshold=0.001, max_iter=10000):
    v = {s: 0 for s in all_states}  # value function, initialized to 0
    # your code goes here
    return v

In [None]:
v_random = iterative_policy_iteration_random_agent(env, all_states, all_actions, discount_rate=0.9)
visualize_value_fct(v_random)

### 1. c) Value Iteration
Use value iteration to find the optimal policy!

In [None]:
def value_iteration(env, all_states, all_actions, discount_rate, threshold=0.001, max_iter=10000):
    v = {s: 0 for s in all_states}  # value function, initialized to 0
    # your code goes here
    return v

In [None]:
v_optimal = value_iteration(env, all_states, all_actions, discount_rate=0.9)
visualize_value_fct(v_optimal)

### 2. a) Sarsa & Q-Learning
With the language of a Q-table, we can define a more general agent by a Q-function.

*Please do not use* `set_state` *anymore! Instead always start an episode with* `state = env.reset()`!

In [None]:
def visualize_q_fct(q):
    acts = {0 : "L", 1 : "D", 2 : "R", 3 : "U"} 
    for j in range(4):
        print("Value for action", acts[j], ":")
        print(np.round(np.array([q[i][j] for i in range(16)]).reshape((4,4)), 3))
    for i in range(4):
        print([acts[np.argmax(q[4*i + j])] for j in range(4)])
        
def argmax_tiebreak(array):
    return np.random.choice(np.where(array == array.max())[0])

In [None]:
class Discrete_Q_Agent:
    def __init__(self, action_space, observation_space, epsilon=0.9):
        self.action_space = action_space
        self.observation_space = observation_space
        self.epsilon = epsilon
        self.reset_Q()
    
    def reset_Q(self):
        all_states = list(range(self.observation_space.n))
        self.actions = list(range(self.action_space.n))
        self.Q = {s: np.zeros(self.action_space.n) for s in all_states}

    def action(self, state):
# your code goes here
        return action

In [None]:
def Sarsa(env, q_agent, alpha=0.1, gamma=0.99, rollouts=10000):
    # your code goes here
    return q_agent, q_agent.Q

In [None]:
def Q_Learning(env, q_agent, alpha=0.1, gamma=0.99, rollouts=10000):
    # your code goes here
    return q_agent, q_agent.Q

In [None]:
env_slippery = gym.make("FrozenLake-v1", is_slippery=True)
q_agent = Discrete_Q_Agent(env_slippery.action_space, env_slippery.observation_space, epsilon=0.9)
q_agent, q = Sarsa(env_slippery, q_agent)
visualize_q_fct(q)

In [None]:
env_slippery = gym.make("FrozenLake-v1", is_slippery=True)
q_agent = Discrete_Q_Agent(env_slippery.action_space, env_slippery.observation_space, epsilon=0.9)
q_agent, q = Q_Learning(env_slippery, q_agent)
visualize_q_fct(q)

### 2. b) Cartpole
Next, try the [Cartpole](https://www.gymlibrary.ml/environments/classic_control/cart_pole/) environment. It has a continuous state space, so we need to adjust our methods to accomodate that.

In [None]:
# your code goes here

### 2. c) Cartpole learning
The observation space of the Cartpole environment can be accessed with `env.observation_space`. It is a [`Box`](https://gymnasium.farama.org/api/spaces/fundamental/#box) space, which contains lower bounds, upper bounds, number of dimensions, and datatype. The second and forth dimension are unbounded. We can make them bounded by clipping every value over a certain threshold. Also, the first and third dimension have higher admissbable bounds, than is useful during training!

Hint: Binned Q-Learning is not the most efficient or useful algorithm for this problem. With the provided hyperparameters I achieved only a mean reward of ~100 after 50000 rollouts of training without any further tuning. Can you achieve a better result by changing the hyperparameters or employing some additional technique?

In [None]:
learning_rate = 0.1
discounting_rate = 0.95
number_episodes = 50000
total_reward = 0

q_table = np.zeros([31, 31, 51, 51, 2])
window_size = np.array([0.25, 0.25, 0.01, 0.1])
low_clip = [-3.75, -3.75, -0.25, -2.5]
high_clip = [3.75, 3.75, 0.25, 2.5]

# your code goes here

In [None]:
env = gym.make("CartPole-v1")
bagent = Binned_Q_Agent_Cartpole(window_size, q_table)
binned_q_learning(env, bagent, num_episodes=50000)

### 3.a) Linear function control
Implement the linear gradient Sarsa here. Most of the time after a few thousend episodes the linear policy is able to solve the problem (500 reward), but sometimes it just does not converge. The algorithm is a bit shakey as is! I also needed to add one little tweak: Normalize the state by clipping it, just as in the task before, and then dividing by the clip-value. This normalizes the state-vectors to [-1,1] and stablizes the algorithm.

Note that for a linear formulation of Q_theta(., a), Grad(Q_theta(., a)) at state vector s is just that state vector s.

In [None]:
class Linear_Q_Agent:
    def __init__(self, action_space, observation_space, epsilon=0.9):
        self.action_space = action_space
        self.observation_space = observation_space
        self.epsilon = epsilon
        self.theta = np.zeros((action_space.n, observation_space.shape[0]))
        
    def norm_state(self, state):
        norm_state = state
        norm_state = np.clip(norm_state,low_clip,high_clip)
        norm_state /= high_clip
        return norm_state

# your code goes here

In [None]:
lin_agent = Linear_Q_Agent(env.action_space, env.observation_space)
lin_agent = Grad_Sarsa(env, lin_agent, rollouts=10000)

### 3.b) DQN
As a suggestion, I provided the interfaces for functions, some hyperparameters, and the architecture of the neural net that approximates Q. For this algorithm to somewhat work, I needed at least experience replay. But other techniques may also be interesting and work even better. Please feel free to experiment!

*Note*: 1. Whenever you either `model.predict` oder `model.fit` you can gain a lot of performance if you do it as a batch. E.g. use 
```
X = []
y = []
for i in I:
    X.append(get_data(i))
    y.append(get_label(i))
model.fit(X,y)
```
instead of
```
for i in I:
    model.fit(get_data(i), get_label(i))
```

In [None]:
memory_size = 2000
epsilon = 0.05
learning_rate = 0.001

class DQN_Agent:
    def _init_model(self, state_dim, action_dim, learning_rate):
        model = Sequential()
        model.add(Dense(32, input_dim=state_dim, activation='relu'))
        model.add(Dense(32, activation='relu'))
        model.add(Dense(action_dim, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(lr=learning_rate))
        return model
        
    def action(self, state):
        pass
    
    def remember(self, state, action, reward, next_state, done):
        pass

    def learn_from_replay(self, batch_size):
        pass
    
def DQN(env, agent, replay_batch_size=128, rollouts=2000):
    pass

### 3.c) Another one
Browse the [environments](https://gymnasium.farama.org/) to pick another challenge! Maybe even record a video with the [RecordVideo wrapper](https://gymnasium.farama.org/api/wrappers/misc_wrappers/#gymnasium.wrappers.RecordVideo)!