#### Authors and Group members
Time spent:
Nima Hansen: 14 hours 
Kailash de Jesus Hornig: 14 hours


#DAT405 Introduction to Data Science and AI 
##2022-2023, Reading Period 1
## Assignment 5: Reinforcement learning and classification
There will be an overall grade for this assignment. To get a pass grade (grade 3), you need to pass items 1-3 below. To receive higher grades, finish items 4 and 5 as well.

The exercise takes place in a notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical. 
Hints:
You can execute certain linux shell commands by prefixing the command with `!`. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.  

This assignment is about **sequential decision making** under uncertainty (Reinforcement learning). In a sequential decision process, the process jumps between different states (the environment), and in each state the decision maker, or agent, chooses among a set of actions. Given the state and the chosen action, the process jumps to a new state. At each jump the decision maker receives a reward, and the objective is to find a sequence of decisions (or an optimal policy) that maximizes the accumulated rewards.

We will use **Markov decision processes** (MDPs) to model the environment, and below is a primer on the relevant background theory. The assignment can be divided in two parts:



* To make things concrete, we will first focus on decision making under **no** uncertainity, i.e, given we have a world model, we can calculate the exact and optimal actions to take in it. We will first introduce **Markov Decision Process (MDP)** as the world model. Then we give one algorithm (out of many) to solve it.


* Next, we will work through one type of reinforcement learning algorithm called Q-learning. Q-learning is an algorithm for making decisions under uncertainity, where uncertainity is over the possible world model (here MDP). It will find the optimal policy for the **unknown** MDP, assuming we do infinite exploration.

## Primer
### Decision Making
The problem of **decision making under uncertainty** (commonly known as **reinforcement learning**) can be broken down into
two parts. First, how do we learn about the world? This involves both the problem of modeling our initial uncertainty about the world, and that of drawing conclusions from evidence and our initial belief. Secondly, given what we currently know about the world, how should we decide what to do, taking into account future events and observations that may change our conclusions?
Typically, this will involve creating long-term plans covering possible future eventualities. That is, when planning under uncertainty, we also need to take into account what possible future knowledge could be generated when implementing our plans. Intuitively, executing plans which involve trying out new things should give more information, but it is hard to tell whether this information will be beneficial. The choice between doing something which is already known to produce good results and experiment with something new is known as the **exploration-exploitation dilemma**.

### The exploration-exploitation trade-off

Consider the problem of selecting a restaurant to go to during a vacation. Lets say the
best restaurant you have found so far was **Les Epinards**. The food there is
usually to your taste and satisfactory. However, a well-known recommendations
website suggests that **King’s Arm** is really good! It is tempting to try it out. But
there is a risk involved. It may turn out to be much worse than **Les Epinards**,
in which case you will regret going there. On the other hand, it could also be
much better. What should you do?
It all depends on how much information you have about either restaurant,
and how many more days you’ll stay in town. If this is your last day, then it’s
probably a better idea to go to **Les Epinards**, unless you are expecting **King’s
Arm** to be significantly better. However, if you are going to stay there longer,
trying out **King’s Arm** is a good bet. If you are lucky, you will be getting much
better food for the remaining time, while otherwise you will have missed only
one good meal out of many, making the potential risk quite small.

### Markov Decision Processes
Markov Decision Processes (MDPs) provide a mathematical framework for modeling sequential decision making under uncertainty. An *agent* moves between *states* in a *state space* choosing *actions* that affects the transition probabilities between states, and the subsequent *rewards* recieved after a jump. This is then repeated a finite or infinite number of epochs. The objective, or the *solution* of the MDP, is to optimize the accumulated rewards of the process.

Thus, an MDP consists of five parts: 

* Decision epochs: $t={1,2,...,T}$, where $T\leq \infty$
* State space: $S=\{s_1,s_2,...,s_N\}$ of the underlying environment
* Action space $A=\{a_1,a_2,...,a_K\}$ available to the decision maker at each decision epoch
* Transition probabilities $p(s_{t+1}|s_t,a_t)$ for jumping from state $s_t$ to state $s_{t+1}$ after taking action $a_t$
* Reward functions $R_t = r(a_t,s_t,s_{t+1})$ resulting from the chosen action and subsequent transition

A *decision policy* is a function $\pi: s \rightarrow a$, that gives instructions on what action to choose in each state. A policy can either be *deterministic*, meaning that the action is given for each state, or *randomized* meaning that there is a probability distribution over the set of possible actions for each state. Given a specific policy $\pi$ we can then compute the the *expected total reward* when starting in a given state $s_1 \in S$, which is also known as the *value* for that state, 

$$V^\pi (s_1) = E\left[ \sum_{t=1}^{T} r(s_t,a_t,s_{t+1}) {\Large |} s_1\right] = \sum_{t=1}^{T} r(s_t,a_t,s_{t+1}) p(s_{t+1} | a_t,s_t)$$ 

where $a_t = \pi(s_t)$. To ensure convergence and to control how much credit to give to future rewards, it is common to introduce a *discount factor* $\gamma \in [0,1]$. For instance, if we think all future rewards should count equally, we would use $\gamma = 1$, while if we value near-future rewards higher than more distant rewards, we would use $\gamma < 1$. The expected total *discounted* reward then becomes

$$V^\pi( s_1) = \sum_{t=1}^T \gamma^{t-1} r(s_t,a_t, s_{t+1}) p(s_{t+1} | s_t, a_t) $$

Now, to find the *optimal* policy we want to find the policy $\pi^*$ that gives the highest total reward $V^*(s)$ for all $s\in S$. That is, we want to find the policy where

$$V^*(s) \geq V^\pi(s), s\in S$$

To solve this we use a dynamic programming equation called the *Bellman equation*, given by

$$V(s) = \max_{a\in A} \left\{\sum_{s'\in S} p(s'|s,a)( r(s,a,s') +\gamma V(s')) \right\}$$

It can be shown that if $\pi$ is a policy such that $V^\pi$ fulfills the Bellman equation, then $\pi$ is an optimal policy.

A real world example would be an inventory control system. The states could be the amount of items we have in stock, and the actions would be the amount of items to order at the end of each month. The discrete time would be each month and the reward would be the profit. 


## Question 1

The first question covers a deterministic MPD, where the action is directly given by the state, described as follows:

* The agent starts in state **S** (see table below)
* The actions possible are **N** (north), **S** (south), **E** (east), and **W** west. 
* The transition probabilities in each box are uniform. Note, however, that you cannot move outside the grid, thus all actions are not available in every box.
* When reaching **F**, the game ends (absorbing state).
* The numbers in the boxes represent the rewards you receive when moving into that box. 
* Assume no discount in this model: $\gamma = 1$
    
| | | |
|----------|----------|---------|
|-1 |1|**F**|
|0|-1|1|  
|-1 |0|-1|  
|**S**|-1|1|

Let $(x,y)$ denote the position in the grid, such that $S=(0,0)$ and $F=(2,3)$.

**1a)** What is the optimal path of the MDP above? Is it unique? Submit the path as a single string of directions. E.g. NESW will make a circle.

**1b)** What is the optimal policy (i.e. the optimal action in each state)?

**1c)** What is expected total reward for the policy in 1b)?


#### Answer 1a)
a) 
The optimal path is EENNN and gives an total score of 0. There are alternative paths e.g EENNWNE that also provides a score of 0 but these require longer "walks" so yes the path is unique in that way there isn't a path with equal amounts of steps providing the same total score. However, Marina talks about an optimal path as the path that get the highest reward as possible and there are several that gives a total value of 0, so in that sense it is not unique.

b) 
(0,0) = E, 
(0,1) = E & N,
(0,2) = E & N,
(0,3) = E,
(1,0) = E,
(1,1) = N & E & S,
(1,2) = N & E,
(1,3) = E,
(2,0) = N,
(2,1) = N,
(2,2) = N,
(2,3) = Nothing, absorption state

Useful definition: "optimal policy" above interpreted as -> set of rules for which action to take in each state for the max reward in the long run. (Lecture 9)

c) 
$$V^\pi (s_1) = E\left[ \sum_{t=1}^{T} r(s_t,a_t,s_{t+1}) {\Large |} s_1\right] = \sum_{t=1}^{T} r(s_t,a_t,s_{t+1}) p(s_{t+1} | a_t,s_t)$$  <br>

(0,0): a = E    V = 0 + (-1) = -1,

(1,0): a = E    V = (-1) + 1 = 0,

(2,0): a = N    V = 1 + (-1) = 0,

(2,1): a = N    V = (-1) + 1 = 0,

(2,2): a = N    V = 1 + 0 + = 1


Thus, the expected total reward is -1+0+0+0+1=0.


## Value Iteration

For larger problems we need to utilize algorithms to determine the optimal policy $\pi^*$. *Value iteration* is one such algorithm that iteratively computes the value for each state. Recall that for a policy to be optimal, it must satisfy the Bellman equation above, meaning that plugging in a given candidate $V^*$ in the right-hand side (RHS) of the Bellman equation should result in the same $V^*$ on the left-hand side (LHS). This property will form the basis of our algorithm. Essentially, it can be shown that repeated application of the RHS to any intial value function $V^0(s)$ will eventually lead to the value $V$ which statifies the Bellman equation. Hence repeated application of the Bellman equation will also lead to the optimal value function. We can then extract the optimal policy by simply noting what actions that satisfy the equation.    

The process of repeated application of the Bellman equation is what we here call the _value iteration_ algorithm. It practically procedes as follows:

```
epsilon is a small value, threshold
for x from i to infinity 
do
    for each state s
    do
        V_k[s] = max_a Σ_s' p(s′|s,a)*(r(a,s,s′) + γ*V_k−1[s′])
    end
    if  |V_k[s]-V_k-1[s]| < epsilon for all s
        for each state s,
        do
            π(s)=argmax_a ∑_s′ p(s′|s,a)*(r(a,s,s′) + γ*V_k−1[s′])
            return π, V_k 
        end
end

```

**Example:** We will illustrate the value iteration algorithm by going through two iterations. Below is a 3x3 grid with the rewards given in each state. Assume now that given a certain state $s$ and action $a$, there is a probability 0.8 that that action will be performed and a probabilit 0.2 that no action is taken. For instance, if we take action **E** in state $(x,y)$ we will go to $(x+1,y)$ 80 percent of the time (given that that action is available in that state), and remain still 20 percent of the time. We will use have a discount factor $\gamma = 0.9$. Let the initial value be $V^0(s)=0$ for all states $s\in S$. 

| | | |  
|----------|----------|---------|  
|0|0|0|
|0|10|0|  
|0|0|0|  


**Iteration 1**: The first iteration is trivial, $V^1(s)$ becomes the $\max_a \sum_{s'} p(s'|s,a) r(s,a,s')$ since $V^0$ was zero for all $s'$. The updated values for each state become

| | | |  
|----------|----------|---------|  
|0|8|0|
|8|2|8|  
|0|8|0|  
  
**Iteration 2**:  
  
Staring with cell (0,0) (lower left corner): We find the expected value of each move:  
Action **S**: 0  
Action **E**: 0.8( 0 + 0.9 \* 8) + 0.2(0 + 0.9 \* 0) = 5.76  
Action **N**: 0.8( 0 + 0.9 \* 8) + 0.2(0 + 0.9 \* 0) = 5.76  
Action **W**: 0

Hence any action between **E** and **N** would be best at this stage.

Similarly for cell (1,0):

Action **N**: 0.8( 10 + 0.9 \* 2) + 0.2(0 + 0.9 \* 8) = 10.88 (Action **N** is the maximizing action)  

Similar calculations for remaining cells give us:

| | | |  
|----------|----------|---------|  
|5.76|10.88|5.76|
|10.88|8.12|10.88|  
|5.76|10.88|5.76|  


## Question 2

**2a)** Code the value iteration algorithm just described here, and show the converging optimal value function and the optimal policy for the above 3x3 grid.

**2b)** Explain why the result of 2a) does not depend on the initial value $V_0$.

In [1]:
#Imports
import numpy as np

# Set value iteration parameters
d_factor = 0.9 # Discount factor 
p_action = 0.8 # Probability of making an action
p_no_action = 0.2 #probability of not making an action
eps = 0.001 # Epsilon, this is the tolerance for precision
max_iter = 100000 # some big value ("infiniity")

#First, we need to define our state_space
state_space=[]
#using a nested forloop for 3x3 board, coordinates x,y
for x in range(3):
    for y in range(3):
            state_space.append((x,y))

#initilize policy with none values
pi = {}
for i in state_space:
   pi[i] = None

# Setting all possible actions for each state
poss_actions = {
    (0,0):('N', 'E'), 
    (0,1):('N', 'E', 'S'),    
    (0,2):('S', 'E'),
    (1,0):('W', 'E', 'N'),
    (1,1):('S', 'W', 'E', 'N'),
    (2,0):('W', 'N'),
    (2,1):('W', 'N', 'S'),
    (1,2):('W', 'E', 'S'),
    (2,2):('S', 'W'),
    }

#Rewards for all states (only 1,1 carries a non zero value)
rewards = {}
for i in state_space:
    if i == (1,1): rewards[i] = 10
    else: rewards[i] = 0

# Creating a 3x3 dictionary with coordinates as keys, values = expected values      
V = {}
for i in state_space: # initiate with values 0
   V[i] = 0

#Creating the algorithm
for i in range(max_iter):
    #to not read from the changed V when iterating through states, only in a new iteration
    copied_V = V.copy()
    #initilize a max diff to know when to stop
    max_diff = 0
    
    #for each state in our state space
    for s in state_space:
        #old expected value in state (needed for calculating abs diff)
        vk_0 = V[s]
        #latest expected value in state for a specific action
        vk_1 = 0 

        #For each possible action of the specific state
        #calculate expected value ("v") for a certain action
        #if this is bigger than the "v" previous action/actions generated
        #set vk_1 = "v" and set this as action/policy for the specific state
        #After iterating all actions, we have the policy/action providing the highest "v"
        #for the specific state, then set V[s] = this expected value

        # Defining how action impact coordinates
        for action in poss_actions[s]:   
            if action == 'W':
                next_s = [s[0]-1, s[1]]
            if action == 'E':
                next_s = [s[0]+1, s[1]]
            if action == 'S':
                next_s = [s[0], s[1]-1]
            if action == 'N':
                next_s = [s[0], s[1]+1]

            #calculating the expected value for the action
            #tuple function added to get the right format
            v = p_action*(rewards[tuple(next_s)]+d_factor*copied_V[tuple(next_s)]) + p_no_action*(rewards[s]+d_factor*copied_V[s])

            #if v>vk_1 this is an better policy for state s 
            if v > vk_1: 
                vk_1 = v #update the latest expected value to the higher value
                pi[s] = action #Add the policy/action 

        #update the expected value in the dictionary for given state
        V[s] = vk_1 

        #set the max diff to the greatest value of max diff or |V_k[s]-V_k-1[s]|
        #This will make sure that we get the maximum difference of |V_k[s]-V_k-1[s]| for all states
        max_diff= max(max_diff, np.abs(vk_0 - V[s]))

    #Deciding to just print the first 10 iterations and the iteration where we reached convergence
    if(i<5):
        print("\nIteration nr " + str(i+1) + ": ")
        print("printed V: ", V)
        print("printed optimal policy: ", pi)
        
    if max_diff < eps: # terminating process
        print("\nIteration nr " + str(i+1) + ": ")
        print("printed V: ", V)
        print("printed optimal policy: ", pi)
        break

# source of inspiration: 
#https://towardsdatascience.com/implement-value-iteration-in-python-a-minimal-working-example-f638907f3437 



Iteration nr 1: 
printed V:  {(0, 0): 0, (0, 1): 8.0, (0, 2): 0, (1, 0): 8.0, (1, 1): 2.0, (1, 2): 8.0, (2, 0): 0, (2, 1): 8.0, (2, 2): 0}
printed optimal policy:  {(0, 0): None, (0, 1): 'E', (0, 2): None, (1, 0): 'N', (1, 1): 'S', (1, 2): 'S', (2, 0): None, (2, 1): 'W', (2, 2): None}

Iteration nr 2: 
printed V:  {(0, 0): 5.760000000000001, (0, 1): 10.88, (0, 2): 5.760000000000001, (1, 0): 10.88, (1, 1): 8.120000000000001, (1, 2): 10.88, (2, 0): 5.760000000000001, (2, 1): 10.88, (2, 2): 5.760000000000001}
printed optimal policy:  {(0, 0): 'N', (0, 1): 'E', (0, 2): 'S', (1, 0): 'N', (1, 1): 'S', (1, 2): 'S', (2, 0): 'W', (2, 1): 'W', (2, 2): 'S'}

Iteration nr 3: 
printed V:  {(0, 0): 8.870400000000002, (0, 1): 15.804800000000002, (0, 2): 8.870400000000002, (1, 0): 15.804800000000002, (1, 1): 11.295200000000001, (1, 2): 15.804800000000002, (2, 0): 8.870400000000002, (2, 1): 15.804800000000002, (2, 2): 8.870400000000002}
printed optimal policy:  {(0, 0): 'N', (0, 1): 'E', (0, 2): 'S', 

#### Comments on prints: 
According to the given theory, we will recieve our optimal policy by noting which actions that satisfies the bellman equation. When reaching convergence (our converged values in iteration 82, given our epsilon) we have reached the optimal value function, and thus we can also identify the optimal policies in iteration 82. Each state is here represented as a coordinate, e.g (0,0) which has the optimal policy "N".

## Answer Q2

2b) Explain why the result of 2a) does not depend on the initial value V0

As we use a discount factor < 1, we will always ensure a convergence to the optimal value function, as it reduces the impact of the initial values/expected values for each iteration (each iteration we multiply our discount factor with these expected values). There is also only one set of values that fullfills the Bellman Equation: $$V(s) = \max_{a\in A} \left\{\sum_{s'\in S} p(s'|s,a)( r(s,a,s') +\gamma V(s')) \right\}$$ according to the definition which will be optimal value function. The inital value does not impact this specific value, the only thing the initial value impacts is the number of iterations needed to converge to this value that satisfies the equation. And the optimal policy will still be the policy that solves this equation, so the results won't change.

## Reinforcement Learning (RL)
Until now, we understood that knowing the MDP, specifically $p(s'|a,s)$ and $r(s,a,s')$ allows us to efficiently find the optimal policy using the value iteration algorithm. Reinforcement learning (RL) or decision making under uncertainity, however, arises from the question of making optimal decisions without knowing the true world model (the MDP in this case).

So far we have defined the value function for a policy through $V^\pi$. Let's now define the *action-value function*

$$Q^\pi(s,a) = \sum_{s'} p(s'|a,s) [r(s,a,s') + \gamma V^\pi(s')]$$

The value function and the action-value function are directly related through

$$V^\pi (s) = \max_a Q^\pi (s,a)$$

i.e, the value of taking action $a$ in state $s$ and then following the policy $\pi$ onwards. Similarly to the value function, the optimal $Q$-value equation is:

$$Q^*(s,a) = \sum_{s'} p(s'|a,s) [r(s,a,s') + \gamma V^*(s')]$$

and the relationship between $Q^*(s,a)$ and $V^*(s)$ is simply

$$V^*(s) = \max_{a\in A} Q^*(s,a).$$

## Q-learning

Q-learning is a RL-method where the agent learns about its unknown environment (i.e. the MDP is unknown) through exploration. In each time step *t* the agent chooses an action *a* based on the current state *s*, observes the reward *r* and the next state *s'*, and repeats the process in the new state. Q-learning is then a method that allows the agent to act optimally. Here we will focus on the simplest form of Q-learning algorithms, which can be applied when all states are known to the agent, and the state and action spaces are reasonably small. This simple algorithm uses a table of Q-values for each $(s,a)$ pair, which is then updated in each time step using the update rule in step $k+1$

$$Q_{k+1}(s,a) = Q_k(s,a) + \alpha \left( r(s,a) + \gamma \max \{Q_k(s',a')\} - Q_k(s,a) \right) $$ 

where $\gamma$ is the discount factor as before, and $\alpha$ is a pre-set learning rate. It can be shown that this algorithm converges to the optimal policy of the underlying MDP for certain values of $\alpha$ as long as there  is sufficient exploration. For our case, we set a constant $\alpha=0.1$.

## OpenAI Gym

We shall use already available simulators for different environments (worlds) using the popular [OpenAI Gym library](https://www.gymlibrary.dev/). It just implements different types of simulators including ATARI games. Although here we will only focus on simple ones, such as the **Chain enviroment** illustrated below.
![alt text](https://chalmersuniversity.box.com/shared/static/6tthbzhpofq9gzlowhr3w8if0xvyxb2b.jpg)
The figure corresponds to an MDP with 5 states $S = \{1,2,3,4,5\}$ and two possible actions $A=\{a,b\}$ in each state. The arrows indicate the resulting transitions for each state-action pair, and the numbers correspond to the rewards for each transition.

## Question 3 
You are to first familiarize with the framework of [the OpenAI environments](https://www.gymlibrary.dev/), and then implement the Q-learning algorithm for the <code>NChain-v0</code> enviroment depicted above, using default parameters and a learning rate of $\gamma=0.95$. Report the final $Q^*$ table after convergence of the algorithm. For an example on how to do this, you can refer to the Q-learning of the **Frozen lake environment** (<code>q_learning_frozen_lake.ipynb</code>), uploaded on Canvas. Hint: start with a small learning rate.

Note that the NChain environment is not available among the standard environments, you need to load the <code>gym_toytext</code> package, in addition to the standard gym:

<code>
!pip install gym-legacy-toytext<br>
import gym<br>
import gym_toytext<br>
env = gym.make("NChain-v0")<br>
</code>

In [8]:
!pip install gym-legacy-toytext

import gym
import gym_toytext
env = gym.make("NChain-v0")

# The following code is run in order to avoid an error message later ('numpy.random._generator.Generator' object has no attribute 'rand')
#currently commented away for less echos
#!pip install ale-py==0.7.4
#!pip install git+https://github.com/DLR-RM/stable-baselines3

You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [3]:
#According to framework: returns the type of actions and number of valid discrete actions
print(env.action_space)
#According to framework: returns the type of observations and number of valid observations
print(env.observation_space)

Discrete(2)
Discrete(5)


In the following code we have mainly used the code from the provided reference as we belive that what you wanted us to do: Q learning frozen lake. In addition, under q-learning its stated "For our case, we set a constant $\alpha=0.1$." but here you hint us to start with a small so we will just set our learning rate to a small number and check how well it converges or not

In [4]:
#Import
import random as rd
import math
#Assignment 3

#Setting our parameters (using the q_learning_frozen_lake as reference)
d_factor = 0.95
learning_rate = 0.001 #as given in the task
eps = 0.5
num_episodes = 5000

# initialize the Q table
Q = np.zeros([5, 2])

for i in range(num_episodes):
	state = env.reset()
	done = False
	while done == False:
        # First we select an action:
		if rd.uniform(0, 1) < eps: # Flip a skewed coin
			action = env.action_space.sample() # Explore action space
		else:
			action = np.argmax(Q[state,:]) # Exploit learned values
        # Then we perform the action and receive the feedback from the environment
		new_state, reward, done, info = env.step(action)
        # Finally we learn from the experience by updating the Q-value of the selected action
		update = reward + (d_factor*np.max(Q[new_state,:])) - Q[state, action]
		Q[state,action] += learning_rate*update 
		state = new_state
	#Printing the first 10 and last 10 iteration to see plateau convergence
	if i>=4990:
		print("Epsiode number:", i)
		print(Q)

Epsiode number: 4990
[[61.38357798 60.44599567]
 [64.94594351 61.24950031]
 [69.6425965  62.54467745]
 [75.59138155 63.97079169]
 [83.82775874 65.89970627]]
Epsiode number: 4991
[[61.39318336 60.4586788 ]
 [64.93513604 61.26044397]
 [69.64322005 62.55270658]
 [75.61387796 63.94517717]
 [83.74194211 65.78222291]]
Epsiode number: 4992
[[61.39091731 60.47095877]
 [64.94314095 61.26577702]
 [69.65352377 62.56565451]
 [75.5330452  63.95081542]
 [83.73149688 65.79797257]]
Epsiode number: 4993
[[61.39060414 60.48529783]
 [64.91508879 61.27323325]
 [69.68618036 62.54476454]
 [75.53367501 63.92387893]
 [83.41157753 65.84369724]]
Epsiode number: 4994
[[61.38320872 60.49769521]
 [64.9122726  61.28694189]
 [69.71619233 62.54444021]
 [75.54866386 64.02740874]
 [83.57701871 65.96632251]]
Epsiode number: 4995
[[61.3878858  60.51082355]
 [64.89055845 61.32768486]
 [69.7574578  62.53181058]
 [75.61705675 64.02550221]
 [83.46355779 66.01277648]]
Epsiode number: 4996
[[61.39092647 60.51805982]
 [64.87964

By setting our epsilon value to 0.5 we decide that our model will explore our action space 50% of the episodes and exploit the learned values of Q 50% of the episodes. By analyzing the print illustrating the Q table for episode 4990 - 4990 we can not see a convergence to a specific constant value. The difference between a specific value for two episodes are as big as [0.0xxx..., 0.xxxx] however sometimes we have a decrease and sometimes an increase.

## Question 4

**4a)** What is the importance of exploration in RL? Explain with an example.

**4b)** Explain what makes reinforcement learning different from supervised learning tasks such as regression or classification.


#### 4a answer) 
Exploration is important in RL because it is under exploration the agent learns by testing different actions at different stages. Without exploration the agent would not learn anything about its environment and which action to choose, so some exploration is neccessary. Exploitation instead means that the agent utalizes previus knowledge/estimated values to get the most reward/minimal punichment. However, this is just based on the current estimated values which may not be the true values and it might not know that there is a better reward because it has not explored enough, ie. what the agent believes is the best action may not actually be the optimal action. 

An example of exploration from the real world could be a human going on a promenade with its dog (the agent), the human would give 2 candies (reward) to the dog if its sits and does not bark (action) when seeing another dog (state), but one 1 candy (reward) if it only does not bark (action). With some exploration, lets say testing one random action, the dog would learn that not barking when meeting another dog provides one candy. By only exploting this to recieve this reward the dog would always only recieve one candy, believing that this is the optimal action when meeting a new dog. However, this is not the optimal action, if the dog would have explored other actions such as sitting and not barking, it would have recieved two candies. However, a big difference to this simple example is that we in reality do not have just a small set of actions to choose from. Having a problem where we have a finite set of actions, we could explore every alternative but being in the real world, the dog could run, sit, bark and lay down, bite the owner, bite the dog etc etc etc. So even though exploraiton is very important, the dog reaches a certain state where the exploration might not be worth it anymore, and sticking with the action that has given highest reward (exploiting) is good enough as it otherwise may not recieve a reward at all, therefore there is a trade-off between exploring and exploiting. Exploring is always neccessary, however, the trade-off is important to be aware of.


#### 4b answer)
When using supervised learning we have a answer sheet knowing what is correct and what's not, we train our model on pre-labeled data and this is in turn what our predictions are based on. In other words, to use supervised learning we need to have the right values (e.g house prices) pre-defined and the variables helping them determine it (e.g living area). Worth to mention here is that there is no exploration going on as everything is known from the data. In contrast, when using reinforcement learning, there are no right answers provided, the model/agent learns from the enivoronment by interacting with and exploring it, by it's previous decisions and the following consequences (e.g reward (for right decision)/punishment (for wrong decisions) the agent will create its own answers of what actions seems right/best being in that situation (state). Thereby, we can use an agent in an unknown environment using trial and error for learning.

An analogy to illustrate this difference could be: learning how life works through reading a book on it (with decision, consequences and analysis) VS learning by experiencing life itself (learning iterativly by taking decisions, facing their consequences yourself, and moving on from this). A book may hold an answer to how right/good a decision was and can provide this information directly, while life may not give as clearly defined and coherent answers to decisions. Not everything in life is a black and white label either, but more a list of pros and cons that come with every choice. 



## Question 5

**5a)** Give a summary of how a decision tree works and how it extends to random forests.

**5b)** State at least one advantage and one drawback with using random forests over decision trees.

#### 5a 
Decisions trees can be used to support decisions, e.g in classification. As its name suggests, it is a representation with a tree shaped diagram. First of all we have the root node which represents the overall problem/question. Then from each node (if its not a leaf), we have branches connecting to either nodes/internal nodes or leafs based on the answer of the earlier node. If its an internal node, it will contain another subquestion which based on the answer will branch into another leaf or internal node. If we have reached a leaf, it means that a decision can be made based on the previous answers and the iteration through the tree is complete. In classification, this means we will recieve a label. 

The extension to a random forest is done by using a set of trees, where the trees vote indivually and the label (in a classification problem) is determined by the majority of the votes. The idea is that many trees (a forest) will result in a better/smarter decision process than with just one tree. The more detailed explaination of how this works and is built goes like this: an individual decision tree for the random forest is created by first creating a bootstrapped data set (subset with randomly chosen rows) of the original set. Then a random pair of features (columns in the data set) are used to create first the root and later the internal nodes in each tree. The root pair is chosen based on the best candidates for separating the samples in the bootstrapped data set. Then the rest of the tree is built by again choosing two random features from the bootstrapped data set to build another decision intersection/internal node. This is repeated until all a fullblown tree is built. The leafs are different end stations (in classification labels) for each trail that can "be walked" in a decision tree. By repeating the building process for a three, with certain random steps in the process, different trees are created and a random forest is created. When later used, the random forest lets every tree classify by it's own struture and the every tree gets a vote which, as previously mention, will make up the end result when summing all votes.        

#### 5b 
Advantages of a random forest over a decision tree 
- Better accuracy and more robust, for a relativly little extra work.
- Does not suffer from overfitting as much (using an average cancels out the biases of individual trees). Performs hence better on new data and is also more flexible.
- Can handle missing values. (using either: 1. median values for continuous variables, 2. proximity-weighted averages of missing values)

Disadvantages of a random forest over a decision tree
- Time consuming, both during building process and running the descions process. Slow in generating predictions because if has multiple decision trees and has the voting process on top of that. 
- Difficult to interpret compared to a single decision tree. The endresult might not be explainable and for different applications this might bause troublesome situations. (e.g. health care where critical and risky decisions are made) 


Source: 
- Lecture 10 in this course


# References
Primer/text based on the following references:
* http://www.cse.chalmers.se/~chrdimi/downloads/book.pdf
* https://github.com/olethrosdc/ml-society-science/blob/master/notes.pdf

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=85ae1618-9b9c-4671-808b-a0cb8ea95e84' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>