###  1. Write out the MP/MRP/MDP/Policy definitions and MRP/MDP Value Function definitions in your own style/notation (so you really internalize these concepts)

##### MP:
Markov Processes are memoryless random processes.A Markov Process can be represented by a tuple $<S,P>$, where $S$ is a finite set of states and $P$ is a state transition probability matrix. The state transition matrix from $s$ to $s'$ is $P_{SS'}=\mathbb{P}[S_{t+1}=s'|S_{t}=s]$.

##### MRP: 
A Markov Reward Process is a Markov Process with rewards. It can be represented by a tuple $<S,P,R,\gamma>$, where $S$ is a finite set of states, $P$ is a state transition probability matrix, $R$ is a reward function, and $\gamma$ is a discount factor $\in [0,1]$. The reward function of state $s$ is $R_s=\mathbb{E}[R_{t+1}|S_t=s]$.

##### MDP: 
A Markov Decision Process is a Markov Reward Process with decisions. It can be represented by a tuple $<S,A,P,R,\gamma>$, where $S$ is a finite set of states, $A$ is a finite set of actions, $P$ is a state transition probability matrix, $R$ is a reward function, and $\gamma$ is a discount factor $\in [0,1]$. The state transition matrix from $s$ with action $a$ to $s'$ is $P_{SS'}^a=\mathbb{P}[S_{t+1}=s'|S_{t}=s,A_{t}=a]$. The reward function of state $s$ with action $a$ is $R_s^a=\mathbb{E}[R_{t+1}|S_t=s, A_t=a]$.

##### MRP value function: 
The value function of an MRP is the expected return starting from state $s$.
$$v(s) = \mathbb{E}[G_t|S_t=s]$$
where $G_t$ is total discounted reward from $t$.
$$G_t = R_{t+1}+\gamma R_{t+2} + ...= \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$
where $R$ is the reward after $k+1$ time steps.

##### MDP value function: 
The value function of an MDP is the expected return starting from state $s$ and then following policy $\pi$.
$$v_{\pi}(s) = \mathbb{E}_{\pi}[G_t|S_t=s]$$
where $\pi$ is the policy, which is a distribution of actions given states
$$\pi(a|s) = \mathbb{P}[A_t=a|S_t=s]$$

### 2. Think about the data structures/class design to represent MP/MRP/MDP/Policy/Value Functions and implement them with clear type declarations. 
Remember - your data structure/code design must resemble the Mathematical/notational formalism as much as possible. 
Specifically the data structure/code design of MRP/MDP should be incremental (and not independent) to that of MP/MRP.

In [None]:
from typing import Mapping, Set, Sequence
from utils.generic_typevars import S

class MP:
    """A class representing a Markov Process"""
    def __init__(self, transitions: Mapping[S, Mapping[S, float]]):
        """transitions: a dictionary of dictionaries that stores the transition matrix"""
        self.all_states_list = list(transitions.keys()) # a list that store the names of all states
        self.transitions = transitions # a dictionary of dictionaries that stores the transition matrix
        
    def get_all_states(self):
        return self.all_states_list
        
    def get_transition(self, s: S):
        try:
            return self.transitions[s]
        except ValueError:
            print("Invalid state!")
            
    def 
            

In [None]:
class MRP(MP):
    """A class representing a Markov Reward Process"""
    def __init__(self, transitions: Mapping[S, Mapping[S, float]], rewards, gamma):
        super.__init__(self, transitions)
        self.rewards = rewards
        self.gamma = gamma
        
    def get_reward(self, s: S):
        try:
            return self.rewards[s]
        except ValueError:
            print("Invalid state!")
        
    def get_value_function(self):
    

In [None]:
class MDP(MRP):
    def __init__(self, transitions: Mapping[S, Mapping[S, float]], rewards, gamma, actions):
        super.__init__(self, transitions, rewards, gamma)
        self.actions = actions
        
    def get_value_function(self):
        
        

### 3. Separately implement the $r(s,s')$ and the $R(s) = \sum_{s'} p(s,s') * r(s,s')$ definitions of MRP

See functions in class MRP.

### 4. Write code to convert/cast the r(s,s') definition of MRP to the R(s) definition of MRP (put some thought into code design here)

See functions in class MRP.

### 5. Write code to create a MRP given a MDP and a Policy

### 6. Write out the MDP/MRP Bellman Equations

1. Bellman Equation for $v_{\pi}$:

$$ v_{\pi}(s)=\sum_{a \in A} \pi(a|s)q_{\pi}(s,a)$$

2. Bellman Equation for $q_{\pi}$:

$$ q_{\pi}(s,a) = R_s^a + \gamma\sum_{s' \in S}P_{ss'}^a v_{\pi}(s')$$

3.  Bellman Equation for $v_{\pi}$ (2):

$$ v_{\pi}(s)= \sum_{a \in A} \pi(a|s) \left ( R_s^a + \gamma\sum_{s' \in S}P_{ss'}^a v_{\pi}(s') \right )$$

4. Bellman Equation for $q_{\pi}$ (2):

$$ q_{\pi}(s,a) = R_s^a + \gamma\sum_{s' \in S}P_{ss'}^a \sum_{a' \in A} \pi(a'|s')q_{\pi}(s',a') $$

### 7. Write code to calculate MRP Value Function (based on Matrix inversion method you learnt in this lecture)

See functions in class MRP.

### 8. Write code to generate the stationary distribution for an MP

See functions in class MP.