###  1. Write out the MP/MRP/MDP/Policy definitions and MRP/MDP Value Function definitions in your own style/notation (so you really internalize these concepts)

##### MP:
Markov Processes are memoryless random processes.A Markov Process can be represented by a tuple $<S,P>$, where $S$ is a finite set of states and $P$ is a state transition probability matrix. The state transition matrix from $s$ to $s'$ is $P_{SS'}=\mathbb{P}[S_{t+1}=s'|S_{t}=s]$.

##### MRP: 
A Markov Reward Process is a Markov Process with rewards. It can be represented by a tuple $<S,P,R,\gamma>$, where $S$ is a finite set of states, $P$ is a state transition probability matrix, $R$ is a reward function, and $\gamma$ is a discount factor $\in [0,1]$. The reward function of state $s$ is $R_s=\mathbb{E}[R_{t+1}|S_t=s]$.

##### MDP: 
A Markov Decision Process is a Markov Reward Process with decisions. It can be represented by a tuple $<S,A,P,R,\gamma>$, where $S$ is a finite set of states, $A$ is a finite set of actions, $P$ is a state transition probability matrix, $R$ is a reward function, and $\gamma$ is a discount factor $\in [0,1]$. The state transition matrix from $s$ with action $a$ to $s'$ is $P_{SS'}^a=\mathbb{P}[S_{t+1}=s'|S_{t}=s,A_{t}=a]$. The reward function of state $s$ with action $a$ is $R_s^a=\mathbb{E}[R_{t+1}|S_t=s, A_t=a]$.

##### MRP value function: 
The value function of an MRP is the expected return starting from state $s$.
$$v(s) = \mathbb{E}[G_t|S_t=s]$$
where $G_t$ is total discounted reward from $t$.
$$G_t = R_{t+1}+\gamma R_{t+2} + ...= \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$
where $R$ is the reward after $k+1$ time steps.

##### MDP value function: 
The value function of an MDP is the expected return starting from state $s$ and then following policy $\pi$.
$$v_{\pi}(s) = \mathbb{E}_{\pi}[G_t|S_t=s]$$
where $\pi$ is the policy, which is a distribution of actions given states
$$\pi(a|s) = \mathbb{P}[A_t=a|S_t=s]$$

### 2. Think about the data structures/class design to represent MP/MRP/MDP/Policy/Value Functions and implement them with clear type declarations. 
Remember - your data structure/code design must resemble the Mathematical/notational formalism as much as possible. 
Specifically the data structure/code design of MRP/MDP should be incremental (and not independent) to that of MP/MRP.

In [2]:
from typing import Mapping, Set, Sequence
from utils.generic_typevars import S, A
import numpy as np

class MP:
    """A class representing a Markov Process"""
    def __init__(self, transitions: Mapping[S, Mapping[S, float]]):
        """transitions: a dictionary of dictionaries that stores the transition matrix"""
        self.all_states_list = list(transitions.keys()) # a list that store the names of all states
        self.transitions = transitions # a dictionary of dictionaries that stores the transition matrix
        
    def get_all_states(self):
        return self.all_states_list
        
    def get_transition(self, s: S):
        try:
            return self.transitions[s]
        except ValueError:
            print("Invalid state!")
            
    def get_trans_mat(self):
        length = len(self.all_states_list)
        trans_mat = np.zeros((length, length))
        for i in range(length):
            for j in range(length):
                trans_mat[i, j] = self.transitions[self.all_states_list[i]].get(self.all_states_list[j], 0)
        return trans_mat
            
    def get_stationary_dist(self):
        trans_mat = self.get_trans_mat()
        eig_val, eig_vec = np.linalg.eig(trans_mat.T)
        res = np.array(eig_vec[:, np.where(np.abs(eig_val- 1.) < 1e-8)[0][0]])
        res /= np.sum(res)    
        return res

In [3]:
class MRP(MP):
    """A class representing a Markov Reward Process"""
    def __init__(self, transitions: Mapping[S, Mapping[S, float]], rewards: Mapping[S, float], gamma: float):
        super().__init__(transitions)
        self.rewards = rewards
        self.reward_vec = self.get_rewards_vec()
        self.gamma = gamma
        
    def get_reward(self, s: S):
        try:
            return self.rewards[s]
        except ValueError:
            print("Invalid state!")
            
    def get_rewards_vec(self):
        length = len(self.all_states_list)
        reward_vec = np.zeros(length)
        for i in range(length):
            reward_vec[i] = self.rewards[self.all_states_list[i]]
        return reward_vec             

    def get_value_function(self):
        trans_mat = self.get_trans_mat()
        return np.linalg.inv(np.eye(len(self.all_states_list)) - self.gamma * trans_mat).dot(self.reward_vec)
        
    

In [4]:
class MDP:
    """A class representing a Markov Decision Process"""
    def __init__(self, transitions: Mapping[S, Mapping[A, Mapping[S, float]]], rewards: Mapping[S, Mapping[A, float]], gamma):
        self.all_states_list = list(transitions.keys()) 
        self.transitions = transitions
        self.actions = self.get_actions()
        self.rewards = rewards
        self.gamma = gamma
        
    def get_all_states(self):
        return self.all_states_list
    
    def get_actions(self) ->  Mapping[S, Set[A]]:
        action_dict = {}
        for s in self.all_states_list:
            action_dict[s] = set()
        for s1, v1 in self.transitions.items():
            for a, v2 in v1.items():
                action_dict[s1].add(a)
        return action_dict
        
    def get_transition(self, s: S, a: A):
        try:
            return self.transitions[s][a]
        except ValueError:
            print("Invalid state!")
            
    def get_mrp(self, policy: Mapping[S, Mapping[A, float]]):
        transitions = {}
        rewards = {}
        for s1, v1 in self.transitions.items():
            transitions[s1] = {}
            for a, p in policy[s1].items():
                for s2, v2 in v1[a].items():
                    transitions[s1][s2] = transitions[s1].get(s2, 0) + p*v2
                    
        for s1, v1 in self.rewards.items():
            rewards[s1] = 0
            for a, p in policy[s1].items():
                rewards[s1] += p*v1[a]
        
        return MRP(transitions, rewards, self.gamma)
    
    def get_value_function_dict(self, policy: Mapping[S, Mapping[A, float]]):
        mrp = self.get_mrp(policy)
        val_func_vec = mrp.get_value_function()
        length = len(mrp.all_states_list)
        val_func_dict = {}
        for s in self.all_states_list:
            val_func_dict[s] = 0
        for i in range(length):
            val_func_dict[mrp.all_states_list[i]] = val_func_vec[i]
        return val_func_dict
    
    def get_action_value_function(self, policy: Mapping[S, Mapping[A, float]]):
        val_func_dict = self.get_value_function_dict(policy)
        action_val_func_dict = {}
        for s1, v1 in self.rewards.items():
            action_val_func_dict[s1] = {}
            for a, v2 in v1.items():
                action_val_func_dict[s1][a] = v2 
                for s2, p in self.transitions[s1][a].items():                 
                    action_val_func_dict[s1][a] += self.gamma * p * val_func_dict[s2]
        return action_val_func_dict
         
        

### 3. Separately implement the $r(s,s')$ and the $R(s) = \sum_{s'} p(s,s') * r(s,s')$ definitions of MRP

See get_reward() and get_value_function() functions in class MRP.

### 4. Write code to convert/cast the r(s,s') definition of MRP to the R(s) definition of MRP (put some thought into code design here)

In [36]:
def convert_Reward(reward: Mapping[S, Mapping[S, float]], trans: Mapping[S, Mapping[S, float]]) -> Mapping[S, float]:  
    rs = {}
    for s1 in trans.keys():
        rs[s1] = 0
        for s2, prob in trans[s].items():
            rs[s1] += prob * r_s[s1][s2]    
    return rs

### 5. Write code to create a MRP given a MDP and a Policy

In [53]:
transitions = {
    1: {
        'a': {1: 0.3, 2: 0.7},
        'b': {2: 0.3, 2: 0.2, 3: 0.5},
    },
    2: {
        'a': {1: 0.4, 2: 0.6},
        'b': {1: 0.3, 2: 0.3, 3: 0.4}
    },
    3: {
        'a': {1: 1.0},
        'b': {2: 0.8, 3: 0.2}
    }
}

rewards = {
    1: {
        'a': 5.2,
        'b': 3.6
    },
    2: {
        'a': 1.8,
        'b': -1.0
    },
    3: {
        'a': 0.1,
        'b': 4.0
    }
}

gamma = 0.9

mdp = MDP(transitions, rewards, gamma)


policy = {
    1: {
        'a': 0.1,
        'b': 0.9
    },
    2: {
        'a': 0.5,
        'b': 0.5
    },
    3: {
        'a': 0.4,
        'b': 0.6
    }
}

mrp = mdp.get_mrp(policy)
print("Successfully created an MRP. \n")
print("All states:")
print(mrp.get_all_states())
print("\nTransition matrix:")
print(mrp.get_trans_mat())
print("\nRewards:")
print(mrp.get_rewards_vec())
print("\nValue function:")
print(mrp.get_value_function())


Successfully created an MRP. 

All states:
[1, 2, 3]

Transition matrix:
[[0.03 0.25 0.45]
 [0.35 0.45 0.2 ]
 [0.4  0.48 0.12]]

Rewards:
[3.76 0.4  2.44]

Value function:
[11.48997194 10.52806415 12.47142781]


### 6. Write out the MDP/MRP Bellman Equations

1. Bellman Equation for $v_{\pi}$:

$$ v_{\pi}(s)=\sum_{a \in A} \pi(a|s)q_{\pi}(s,a)$$

2. Bellman Equation for $q_{\pi}$:

$$ q_{\pi}(s,a) = R_s^a + \gamma\sum_{s' \in S}P_{ss'}^a v_{\pi}(s')$$

3.  Bellman Equation for $v_{\pi}$ (2):

$$ v_{\pi}(s)= \sum_{a \in A} \pi(a|s) \left ( R_s^a + \gamma\sum_{s' \in S}P_{ss'}^a v_{\pi}(s') \right )$$

4. Bellman Equation for $q_{\pi}$ (2):

$$ q_{\pi}(s,a) = R_s^a + \gamma\sum_{s' \in S}P_{ss'}^a \sum_{a' \in A} \pi(a'|s')q_{\pi}(s',a') $$

### 7. Write code to calculate MRP Value Function (based on Matrix inversion method you learnt in this lecture)

See get_value_function() function in class MRP.

### 8. Write code to generate the stationary distribution for an MP

See get_stationary_dist() function in class MP.