### MS&E 346 Assignment 2
#### January 16

In [None]:
import numpy as np
from scipy.linalg import eig


#### Question 1
Write the Bellman equation for MRP Value Function and code to calculate MRP Value Function (based on Matrix inversion method you learnt in this lecture)

#### Answer:
The Bellman equation for MRP value funcion:
$${v(s) = R_s + \gamma \Sigma_{s' \in S}p_{ss'}v(s')}$$
The Bellman equation can be expressed consisely using matrices:
$${v = R + \gamma Pv}$$

In [None]:
def solve_v(gamma, P, R):
    """the function to solve the reward for the markovian reward process"""
    a = np.identity(P.shape[0]) - gamma * P
    inv = np.linalg.inv(a)
    return np.matmul(inv, R)

In [None]:
gamma = 0.95
P = np.matrix([[0.5, 0.5],[1, 0]])
R = np.matrix([[0.6, -1], [0.3, -0.5]])
v = solve_v(gamma, P, R)
print (v)

#### Question 2
Write out the MDP definition, Policy definition and MDP Value Function definition (in LaTeX) in your own style/notation (so you really internalize these concepts)

#### Answer:
A Markov Decision Process is a Markov Reward Process with decisions. It is an environment where all states are Markovian. The Markov Decision Process is made of a tuple of a finite set of states, a finite set of actions, a state transition probability, a reward function and a discount factor. It could be written as ${<S, A, P, R \gamma>}$, where ${P_{ss'}^a = P[S_{t+1} = s' | S_t = s, A_t = a}$

A policy ${\pi}$ is a distribution over actions given states, it is at choice of the agent, i.e. a policy defines the behavior of the agent.
$${\pi(a|s) = P[A_t = a | S_t = s]}$$  

The state-value function ${v_\pi(s)}$ is the expected return starting from the state s, and then following the policy ${\pi}$, which is to say, assuming, from the current state, all the future state will be determined by the given policy, the expected reward is the state-value function.
$${v_\pi(s) = E_\pi[G_t | S_t = s]}$$

The action value function, ${q_\pi(s, a)}$ is the expected return from state s and taking action a, and then following the policy ${\pi}$
$${q_\pi(s, a) = E_\pi[G_t | S_t = s, A_t = a]}$$

#### Question 3:
Think about the data structure/class design (in Python 3) to represent MDP, Policy, Value Function, and implement them with clear type definitions. The data structure/code design of MDP should be incremental (and not independent) to that of MRP.

In [None]:
class MarkovDecisionProcess:
    
    def get_transition_matrix(self, data):
        name2index = {}
        for i, name_of_state in enumerate(data):
            name2index[name_of_state] = i
        
        self.name2index = name2index
        
        # store the transition probability in a transition matrix
        P = np.zeros((len(name2index), len(name2index)))
        for name_of_state in data:
            for name_of_next_state in data[name_of_state]:
                P[name2index(name_of_state), name2index(name_of_next_state)] = data[name_of_state][name_of_next_state]
        print ('The transition matrix is constructed as: ', P)
        return P
    
    def get_value_matrix(self, data):
        name2index = self.name2index
        V = np.zeros((len(name2index), len(name2index)))
        # the case when the reward is only associated with states, as is in MRP
        for name_of_state in data:
            for name_of_next_state in data[name_of_state]:
                V[name2index(name_of_state), name2index(name_of_next_state)] = data[name_of_state][name_of_next_state]
    
    def get_policy_matrix(self, policy):
        name2index = self.name2index
        # construct a dictionary to mapping the name of the action to index
        action2index = {}
        i = 0
        for name_of_state in policy:
            for name_of_action in policy[name_of_state]:
                if name_of_action not in action2index:
                    action2index[name_of_action] = i
                    i += 1
        self.action2index = action2index
        
        policy_matrix = np.zeros((len(name2index), len(name2index)))
        for name_of_state in policy:
            for name_of_action in policy[name_of_state]:
                policy_matrix[name2index(name_of_state), action2index(name_of_action)] = policy[name_of_state][name_of_action]
        return policy_matrix
        
    def link_model_to_states(self, model):
        name2index = self.name2index
        action2index = self.action2index
        p_ss_prime = np.zeros((len(name2index), len(action2index), len(name2index)))
        # for this part, it will link a tuple of <state, action> to next state and generate the corresponding probabilistic distribution
        transition = self.transition
        policy = self.policy
        for state_name in policy:
            for action_name in policy[state_name]:
                p_curts_curta = policy[state_name][action_name]
                p_ss_prime += p_curts_curta * model[next_state_name][state_name][action_name]            
        return P_ss_prime
    
    def convert_to_MRP(self, policy, action, model, gamma, value):
        P_ss_prime = link_model_to_states(model)
        r_s = np.sum(np.multiply(P_ss_prime, value), axis = 1)
        return r_s
    
    def __init__(self, transition: dict, policy: dict, action: dict, model: dict, value: dict, gamma):
        # the transition matrix will be stored in a numpy matrix with transition probability
        self.transition = get_transition_matrix(transition)
        self.gamma = gamma
        self.policy = policy
        self.model = model
        self.action = action
        value = convert_to_MRP(polic, action, model, gamma, value)
        self.value = value
        

* The class/ object for MDP can be seen as above, where the transition matrix is a dictionary of the transition probability, policy is a diction of diction, where it indicates the probabilistic distribution of each action in each state, action is a dictionary, which can be derived using model and policy, the model is a dictionary mapping ${<state, action>}$ to next state which is written as ${P[s'|S_t = s, A_t = a]}$

* There are other ways to represent the object, in accordance with the MP and MRP, the class can be written as (part of the implementation referred to MDP-DP-RL by Ashwin Rao):

In [None]:
from typing import TypeVar
from typing import Mapping, Set, Tuple, Generic, Any, Callable

S = TypeVar('S')
A = TypeVar('A')
X = TypeVar('X')
Y = TypeVar('Y')
Z = TypeVar('Z')

class MDP(Generic[S, A]):
    
    def map_reconstruct(self, d: Mapping[X, Tuple[Y, Z]])\
        -> Tuple[Mapping[X, Y], Mapping[X, Z]]:
        d1 = {k: v1 for k, (v1, _) in d.items()}
        d2 = {k: v2 for k, (_, v2) in d.items()}
        return d1, d2
    
    def get_all_states(self, S):
        # return the names of each state in a dictionary
        return set(S.keys())
    
    def get_actions_set(self, mdp_data: Mapping[S, Mapping[A, Any]])\
        -> Mapping[S, Set[A]]:
        actions_set = {}
        for k, v in mdp_data.items():
            actions_set[k] = set(v.keys())
        return actiosn_set
    
    def get_actions_for_states(self, mdp_data: Mapping[S, Mapping[A, Any]])\
        -> Mapping[S, Set[A]]:
        return {k: set(v.keys()) for k, v in mdp_data.items()}
    
    def convert_to_MRP_transition(self, d1, d2):
        """
        The convert_to_MRP_transition should be a part of MDP class, where it takes two dictionary
        It will return a dictionary mapping from state -> {state: probability},
        which is the transition matrix for the MArkov Process
        """
        data = {k: {v : 0 for v in self.all_states} for k in self.all_states}
        for curt_state, mapping_s_a in d1.items():
            for action, tup in mapping_s_a.items():
                for next_state, probability in enumerate(tup):
                    data[curt_state][next_state] += probability * d2[curt_state][action]
        return data
    
    def convert_to_MRP_reward(self, d1, d2):
        """
        The convert_to_MRP_reward should be a part of the MDP class, where it takes two dictionary, namely d1, and d2
        """
        # where d1 is the mapping from state -> {action: (state, probability)
        # d2 is the mapping from the state -> {action : reward}

        # initialize the MRP with {state: {state: reward}} and the initial reward are all 0
        data = {k: {v : 0 for v in self.all_states} for k in self.all_states}
        for curt_state, mapping_s_a in d1.items():
            for action, tup in mapping_s_a.items():
                for next_state, probability in enumerate(tup):
                    data[curt_state][next_state] += probability * d2[curt_state][action]

        # another way to improve efficiency is to construct a 2D big matrix (state, state, probability) 
        # --- SEE MarkovDecisionProcess.convert_to_MRP()
        return data

    def __init__(self, info: Mapping[S, Mapping[A, Tuple[Mapping[S, float], float]]], gamma: float):
        
        # in this case, the params passes in into the init function is a big dictionary
        # the data structure can be described as {state:{action:({state:probability, reward})}
        
        d = {k: self.map_reconstruct(v) for k, v in info.items()}
        # after reconstructing the map, what we have here is a mapping from State -> {action: {state: probability}}, {action: reward}
        d1, d2 = self.map_reconstruct(d)
        # after reconstructing the map, what we have so far two dictionaries:
        # where d1 is a mapping from state -> {action: {state: probabilities}}
        # and d2 is a mapping from state -> {action, reward}
        self.d1 = d1
        self.d2 = d2
        self.all_states: Set[S] = self.get_all_states(info)
        
        # state_action_dict is a mapping from a state to a set, which is consist of all the actions it can make at current state
        self.state_action_dict: Mapping[S, Set[A]] = \
            self.get_actions_for_states(info)
        
        
        self.transitions: Mapping[S, Mapping[A, Mapping[S, float]]] = d1
        
        self.rewards: Mapping[S, Mapping[A, float]] = d2
        self.gamma: float = gamma
        self.terminal_states: Set[S] = self.get_terminal_states()

    def get_sink_states(self) -> Set[S]:
        return {k for k, v in self.transitions.items() if
                all(len(v1) == 1 and k in v1.keys() for _, v1 in v.items())
                }

    def get_terminal_states(self) -> Set[S]:

        sink = self.get_sink_states()
        return {s for s in sink if
                all(is_approx_eq(r, 0.0) for _, r in self.rewards[s].items())}


#### Question 4:
Separately implement the r(s,s',a) and R(s,a) = \sum_{s'} p(s,s',a) * r(s,s',a) definitions of MDP. Write code to convert/cast the r(s,s',a) definition of MDP to the R(s,a) definition of MDP (put some thought into code design here)

#### Answer:

A fundamental property of value functions used throughout reinforcement learning and dynamic programming is that they satisfy recursive relationships. For any policy and any state s, the following consistency condition holds between the value of s and the value of its possible successor states:
$${v_\pi(s) = E[G_t | S_t = s] = \Sigma_a \pi(a|s)\Sigma_{s'}\Sigma_r p(s', r|s, a)[r + \gamma E_\pi[G_{t+1}|S_{t+1}=s']]}$$

In [None]:
def convert_to_MRP_reward(self, d1, d2):
    """
    The convert_to_MRP_reward should be a part of the MDP class, where it takes two dictionary, namely d1, and d2
    """
    # where d1 is the mapping from state -> {action: (state, probability)
    # d2 is the mapping from the state -> {action : reward}
    
    # initialize the MRP with {state: {state: reward}} and the initial reward are all 0
    data = {k: {v : 0 for v in self.all_states} for k in self.all_states}
    for curt_state, mapping_s_a in d1.items():
        for action, tup in mapping_s_a.items():
            for next_state, probability in enumerate(tup):
                data[curt_state][next_state] += probability * d2[curt_state][action]
    
    # another way to improve efficiency is to construct a 2D big matrix (state, state, probability) 
    # --- SEE MarkovDecisionProcess.convert_to_MRP()
    return data
    

#### Question 5:
Write code to create a MRP given a MDP and a Policy
#### Answer:

In [None]:
# The MarkovRewardProcess class takes in a dictionary of transition matrix and reward matrix
# The reward matrix can be constructed by the convert_to_MRP_reward method
# a similar method is required to convert a MDP into transition matrix

def convert_to_MRP_transition(self, d1, d2):
    """
    The convert_to_MRP_transition should be a part of MDP class, where it takes two dictionary
    It will return a dictionary mapping from state -> {state: probability},
    which is the transition matrix for the MArkov Process
    """
    data = {k: {v : 0 for v in self.all_states} for k in self.all_states}
    for curt_state, mapping_s_a in d1.items():
        for action, tup in mapping_s_a.items():
            for next_state, probability in enumerate(tup):
                data[curt_state][next_state] += probability * d2[curt_state][action]
    return data

In [None]:
# a MRP object can be instatiated given transition and reward objects
from MRP import MarkovRewardProcess

# sudo object
mdp_object = MDP({}, 0.95)
transition = mdp_object.convert_to_MRP_transition(mdp_object.d1, mdp_object.d2)
reward = mdp_object.convert_to_MRP_transition(mdp_object.d1, mdp_object.d2)
mrp_object = MarkovRewardProcess(transition, reward)

#### Question 6
Write out all 8 MDP Bellman Equations and also the transformation from Optimal Action-Value function to Optimal Policy (in LaTeX)
#### Answer:

Bellman Optimality Equation for v*
$${v_*(s) = max_a q_*(s, a)}$$
Bellman Optimality Equation for q*
$${q_*(s, a) = R_s^a + \gamma \Sigma_{s' \in S}P_{ss'}^a(v_*(s'))}$$

Bellman Optimality Equation for v*(2)
$${v_*(s) = max_a R_s^a + \gamma \Sigma_{s' \in S}P_{ss'}^a(v_*(s'))}$$

Bellman Optimality Equation for q*(2)
$${q_*(s, a) = R_s^a + \gamma \Sigma_{s' \in S}P_{ss'}^a(max_a q_*(s, a))}$$

#### Bellman Expectation Equation:  
$${v_{\pi}(s) = E_\pi[R_{t+1} + \gamma*v(s_{t+1})|S_t = s]}$$  
The action-value function can be decomposed: 
$${q(s, a) = E_{\pi}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_t = s, A_t = a] }$$  
Bellman Expectation equation for ${V^\pi}$
$${v_\pi(s) = \Sigma_{a \in A}\pi(a|s)q_{\pi}(s, a)}$$

Bellman expectation Equation for Q*
$${q(s, a) = R_s^a + \gamma \Sigma_{s' \in S}P_{ss'}^a(v_{\pi}(s'))}$$