### Assignment 1.
Jan 11th

In [1]:
import numpy as np
from scipy.linalg import eig

#### Question 1
Write out the MP/MRP definitions and MRP Value Function definition (in LaTeX) in your own style/notation (so you really internalize these concepts)

#### Answer:
The Markov Property is such that the future state will be independent from the previous state given the current state is observed.
$${P[S_{t+1} | S_t] = P[S_{t+1} | S_t, S_{t-1}, ..., S_1]}$$

The Markov Repard Process is a Markov Process with reward. It is consists of a set of ${S, P, R, \gamma}$  
where ${S}$ is a finite set of states, ${P}$ is a state transition probability matrix, ${R}$ is the reward function and ${\gamma}$ is the discount factor.  

The return ${G_t}$ is the total discounted reward from time step t $${G_t = R_{t+1} + \gamma R_{t+2} + ... = \Sigma_{k = 0}^{\infty}\gamma^k R_{t + k + 1}}$$

The value function gives the long-term value of state s: $${v(s) = E[G_t | S_t = s]}$$


#### Question 2
Think about the data structures/class design (in Python 3) to represent MP/MRP and implement them with clear type declarations. Remember your data structure/code design must resemble the Mathematical/notational formalism as much as possible. Specifically the data structure/code design of MRP should be incremental (and not independent) to that of MP

In [2]:
class MarkovProcess:
    """A class describing a Markov Process"""
        
    def get_all_states(self, S):
        # return the names of each state in a dictionary
        return set(S.keys())
    
    def construct_row_transition_matrix(self, row):
        # using the similar approximation function as MDP-DP-RL where two values are taken 
        # as the same when the absolute difference between them is smaller than 1e-8
        return {s: v for s, v in row.items() if not abs(v - 0.0) <= 1e-8}
    
    def get_transition_matrix(self, S):
        transition_matrix = {s : self.construct_row_transition_matrix(v) for s, v in S.items()}
        return transition_matrix
            
    
    def get_stationary_distribution(self):
        sz = len(self.all_states_list)
        mat = np.zeros((sz, sz))
        for i, s1 in enumerate(self.all_states_list):
            for j, s2 in enumerate(self.all_states_list):
                mat[i, j] = self.transition[s1].get(s2, 0.)

        eig_vals, eig_vecs = eig(mat.T)
        stat = np.array(
            eig_vecs[:, np.where(np.abs(eig_vals - 1.) < 1e-8)[0][0]].flat
        ).astype(float)
        norm_stat = stat / sum(stat)
        return {s: norm_stat[i] for i, s in enumerate(self.all_states_list)}
    
    
    def __init__(self, S):
        self.S = S
        
        # all_states_list is a variable containing all the state names
        self.all_states_list = list(self.get_all_states(S))
        
        # transition is dictionary of the transition matrix, where the keys are the states and the values are dictinaries of current row
        # Without type coersion, the following two instantiations are equivalent
        
#         self.transition = S
        self.transition = self.get_transition_matrix(S)
        

In [3]:
transition = {
    1: {1: 0.1, 2: 0.6, 3: 0.1, 4: 0.2},
    2: {1: 0.25, 2: 0.22, 3: 0.24, 4: 0.29},
    3: {1: 0.7, 2: 0.3},
    4: {1: 0.3, 2: 0.5, 3: 0.2}
}
a_MP_process = MarkovProcess(transition)

In [4]:
a_MP_process.transition

{1: {1: 0.1, 2: 0.6, 3: 0.1, 4: 0.2},
 2: {1: 0.25, 2: 0.22, 3: 0.24, 4: 0.29},
 3: {1: 0.7, 2: 0.3},
 4: {1: 0.3, 2: 0.5, 3: 0.2}}

In [5]:
a_MP_process.get_stationary_distribution()



{1: 0.28574421284173046,
 2: 0.38860374986906865,
 3: 0.15580810725882485,
 4: 0.16984393003037593}

*This class object will be implemented in a python file as well for future uses

#### Question 3
Separately implement the r(s,s') and the R(s) = \sum_{s'} p(s,s') * r(s,s') definitions of MRP. Write code to convert/cast the r(s,s') definition of MRP to the R(s) definition of MRP (put some thought into code design here)

In [6]:
# two ways of constructing the class will be presented
# in the first way of constructing the the Markov Reward Process by passing the transition matrix and reward matrix separately
class MarkovRewardProcess:
    """ A class describing Markov Reward Process"""
    def get_all_states(self, S):
        # return the names of each state in a dictionary
        return set(S.keys())
    
    def construct_row_transition_matrix(self, row):
        # using the similar approximation function as MDP-DP-RL where two values are taken 
        # as the same when the absolute difference between them is smaller than 1e-8
        return {s: v for s, v in row.items() if not abs(v - 0.0) <= 1e-8}
    
    def get_transition_matrix(self, S):
        transition_matrix = {s : self.construct_row_transition_matrix(v) for s, v in S.items()}
        return transition_matrix
    
    def get_stationary_distribution(self):
        sz = len(self.all_states_list)
        mat = np.zeros((sz, sz))
        for i, s1 in enumerate(self.all_states_list):
            for j, s2 in enumerate(self.all_states_list):
                mat[i, j] = self.transition[s1].get(s2, 0.)

        eig_vals, eig_vecs = eig(mat.T)
        stat = np.array(
            eig_vecs[:, np.where(np.abs(eig_vals - 1.) < 1e-8)[0][0]].flat
        ).astype(float)
        norm_stat = stat / sum(stat)
        return {s: norm_stat[i] for i, s in enumerate(self.all_states_list)}
    
    def __init__(self, transition, reward):
        # in this case, the transition matrix and the reward matrix are separate dictionaries
        self.transition = transition
        self.reward = reward
        self.all_state_list = list(get_all_states)

In [7]:
class MarkovRewardProcess_matrix:
    # the class object holds the transition probability as the matrix so does the reward
    def __init__(self, transition, reward):
        self.transition = np.matrix(transition)
        self.reward = np.matrix(transition)

In [8]:
def convert_Reward(p_s, r_s):
    # where p_s is a dictionary whose keys are the original state 
    # and the values are dictionaries of transited states and corresponding probabilities
    
    Reward = {}
    for state in p_s.keys():
        Reward[state] = 0
        for transitted_state, probability in p_s[state].items():
            Reward[state] += probability * r_s[state][transitted_state]
    
    return Reward

#### Question 4
Write code to generate the stationary distribution for an MP

#### Answer
See the code for the declaration of the Markov Process and the attribute called get_stationary_distribution