# Heyang Huang heyangh

# P1

$G = \{(i,j)|0 \leq i < m, 0\leq j < n\}$
<br>$B=\{(i,j)|(i,j)\in G,(i,j)\in \text{uninhabitable cells known as
"blocks"}\}$
<br>
<br>$S=G-B=\{(i,j)|(i,j)\in G,(i,j)\notin B\}$ 
<br>$T=\{(i,j)|(i,j)\in S,(i,j)\in \text{goal cells}\}$ 
<br>$N=S-T=G-B-T=\{(i,j)|(i,j)\in S,(i,j)\notin T \}$ 
<br>$A = \{(0,-1), (0,1), (-1,0), (1,0)\}$
<br>Discount factor: $\gamma=1$
<br>
$Pr(s,a,s',r)=Pr((x,y),(a_i,a_j),(x',y'),r)=$
<br>
$
\begin{cases}
        \\
        1 & \text{if }r=-1,x^*=x',y^*=y',(x^*,y^*)\in T\\
        \\
        1-p_{1,x^*}-p_{2,x^*} & \text{if }r=-1, x^*=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\in S,(x^*-1,y^*)\in S\\
        p_{1,x^*} & \text{if } r=-1,x^*-1=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\in S,(x^*-1,y^*)\in S\\
        p_{2,x^*} & \text{if } r=-1,x^*+1=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\in S,(x^*-1,y^*)\in S\\
        \\
        1-p_{1,x^*}-p_{2,x^*} & \text{if } r=-1,x^*=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\in S,(x^*-1,y^*)\notin S\\
        p_{1,x^*} & \text{if } r=-(1+b),x^*=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\in S,(x^*-1,y^*)\notin S\\
        p_{2,x^*} & \text{if } r=-1,x^*+1=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\in S,(x^*-1,y^*)\notin S\\
        \\
        1-p_{1,x^*}-p_{2,x^*} & \text{if } r=-1,x^*=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\notin S,(x^*-1,y^*)\in S\\
        p_{2,x^*} & \text{if } r=-(1+b),x^*=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\notin S,(x^*-1,y^*)\in S\\
        p_{1,x^*} & \text{if } r=-1,x^*-1=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\notin S,(x^*-1,y^*)\in S\\
        \\
        1-p_{1,x^*}-p_{2,x^*} & \text{if } r=-1,x^*=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\notin S,(x^*-1,y^*)\notin S\\
        p_{1,x^*}+p_{2,x^*} & \text{if } r=-(1+b),x^*=x',y^*=y',(x^*,y^*)\in N,(x^*+1,y^*)\notin S,(x^*-1,y^*)\notin S\\
        \\
        0 & \text{OTW} \\
        \end{cases} 
$
$(\text{ Let } x^*=x+a_i \text{ and } y^*=y+a_j \text{ for conciseness of notation})$

In [55]:
from typing import Tuple, Sequence, Set, Mapping, Dict, Callable, Optional
from dataclasses import dataclass
from operator import itemgetter
from rl.distribution import Categorical, Choose, Constant
from rl.markov_decision_process import FiniteMarkovDecisionProcess
from rl.markov_decision_process import StateActionMapping
from rl.markov_decision_process import FinitePolicy
from rl.dynamic_programming import value_iteration_result, V

'''
Cell specifies (row, column) coordinate
'''
Cell = Tuple[int, int]
CellSet = Set[Cell]
Move = Tuple[int, int]
'''
WindSpec specifies a random vectical wind for each column.
Each random vertical wind is specified by a (p1, p2) pair
where p1 specifies probability of Downward Wind (could take you
one step lower in row coordinate unless prevented by a block or
boundary) and p2 specifies probability of Upward Wind (could take
you onw step higher in column coordinate unless prevented by a
block or boundary). If one bumps against a block or boundary, one
incurs a bump cost and doesn't move. The remaining probability
1- p1 - p2 corresponds to No Wind.
'''
WindSpec = Sequence[Tuple[float, float]]

possible_moves: Mapping[Move, str] = {
    (-1, 0): 'D',
    (1, 0): 'U',
    (0, -1): 'L',
    (0, 1): 'R'
}


@dataclass(frozen=True)
class WindyGrid:

    rows: int  # number of grid rows
    columns: int  # number of grid columns
    blocks: CellSet  # coordinates of block cells
    terminals: CellSet  # coordinates of goal cells
    wind: WindSpec  # spec of vertical random wind for the columns
    bump_cost: float  # cost of bumping against block or boundary

    def validate_spec(self) -> bool:
        b1 = self.rows >= 2
        b2 = self.columns >= 2
        b3 = all(0 <= r < self.rows and 0 <= c < self.columns
                 for r, c in self.blocks)
        b4 = len(self.terminals) >= 1
        b5 = all(0 <= r < self.rows and 0 <= c < self.columns and
                 (r, c) not in self.blocks for r, c in self.terminals)
        b6 = len(self.wind) == self.columns
        b7 = all(0. <= p1 <= 1. and 0. <= p2 <= 1. and p1 + p2 <= 1.
                 for p1, p2 in self.wind)
        b8 = self.bump_cost > 0.
        return all([b1, b2, b3, b4, b5, b6, b7, b8])

    def print_wind_and_bumps(self) -> None:
        for i, (d, u) in enumerate(self.wind):
            print(f"Column {i:d}: Down Prob = {d:.2f}, Up Prob = {u:.2f}")
        print(f"Bump Cost = {self.bump_cost:.2f}")
        print()

    @staticmethod
    def add_tuples(a: Cell, b: Cell) -> Cell:
        return a[0] + b[0], a[1] + b[1]

    def is_valid_state(self, cell: Cell) -> bool:
        '''
        checks if a cell is a valid state of the MDP
        '''
        return 0 <= cell[0] < self.rows and 0 <= cell[1] < self.columns \
            and cell not in self.blocks

    def get_all_nt_states(self) -> CellSet:
        '''
        returns all the non-terminal states
        '''
        return {(i, j) for i in range(self.rows) for j in range(self.columns)
                if (i, j) not in set.union(self.blocks, self.terminals)}

    def get_actions_and_next_states(self, nt_state: Cell) \
            -> Set[Tuple[Move, Cell]]:
        '''
        given a non-terminal state, returns the set of all possible
        (action, next_state) pairs
        '''
        temp: Set[Tuple[Move, Cell]] = {(a, WindyGrid.add_tuples(nt_state, a))
                                        for a in possible_moves}
        return {(a, s) for a, s in temp if self.is_valid_state(s)}

    def get_transition_probabilities(self, nt_state: Cell) \
            -> Mapping[Move, Categorical[Tuple[Cell, float]]]:
        '''
        given a non-terminal state, return a dictionary whose
        keys are the valid actions (moves) from the given state
        and the corresponding values are the associated probabilities
        (following that move) of the (next_state, reward) pairs.
        The probabilities are determined from the wind probabilities
        of the column one is in after the move. Note that if one moves
        to a goal cell (terminal state), then one ends up in that
        goal cell with 100% probability (i.e., no wind exposure in a
        goal cell).
        '''
        d: Dict[Move, Categorical[Tuple[Cell, float]]] = {}
        for a, (r, c) in self.get_actions_and_next_states(nt_state):
            if (r, c) in self.terminals:
                d[a] = Categorical({((r, c), -1.): 1.})
            else:
                d1={}
                up_valid=self.is_valid_state((r+1,c))
                down_valid=self.is_valid_state((r-1,c))
                #both up and down valid#
                if up_valid and down_valid:
                    d1[((r,c),-1)]=1-self.wind[c][0]-self.wind[c][1]
                    d1[((r-1,c),-1)]=self.wind[c][0]
                    d1[((r+1,c),-1)]=self.wind[c][1]
                #only up valid#
                elif up_valid:
                    d1[((r,c),-1)]=1-self.wind[c][0]-self.wind[c][1]
                    d1[((r,c),-1-self.bump_cost)]=self.wind[c][0]
                    d1[((r+1,c),-1)]=self.wind[c][1]
                #only down valid#
                elif down_valid:
                    d1[((r,c),-1)]=1-self.wind[c][0]-self.wind[c][1]
                    d1[((r,c),-1-self.bump_cost)]=self.wind[c][1]
                    d1[((r-1,c),-1)]=self.wind[c][0]
                #neither up nor down valid#
                else:
                    d1[((r,c),-1)]=1-self.wind[c][0]-self.wind[c][1]
                    d1[((r,c),-1-self.bump_cost)]=self.wind[c][0]+self.wind[c][1]
                d[a]=Categorical(d1)
        return d

    def get_finite_mdp(self) -> FiniteMarkovDecisionProcess[Cell, Move]:
        '''
        returns the FiniteMarkovDecision object for this windy grid problem
        '''
        d1: StateActionMapping[Cell, Move] = \
            {s: self.get_transition_probabilities(s) for s in
             self.get_all_nt_states()}
        d2: StateActionMapping[Cell, Move] = {s: None for s in self.terminals}
        return FiniteMarkovDecisionProcess({**d1, **d2})

    def get_vi_vf_and_policy(self) -> Tuple[V[Cell], FinitePolicy[Cell, Move]]:
        '''
        Performs the Value Iteration DP algorithm returning the
        Optimal Value Function (as a V[Cell]) and the Optimal Policy
        (as a FinitePolicy[Cell, Move])
        '''
        return value_iteration_result(self.get_finite_mdp(), gamma=1.)

    @staticmethod
    def epsilon_greedy_action(
        nt_state: Cell,
        q: Mapping[Cell, Mapping[Move, float]],
        epsilon: float
    ) -> Move:
        '''
        given a non-terminal state, a Q-Value Function (in the form of a
        {state: {action: Expected Return}} dictionary) and epislon, return
        an action sampled from the probability distribution implied by an
        epsilon-greedy policy that is derived from the Q-Value Function.
        '''
        action_values: Mapping[Move, float] = q[nt_state]
        greedy_action: Move = max(action_values.items(), key=itemgetter(1))[0]
        return Categorical(
            {a: epsilon / len(action_values) +
             (1 - epsilon if a == greedy_action else 0.)
             for a in action_values}
        ).sample()

    def get_states_actions_dict(self) -> Mapping[Cell, Optional[Set[Move]]]:
        '''
        Returns a dictionary whose keys are the states and the corresponding
        values are the set of actions for the state (if the key is a
        non-terminal state) or is None if the state is a terminal state.
        '''
        d1: Mapping[Cell, Optional[Set[Move]]] = \
            {s: {a for a, _ in self.get_actions_and_next_states(s)}
             for s in self.get_all_nt_states()}
        d2: Mapping[Cell, Optional[Set[Move]]] = \
            {s: None for s in self.terminals}
        return {**d1, **d2}

    def get_sarsa_vf_and_policy(
        self,
        states_actions_dict: Mapping[Cell, Optional[Set[Move]]],
        sample_func: Callable[[Cell, Move], Tuple[Cell, float]],
        episodes: int = 10000,
        step_size: float = 0.01
    ) -> Tuple[V[Cell], FinitePolicy[Cell, Move]]:
        '''
        states_actions_dict gives us the set of possible moves from
        a non-block cell.
        sample_func is a function with two inputs: state and action,
        and with output as a sampled pair of (next_state, reward).
        '''
        q: Dict[Cell, Dict[Move, float]] = \
            {s: {a: 0. for a in actions} for s, actions in
             states_actions_dict.items() if actions is not None}
        nt_states: CellSet = {s for s in q}
        uniform_states: Choose[Cell] = Choose(nt_states)
        for episode_num in range(episodes):
            epsilon: float = 1.0 / (episode_num + 1)
            state: Cell = uniform_states.sample()
            '''
            write your code here
            update the dictionary q initialized above according
            to the SARSA algorithm's Q-Value Function updates.
            '''
            # Since we are at a non-terminating state, we can always get an action wrt epsilon greedy policy
            sampled_a=self.epsilon_greedy_action(nt_state=state,q=q,epsilon=epsilon)
            while True:
                # sample the next state and reward pair
                nxt_state,r=sample_func(state,sampled_a)
                # if next state is T,we update with Q(s_t+1,A_t+1)=0 and stop this episode
                if nxt_state in self.terminals:
                    q[state][sampled_a]=q[state][sampled_a]+step_size*(r-q[state][sampled_a])
                    break
                # Else update q table and continue
                sampled_nxt_a=self.epsilon_greedy_action(nt_state=nxt_state,q=q,epsilon=epsilon)
                q[state][sampled_a]=q[state][sampled_a]+step_size*(r+q[nxt_state][sampled_nxt_a]-q[state][sampled_a])
                state=nxt_state
                sampled_a=sampled_nxt_a
        vf_dict: V[Cell] = {s: max(d.values()) for s, d in q.items()}
        policy: FinitePolicy[Cell, Move] = FinitePolicy(
            {s: Constant(max(d.items(), key=itemgetter(1))[0])
             for s, d in q.items()}
        )
        return (vf_dict, policy)

    def get_q_learning_vf_and_policy(
        self,
        states_actions_dict: Mapping[Cell, Optional[Set[Move]]],
        sample_func: Callable[[Cell, Move], Tuple[Cell, float]],
        episodes: int = 10000,
        step_size: float = 0.01,
        epsilon: float = 0.1
    ) -> Tuple[V[Cell], FinitePolicy[Cell, Move]]:
        '''
        states_actions_dict gives us the set of possible moves from
        a non-block cell.
        sample_func is a function with two inputs: state and action,
        and with output as a sampled pair of (next_state, reward).
        '''
        q: Dict[Cell, Dict[Move, float]] = \
            {s: {a: 0. for a in actions} for s, actions in
             states_actions_dict.items() if actions is not None}
        nt_states: CellSet = {s for s in q}
        uniform_states: Choose[Cell] = Choose(nt_states)
        for episode_num in range(episodes):
            state: Cell = uniform_states.sample()
            '''
            write your code here
            update the dictionary q initialized above according
            to the Q-learning algorithm's Q-Value Function updates.
            '''
        # Since we are at a non-terminating state, we can always get an action wrt epsilon greedy policy
            sampled_a=self.epsilon_greedy_action(nt_state=state,q=q,epsilon=epsilon)
            while True:
                # sample the next state and reward pair
                nxt_state,r=sample_func(state,sampled_a)
                # if next state is T,we update with Q(s_t+1,A_t+1)=0 and stop this episode
                if nxt_state in self.terminals:
                    q[state][sampled_a]=q[state][sampled_a]+step_size*(r-q[state][sampled_a])
                    break
                # Else we take next action from a greedy policy(epsilon=0) and update Q table
                sampled_nxt_a=self.epsilon_greedy_action(nt_state=nxt_state,q=q,epsilon=0)
                q[state][sampled_a]=q[state][sampled_a]+step_size*(r+q[nxt_state][sampled_nxt_a]-q[state][sampled_a])
                state=nxt_state
                sampled_a=sampled_nxt_a   
            
        vf_dict: V[Cell] = {s: max(d.values()) for s, d in q.items()}
        policy: FinitePolicy[Cell, Move] = FinitePolicy(
            {s: Constant(max(d.items(), key=itemgetter(1))[0])
             for s, d in q.items()}
        )
        return (vf_dict, policy)

    def print_vf_and_policy(
        self,
        vf_dict: V[Cell],
        policy: FinitePolicy[Cell, Move]
    ) -> None:
        display = "%5.2f"
        display1 = "%5d"
        vf_full_dict = {
            **{s: display % -v for s, v in vf_dict.items()},
            **{s: display % 0.0 for s in self.terminals},
            **{s: 'X' * 5 for s in self.blocks}
        }
        print("   " + " ".join([display1 % j for j in range(self.columns)]))
        for i in range(self.rows - 1, -1, -1):
            print("%2d " % i + " ".join(vf_full_dict[(i, j)]
                                        for j in range(self.columns)))
        print()
        pol_full_dict = {
            **{s: possible_moves[policy.act(s).value]
               for s in self.get_all_nt_states()},
            **{s: 'T' for s in self.terminals},
            **{s: 'X' for s in self.blocks}
        }
        print("   " + " ".join(["%2d" % j for j in range(self.columns)]))
        for i in range(self.rows - 1, -1, -1):
            print("%2d  " % i + "  ".join(pol_full_dict[(i, j)]
                                          for j in range(self.columns)))
        print()



In [56]:

if __name__ == '__main__':
    wg = WindyGrid(
        rows=5,
        columns=5,
        blocks={(0, 1), (0, 2), (0, 4), (2, 3), (3, 0), (4, 0)},
        terminals={(3, 4)},
        wind=[(0., 0.9), (0.0, 0.8), (0.7, 0.0), (0.8, 0.0), (0.9, 0.0)],
        bump_cost=4.0
    )
    valid = wg.validate_spec()
    if valid:
        wg.print_wind_and_bumps()
        vi_vf_dict, vi_policy = wg.get_vi_vf_and_policy()
        print("Value Iteration\n")
        wg.print_vf_and_policy(
            vf_dict=vi_vf_dict,
            policy=vi_policy
        )
        mdp: FiniteMarkovDecisionProcess[Cell, Move] = wg.get_finite_mdp()

        def sample_func(state: Cell, action: Move) -> Tuple[Cell, float]:
            return mdp.step(state, action).sample()

        sarsa_vf_dict, sarsa_policy = wg.get_sarsa_vf_and_policy(
            states_actions_dict=wg.get_states_actions_dict(),
            sample_func=sample_func,
            episodes=100000,
            step_size=0.01
        )
        print("SARSA\n")
        wg.print_vf_and_policy(
            vf_dict=sarsa_vf_dict,
            policy=sarsa_policy
        )

        ql_vf_dict, ql_policy = wg.get_q_learning_vf_and_policy(
            states_actions_dict=wg.get_states_actions_dict(),
            sample_func=sample_func,
            episodes=30000,
            step_size=0.01,
            epsilon=0.1
        )
        print("Q-Learning\n")
        wg.print_vf_and_policy(
            vf_dict=ql_vf_dict,
            policy=ql_policy
        )

    else:
        print("Invalid Spec of Windy Grid")

Column 0: Down Prob = 0.00, Up Prob = 0.90
Column 1: Down Prob = 0.00, Up Prob = 0.80
Column 2: Down Prob = 0.70, Up Prob = 0.00
Column 3: Down Prob = 0.80, Up Prob = 0.00
Column 4: Down Prob = 0.90, Up Prob = 0.00
Bump Cost = 4.00

Number of Iterations: 111
Value Iteration

       0     1     2     3     4
 4 XXXXX  5.25  2.02  1.10  1.00
 3 XXXXX  8.53  5.20  1.00  0.00
 2  9.21  6.90  8.53 XXXXX  1.00
 1  8.36  9.21  8.36 12.16 11.00
 0 10.12 XXXXX XXXXX 17.16 XXXXX

    0  1  2  3  4
 4  X  R  R  R  D
 3  X  R  R  R  T
 2  R  U  U  X  U
 1  R  U  L  L  U
 0  U  X  X  U  X

SARSA

       0     1     2     3     4
 4 XXXXX  5.32  2.02  1.11  1.00
 3 XXXXX  8.25  5.28  1.00  0.00
 2  9.20  6.82  8.16 XXXXX  1.00
 1  8.26  9.17  8.37 11.87 11.68
 0 10.11 XXXXX XXXXX 17.05 XXXXX

    0  1  2  3  4
 4  X  R  R  R  D
 3  X  R  R  R  T
 2  R  U  U  X  U
 1  R  U  L  L  U
 0  U  X  X  U  X

Q-Learning

       0     1     2     3     4
 4 XXXXX  5.34  2.02  1.08  1.00
 3 XXXXX  8.45  5.17  1

# P2

### a)

Given $c=\theta x$, we have $x_{t+1}=(1-\theta)x_t $ or equivalently $x_t=x_0(1-\theta)^t$
<br><br>$V_\theta(x_0)=E[G_\theta(x_0)]=\sum_{t=0}^\infty \beta^t U(\theta x_t)=\sum_{t=0}^\infty \beta^t U(\theta x_0 (1-\theta)^t)
=\sum_{t=0}^\infty \theta^{1-\gamma} \beta^t (1-\theta)^{t(1-\gamma)}U(x_0)$
<br><br>
It's the sum of an infinite geometric series, so we know that $V_\theta(x_0)={\theta^{(1-\gamma)}U(x_0) \over 1-\beta(1-\theta)^{1-\gamma}}$
<br>
Therefore,$V_\theta(x)={\theta^{(1-\gamma)}U(x) \over 1-\beta{(1-\theta)}^{1-\gamma}}={\theta^{(1-\gamma)}x^{1-\gamma} \over (1-\beta{(1-\theta)}^{1-\gamma})(1-\gamma)}$

### b)

First order condition: given ${dV \over d\theta}=0$, we have:
<br>$(1-r)\theta^{-\gamma}U(x)(1-\beta(1-\theta)^{1-\gamma})=\theta^{1-\gamma}U(x)\beta (1-\gamma)(1-\theta)^{-\gamma}$
<br>$\theta^*=1-\beta^{1\over\gamma}$

### c)

Plug 
$\theta^*=1-\beta^{1\over\gamma}$ 
back into 
$V_\theta(x)={\theta^{(1-\gamma)}x^{1-\gamma} \over (1-\beta{(1-\theta)}^{1-\gamma})(1-\gamma)}$ 
to get the maximized $V*(x)$
<br> We have 
$V^*(x)=(1-\beta^{1\over\gamma})^{-\gamma}{x^{1-\gamma} \over 1-\gamma}$

### d)

<br> In our example, Bellman Optimality Equation: $V^*(s)=max_{a \in A}\{R(s,a)+\gamma \sum_{s' \in N}P(s,a,s')V^*(s')\} \forall s \in N$ can be written as: $V^*(x)=max_{\theta \in [0,1]}\{U(\theta x)+\beta V^*((1-\theta)x)\} \forall x=
U(\theta^* x)+\beta V^*((1-\theta^*)x)
$  
<br>
plug $\theta^*=1-\beta^{1\over\gamma}$ and $V^*(x)=(1-\beta^{1\over\gamma})^{-\gamma}{x^{1-\gamma} \over 1-\gamma}$ back into above equation,we can observe that:
<br>
$\beta V^*((1-\theta^*)x)-V^*(x)={{\theta^*}^{1-\gamma}}U(x)=U(\theta^*x)$
<br>therefore, our calculated Optimal Policy and Optimal Value function satisfy the Bellman Optimality Equation. 

# P3

## a)

$S:\{(b,m,m_{nxt})|b \in [0,L],m \in [0,\infty],m_{nxt}\in [0,\infty]\}$ where $b=$current balance, $m=$current mortgage rate, and $m_{nxt}=$ new rate offered
<br>$T:\{(b,m,m_{nxt})|(b,m,m_{nxt}) \in S,b=0\}$
<br>$A:\{1,2\}$ 
where 1,2 indicates which mortgage option to choose given observation of current 
$(b,m,m_{nxt})$
<br> Discount Factor:
$\gamma=r/12$ 
we adjust the annualized discount rate into monthly
<br> Assume that the distribution of new offer mortgage rate follows 
$p(m_{nxt}=m')=f(m')$
<br><br>
To simplify notation, let 
$\delta B(b,m')=P={bm'/12 \over (1+m'/12)^n-1}$
<br>$Pr(s,a,s',r)=Pr((b,m,m_{nxt}),a,(b',m',m_{nxt}'),r)= \\
\begin{cases}
        f(m_{nxt}') & \text{if } a=1,m'=m,b'=b-\delta B(b,m'),r=-{bm' \over 1-(1+m'/12)^{-n}}\\
        f(m_{nxt}') & \text{if } a=2,m'=m_{nxt},b'=b-\delta B(b,m'),r=-(C+{bm' \over 1-(1+m'/12)^{-n}})\\
        0 & \text{OTW} \\
        \end{cases} 
$


Explanation of MDP formulation: We pause our clock before each month and observe: current loan balance,current mortgage rate, and new rate offered:$(b,m,m_{nxt})$ and make decision whether to accept new loan or not. Since we only care about optimal policy, the loan payment in the first month is considered fixed and independent of our policy, and can be ignored. After we make the decision, everything except for the new rate offered next month(T+2) is determined. In short, this is a discrete time, infinite state space, finite action space MDP.

## b)

First, let's identify some traits of our specific MDP problem:
<br>1)Though there are terminating states, each episode can last extremely long. Especially when n is large, an agent may choose to keep renewing mortgage loan. 
<br>2)We have infinite state space and concise action space.

<br><br> 1) indicates that TD family algorithms are better than MC.
<br> 2) indicates that tabular methods are not suitable and function approximation is superior.

<br><br> Therefore, I propose to implement DQN control algorithm. There are several aspects that we need to focus on in our implementation:
<br> 1)Batch norm or other normalization method: While "m" and binary variable "a" tend to vary in a tight range through episodes,b and r can vary from millions to single digits. Therefore, after we retrieve mini-batches from memory, we have to perform some normalization or reward clipping to  raw data to ensure stability.
<br> 2)Given n and distribution of M,we may want to consider episodes early-stopping. An agent may never get out of the first few episodes if n is extremely large and our memory space will therefore be flooded with small loan balance atomic experiences.
Therefore, we may consider to define a converged threshold and manually terminate the first few episodes.
<br >3) In summary, We can simply follow the following DQN procedure:
<br> a)Given state (b,m,m_nxt), take action "a" according to epsilon-greedy policy extracted from Q-network values Q((b,m,m_nxt), a; w). Remember to choose proper epsilon value since we only have two available action each state.
<br>b) Derive the respective next state and reward, store the atmoic experience inreplay memory.
<br>c) Sample random mini-batch from replay memory, normalize or clip the batches, and update w with frozen w-
<br>d) Infrequently update w- after a reasonable number of time steps.
<br>e) Pay attention to the length of each episode and decide whether it's necessary to early stop episodes wrt a thresholld.
