Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "John Ortiz"
COLLABORATORS = ""

---

<a id='top'></a>

# CSCI 3202, Spring 2021:  Assignment 5  

Shortcuts:  [top](#top) -- [1](#p1) | [1a](#p1a) | [1b](#p1b) | [1c](#p1c) | [1d](#p1d) | [1e](#p1e) | [1f](#p1f) | [1g](#p1g) -- [2](#p2) | [2a](#p2a) | [2b](#p2b) | [2c](#p2c) | [2d](#p2d) | [2e](#p2e) 

# Assignment Overview

This assignment is an exercise in implementing and analyzing Markov Decision Processes (MDPs). Problem 1 asks you to code a solution to a specific scenario, while Problem 2 is a conceptual question which asks you to describe an MDP problem of your own design.

Here's a summary of the tasks required and the associated points:

| Problem #  | Tasks                                                  | Points  |
|:---        |:---                                                    |:---:    |
| 1a         | Code: Complete implementation of `MDP` class           | 10      |
| 1b         | Code: Implement `value_iteration` and `find_policy`    | 5       |
| 1c         | Code and create: Generate and illustrate optimal path  | 5       |
| 1d         | Written: analyze policy                                | 5       |
| 1e         | Code: adjust non-terminal rewards                      | 5       |
| 1f         | Code and write: adjust terminal rewards                | 5       |
| 1g         | Written: analyze transition model                      | 5       |
| 2a         | Written: define problem                                | 4       |
| 2b         | Written: define states                                 | 4       |
| 2c         | Written: define reward                                 | 4       |
| 2d         | Written: define actions and transition                 | 4       |
| 2e         | Written: define optimal policy                         | 4       |
| Total      |                                                        | 60      |


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# add any imports you may need

<a id='p1'></a>
[Back to top](#top)

# Problem 1: Navigating an awkward situation with grace and poise

<img src='https://www.explainxkcd.com/wiki/images/5/5f/interaction.png' style="width: 600px;"/>

Suppose you are at a social event where you would like to avoid any interaction with a large number of the other attendees. It's not that you don't like them, it's just that you don't like *talking to* them. A few of your good friends are also in attendance, but they are tucked away in a corner. The rectangular room in which the event is being held spans gridcells at $x=1,2,\ldots, 6$ and $y=1,2,\ldots, 5$. At the eastern edge ($x=6$) of this first floor room, there is a balcony, with a 6-foot drop. If the event becomes unbearably awkward, you can jump off the balcony and run away. Of course, this might hurt a little bit, so we should incorporate this into our reward structure.

The terminal states and rewards associated with them are given in the diagram below. The states are represented as $(x,y)$ tuples. The available actions in non-terminal states include moving exactly 1 unit North (+y), South (-y), East (+x) or West (-x), although you should not include walking into walls, because that would be embarrassing in front of all these other people. Represent actions as one of 'N', 'S', 'E', or 'W'. For now, assume all non-terminal states have a default reward of -0.01, and use a discount factor of 0.99.

<img src="http://www.cs.colorado.edu/~tonyewong/home/resources/hw06_mdp.png" style="width: 400px;"/>

Use the following transition model for this decision process, if you are trying to move from state $s$ to state $s'$:
* you successfully move from $s$ to $s'$ with probability 0.6
* the remaining 0.4 probability is spread equally likely across state $s$ **and** all adjacent (N/S/E/W) states except for $s'$. Note that this does not necessarily mean that all adjacent states have 0.1, because some states do not have 4 adjacent states.

<a id='p1a'></a>
[Back to top](#top)

## (1a) - 10 points

Complete the `MDP` class below. The docstring comments provide some desired specifications. You may add additional methods or attributes, if you would like.

In [3]:
class MDP:
    def __init__(self, nrow, ncol, terminal_states, default_reward, df):
        '''Create/store the following attributes:
        self.states -- a list of all the states as (x,y) tuples
        self.terminal_states -- a dictionary with terminal state keys, and rewards as values
        self.default_reward -- the reward for being in any non-terminal state
        self.df -- discount factor
        ... and anything else you decide will be useful!
        '''
        
        self.states=  []
        for x in range(6):
            for y in range(5):
                state = (x+1,y+1)
                self.states.append(state)
        self.terminal_states =terminal_states #we could need to change to be for loop
        self.default_reward = default_reward
        self.df = df
        
        #possibleAction
        self.possibleActions = ['N','S','E','W',None]
        #raise NotImplementedError()
    #helper for actions
    def isInBounds(self,newX,newY):
        #return true if newX and newY are in boundaries
        isIn = True
        if((newX<1) or (newX>6)):
            isIn = False
        if((newY<1) or (newY>5)):
            isIn = False
        return isIn
    def actions(self, state):
        '''Return a list of available actions from the given state.
        Possible actions are 'N','S','E','W'
        [None] are the actions available from a terminal state.
        '''
        #first check if it is terminal state
        if(state in self.terminal_states):
            return [None]
        #possibleActions = ['N','S','E','W']
        newStates=[(state)]*len(self.possibleActions)
        finalActions=[]
        #for each possible action check if (newx,newy) outbound if is do add to actionlist:(x,y) -> (x,y+1)
        for i in range(len(newStates)):
            if(i==0):
                #north -> (x,y+1)
                newState= list(newStates[i])
                newState[1] = newState[1]+1
                if(self.isInBounds(newState[0],newState[1])):
                    finalActions.append(self.possibleActions[i])
            elif(i==1):
                #south -> (x,y-1)
                newState= list(newStates[i])
                newState[1] = newState[1]-1
                if(self.isInBounds(newState[0],newState[1])):
                    finalActions.append(self.possibleActions[i])
            elif(i==2):
                #east -> (x+1,y)
                newState= list(newStates[i])
                newState[0] = newState[0]+1
                if(self.isInBounds(newState[0],newState[1])):
                    finalActions.append(self.possibleActions[i])
            elif(i==3):
                #west -> (x-1,y)
                newState= list(newStates[i])
                newState[0] = newState[0]-1
                if(self.isInBounds(newState[0],newState[1])):
                    finalActions.append(self.possibleActions[i])
            #elif(i==4):
            #    finalActions.append(None)
        return finalActions
    def reward(self, state):
        '''Return the reward for being in the given state'''
        
        #check if given state is terminal state
        if(state in self.terminal_states):
            #terminal reward
            reward = self.terminal_states.get(state)
        else:
            #default reward
            reward = self.default_reward
        #raise NotImplementedError()
        return reward
    def result(self, state, action):
        '''Return the resulting state (as a tuple) from doing the given
        action in the given state, without uncertainty. Uncertainty
        is incorporated into the transition method.
        state -- a tuple representing the current state
        action -- one of N, S, E or W, as a string
        '''
               
        
        assert action in self.actions(state), 'Error: action needs to be available in that state'
        assert state in self.states, 'Error: invalid state'
        actionIndex = self.possibleActions.index(action)
        
        if(actionIndex==0):
            #north -> (x,y+1)
            newState= list(state)
            newState[1] = newState[1]+1
            if(self.isInBounds(newState[0],newState[1])):
                return (newState[0],newState[1])
        elif(actionIndex==1):
            #south -> (x,y-1)
            newState= list(state)
            newState[1] = newState[1]-1
            if(self.isInBounds(newState[0],newState[1])):
                return (newState[0],newState[1])
        elif(actionIndex==2):
            #east -> (x+1,y)
            newState= list(state)
            newState[0] = newState[0]+1
            if(self.isInBounds(newState[0],newState[1])):
                return (newState[0],newState[1])
        elif(actionIndex==3):
            #west -> (x-1,y)
            newState= list(state)
            newState[0] = newState[0]-1
            if(self.isInBounds(newState[0],newState[1])):
                return (newState[0],newState[1])
        
    
                
    def transition(self, state, action):
        '''Return a list of (probability, next_state) associated
        with the possibilities of taking the given action from the given state.
        '''
        #you successfully move from  𝑠  to  𝑠′  with probability 0.6 the remaining 0.4 probability is 
        #spread equally likely across state  𝑠  and all adjacent (N/S/E/W) states except for  𝑠′ . 
        #Note that this does not necessarily mean that all adjacent states have 0.1, 
        #because some states do not have 4 adjacent states.
        if action is None:
            # This happens for a terminal state
            return [(0, state)]
        else:
            # Not a terminal state
            
            # YOUR CODE HERE
            #grab available actions
            avActions = self.actions(state)
            avActions.append(None)
            indexOfAction = avActions.index(action)
            p = [1]*len(avActions)
            p[indexOfAction]= .6
            evenProb = .4/(len(avActions)-1)
            
            #sanity check
            #print(avActions)
            #print(p)
            #print(evenProb)
            
            #update rest of probabilities
            for i in range(len(avActions)):
                if(i!=indexOfAction):
                    p[i]= evenProb
            #print(p)
            
            transitionList =[]
            #create new next state for each action in avActions
            for i in range(len(avActions)):
                if(avActions[i]!= None):
                    newState = self.result(state,avActions[i])
                    transitionList.append((p[i],newState))
                else:
                    transitionList.append((p[i],state))
            
            #sPrime =np.random.choice(avActions,p=p)
            #avActions.append()
            #raise NotImplementedError()
            
            #print(transitionList)
            return transitionList
    def expected_utility(self, next_states, cur_util):
        '''Return the expected utility given generated list of possible 
        next states and the current utility, which is a dictionary of the form {state : utility}
        '''
        
        sumUtil=0
        #print("Next States:"+str(next_states))
        for possibleState in next_states:
            #print("possibleState: "+str(possibleState))
            prob = possibleState[0]
            util_of_state =cur_util.get(possibleState[1])
            #print("multiplying: "+str(prob)+", "+str(util_of_state) )
            sumUtil += (prob*util_of_state)
        return sumUtil
     

### (1a) tests

Note that these are non-exhaustive, because there is some flexibility in how the `transition` method works.

In [4]:
## BEGIN TESTS
nrow = 3
ncol = 3
default_reward = -0.2
discount = 0.5
terminal = {(1,3):-1, (1,2):2}
mdp_simple = MDP(nrow, ncol, terminal, default_reward, discount)

actions1 = set(mdp_simple.actions((2,2)))
assert (actions1 == {'N','S','E','W'}), "Expected set of actions is {'N','S','E','W'}, your code returned: %s" % actions1

actions2 = set(mdp_simple.actions((1,1)))
assert (actions2 == {'N','E'}), "Expected set of actions is {'N','E'}, your code returned: %s" % actions2

actions3 = set(mdp_simple.actions((1,2)))
assert (actions3 == {None}), "Expected set of actions is {None}, your code returned: %s" % actions3

reward1 = mdp_simple.reward((1,2))
assert (reward1 == 2), "Expected reward is 2, your code returned: %f" % reward1

reward2 = mdp_simple.reward((2,2))
assert (reward2 == -0.2), "Expected reward is -0.2, your code returned: %f" % reward2

result1 = mdp_simple.result((1,1), 'N')
assert (result1 == (1,2)), "Expected result is (1,2), your code returned: %s" % (result1,)

print("All tests passed: 5 points")
## END TESTS

All tests passed: 5 points


In [5]:
## BEGIN TESTS
# set values from problem statement
nrow = 5
ncol = 6
default_reward = -0.01
discount = 0.99
terminal = {(1,3):-1, (1,4):2, (1,5):2, (2,1):-1, (3,1):-1, (3,4):-1, (3,5):1,
            (4,3):-1, (4,4):-1, (6,1):-5, (6,2):-5, (6,3):-5, (6,4):-5, (6,5):-5}
mdp_p1 = MDP(nrow, ncol, terminal, default_reward, discount)

# Find the expected utility of walking N from (1,1):
util_old = {s : s[0]+s[1] for s in mdp_p1.states}

next_states = mdp_p1.transition((1,1), 'N')
assert (len(next_states) == 3), "Expected 3 possible next states, your code returned: %d" % len(next_states)

exp_util = mdp_p1.expected_utility(next_states, util_old)
assert (exp_util == 2.8), "Expected utility should be 2.8, your code returned %f" % exp_util

print("All tests passed: 5 points")
## END TESTS

All tests passed: 5 points


<a id='p1b'></a>
[Back to top](#top)

## (1b) - 5 points

Implement value iteration to calculate the utilities for each state.  Also implement a function that takes as arguments an `MDP` object and a dictionary of state-utility pairs (key-value) and returns a dictionary for the optimal policy.  The optimal policy dictionary should have state tuples as keys and the optimal move (None, N, S, E or W) as values.

In [6]:
def value_iteration(mdp, tol=1e-3):
    # TODO: 
    # 1. initialize utility for all states to 0
    # 2. for each state on the board
    #    2.1. calculate expected utility for each possible next state
    #    2.2. find best utility out of possible expected utilities
    #    2.3. define new utility of the state
    # 3. Repeat 2 until problem is converged
    utility = {s : 0 for s in mdp.states}
    tolerence_Met=10000

    while(tolerence_Met > tol):
        snapUtility = utility.copy()
        for i in range(len(mdp.states)):
            state=mdp.states[i]
            possibleActions = mdp.actions(state)
            #print(state)
            #print("---------------")
            possibleUtilities=[] #store all possible expectUtilities from a given action
            
            for action in possibleActions:
                if(action != None):
                    nextStates= mdp.transition(state,action)

                    expectUtility = mdp.expected_utility(nextStates, snapUtility)
                    #print(action,nextStates,expectUtility)
                    possibleUtilities.append(expectUtility)
                else:
                    possibleUtilities.append(0)
            maxExpectUtil = max(possibleUtilities)
            
            newUtil =(maxExpectUtil*mdp.df + mdp.reward(state))
            #print("newUtil: "+str(newUtil))
            
            u1 ={state : newUtil}
            
            snapUtility.update(u1)
        #update utility with new version of utility
        newUtilList = list(snapUtility.values())
        oldUtilList = list(utility.values())
        diffList = [abs(newUtilList[x]-oldUtilList[x]) for x in range(len(newUtilList))] #gets differences
        tolerence_Met = max(diffList) #gets biggest diff
        utility = snapUtility
        
    return utility   
    

def find_policy(mdp, utility):
    '''Return a dictionary containing the best policy for each state s,
    of the form {s : s_policy}
    '''
    bestPolicies = {s : None for s in mdp.states}
    for i in range(len(mdp.states)):
            state=mdp.states[i]
            if(state not in mdp.terminal_states):
                possibleActions = mdp.actions(state)
                possibleUtilities=[]
                actionList=[]
                #print(state)
                #print("---------------")
                for action in possibleActions:
                    nextStates= mdp.transition(state,action)
                    expectUtility = mdp.expected_utility(nextStates, utility)
                    #print(action,nextStates,expectUtility)
                    possibleUtilities.append(expectUtility)
                    actionList.append(action)
                maxExpectUtil = max(possibleUtilities)
                indexMaxUtil = possibleUtilities.index(maxExpectUtil)
                #print("maxValue:"+str(maxExpectUtil)+str(actionList[indexMaxUtil]))
                
                u1 ={state : actionList[indexMaxUtil]}
            
                bestPolicies.update(u1)
            
    return bestPolicies
    

### (1b) tests

In [7]:
## BEGIN TESTS
utility = value_iteration(mdp_p1, tol=1e-6)
policy = find_policy(mdp_p1, utility)

util1 = utility[(1,5)]
assert (util1 == 2), "Expected utility of 2, your code returned %f" % util1

util2 = utility[(6,1)]
assert (util2 == -5), "Expected utility of -5, your code returned %f" % util2

util3 = round(utility[(2,5)],2)
assert (util3 == 1.74), "Expected utility of 1.74, your code returned %f" % util3

util4 = round(utility[(5,3)],2)
assert (util4 == -1.39), "Expected utility of -1.39, your code returned %f" % util4

policy1 = policy[(2,4)]
assert (policy1 == 'W'), "Expected policy is 'W', your code returned %s" % policy1

policy2 = policy[(1,1)]
assert (policy2 == 'N'), "Expected policy is 'N', your code returned %s" % policy2

print("Passed test cases: 5 points")
## END TESTS

Passed test cases: 5 points


<a id='p1c'></a>
[Back to top](#top)

## (1c) - 5 points

If we enter the room at $s_0$, what is the optimal path for us to follow? Complete the following function to generate the sequence of states along the path. If we start in state $s_0$, then your output should be in the form $[s_0, s_1, s_2, ... , s_{term}]$ where $s_{term}$ is a terminal state. Set your tolerance for value iteration to be $10^{-6}$

Additionally, create a graphic to illustrate this policy pathway, either by generating a plot in Python or by uploading a hand-drawn image and including a link to it below.

In [8]:
def find_optimal_path(mdp, state):
    '''Generate list of states visited along the optimal path given an MDP
    instance and the starting state
    '''
    utility = value_iteration(mdp, tol=1e-6)
    policy = find_policy(mdp, utility)
    optimalPath =[]
    currentState = state
    while (currentState not in mdp.terminal_states):
        bestPolicy = policy[currentState]
        optimalPath.append(currentState)
        currentState =mdp.result(currentState, bestPolicy)
    optimalPath.append(currentState)#add terminal state
    return optimalPath



In [9]:
path1 = find_optimal_path(mdp_p1, (5,3))
path1

[(5, 3), (5, 2), (4, 2), (3, 2), (2, 2), (2, 3), (2, 4), (1, 4)]

Put a link to your graphic below for the path from (5,3). You can find an [example image here](http://www.cs.colorado.edu/~tonyewong/home/resources/hw06_mdp_path.png) with the optimal path starting from (5,1). **Please include a link rather than attaching a file**. This can be a link to your file in Google Drive, with the permissions set to public. The graders will not ask for permissions! The syntax for including a link in markdown is `[link text](url.com/to/your/image)`

[Link to Optimal Path From (5,3)](https://drive.google.com/file/d/1dcGCc7FghJAyEWyM4Sn7BMH_9RRI1LPU/view?usp=sharing)

### (1c) tests

In [10]:
## BEGIN TESTS
path1 = find_optimal_path(mdp_p1, (5,3))
assert (path1 == [(5, 3), (5, 2), (4, 2), (3, 2), (2, 2), (2, 3), (2, 4), (1, 4)]), "The optimal path is [(5, 3), (5, 2), (4, 2), (3, 2), (2, 2), (2, 3), (2, 4), (1, 4)], your code generated %s" % path1

path2 = find_optimal_path(mdp_p1, (5,1))
assert (path2 == [(5, 1), (4, 1), (4, 2), (3, 2), (2, 2), (2, 3), (2, 4), (1, 4)]), "The optimal path is [(5, 1), (4, 1), (4, 2), (3, 2), (2, 2), (2, 3), (2, 4), (1, 4)], your code generated %s" % path2

path3 = find_optimal_path(mdp_p1, (1,1))
assert (path3 == [(1, 1), (1, 2), (2, 2), (2, 3), (2, 4), (1, 4)]), "The optimal path is [(1, 1), (1, 2), (2, 2), (2, 3), (2, 4), (1, 4)], your code generated %s" % path3

print("Passed tests: 3 points")
## END TESTS

Passed tests: 3 points


<a id='p1d'></a>
[Back to top](#top)

## (1d) - 5 points

From (3,2) the optimal move is to walk West. If we are trying to go talk to our friends in the Northwest corner, why would we rather do this than walk North first, then West?

In context, of the MDP, it is because there is a negative reward if we were to take action North. In the context of the problem it would be because there is a person we do not know in the North direction and if we end up going there than we would be forced to have an anxiety filled conversation.

<a id='p1e'></a>
[Back to top](#top)

## (1e) - 5 points

How painfully awkward do you need to set the default reward for non-terminal states before the optimal move at (5,1) becomes jumping off the balcony immediately and running away? Implement the following function which returns the reward where the policy for (5,1) is to jump off the balcony.

<img src="http://www.cs.colorado.edu/~tonyewong/home/resources/hw06_mdp.png" style="width: 400px;"/>

In [11]:
def find_non_terminal_reward():
    nrow = 5
    ncol = 6
    default_reward = -0.01
    discount = 0.99
    terminal = {(1,3):-1, (1,4):2, (1,5):2, (2,1):-1, (3,1):-1, (3,4):-1, (3,5):1,
            (4,3):-1, (4,4):-1, (6,1):-5, (6,2):-5, (6,3):-5, (6,4):-5, (6,5):-5}
    
    for reward in np.arange(-0.01, -3, -0.01):
        mdp = MDP(nrow, ncol, terminal, reward, discount)
        path1 = find_optimal_path(mdp, (5,1))
        #print(reward,path1)
        if(path1 == [(5, 1), (6, 1)]):
            #print(reward)
            return reward

### (1e) tests

In [12]:
reward1 = find_non_terminal_reward()
assert (reward1 == -2.09), "The expected reward is -2.09, your code returned %f" % reward1

print("Test passed: 5 points")

Test passed: 5 points


<a id='p1f'></a>
[Back to top](#top)

## (1f) - 5 points

In **1e** we assumed a certain level of loss (negative reward) just for being present.  But a more realistic approach might be to instead change the reward structure for the terminal states. Consider the terminal states with -1 reward in the default model. Let $R^*$ denote the reward associated with these states. How low does $R^*$ need to be in order for us to immediately jump off the balcony and run away? Use the default non-terminal state reward of -0.01. Implement the following function to return the value of $R^*$ which leads to a policy of jumping off the balcony at (5,1). Write a few sentences interpreting your result.

In [13]:
def find_terminal_reward():
    nrow = 5
    ncol = 6
    default_reward = -0.01
    discount = 0.99
    
    for rstar in np.arange(-6, -12, -0.01):
        # TODO:
        # 1. set the reward of the terminal nodes 
        
        # 2. define MDP with appropriate parameters
        # 3. find policy for state (5,1)
        # 4. return R* if the policy for (5,1) is to jump off the balcony
        terminal = {(1,3):rstar, (1,4):2, (1,5):2, (2,1):rstar, (3,1):rstar, (3,4):rstar, (3,5):1,
            (4,3):rstar, (4,4):rstar, (6,1):-5, (6,2):-5, (6,3):-5, (6,4):-5, (6,5):-5}
        mdp = MDP(nrow, ncol, terminal, default_reward, discount)
        path1 = find_optimal_path(mdp, (5,1))
        #print(reward,path1)
        if(path1 == [(5, 1), (6, 1)]):
            #print(reward)
            return rstar


In [14]:
reward1 = round(find_terminal_reward(),2)
reward1

-11.39

### (1f) tests

In [15]:
reward1 = round(find_terminal_reward(),2)
assert (reward1 == -11.39), "Expected reward is -11.39, your code returned %f" % reward1

print("Passed test: 3 points")

Passed test: 3 points


Write a few sentences with your interpretation here:

The reward of -11.39 is a pretty negative reward compare to a -5 of jumping off the balcony. In the context of the problem one must have a crippling anxiety to have such a negative reward for talking to strangers. It would make more sense for such a negative on like a particular person you specifically do not want to see like an ex. In terms of the MDP optimal policy it changed policies when it was not worth the risk of taking the optimal path to a positive reward state as going off direction has a .4 probability: which is why it is more dangerous to run into one of these anxiety promoting partygoers than just take the balcony.

<a id='p1g'></a>
[Back to top](#top)

## (1g) - 5 points

Given the problem context, write a few sentences about why this is or is not an appropriate transition model. Include an interpretation of the terminal states.

I think the transition model is appropiate as there is alot of moving parts at a party. Setting out to move across a room during a party, I believe that the .4 dedicated to not making it to the intended next state is justified. At a party, you really can not control alot. There is a chance you could run in people you know or even if you do not know them they could know you. If one is of legal age, inebriation could also acount for a part of why we did not move to the next state. I think the rewards for the terminal states describing running into some one you do not want to talk to have a good set reward of -1. It would not be the end of the world to have a two minute slightly okward conversation. Alse the positive terminal states make sense for being positive, but I do not understand why one of my friends is only +1. Do I like them less? The terminal states for jumping of the balcony, I believe have to small of a negative reward. Personally, I think jumping of the balcony should have at least a negative 15 reward.As I think jumping off the balcony might have people questioning one's sanity.

---
<a id='p2'></a>
[Back to top](#top)

# Problem 2: Define your very own MDP

For this problem, you do not need to write any code, but rather communicate your ideas clearly using complete sentences and descriptions of the concepts the questions ask about. You can, of course, include some pseudocode if it helps, but that is not strictly necessary.


<a id='p2a'></a>

## (2a) - 4 points

Describe something you think would be interesting to model using a Markov decision process.  Be **creative** - do not use any examples from your homework, class, or the textbook, and if you are working with other students, please **come up with your own example**. There are so, SO many possible answers!

Imagine, that you heard their would be a great snowfall this Saturday. So you set off to make the biggest snowball. On that Saturday you grab your outdoor heater to keep your self warm, however it is teathered by an extension cord for power. So you decide it would be best to just make trips back to moderate your temperature. Lets use a Markov decision process to model it.

<a id='p2b'></a>

## (2b) - 4 points

What are the states associated with your MDP? Include a discussion of terminal/non-terminal states.

For you to make the biggest snowball you will have to be outside. So you will have monitor your temperature. We can define the states you can be in as follows: { 'Cold','Warm','Frozen', 'Toasty'}. There is one terminal states of being Frozen. In the case of being frozen, you were cold for to long and passed out from it. Embarrasingly, you had to be rescued by the fire departement and given hot coco. Hence, you could not finish making the biggest snowball. States cold,warm,toasty are non-terminal states. The non-terminal state of being cold is due to the fact you are adding to the giant snowball and are not by the heater(obviously the heater will melt the snowball!).If you sit by the heater you are just warm. If you sit by the heater for too long, you become too warm. You lose inspiration so you take a 30 minute break inside your house which makes you 'toasty'.

<a id='p2c'></a>

## (2c) - 4 points

What is the reward structure associated with your MDP?  Include a discussion of terminal/non-terminal states.

For this MDP, the reward for being in the state 'Cold' is +5 as you must be by snowball,which means the only action that allows us to stay/go to the state of 'Cold is by adding a piece of snow. R('Warm') will be -1 as we must be by the heater hence no progress would have been made by the last action. As discussed in 2b, the state 'frozen' is if you pass out from hyperthermia and have to be rescued by the fire department. Since you feel this is an embarrasing ordeal and it is a terminal state (we can not finish the snowball) the reward is very negative. R('Frozen') = -100. For the state of 'Toasty', R('Toasty')= -15, as it is our own laziness and lack of determination that is slowing us down.  

<a id='p2d'></a>

## (2d) - 4 points

What are the actions and transition model associated with your MDP?

There are two actions we can make {'Add Snow to Snowball', 'Rest by Heater'}. For our transition function, we should define which states can go to other states. You can only become 'frozen' from the state 'cold'.I wanted to put this first as for any non-adjacent states will be assigned with probability of 0.

Now for the transition model, to move from a state $s$ to $s'$ with probability of .7. The other .3 is distributed to other three states. However, the distribution of the .3 will be split unevenly between the non-terminal states and one terminal state. The non-terminal states will split .8 of the .3 and the other .2 of the .3 will go to the terminal state 'frozen'. Assuming all states are accessible from current state. If the terminal-state is not accessible from the current state than assume the .3 will be distributed to other two non-terminal states.

<a id='p2e'></a>

## (2e) - 4 points

Interpret what an optimal policy represents in the context of your particular MDP.

The optimal policy in terms of the MDP, would to be add snow when we are in the warm state as we would most likely end up in the cold state. From the state cold, it is best to rest by the heater as this action would exclude the possibility of getting a huge negative reward in the state s'. From the state 'toasty', the optimal policy would be to add Snow as there is a positive reward for building the snowball. Also, resting by the heater ('warm) would prove to produce a small negative reward instead. Again the goal would be to maximize the size of the snowball and minimize the change of getting in to the 'frozen' terminal state.