# Markov Decision Process

Markov decision processes (MDPs) provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. A problem in this process is defined as:
    1.States: All the possible states that the agent could be in.
    2.Actions: Possible actions taken for a given state to reach another state.
    3.Rewards: A reward is granted for every transition made based on the action taken to reach a state.
    4.Transition: It tells the probability of reaching state s' from state s if action a is performed.
The decision to select the action for a given state is independent of previous transitions and is solely dependent on current state and action.
The solution for the MDP is given by a policy that contains actions to be taken for a given state.

Rewards are list of tuples where each tuple consits of state and its corresponding state. Rewards for terminal states are defined in combo.csv file. 


In [9]:
import pandas as pd
combo = pd.read_csv("combo.csv")
combo

Unnamed: 0,Combo,Reward
0,1,-2
1,1 2,14
2,1 3,-4
3,1 4,14
4,1 5,14
5,1 2 3,8
6,1 2 4,10
7,1 2 5,20
8,1 3 4,8
9,1 3 5,8


getR() matches the combination in csv file with the input sequence and returns the corresponding reward, if the sequence is in terminal state. Otherwise reward is set to 0.

In [10]:
def getR(sequence):
    if(sequence[1]==6):
        if(len(str(sequence[0]))==1):
            plate_str=str(sequence[0])
        else:
            plate_str = ' '.join(str(e) for e in sequence[0])
        if(plate_str==''):
            return -10
        #print(plate_str,'plate')
        data_I = combo[combo['Combo'] == plate_str]
        reward = int(data_I['Reward'])
        return reward
             #read from table
    else:
        return 0

Value() function is used to calculate a part of Bellman Equation. It gives the sum of products of probabilty to reach a state s' from given state s and action a and utility value of resulting state s'.

# Value Iteration

In [11]:
def Value(s,a,U):
    total = 0
    if(s[0]==()):
        p=[(s[0],s[1]+1),((s[1],),s[1]+1)] 
    else:
        L=list(s[0])
        L.append(s[1])
        T=tuple(L)
        p=[(s[0],s[1]+1),(T,s[1]+1)] 
    if(a=='dontbuy'):
        pr=1.0
        total += pr * U[p[0]] 
    else:
        pr=prob[s[1]] 
        total += (pr * U[p[1]] + (1-pr)* U[p[0]])
    return total

It is used to calculate the optimal policy. It uses the Bellman equation to calculate the utility value of all the states. According to Bellman Equation, the utility of a state is the immediate reward
for that state plus the expected discounted utility of the next state, assuming that the agent
chooses the optimal action.
value_iteration() function calculates the utility of all states. The number of iterations is controlled by the delta condition and epsilon value.

In [12]:
def value_iteration(gamma, epsilon=0.01):
    "Solving an MDP by value iteration. [Fig. 17.4]"
    U1 = dict([(s, 0) for s in states])
    count = 0
    while True:
        U = U1.copy()
        delta = 0
        for s in states:
            l = []
            if(s[1]!=6):
                for a in actions:
                    l.append(Value(s,a,U))
            if len(l) > 0:
                m = max(l)
            else:
                m = 0
            U1[s] = getR(s) + gamma*m
            U1[s] = round(U1[s],4)
            delta = max(delta, abs(U1[s] - U[s]))
        count += 1
        print(U)
        if delta <= epsilon * (1 - gamma) / gamma:
            print("Number of iterations: ", count-1)
            return U

Optimal policy is determined by choosing the action that gave the max value in the last iteration of utility value of a state.

In [13]:
def optimal_policy(U):
    pi = {}
    for s in states:
        if(s[1]==6):
            pi[s]='Terminal State'
        else:
            l = []
            for a in actions:
                l.append((a,Value(s,a,U)))
            if len(l) > 0 :
                pi[s] = max(l,key=lambda item:item[1])[0]
    return pi

# Policy Iteration

An arbitrary policy is chosen at the begining of iteration and policy iteration is performed till the optimal policy is determined. Policy evaluation calcuates the utility values. 

In [22]:
def policy_evaluation(policy,U,gamma,k=20):
    for i in range(k):
        for s in states:
            if(s[1]==6):
                U[s] = getR(s)
            else:
                U[s] = getR(s) + gamma*Value(s,policy[s],U)             
    #print(U)
    return U
    
def policy_iteration(policy,gamma):
    U = dict([(s, 0) for s in states])
    while True:
        U = policy_evaluation(policy,U,gamma)
        #print(U)
        unchanged = True
        for s in states:
            l = []
            if(s[1]!=6):
                for a in actions:
                    l.append((a,Value(s,a,U)))
            if len(l) > 0:
                m = max(l,key=lambda item:item[1])[1]
            if(s[1]==6):
                var=0.0
            else:
                var = Value(s,policy[s],U)
            if m > var:
                if len(l) > 0:
                    policy[s] = max(l,key=lambda item:item[1])[0]
                    unchanged = False
        if unchanged:
            #U = [round(U[s],4) for s in states]
            print("Utility: ",U)
            return policy

MDP Problem is defined as global variables.

In [24]:
prob = dict()
states = [((),1), ((),2), ((),3), ((),4), ((),5), ((),6), ((1,),2), ((1,),3), ((1,),4), ((1,),5), ((1,),6), ((2,),3), ((2,),4), ((2,),5), ((2,),6), ((3,),4), ((3,),5), ((3,),6), ((4,),5), ((4,),6), ((5,),6),
 ((1,2),3), ((1,2),4), ((1,2),5), ((1,2),6), ((1,3),4), ((1,3),5), ((1,3),6), ((1,4),5), ((1,4),6), ((1,5),6), ((1,2,3),4), ((1,2,3),5), ((1,2,3),6), ((1,2,4),5), ((1,2,4),6), ((1,2,5),6), ((1,3,4),5), ((1,3,4),6), ((1,3,5),6), ((1,4,5),6), ((1,2,3,4),5), ((1,2,3,4),6), ((1,2,3,5),6), ((1,2,4,5),6), ((1,3,4,5),6),  
 ((1,2,3,4,5),6), ((2,3),4), ((2,3),5), ((2,3),6), ((2,4),5), ((2,4),6), ((2,5),6), ((2,3,4),5), ((2,3,4),6), ((2,3,5),6), ((2,4,5),6), ((2,3,4,5),6),
 ((3,4),5), ((3,4),6), ((3,5),6), ((3,4,5),6),
 ((4,5),6)]
print("States: ", states)
prob={1:0.6,2:0.6,3:0.2,4:0.3,5:0.7}
print(prob[1])
actions = ['buy','dontbuy']
print("Actions: ", actions)

States:  [((), 1), ((), 2), ((), 3), ((), 4), ((), 5), ((), 6), ((1,), 2), ((1,), 3), ((1,), 4), ((1,), 5), ((1,), 6), ((2,), 3), ((2,), 4), ((2,), 5), ((2,), 6), ((3,), 4), ((3,), 5), ((3,), 6), ((4,), 5), ((4,), 6), ((5,), 6), ((1, 2), 3), ((1, 2), 4), ((1, 2), 5), ((1, 2), 6), ((1, 3), 4), ((1, 3), 5), ((1, 3), 6), ((1, 4), 5), ((1, 4), 6), ((1, 5), 6), ((1, 2, 3), 4), ((1, 2, 3), 5), ((1, 2, 3), 6), ((1, 2, 4), 5), ((1, 2, 4), 6), ((1, 2, 5), 6), ((1, 3, 4), 5), ((1, 3, 4), 6), ((1, 3, 5), 6), ((1, 4, 5), 6), ((1, 2, 3, 4), 5), ((1, 2, 3, 4), 6), ((1, 2, 3, 5), 6), ((1, 2, 4, 5), 6), ((1, 3, 4, 5), 6), ((1, 2, 3, 4, 5), 6), ((2, 3), 4), ((2, 3), 5), ((2, 3), 6), ((2, 4), 5), ((2, 4), 6), ((2, 5), 6), ((2, 3, 4), 5), ((2, 3, 4), 6), ((2, 3, 5), 6), ((2, 4, 5), 6), ((2, 3, 4, 5), 6), ((3, 4), 5), ((3, 4), 6), ((3, 5), 6), ((3, 4, 5), 6), ((4, 5), 6)]
0.6
Actions:  ['buy', 'dontbuy']


In [34]:
gamma = 0.4
print("Discounting: ", gamma)

print("\n Value Iteration")
utility = value_iteration(gamma)
print("Utility: ",utility)

best_policy = optimal_policy(utility)
print("Optimal Policy: ", best_policy)

print("\n Policy iteration")
policy = {((), 1): 'dontbuy', ((), 2): 'dontbuy', ((), 3): 'dontbuy', ((), 4): 'dontbuy', ((), 5): 'dontbuy', ((), 6): 'Terminal State', ((1,), 2): 'dontbuy', ((1,), 3): 'buy', ((1,), 4): 'dontbuy', ((1,), 5): 'dontbuy', ((1,), 6): 'Terminal State', ((2,), 3): 'dontbuy', ((2,), 4): 'dontbuy', ((2,), 5): 'dontbuy', ((2,), 6): 'Terminal State', ((3,), 4): 'dontbuy', ((3,), 5): 'dontbuy', ((3,), 6): 'Terminal State', ((4,), 5): 'dontbuy', ((4,), 6): 'Terminal State', ((5,), 6): 'Terminal State', ((1, 2), 3): 'buy', ((1, 2), 4): 'buy', ((1, 2), 5): 'dontbuy', ((1, 2), 6): 'Terminal State', ((1, 3), 4): 'dontbuy', ((1, 3), 5): 'dontbuy', ((1, 3), 6): 'Terminal State', ((1, 4), 5): 'buy', ((1, 4), 6): 'Terminal State', ((1, 5), 6): 'Terminal State', ((1, 2, 3), 4): 'buy', ((1, 2, 3), 5): 'buy', ((1, 2, 3), 6): 'Terminal State', ((1, 2, 4), 5): 'buy', ((1, 2, 4), 6): 'Terminal State', ((1, 2, 5), 6): 'Terminal State', ((1, 3, 4), 5): 'buy', ((1, 3, 4), 6): 'Terminal State', ((1, 3, 5), 6): 'Terminal State', ((1, 4, 5), 6): 'Terminal State', ((1, 2, 3, 4), 5): 'buy', ((1, 2, 3, 4), 6): 'Terminal State', ((1, 2, 3, 5), 6): 'Terminal State', ((1, 2, 4, 5), 6): 'Terminal State', ((1, 3, 4, 5), 6): 'Terminal State', ((1, 2, 3, 4, 5), 6): 'Terminal State', ((2, 3), 4): 'buy', ((2, 3), 5): 'dontbuy', ((2, 3), 6): 'Terminal State', ((2, 4), 5): 'buy', ((2, 4), 6): 'Terminal State', ((2, 5), 6): 'Terminal State', ((2, 3, 4), 5): 'buy', ((2, 3, 4), 6): 'Terminal State', ((2, 3, 5), 6): 'Terminal State', ((2, 4, 5), 6): 'Terminal State', ((2, 3, 4, 5), 6): 'Terminal State', ((3, 4), 5): 'buy', ((3, 4), 6): 'Terminal State', ((3, 5), 6): 'Terminal State', ((3, 4, 5), 6): 'Terminal State', ((4, 5), 6): 'Terminal State'}
print("Given Policy: ",policy)
U = dict([(s, 0) for s in states])
print("\n Policy evaluation: ",policy_evaluation(policy,U,0.8))
p = policy_iteration(policy, gamma)
print("Optimal Policy: ",p)

Discounting:  0.4

 Value Iteration
{((), 1): 0, ((), 2): 0, ((), 3): 0, ((), 4): 0, ((), 5): 0, ((), 6): 0, ((1,), 2): 0, ((1,), 3): 0, ((1,), 4): 0, ((1,), 5): 0, ((1,), 6): 0, ((2,), 3): 0, ((2,), 4): 0, ((2,), 5): 0, ((2,), 6): 0, ((3,), 4): 0, ((3,), 5): 0, ((3,), 6): 0, ((4,), 5): 0, ((4,), 6): 0, ((5,), 6): 0, ((1, 2), 3): 0, ((1, 2), 4): 0, ((1, 2), 5): 0, ((1, 2), 6): 0, ((1, 3), 4): 0, ((1, 3), 5): 0, ((1, 3), 6): 0, ((1, 4), 5): 0, ((1, 4), 6): 0, ((1, 5), 6): 0, ((1, 2, 3), 4): 0, ((1, 2, 3), 5): 0, ((1, 2, 3), 6): 0, ((1, 2, 4), 5): 0, ((1, 2, 4), 6): 0, ((1, 2, 5), 6): 0, ((1, 3, 4), 5): 0, ((1, 3, 4), 6): 0, ((1, 3, 5), 6): 0, ((1, 4, 5), 6): 0, ((1, 2, 3, 4), 5): 0, ((1, 2, 3, 4), 6): 0, ((1, 2, 3, 5), 6): 0, ((1, 2, 4, 5), 6): 0, ((1, 3, 4, 5), 6): 0, ((1, 2, 3, 4, 5), 6): 0, ((2, 3), 4): 0, ((2, 3), 5): 0, ((2, 3), 6): 0, ((2, 4), 5): 0, ((2, 4), 6): 0, ((2, 5), 6): 0, ((2, 3, 4), 5): 0, ((2, 3, 4), 6): 0, ((2, 3, 5), 6): 0, ((2, 4, 5), 6): 0, ((2, 3, 4, 5), 6): 0, ((


 Policy evaluation:  {((), 1): -3.276800000000001, ((), 2): -4.096000000000001, ((), 3): -5.120000000000001, ((), 4): -6.4, ((), 5): -8.0, ((), 6): -10, ((1,), 2): -0.9830400000000005, ((1,), 3): -1.2288000000000006, ((1,), 4): -1.2800000000000002, ((1,), 5): -1.6, ((1,), 6): -2, ((2,), 3): -1.0240000000000002, ((2,), 4): -1.2800000000000002, ((2,), 5): -1.6, ((2,), 6): -2, ((3,), 4): -1.2800000000000002, ((3,), 5): -1.6, ((3,), 6): -2, ((4,), 5): -1.6, ((4,), 6): -2, ((5,), 6): -2, ((1, 2), 3): 5.470208, ((1, 2), 4): 7.6544, ((1, 2), 5): 11.200000000000001, ((1, 2), 6): 14, ((1, 3), 4): -2.5600000000000005, ((1, 3), 5): -3.2, ((1, 3), 6): -4, ((1, 4), 5): 8.96, ((1, 4), 6): 14, ((1, 5), 6): 14, ((1, 2, 3), 4): 3.5711999999999997, ((1, 2, 3), 5): 5.28, ((1, 2, 3), 6): 8, ((1, 2, 4), 5): 5.76, ((1, 2, 4), 6): 10, ((1, 2, 5), 6): 20, ((1, 3, 4), 5): 5.28, ((1, 3, 4), 6): 8, ((1, 3, 5), 6): 8, ((1, 4, 5), 6): 10, ((1, 2, 3, 4), 5): 2.5600000000000005, ((1, 2, 3, 4), 6): 6, ((1, 2, 3, 5),