POLICY ITERATION ALGORITHM

AIM

To develop a Python program to find the optimal policy for the given MDP using the policy iteration algorithm.

PROBLEM STATEMENT

The aim of this experiment is to find optimal policy for the mdp using policy iteration. Policy iteration includes policy evaluation and policy improvement where evaluation function is used to find optimal value function of each state and then improvement function is used to find best policy by comparing all the action value function as well as policy.

POLICY ITERATION ALGORITHM

-> Step1 : We are going to do policy evaluation of each state to get the state value function where the initial policy is defined randomly to the mdp.

-> Step2: Once we obtain convergence in the policy evaluation then implement policy improvement where we are going to find best optimal policy until the previous and current policy are same.

POLICY IMPROVEMENT FUNCTION

Name : NAVEEN S

Register Number : 212222240070

def policy_improvement(V, P, gamma=1.0):
    Q = np.zeros((len(P), len(P[0])), dtype=np.float64)
    # Write your code here to improve the given policy
    for s in range(len(P)):
      for a in range(len(P[s])):
        for prob,next_state,reward,done in P[s][a]:
          Q[s][a]+=prob*(reward+gamma*V[next_state]*(not done))
          new_pi=lambda s:{s:a for s, a in enumerate(np.argmax(Q,axis=1))}[s]
    return new_pi

POLICY ITERATION FUNCTION

Name : NAVEEN S

Register Number : 212222240070

def policy_iteration(P, gamma=1.0, theta=1e-10):
   random_actions=np.random.choice(tuple(P[0].keys()),len(P))
   pi = lambda s: {s:a for s, a in enumerate(random_actions)}[s]
   while True:
    old_pi = {s:pi(s) for s in range(len(P))}
    V = policy_evaluation(pi, P,gamma,theta)
    pi = policy_improvement(V,P,gamma)
    if old_pi == {s:pi(s) for s in range(len(P))}:
      break
   return V, pi

OUTPUT:

1. Policy, Value function and success rate for the Adversarial Policy

2. Policy, Value function and success rate for the Improved Policy

3. Policy, Value function and success rate after policy iteration

RESULT:

Thus, The Python program to find the optimal policy for the given MDP using the policy iteration algorithm is successfully executed.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
EXP3REINLEARN.ipynb		EXP3REINLEARN.ipynb
Ex03_Policy_Iteration_FL_Exp.ipynb		Ex03_Policy_Iteration_FL_Exp.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

POLICY ITERATION ALGORITHM

AIM

PROBLEM STATEMENT

POLICY ITERATION ALGORITHM

POLICY IMPROVEMENT FUNCTION

Name : NAVEEN S

Register Number : 212222240070

POLICY ITERATION FUNCTION

Name : NAVEEN S

Register Number : 212222240070

OUTPUT:

1. Policy, Value function and success rate for the Adversarial Policy

2. Policy, Value function and success rate for the Improved Policy

3. Policy, Value function and success rate after policy iteration

RESULT:

About

Uh oh!

Releases

Packages

Languages

License

Naveensrinivasan07/policy-iteration-algorithm

Folders and files

Latest commit

History

Repository files navigation

POLICY ITERATION ALGORITHM

AIM

PROBLEM STATEMENT

POLICY ITERATION ALGORITHM

POLICY IMPROVEMENT FUNCTION

Name : NAVEEN S

Register Number : 212222240070

POLICY ITERATION FUNCTION

Name : NAVEEN S

Register Number : 212222240070

OUTPUT:

1. Policy, Value function and success rate for the Adversarial Policy

2. Policy, Value function and success rate for the Improved Policy

3. Policy, Value function and success rate after policy iteration

RESULT:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages