# RL - CIA 1

### Run 10 episodes and observe the effect of discounting by changing the values of gamma.
 
For ease of reading and convenience I have added both the gamma values together.

**Discounting - reducing the value of future rewards so immediate reward is worth more.**

- Discounting with γ = 0.5 keeps the returns below zero for this particular experiment
- The returns for γ = 0.5 are not as spread out as in the case of γ = 0.9
- γ = 0.9 gives more importance to future rewards than γ = 0.5. Hence when the future exhibits better rewards (more higher values) this is reflected in the return, at the same time when the future exhibits lower rewards (more negative values) the return in significantly smaller.
- γ = 0.5 doesn't focus on the future as much as γ = 0.9 so the returns aren't on extremities as seen in γ = 0.9. γ = 0.9 has returns that are more spread out.

In [114]:
import random
import numpy as np

In [115]:
#Given Code

ROWS = 3
COLS = 4
ACTIONS = ['N', 'S', 'E', 'W']
REWARDS = { (0, 3): 10,
 (1, 3): -10 }
STEP_REWARD = -1

# Terminal states
TERMINAL_STATES = [(0, 3), (1, 3)]
ACTION_EFFECTS = {
 'N': [(-1, 0), (0, -1), (0, 1)],
 'S': [(1, 0), (0, -1), (0, 1)],
 'E': [(0, 1), (-1, 0), (1, 0)],
 'W': [(0, -1), (-1, 0), (1, 0)] 
}
PROBS = [0.8, 0.1, 0.1]

def in_bounds(state):
 """Check if the state is inside the grid."""
 r, c = state
 return 0 <= r < ROWS and 0 <= c < COLS
    
def move(state, action):
 """Move from current state according to stochastic transitions."""
 effects = ACTION_EFFECTS[action]
 chosen_effect = random.choices(effects, PROBS)[0]
 new_r, new_c = state[0] + chosen_effect[0], state[1] + chosen_effect[1]
 if in_bounds((new_r, new_c)):
     return (new_r, new_c)
 return state # If out of bounds, stay in place
    
def get_reward(state):
 """Return the reward for a state."""
 return REWARDS.get(state, STEP_REWARD)

In [113]:

#Function for calculating the discount
def discount(rews, gamma):
    return np.dot(rews, gamma**np.arange(len(rews))) # multiplies rewards with the gamma value exponents
    
# Run the experiment
def experiment():
    # Conducting 10 episodes
    for ep in range(10):
        state = (0,0) # initial state/starting state is at (0,0)
        rews = [] # matrix to store the reward at each time step
        states = [state] # Matrix to store the states
        
        # Do till a terminal state is reached or the max limit 20 is reached
        for step in range(20):
            action = random.choice(ACTIONS) # Choose a random direction to move (action)
            state = (move(state, action)) # Now Based on the transition probabilities choose the state we transiton to from initial state
            reward = get_reward(state) # Get reward for the action
            rews.append(reward) # Store reward
            states.append(state) # Store the state

            # Terminate if terminal state is reached
            if state in ((0,3), (1,3)):
                break
    
        G_ep_9 = discount(rews, 0.9) # Calculate the Return with gamma value = 0.9
        G_ep_5 = discount(rews, 0.5) # Calculate the Return with gamma value = 0.5


        # Print the details of the episode - States visited, Rewards earned at each step, Return with gamma 0.9 and 0.5
        print(f"Episode {ep + 1}: ")
        print(f"States visited: {states}")
        print(f"Rewards: {rews}")
        print(f"Gamma value 0.9: G_{ep}: {G_ep_9}")
        print(f"Gamma value 0.5: G_{ep}: {G_ep_5}")
        print()
        

# Running the experiment

In [99]:
experiment()

Episode 1: 
States visited: [(0, 0), (1, 0), (1, 1), (1, 0), (0, 0), (0, 0), (0, 1), (0, 1), (0, 1), (0, 1), (0, 0), (0, 1), (1, 1), (0, 1), (0, 0), (0, 0), (0, 0), (1, 0), (0, 0), (0, 1), (0, 0)]
Rewards: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
Gamma value 0.9: G_0: -8.78423345409431
Gamma value 0.5: G_0: -1.9999980926513672

Episode 2: 
States visited: [(0, 0), (0, 0), (0, 0), (1, 0), (1, 0), (2, 0), (1, 0), (0, 0), (1, 0), (1, 1), (0, 1), (0, 2), (0, 3)]
Rewards: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 10]
Gamma value 0.9: G_1: -3.7237880781999997
Gamma value 0.5: G_1: -1.994140625

Episode 3: 
States visited: [(0, 0), (0, 0), (0, 0), (1, 0), (1, 0), (2, 0), (1, 0), (0, 0), (1, 0), (1, 0), (1, 1), (0, 1), (0, 1), (1, 1), (1, 2), (0, 2), (0, 2), (0, 2), (0, 1), (1, 1), (1, 0)]
Rewards: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
Gamma value 0.9: G_2: -8.78423345409431
Gamma value 0.5: G_2: -1.999998092

## I wanted to see more values

In [100]:
experiment()

Episode 1: 
States visited: [(0, 0), (1, 0), (1, 1), (1, 0), (2, 0), (2, 0), (2, 0), (2, 0), (2, 0), (2, 0), (1, 0), (1, 1), (0, 1), (0, 1), (1, 1), (1, 2), (0, 2), (0, 3)]
Rewards: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 10]
Gamma value 0.9: G_0: -6.293959622296319
Gamma value 0.5: G_0: -1.99981689453125

Episode 2: 
States visited: [(0, 0), (1, 0), (1, 0), (1, 0), (0, 0), (0, 1), (0, 2), (0, 3)]
Rewards: [-1, -1, -1, -1, -1, -1, 10]
Gamma value 0.9: G_1: 0.6288200000000006
Gamma value 0.5: G_1: -1.8125

Episode 3: 
States visited: [(0, 0), (1, 0), (0, 0), (0, 0), (0, 0), (1, 0), (2, 0), (2, 1), (2, 2), (2, 2), (1, 2), (1, 1), (1, 2), (1, 1), (0, 1), (0, 0), (0, 0), (1, 0), (2, 0), (2, 0), (1, 0)]
Rewards: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
Gamma value 0.9: G_2: -8.78423345409431
Gamma value 0.5: G_2: -1.9999980926513672

Episode 4: 
States visited: [(0, 0), (1, 0), (1, 1), (1, 2), (1, 3)]
Rewards: [-1, -1, -1, -10

In [112]:
experiment()

Episode 1: 
States visited: [(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (1, 0), (1, 1), (1, 2), (1, 3)]
Rewards: [-1, -1, -1, -1, -1, -1, -1, -10]
Gamma value 0.9: G_0: -10.0
Gamma value 0.5: G_0: -2.0625

Episode 2: 
States visited: [(0, 0), (1, 0), (1, 1), (2, 1), (2, 2), (2, 3), (2, 3), (2, 3), (1, 3)]
Rewards: [-1, -1, -1, -1, -1, -1, -1, -10]
Gamma value 0.9: G_1: -10.0
Gamma value 0.5: G_1: -2.0625

Episode 3: 
States visited: [(0, 0), (0, 0), (0, 1), (0, 2), (0, 1), (0, 1), (0, 1), (0, 1), (1, 1), (2, 1), (2, 1), (1, 1), (2, 1), (2, 2), (2, 1), (2, 2), (1, 2), (0, 2), (0, 3)]
Rewards: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 10]
Gamma value 0.9: G_2: -6.664563660066687
Gamma value 0.5: G_2: -1.999908447265625

Episode 4: 
States visited: [(0, 0), (1, 0), (2, 0), (1, 0), (2, 0), (2, 1), (1, 1), (2, 1), (1, 1), (0, 1), (0, 0), (0, 0), (1, 0), (0, 0), (0, 0), (0, 1), (0, 2), (1, 2), (2, 2), (1, 2), (0, 2)]
Rewards: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -