# Apprenticeship Learning via IRL - Q Learning

* This notebook aims to showcase the application of Inverse Reinforcement Learning (IRL) to solve the CartPole model from the OpenAI Gym.

* Algorithm based on the paper titled "Apprenticeship Learning via Inverse Reinforcement Learning."

* The IRL algorithm requires an expert demonstration as input. In this case, the expert demonstration is derived from a traditional Q-learning implementation (with discretized state space).

* The primary purpose of this implementation is to serve as a demonstration, showcasing the effectiveness of Inverse Reinforcement Learning in solving the CartPole problem.

## CartPole-v0

[CartPole-v0 Wiki](https://github.com/openai/gym/wiki/CartPole-v0)

In [56]:
import gym
import numpy as np
from tqdm import tqdm

In [7]:
env = gym.make('CartPole-v0')

nbins = 10    # discretization
GAMMA = 0.9   # discount factor
ALPHA = 0.01  # learning rate

Observations

Num | Observation | Min | Max
---|---|---|---
0 | Cart Position | -2.4 | 2.4
1 | Cart Velocity | -Inf | Inf
2 | Pole Angle | ~ -0.418 rad (-24&deg;) | ~ 0.418 rad (24&deg;)
3 | Pole Velocity At Tip | -Inf | Inf

In [10]:
# Discretize the continuous observable state space

bins = np.zeros((4,nbins))

bins[0] = np.linspace(-2.4, 2.4, nbins)    # position
bins[1] = np.linspace(-5, 5, nbins)        # velocity
bins[2] = np.linspace(-.418, .418, nbins)  # angle
bins[3] = np.linspace(-5, 5, nbins)        # tip velocity

In [13]:
# States

states = []

for i in range (nbins+1):
    for j in range (nbins+1):
        for k in range(nbins+1):
            for l in range(nbins+1):
                a=str(i).zfill(2)+str(j).zfill(2)+str(k).zfill(2)+str(l).zfill(2)
                states.append(a)

In [16]:
# Possible states = (nbins+1)^4

len(states)

14641

Actions:
Num | Action
--- | ---
0 | Push cart to the left
1 | Push cart to the right

In [19]:
# Possible actions

env.action_space.n

2

In [53]:
# Initialize Q Table

def init_Q():
    
    Q = {}
    
    for state in states:
        Q[state] = {}
        
        for action in range(env.action_space.n):
            Q[state][action] = 0
    
    return Q

In [33]:
# Discretize observation into bins

def assign_bins(observation, bins):
    
    discretized_state = np.zeros(4)
    
    for i in range(4):
        discretized_state[i] = np.digitize(observation[i], bins[i])
        
    return discretized_state

In [35]:
# Encode state into string representation as a dictionary

def get_string_state(state):

    string_state = ''.join(str(int(e)).zfill(2) for e in state)
    
    return string_state

In [54]:
# Train an episode

def train_an_episode(bins, Q, epsilon=0.5):
    """
    Simulate one episode of training.

    Parameters:
    - bins: Discretization bins for state representation.
    - Q: Q-table for the reinforcement learning agent.
    - epsilon: Exploration-exploitation trade-off parameter.

    Returns:
    - total_reward: Total reward obtained during the episode.
    - move_count: Number of moves in the episode.
    """
    
    observation = env.reset()
    done = False
    move_count = 0   # no. of moves in an episode
    state = get_string_state(assign_bins(observation, bins))
    total_reward = 0

    while not done:
        move_count += 1

        if np.random.uniform() < epsilon:
            action = env.action_space.sample()  # epsilon-greedy exploration
        else:
            action = max(Q[state].items(), key=lambda x: x[1])[0]   # action with max value

        observation, reward, done, _ = env.step(action)

        total_reward += reward

        # penalize early episode termination
        if done and move_count < 200:
            reward = -300

        new_state = get_string_state(assign_bins(observation, bins))

        best_action, max_q_s1a1 = max(Q[new_state].items(), key=lambda x: x[1])
        Q[state][action] += ALPHA * (reward + GAMMA * max_q_s1a1 - Q[state][action])
        state, action = new_state, best_action

    return total_reward, move_count

In [57]:
# Train for many episodes

def train_q_learning(bins, num_episodes=10000, print_interval=1000):
    """
    Train a Q-learning agent through multiple episodes.
    
    Parameters:
    - bins: Discretization bins for state representation.
    - num_episodes: Number of training episodes.
    - epsilon_decay_factor: Epsilon decay factor for exploration.
    - print_interval: Interval for printing training progress.
    
    Returns:
    - episode_lengths: List of lengths for each episode.
    - episode_rewards: List of rewards for each episode.
    - Q: Trained Q-table.
    """
    
    Q = init_Q()
    episode_lengths = []
    episode_rewards = []

    for episode in tqdm(range(1, num_episodes + 1), desc="Training Episodes.."):
        epsilon = 1.0 / np.sqrt(episode + 1)

        episode_reward, episode_length = train_an_episode(bins, Q, epsilon)

        if episode % print_interval == 0:
            print(f"Episode: {episode}, Epsilon: {epsilon:.4f}, Reward: {episode_reward}")

        episode_lengths.append(episode_length)
        episode_rewards.append(episode_reward)

    env.close()
    
    return episode_lengths, episode_rewards, Q

In [58]:
# Plot reward curve

def plot_running_avg(total_rewards, title='Running Average Reward', save=False, filename='result'):
    """
    Plot the running average of rewards during training.

    Parameters:
    - total_rewards: List of total rewards obtained in each episode during training.
    - title: Title of the plot.
    - save: If True, save the plot as an image file.
    - filename: Name of the file if saving the plot.

    Returns:
    - None
    """
    
    fig = plt.figure()
    num_episodes = len(total_rewards)
    running_avg = np.empty(num_episodes)

    for episode in range(num_episodes):
        running_avg[episode] = np.mean(total_rewards[max(0, episode - 100):(episode + 1)])

    plt.plot(running_avg)
    plt.title(title)
    plt.xlabel("Episode")
    plt.ylabel("Running Average Reward")

    if save:
        plt.savefig(f"{filename}.png", bbox_inches='tight')
        plt.show()
    else:
        plt.show()