#### Question 2: Mountain Car with Q-Learning
#### Dataset Problem: Use OpenAI Gym's MountainCar-v0 environment to train a Q-learning agent.
#### Similar to the CartPole example, but with the Mountain Car environment. The Q-learning code will be similar, with adjustments to the state and action space to fit the Mountain Car environment.


In [2]:
import gym
import numpy as np

# Create the environment
env = gym.make('MountainCar-v0')

# Hyperparameters
learning_rate = 0.1
gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
episodes = 1000
bins = (20, 20)  # Discretization bins

In [6]:
# Update env.reset() to unpack the state correctly
state, _ = env.reset()

# Discretize the state function remains the same
def discretize_state(state, env, bins):
    env_low = env.observation_space.low
    env_high = env.observation_space.high
    env_bins = [np.linspace(low, high, num=b) for low, high, b in zip(env_low, env_high, bins)]
    return tuple(np.digitize(s, bins) for s, bins in zip(state, env_bins))

# Use the updated state
discretized_state = discretize_state(state, env, bins)

# Epsilon-greedy policy
def epsilon_greedy(state, epsilon):
    if np.random.rand() < epsilon:
        return env.action_space.sample()
    return np.argmax(q_table[state])


In [11]:
for episode in range(episodes):
    state, _ = env.reset()  # Correct handling of env.reset()
    state = discretize_state(state, env, bins)
    total_reward = 0

    for time in range(200):  # Max steps per episode
        # Choose an action
        action = epsilon_greedy(state, epsilon)

        # Step the environment
        next_state, reward, done, *info = env.step(action)  # Updated for new Gym API
        next_state = discretize_state(next_state, env, bins)

        # Q-value update
        q_table[state][action] += learning_rate * (
            reward + gamma * np.max(q_table[next_state]) - q_table[state][action]
        )

        state = next_state
        total_reward += reward

        if done:
            print(f"Episode: {episode+1}/{episodes}, Reward: {total_reward}")
            break

    # Update epsilon
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay


Episode: 594/1000, Reward: -182.0
Episode: 598/1000, Reward: -194.0
Episode: 601/1000, Reward: -173.0
Episode: 603/1000, Reward: -157.0
Episode: 605/1000, Reward: -173.0
Episode: 608/1000, Reward: -163.0
Episode: 612/1000, Reward: -197.0
Episode: 725/1000, Reward: -170.0
Episode: 741/1000, Reward: -180.0
Episode: 751/1000, Reward: -165.0
Episode: 753/1000, Reward: -165.0
Episode: 757/1000, Reward: -178.0
Episode: 778/1000, Reward: -176.0
Episode: 832/1000, Reward: -164.0
Episode: 834/1000, Reward: -160.0
Episode: 852/1000, Reward: -159.0
Episode: 855/1000, Reward: -200.0
Episode: 856/1000, Reward: -175.0
Episode: 858/1000, Reward: -165.0
Episode: 859/1000, Reward: -168.0
Episode: 860/1000, Reward: -197.0
Episode: 861/1000, Reward: -172.0
Episode: 865/1000, Reward: -169.0
Episode: 867/1000, Reward: -186.0
Episode: 871/1000, Reward: -184.0
Episode: 872/1000, Reward: -173.0
Episode: 876/1000, Reward: -164.0
Episode: 886/1000, Reward: -175.0
Episode: 915/1000, Reward: -169.0


#### Exploration Dominates Early on:

During the initial stages of training, the agent explores randomly. If epsilon (exploration rate) is still high, it may be choosing suboptimal actions and resulting in negative rewards.

#### Inefficient Q-value Updates:

Q-values might be updating too slowly, or there may not be enough learning from past experiences (due to poor exploration, too small learning rate, or issues with the reward structure).

#### Exploration-Exploitation Balance:

The agent may still be exploring too much, meaning it hasn't yet learned to exploit its knowledge effectively to maximize rewards.

#### Poor Reward Function:

The environment or the reward function might be insufficient for the agent to easily learn optimal behavior. The agent might be learning suboptimal strategies early on.

#### Q-table/Network Initialization:

If using Q-tables, the values are usually initialized arbitrarily. If the initialization is poor or if the learning rate is not appropriate, the Q-values could be slow to converge.

In [33]:
# Initialize the environment
env = gym.make('MountainCar-v0')  # Change to 'CartPole-v1' if needed

# Define hyperparameters
learning_rate = 0.5  # Controls how much to adjust Q-values
gamma = 0.95  # Discount factor for future rewards
epsilon = 1.0  # Exploration rate
epsilon_min = 0.05  # Minimum exploration rate
epsilon_decay = 0.999  # Decay factor for epsilon
episodes = 7000  # Number of training episodes
max_timesteps = 200  # Maximum steps per episode

# Discretize state space (for MountainCar, discretization is not needed with DQN)
# Create Q-table (or neural network for DQN)
state_space_size = (10, 10)  # Define discretization bins for each dimension
action_space_size = env.action_space.n  # Number of possible actions
q_table = np.zeros(state_space_size + (action_space_size,))  # For Q-learning


In [34]:
# Q-learning functions
def discretize_state(state, bins):
    """
    Convert continuous state to discrete state by using predefined bins.
    """
    env_low = env.observation_space.low
    env_high = env.observation_space.high
    
    # Create bins for each state dimension
    state_bins = [np.linspace(low, high, num=bins[i] + 1)[1:-1] for i, (low, high) in enumerate(zip(env_low, env_high))]
    
    # Discretize each state dimension into its corresponding bin
    return tuple(np.digitize(s, bins) for s, bins in zip(state, state_bins))

def epsilon_greedy(state, epsilon):
    """
    Select an action based on epsilon-greedy strategy.
    """
    if np.random.rand() < epsilon:
        return np.random.choice(action_space_size)  # Exploration
    else:
        return np.argmax(q_table[state])  # Exploitation


In [35]:
# Training loop
for episode in range(episodes):
    state, _ = env.reset()  # Reset the environment, get the initial state
    state = discretize_state(state, bins=(10, 10))  # Discretize state space
    total_reward = 0
    done = False

    for t in range(max_timesteps):  # For each time step
        # Select action based on epsilon-greedy policy
        action = epsilon_greedy(state, epsilon)

        # Step the environment
        next_state, reward, done, *info = env.step(action)

        # Discretize the next state
        next_state = discretize_state(next_state, bins=(10, 10))

        # Q-value update (Q-learning update rule)
        q_table[state][action] += learning_rate * (
            reward + gamma * np.max(q_table[next_state]) - q_table[state][action]
        )

        # Move to next state
        state = next_state
        total_reward += reward

        if done:
            print(f"Episode: {episode+1}/{episodes}, Reward: {total_reward}, Epsilon: {epsilon:.3f}")
            break

    # Decay epsilon (reduce exploration) only when epsilon is greater than epsilon_min
    if epsilon > epsilon_min:
        epsilon *= epsilon_decay
        # Make sure epsilon does not go below the minimum value
        epsilon = max(epsilon, epsilon_min)


    # Optionally, print Q-values at certain intervals
    if episode % 100 == 0:
        print(f"Episode {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.3f}")

# Close the environment after training
env.close()


Episode 0, Total Reward: -200.0, Epsilon: 0.999
Episode 100, Total Reward: -200.0, Epsilon: 0.904
Episode 200, Total Reward: -200.0, Epsilon: 0.818
Episode 300, Total Reward: -200.0, Epsilon: 0.740
Episode 400, Total Reward: -200.0, Epsilon: 0.670
Episode 500, Total Reward: -200.0, Epsilon: 0.606
Episode 600, Total Reward: -200.0, Epsilon: 0.548
Episode 700, Total Reward: -200.0, Epsilon: 0.496
Episode 800, Total Reward: -200.0, Epsilon: 0.449
Episode 900, Total Reward: -200.0, Epsilon: 0.406
Episode 1000, Total Reward: -200.0, Epsilon: 0.367
Episode 1100, Total Reward: -200.0, Epsilon: 0.332
Episode 1200, Total Reward: -200.0, Epsilon: 0.301
Episode 1300, Total Reward: -200.0, Epsilon: 0.272
Episode 1400, Total Reward: -200.0, Epsilon: 0.246
Episode 1500, Total Reward: -200.0, Epsilon: 0.223
Episode 1600, Total Reward: -200.0, Epsilon: 0.202
Episode 1700, Total Reward: -200.0, Epsilon: 0.182
Episode 1800, Total Reward: -200.0, Epsilon: 0.165
Episode 1900, Total Reward: -200.0, Epsilon

Episode: 4825/7000, Reward: -163.0, Epsilon: 0.050
Episode: 4829/7000, Reward: -186.0, Epsilon: 0.050
Episode: 4831/7000, Reward: -187.0, Epsilon: 0.050
Episode: 4832/7000, Reward: -162.0, Epsilon: 0.050
Episode: 4860/7000, Reward: -189.0, Epsilon: 0.050
Episode: 4863/7000, Reward: -154.0, Epsilon: 0.050
Episode: 4865/7000, Reward: -162.0, Epsilon: 0.050
Episode: 4877/7000, Reward: -193.0, Epsilon: 0.050
Episode: 4887/7000, Reward: -188.0, Epsilon: 0.050
Episode: 4888/7000, Reward: -193.0, Epsilon: 0.050
Episode 4900, Total Reward: -200.0, Epsilon: 0.050
Episode: 4909/7000, Reward: -170.0, Epsilon: 0.050
Episode: 4928/7000, Reward: -115.0, Epsilon: 0.050
Episode: 4929/7000, Reward: -122.0, Epsilon: 0.050
Episode: 4930/7000, Reward: -194.0, Epsilon: 0.050
Episode: 4931/7000, Reward: -127.0, Epsilon: 0.050
Episode: 4940/7000, Reward: -182.0, Epsilon: 0.050
Episode: 4960/7000, Reward: -192.0, Epsilon: 0.050
Episode 5000, Total Reward: -200.0, Epsilon: 0.050
Episode: 5057/7000, Reward: -20

#### Key aspects of the performance and convergence of the reinforcement learning (RL) algorithm:

#### Rewards Remaining Suboptimal: 
Across episodes, the total rewards are persistently negative, predominantly oscillating around -200. This suggests that the agent has difficulty learning an effective policy or that the environment and reward structure are particularly challenging.

#### Exploration Decay (Epsilon): 
Epsilon, the exploration rate, decreases significantly during the initial episodes, reflecting an appropriate decay schedule. By episode 3000, it stabilizes at 0.05, indicating a shift toward exploitation (leveraging learned policies) over exploration.

#### Occasional Reward Improvement: 
There are occasional episodes where rewards slightly improve (e.g., -181, -153, -115). These instances suggest that the agent occasionally discovers promising policies but fails to sustain or generalize them.

#### Plateau After Epsilon Stabilization: 
After epsilon reaches its minimum value (0.05), the agent's performance plateaus. This could imply that the agent is stuck in a local minimum or has converged to a suboptimal policy.