**Module 8: Reinforcement Learning**

**Excercise-1**

**Title:** Building a Simple Game Environment for Reinforcement Learning Experimentation.

**Problem Statement:**
Create a simple custom game environment using the OpenAI Gym library to facilitate experimentation with reinforcement learning (RL) algorithms. This environment will be based on the classic CartPole-v1 game, where the objective is to balance a pole on a moving cart.

**Steps to be Followed:**
1.	Install Required Libraries:

    a.	Ensure numpy and gym libraries are installed.

2.	Import Necessary Modules:

    a.	Import the gym library for creating and managing the game environment.

3.	Define the Custom Environment Class:

    a.	Create a class SimpleCartPoleEnvironment to encapsulate the environment.

    b.	Initialize the environment using gym.make('CartPole-v1').

    c.	Define methods for resetting the environment, taking a step, rendering the environment, and closing the environment.

4.	Instantiate and Use the Environment:

    a.	Instantiate the custom environment.

    b.	Reset the environment to obtain the initial state.

    c.	Take random actions for a specified number of steps or until the episode is done.

    d.	Render the environment at each step.

    e.	Close the environment after the episode is done.


In [None]:
# Step 1: Install Required Libraries
# Uncomment the line below if running in an environment where gym is not installed
# !pip install numpy gym

# Step 2: Import Necessary Modules
import gym

# Step 3: Define the Custom Environment Class
class SimpleCartPoleEnvironment:
    def __init__(self):
        self.env = gym.make('CartPole-v1')
        self.state = None
        self.done = None

    def reset(self):
        self.state = self.env.reset()
        self.done = False
        return self.state

    def step(self, action):
        if not self.done:
            self.state, reward, self.done, _ = self.env.step(action)
            return self.state, reward, self.done
        else:
            print("Episode is done. Please reset the environment.")
            return None, None, self.done

    def render(self):
        self.env.render()

    def close(self):
        self.env.close()

# Step 4: Instantiate and Use the Environment
if __name__ == "__main__":
    env = SimpleCartPoleEnvironment()

    # Reset the environment
    state = env.reset()
    print("Initial State:", state)

    # Take a random action for 100 steps or until the episode is done
    for _ in range(100):
        action = env.env.action_space.sample()  # Random action
        state, reward, done = env.step(action)
        if done:
            break
        env.render()

    env.close()


  deprecation(
  deprecation(
  if not isinstance(terminated, (bool, np.bool8)):
If you want to render in human mode, initialize the environment in this way: gym.make('EnvName', render_mode='human') and don't call the render method.
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Initial State: [-0.00852331  0.03632113 -0.04794138  0.04505791]


**Explanation of the Code:**
1.	Install Required Libraries:

    a.	The gym library is necessary to create and manage the CartPole environment. The installation command is provided but commented out.

2.	Import Necessary Modules:

    a.	The gym library is imported to access its environment creation and management functionalities.

3.	Define the Custom Environment Class:

    a.	The SimpleCartPoleEnvironment class encapsulates the CartPole environment.

    b.	The __init__ method initializes the environment using gym.make 'CartPole-v1').

    c.	The reset method resets the environment and returns the initial state.

    d.	The step method takes an action in the environment, updates the state, and returns the new state, reward, and done flag.

    e.	The render method renders the environment.

    f.	The close method closes the environment.

4.	Instantiate and Use the Environment:

    a.	The environment is instantiated, reset, and a random action is taken for 100 steps or until the episode is done.

    b.	The environment is rendered at each step, and closed after the episode is done.


**Excercise-2**

**Title:** Implementation of Q-Learning Algorithm with Epsilon-Greedy Exploration Strategy.

**Problem Statement:**
Develop an agent that can learn optimal actions through reinforcement learning using the Q-learning algorithm. The agent will utilize the Q-value update rule and epsilon-greedy exploration strategy to balance exploration and exploitation.

**Steps to be Followed:**

1.	Initialization:

    a.	Import necessary libraries: numpy for numerical operations and random for random number generation.

    b.	Define a class QLearningAgent to encapsulate the Q-learning logic.

2.	Define the Q-Learning Agent:

    a.	The __init__ method initializes the agent with parameters such as the number of states, number of actions, learning rate (alpha), discount factor (gamma), and exploration rate (epsilon). It also initializes the Q-table with zeros.

3.	Action Selection:

    a.	The choose_action method uses the epsilon-greedy strategy to select an action. With probability epsilon, the agent explores by choosing a random action. Otherwise, it exploits by choosing the action with the highest Q-value for the current state.

4.	Q-Value Update:

    a.	The update_q_value method updates the Q-value for a given state-action pair using the Bellman equation.

5.	Simulate Episodes:

    a.	Simulate multiple episodes to allow the agent to learn from its interactions with the environment.

    b.	For each episode, initialize the current state randomly.

    c.	The agent chooses actions based on the current state and updates the Q-values based on the received rewards and transitions to new states.

    d.	Continue this process until a stopping condition is met (e.g., reaching a terminal state or a fixed number of steps).

6.	Print Results:

    a.	Print the Q-table after each episode to observe how the Q-values are being updated.





In [None]:
import numpy as np
import random

class QLearningAgent:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.9, epsilon=0.1):
        self.n_states = n_states
        self.n_actions = n_actions
        self.alpha = alpha  # learning rate
        self.gamma = gamma  # discount factor
        self.epsilon = epsilon  # exploration-exploitation tradeoff
        self.q_table = np.zeros((n_states, n_actions))

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            # Exploration: Choose a random action
            return random.randint(0, self.n_actions - 1)
        else:
            # Exploitation: Choose the action with the highest Q-value for the current state
            return np.argmax(self.q_table[state, :])

    def update_q_value(self, state, action, reward, next_state):
        # Q-value update using the Bellman equation
        current_q = self.q_table[state, action]
        max_future_q = np.max(self.q_table[next_state, :])
        new_q = (1 - self.alpha) * current_q + self.alpha * (reward + self.gamma * max_future_q)
        self.q_table[state, action] = new_q

# Define the number of states and actions
num_states = 5
num_actions = 3

# Create a Q-learning agent
q_agent = QLearningAgent(num_states, num_actions)

# Simulate episodes and update Q-values
for episode in range(10):  # Reduced to 10 episodes for brevity
    print(f"Episode {episode+1}")
    current_state = random.randint(0, num_states-1)  # Random start state
    done = False

    while not done:
        action = q_agent.choose_action(current_state)
        next_state = random.randint(0, num_states-1)  # Simulate transition to a random next state
        reward = random.uniform(-1, 1)  # Simulate a reward between -1 and 1
        print(f"State: {current_state}, Action: {action}, Reward: {reward:.2f}, Next State: {next_state}")

        q_agent.update_q_value(current_state, action, reward, next_state)

        if next_state == num_states - 1:  # Arbitrary condition to end the episode
            done = True

        current_state = next_state

    print(f"Q-Table after episode {episode+1}:\n{q_agent.q_table}\n")

# Example of final Q-Table
print("Final Q-Table:")
print(q_agent.q_table)


Episode 1
State: 3, Action: 0, Reward: 0.69, Next State: 2
State: 2, Action: 0, Reward: 0.48, Next State: 0
State: 0, Action: 0, Reward: 0.67, Next State: 4
Q-Table after episode 1:
[[0.06736403 0.         0.        ]
 [0.         0.         0.        ]
 [0.04841038 0.         0.        ]
 [0.06883014 0.         0.        ]
 [0.         0.         0.        ]]

Episode 2
State: 3, Action: 0, Reward: 0.64, Next State: 3
State: 3, Action: 0, Reward: 0.52, Next State: 4
Q-Table after episode 2:
[[0.06736403 0.         0.        ]
 [0.         0.         0.        ]
 [0.04841038 0.         0.        ]
 [0.17115077 0.         0.        ]
 [0.         0.         0.        ]]

Episode 3
State: 0, Action: 0, Reward: 0.16, Next State: 1
State: 1, Action: 0, Reward: 0.79, Next State: 2
State: 2, Action: 0, Reward: -0.60, Next State: 4
Q-Table after episode 3:
[[ 0.07712418  0.          0.        ]
 [ 0.08305428  0.          0.        ]
 [-0.01593615  0.          0.        ]
 [ 0.17115077  0.    

**Explanation of the Code**

1.	Initialization:

    a.	The QLearningAgent class is initialized with the number of states, actions, learning rate, discount factor, and exploration rate.

    b.	The Q-table is a matrix of size (number of states) x (number of actions) initialized to zeros.

2.	Choosing Actions:

    a.	The agent uses an epsilon-greedy strategy to balance exploration (random action) and exploitation (best action based on Q-values).

3.	Updating Q-Values:

    a.	Q-values are updated using the Bellman equation, which incorporates the learning rate, reward, discount factor, and the maximum future Q-value.

4.	Simulating Episodes:

    a.	For each episode, the agent starts from a random state and takes actions until it reaches an arbitrary terminal state.

    b.	The agent receives rewards and updates Q-values based on its interactions with the environment.

5.	Output:

    a.	The Q-table is printed after each episode to show the learning progress.

    b.	The final Q-table is displayed at the end of the simulation.



**Exercise-3**

**Title:** Deep Q-Networks (DQNs) Implementation

**Problem Statement:**
The objective of this code is to implement a Deep Q-Network (DQN) agent to learn optimal policies for decision-making tasks in a given environment. DQNs combine Q-learning with deep neural networks, allowing the agent to handle high-dimensional state spaces effectively. The agent uses experience replay and target networks to stabilize the training process.

**Steps to be Followed:**
1.	Initialization:

    a.	Import necessary libraries: tensorflow for building neural networks, numpy for numerical operations, random for random sampling, deque for memory storage, and gym for the environment.

2.	Define the DQN Agent:

    a.	The DQNAgent class is initialized with state size, action size, and various hyperparameters (e.g., learning rate, discount factor, epsilon for exploration).

    b.	Create primary and target Q-networks using a simple feedforward neural network structure.

    c.	Copy weights from the primary network to the target network.

3.	Building the Q-Network:

    a.	The build_q_network method constructs a neural network model with two hidden layers and an output layer corresponding to action size.

    b.	Compile the model with mean squared error loss and Adam optimizer.

4.	Update Target Network:

    a.	The update_target_network method copies the weights from the primary Q-network to the target Q-network.

5.	Memory Management:

    a.	The remember method stores experiences (state, action, reward, next state, done) in the agent's memory for experience replay.

6.	Action Selection:

    a.	The choose_action method uses epsilon-greedy strategy to balance exploration and exploitation. It either selects a random action or the action with the highest predicted Q-value.

7.	Experience Replay:

    a.	The replay method samples a minibatch of experiences from memory to update the Q-values using the Bellman equation. It fits the primary Q-network to minimize the loss.

8.	Training the Agent:

    a.	The train method runs multiple episodes where the agent interacts with the environment, stores experiences, and updates Q-values using experience replay.

    b.	Periodically updates the target network and decays the epsilon value to reduce exploration over time.





In [None]:
import tensorflow as tf
import numpy as np
import random
from collections import deque
import gym

class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95  # discount factor
        self.epsilon = 1.0  # exploration-exploitation tradeoff
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.01
        self.learning_rate = 0.001

        # Create primary and target Q-networks
        self.primary_q_network = self.build_q_network()
        self.target_q_network = self.build_q_network()
        self.update_target_network()

    def build_q_network(self):
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(24, input_dim=self.state_size, activation='relu'),
            tf.keras.layers.Dense(24, activation='relu'),
            tf.keras.layers.Dense(self.action_size, activation='linear')
        ])
        model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(learning_rate=self.learning_rate))
        return model

    def update_target_network(self):
        self.target_q_network.set_weights(self.primary_q_network.get_weights())

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def choose_action(self, state):
        if np.random.rand() <= self.epsilon:
            return np.random.choice(self.action_size)
        q_values = self.primary_q_network.predict(state)
        return np.argmax(q_values[0])

    def replay(self, batch_size):
        if len(self.memory) < batch_size:
            return

        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = self.primary_q_network.predict(state)
            if done:
                target[0][action] = reward
            else:
                target[0][action] = reward + self.gamma * np.max(self.target_q_network.predict(next_state)[0])

            self.primary_q_network.fit(state, target, epochs=1, verbose=0)

    def train(self, env, episodes, batch_size=32):
        for episode in range(episodes):
            state = env.reset()
            state = np.reshape(state, [1, self.state_size])

            for time_step in range(500):  # or any appropriate max time step
                action = self.choose_action(state)
                next_state, reward, done, _ = env.step(action)
                next_state = np.reshape(next_state, [1, self.state_size])
                self.remember(state, action, reward, next_state, done)
                state = next_state

                if done:
                    print(f"Episode: {episode + 1}/{episodes}, Score: {time_step}")
                    break

            if len(self.memory) > batch_size:
                self.replay(batch_size)

            if self.epsilon > self.epsilon_min:
                self.epsilon *= self.epsilon_decay

            if episode % 10 == 0:
                self.update_target_network()

# Example Usage:
# Initialize the environment
env = gym.make('CartPole-v1')

# Extract state and action sizes
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# Create DQNAgent
dqn_agent = DQNAgent(state_size, action_size)

# Train the agent
dqn_agent.train(env, episodes=25)


Episode: 1/25, Score: 12
Episode: 2/25, Score: 27
Episode: 3/25, Score: 52
Episode: 4/25, Score: 50
Episode: 5/25, Score: 14
Episode: 6/25, Score: 16
Episode: 7/25, Score: 24
Episode: 8/25, Score: 15
Episode: 9/25, Score: 19
Episode: 10/25, Score: 25
Episode: 11/25, Score: 10
Episode: 12/25, Score: 12
Episode: 13/25, Score: 16
Episode: 14/25, Score: 26
Episode: 15/25, Score: 21
Episode: 16/25, Score: 17
Episode: 17/25, Score: 19
Episode: 18/25, Score: 19
Episode: 19/25, Score: 34
Episode: 20/25, Score: 43
Episode: 21/25, Score: 22
Episode: 22/25, Score: 12
Episode: 23/25, Score: 28
Episode: 24/25, Score: 10
Episode: 25/25, Score: 16


**Explanation of the Code:**
1.	Initialization:

    a.	The DQNAgent class initializes the Q-learning agent with hyperparameters such as learning rate, discount factor, exploration rate, and epsilon decay.

    b.	Two neural networks (primary and target) are built to represent the Q-value functions.

2.	Building the Q-Network:

    a.	The Q-network consists of two hidden layers with ReLU activation and an output layer with linear activation to predict Q-values for each action.

3.	Updating Target Network:

    a.	The weights of the primary network are copied to the target network periodically to stabilize the training process.

4.	Memory Management:

    a.	The agent stores experiences (state, action, reward, next state, done) in a deque for experience replay.

5.	Action Selection:

    a.	The agent uses epsilon-greedy strategy to balance between exploring new actions and exploiting the best-known actions.

6.	Experience Replay:

    a.	A minibatch of experiences is sampled from memory to update the Q-values using the Bellman equation. The primary network is trained on this minibatch.

7.	Training the Agent:

    a.	The agent interacts with the environment for multiple episodes, updating its Q-values based on the rewards received from its actions.

    b.	The epsilon value is decayed over time to reduce exploration and increase exploitation.

    c.	The target network is updated periodically to keep the training process stable.

**Output:**
The script prints the episode number and the score (number of steps before the episode ends) after each episode. The final output will show the training progress over multiple episodes, indicating the agent's performance improvement over time.
