<a href="https://colab.research.google.com/github/Pranav-Reddy-Pedaballe/Reinforcement-Learning/blob/main/A3C_for_Kung_Fu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A3C for Kung Fu

## Part 0 - Installing the required packages and importing the libraries

### Installing Gymnasium

In [None]:
!pip install gymnasium
!pip install "gymnasium[atari, accept-rom-license]"
!pip install ale-py
!apt-get install -y swig
!pip install gymnasium[box2d]

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
swig is already the newest version (4.0.2-1ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


### Importing the libraries

In [None]:
import cv2
import math
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.multiprocessing as mp
import torch.distributions as distributions
from torch.distributions import Categorical
import ale_py
import gymnasium as gym
from gymnasium.spaces import Box
from gymnasium import ObservationWrapper

## Part 1 - Building the AI

### Creating the architecture of the Neural Network

In [None]:
class Network(nn.Module):

  def __init__(self, action_size):
    super(Network, self).__init__()
    self.conv1 = torch.nn.Conv2d(in_channels = 4,  out_channels = 32, kernel_size = (3,3), stride = 2)
    self.conv2 = torch.nn.Conv2d(in_channels = 32, out_channels = 32, kernel_size = (3,3), stride = 2)
    self.conv3 = torch.nn.Conv2d(in_channels = 32, out_channels = 32, kernel_size = (3,3), stride = 2)
    self.flatten = torch.nn.Flatten()
    self.fc1  = torch.nn.Linear(512, 128)
    self.fc2a = torch.nn.Linear(128, action_size)
    self.fc2s = torch.nn.Linear(128, 1)

  def forward(self, state):
    x = self.conv1(state)
    x = F.relu(x)
    x = self.conv2(x)
    x = F.relu(x)
    x = self.conv3(x)
    x = F.relu(x)
    x = self.flatten(x)
    x = self.fc1(x)
    x = F.relu(x)
    action_values = self.fc2a(x)
    state_value = self.fc2s(x)[0]
    return action_values, state_value

```
self.conv1 = torch.nn.Conv2d(in_channels = 4,  out_channels = 32, kernel_size = (3,3), stride = 2)
self.conv2 = torch.nn.Conv2d(in_channels = 32, out_channels = 32, kernel_size = (3,3), stride = 2)
self.conv3 = torch.nn.Conv2d(in_channels = 32, out_channels = 32, kernel_size = (3,3), stride = 2)
```
`Purpose:` Extracts spatial features from input frames (e.g., images from a game).

`in_channels = 4:` Assumes input consists of 4 stacked frames (common in RL).

`out_channels = 32:` Each convolution layer outputs 32 feature maps.

`kernel_size = (3,3):` 3x3 filter size.

`stride = 2:` Reduces spatial dimensions (downsampling)

####**Flattening and Fully Connected Layers:**

`self.flatten` = torch.nn.Flatten()

`self.fc1`  = torch.nn.Linear(512, 128)

`Flatten:` Converts the 3D tensor (from convolution) into a 1D vector for dense layers.

`fc1:` First fully connected (dense) layer that reduces the size to 128 neurons.

```
self.fc2a = torch.nn.Linear(128, action_size)
self.fc2s = torch.nn.Linear(128, 1)
```
for A3C model:-

fc2a (Advantage Stream): Outputs Q-values for each action (action_size).***(Actor)***

fc2s (State-Value Stream): Outputs a single state value (scalar).***(Critic)***

Why Two Streams?

State Value: Measures the overall quality of the current state.
Advantage: Measures the benefit of selecting a particular action compared to others in that state.
Combining:

Q(s,a)=V(s)+A(s,a)




The forward method is simple we just pass it through the layers and assign the output to action_value and state_value.

## Part 2 - Training the AI

### Setting up the environment

In [None]:
class PreprocessAtari(ObservationWrapper):

  def __init__(self, env, height = 42, width = 42, crop = lambda img: img, dim_order = 'pytorch', color = False, n_frames = 4):
    super(PreprocessAtari, self).__init__(env)
    self.img_size = (height, width)
    self.crop = crop
    self.dim_order = dim_order
    self.color = color
    self.frame_stack = n_frames
    n_channels = 3 * n_frames if color else n_frames
    obs_shape = {'tensorflow': (height, width, n_channels), 'pytorch': (n_channels, height, width)}[dim_order]
    self.observation_space = Box(0.0, 1.0, obs_shape)
    self.frames = np.zeros(obs_shape, dtype = np.float32)

  def reset(self):
    self.frames = np.zeros_like(self.frames)
    obs, info = self.env.reset()
    self.update_buffer(obs)
    return self.frames, info

  def observation(self, img):
    img = self.crop(img)
    img = cv2.resize(img, self.img_size)
    if not self.color:
      if len(img.shape) == 3 and img.shape[2] == 3:
        img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img = img.astype('float32') / 255.
    if self.color:
      self.frames = np.roll(self.frames, shift = -3, axis = 0)
    else:
      self.frames = np.roll(self.frames, shift = -1, axis = 0)
    if self.color:
      self.frames[-3:] = img
    else:
      self.frames[-1] = img
    return self.frames

  def update_buffer(self, obs):
    self.frames = self.observation(obs)

def make_env():
  env = gym.make("KungFuMasterDeterministic-v0", render_mode = 'rgb_array')
  env = PreprocessAtari(env, height = 42, width = 42, crop = lambda img: img, dim_order = 'pytorch', color = False, n_frames = 4)
  return env

env = make_env()

state_shape = env.observation_space.shape
number_actions = env.action_space.n
print("State shape:", state_shape)
print("Number actions:", number_actions)
print("Action names:", env.env.env.env.get_action_meanings())

State shape: (4, 42, 42)
Number actions: 14
Action names: ['NOOP', 'UP', 'RIGHT', 'LEFT', 'DOWN', 'DOWNRIGHT', 'DOWNLEFT', 'RIGHTFIRE', 'LEFTFIRE', 'DOWNFIRE', 'UPRIGHTFIRE', 'UPLEFTFIRE', 'DOWNRIGHTFIRE', 'DOWNLEFTFIRE']


### Initializing the hyperparameters

In [None]:
learning_rate = 1e-4
discount_factor = 0.99
number_environments = 10

### Implementing the A3C class

In [None]:
class Agent():

  def __init__(self, action_size):
    self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    self.action_size = action_size
    self.network = Network(action_size).to(self.device)
    self.optimizer = torch.optim.Adam(self.network.parameters(), lr = learning_rate)

  def act(self, state):
    if state.ndim == 3:
      state = [state]
    state = torch.tensor(state, dtype = torch.float32, device = self.device)
    action_values, _ = self.network(state)
    policy = F.softmax(action_values, dim = -1)
    return np.array([np.random.choice(len(p), p = p) for p in policy.detach().cpu().numpy()])

  def step(self, state, action, reward, next_state, done):
    batch_size = state.shape[0]
    state = torch.tensor(state, dtype = torch.float32, device = self.device)
    next_state = torch.tensor(next_state, dtype = torch.float32, device = self.device)
    reward = torch.tensor(reward, dtype = torch.float32, device = self.device)
    done = torch.tensor(done, dtype = torch.bool, device = self.device).to(dtype = torch.float32)
    action_values, state_value = self.network(state)
    _, next_state_value = self.network(next_state)
    target_state_value = reward + discount_factor * next_state_value * (1 - done)
    advantage = target_state_value - state_value
    probs = F.softmax(action_values, dim = -1)
    logprobs = F.log_softmax(action_values, dim = -1)
    entropy = -torch.sum(probs * logprobs, axis = -1)
    batch_idx = np.arange(batch_size)
    logp_actions = logprobs[batch_idx, action]
    actor_loss = -(logp_actions * advantage.detach()).mean() - 0.001 * entropy.mean()
    critic_loss = F.mse_loss(target_state_value.detach(), state_value)
    total_loss = actor_loss + critic_loss
    self.optimizer.zero_grad()
    total_loss.backward()
    self.optimizer.step()

```
def act(self, state):
    if state.ndim == 3:
      state = [state]
    state = torch.tensor(state, dtype = torch.float32, device = self.device)
    action_values, _ = self.network(state)
    policy = F.softmax(action_values, dim = -1)
    return np.array([np.random.choice(len(p), p = p) for p in policy.detach().cpu().numpy()])
```
`state.ndim == 3:` If the state has 3 dimensions (single frame), convert it to a batch by wrapping it in a list.

state = [state] adds an extra dimension of the batch

`state to Tensor:` Converts state to a PyTorch tensor on the same device as the model.

`network(state):` Passes the state through the neural network to get action values and state value.

`F.softmax:` Converts action values to probabilities (policy distribution).To choose the best action. and dim = -1 measn the softmax function should be done across the last dimension.

***Action Sampling:*** Chooses actions based on the probability distribution (policy).

For each row p in policy:

`len(p):` Number of possible actions (action_size).

`np.random.choice(len(p), p = p):`
Randomly selects an action based on the probabilities in p.

`p = p` means that the selection is weighted according to the probabilities in p.

This selects one action per state in the batch.





##STEP METHOD

####1. Batch Size Calculation

`batch_size = state.shape[0]`:
Determines how many states are being processed at once (batch size).

state.shape[0] gives the number of samples in the batch.

####2. Convert Inputs to Tensors
```
state = torch.tensor(state, dtype = torch.float32, device = self.device)
next_state = torch.tensor(next_state, dtype = torch.float32, device = self.device)
reward = torch.tensor(reward, dtype = torch.float32, device = self.device)
done = torch.tensor(done, dtype = torch.bool, device = self.device).to(dtype = torch.float32)
```
Converts state, next_state, reward, and done into PyTorch tensors.

Moves tensors to the appropriate device (CPU/GPU).

done is a boolean (True/False) tensor, but it is cast to float32 (1.0 if done, 0.0 if not). This simplifies later calculations.

####3. Calculate Action Values and State Value
```
action_values, state_value = self.network(state)
_, next_state_value = self.network(next_state)
```
self.network(state) passes the current state through the neural network.

The network returns:

action_values – predicted values for each possible action.

state_value – scalar estimate of how good the current state is.

For next_state, only the state value is needed, so _ discards the action values.

####4. Compute Target State Value (TD Target)

The Bellman Equation is used for training the critic (value function).

If done = 1 (episode ends), next_state_value is ignored.

If done = 0 (episode continues), the future discounted state value (discount_factor * next_state_value) is added to the reward.

####5. Advantage Calculation

Advantage measures how much better the chosen action was compared to the predicted value of the current state.

Positive advantage = the action was better than expected.

Negative advantage = the action was worse than expected.

####6. Compute Action Probabilities (Policy)
```
probs = F.softmax(action_values, dim = -1)
logprobs = F.log_softmax(action_values, dim = -1)
```
`softmax` turns action values into probabilities.

`log_softmax` calculates the logarithm of the softmax probabilities. This is used for policy gradient calculations.

####7. Entropy Calculation (Exploration Bonus)
```
entropy = -torch.sum(probs * logprobs, axis = -1)
```
Entropy measures the randomness (exploration) in the policy.

Higher entropy = more exploration.

This acts as a regularizer to encourage exploration.

###8. Log Probability of Chosen Actions

logp_actions gets the log probability of the action that was actually chosen (action).

This ensures the policy is only updated based on the actions that were taken.

####9. Actor Loss (Policy Update)
```
actor_loss = -(logp_actions * advantage.detach()).mean() - 0.001 * entropy.mean()
```
`Policy Gradient:` The policy (actor) is updated to maximize the advantage.

`advantage.detach()` stops the advantage from contributing to the computation graph for the actor loss.

The second term `(0.001 * entropy)` adds entropy to encourage exploration.


####10. Critic Loss (Value Function Update)
```
critic_loss = F.mse_loss(target_state_value.detach(), state_value)
```
The critic tries to minimize the difference between predicted value (state_value) and target value (target_state_value).

detach() stops gradients from flowing through the target, ensuring only the critic network is updated.

####11. Total Loss and Backpropagation

Total loss is the sum of actor and critic losses.
Gradients are cleared using zero_grad().

Backpropagation computes gradients for both the actor and critic networks.

Optimizer update applies the gradients to update the network parameters.

### Initializing the A3C agent

In [None]:
agent = Agent(number_actions)

### Evaluating our A3C agent on a certain number of episodes

In [None]:
def evaluate(agent, env, n_episodes = 1):
  episodes_rewards = []
  for _ in range(n_episodes):
    state, _ = env.reset()
    total_reward = 0
    while True:
      action = agent.act(state)
      state, reward, done, info, _ = env.step(action[0])
      total_reward += reward
      if done:
        break
    episodes_rewards.append(total_reward)
  return episodes_rewards

###Purpose of the evaluate Function
This function tests/evaluates the agent's performance in the environment without training.

It simulates n_episodes (by default, 1) and collects the total rewards for each episode.

The result helps to measure how well the agent performs after training.

####1.Initialize Rewards List
```
episodes_rewards = []
```
A list to store the total rewards collected in each episode.

####2.Loop Through Episodes

`for _ in range(n_episodes):`
Run the environment for the specified number of episodes (n_episodes).

_ is used because the loop variable isn’t needed.

####3.Reset the Environment
`env.reset()` initializes the environment at the start of each episode.

Returns the initial state and additional info (which is ignored with _).

####4.Initialize Reward Tracker

####5.Simulate the Environment
`action = agent.act(state)`

The agent picks an action based on the current state (using its policy).

`env.step(action[0])` performs the chosen action in the environment.

`action[0]:` The action is likely returned as an array, so [0] extracts the scalar action.

Rewards are summed over the episode and the loop stops when the episode finishes (done==TRUE)

After each episode, the total reward is saved to the list.

At the end we return the total rewards for each episode.



### Managing multiple environments simultaneously

In [None]:
class EnvBatch:

  def __init__(self, n_envs = 10):
    self.envs = [make_env() for _ in range(n_envs)]

  def reset(self):
    _states = []
    for env in self.envs:
      _states.append(env.reset()[0])
    return np.array(_states)

  def step(self, actions):
    next_states, rewards, dones, infos, _ = map(np.array, zip(*[env.step(a) for env, a in zip(self.envs, actions)]))
    for i in range(len(self.envs)):
      if dones[i]:
        next_states[i] = self.envs[i].reset()[0]
    return next_states, rewards, dones, infos

```
def __init__(self, n_envs = 10):
    self.envs = [make_env() for _ in range(n_envs)]
```
Creates Multiple Environments:

n_envs sets the number of environments (default is 10).

`make_env()` is called n_envs times to create a list of environments (self.envs).

This list stores individual environments to run in parallel.

make_env() is  a function in the setting up the environment, it initializes and returns an environment instance.

2. Resetting All Environments (reset method):
```
def reset(self):
    _states = []
    for env in self.envs:
      _states.append(env.reset()[0])
    return np.array(_states)
    ```
Purpose: Resets all environments to start a new episode.

How it Works:
Loops through each environment (self.envs) and calls env.reset().

The initial state of each environment is collected (env.reset()[0]).

_states stores the starting states of all environments.
Return:

Returns the states as a NumPy array – this allows easy batching and manipulation.

###3. Taking a Step in All Environments (step method):

Purpose: Simultaneously perform actions in all environments.

How it Works:

1.Action Execution:

`[env.step(a) for env, a in zip(self.envs, actions)]`

Loops over each environment (env) and performs the corresponding action (a).

env.step(a) returns (next_state, reward, done, info, _) for each environment.

Unpack and Batch Results:

`next_states, rewards, dones, infos, _ = map(np.array, zip(*...))`

`zip(*...)` unpacks the results from each environment.
map(np.array, ...) converts the results into arrays for easy processing.

Handle Episode End (Resetting):

If an environment is done (episode ends), reset that environment immediately.

This ensures the environment continues to produce new data without interruption.

Return:

`next_states` – Next states of all environments.

`rewards` – Rewards from each environment.

`dones` – Indicates if each environment has completed its episode.

`infos` – Additional information (often unused).

### Training the A3C agent

In [None]:
import tqdm

env_batch = EnvBatch(number_environments)
batch_states = env_batch.reset()

with tqdm.trange(0, 10001) as progress_bar:
  for i in progress_bar:
    batch_actions = agent.act(batch_states)
    batch_next_states, batch_rewards, batch_dones, _ = env_batch.step(batch_actions)
    batch_rewards *= 0.01
    agent.step(batch_states, batch_actions, batch_rewards, batch_next_states, batch_dones)
    batch_states = batch_next_states
    if i % 1000 == 0:
      print("Average agent reward: ", np.mean(evaluate(agent, env, n_episodes = 10)))

  critic_loss = F.mse_loss(target_state_value.detach(), state_value)
  0%|          | 9/10001 [00:37<8:25:15,  3.03s/it]  

Average agent reward:  520.0


 10%|█         | 1009/10001 [01:27<2:18:34,  1.08it/s]

Average agent reward:  720.0


 20%|██        | 2008/10001 [02:10<2:10:27,  1.02it/s]

Average agent reward:  240.0


 30%|███       | 3009/10001 [02:56<1:47:16,  1.09it/s]

Average agent reward:  290.0


 40%|████      | 4008/10001 [03:38<1:24:59,  1.18it/s]

Average agent reward:  110.0


 50%|█████     | 5008/10001 [04:28<1:28:05,  1.06s/it]

Average agent reward:  1040.0


 60%|██████    | 6008/10001 [05:15<1:12:55,  1.10s/it]

Average agent reward:  290.0


 70%|███████   | 7006/10001 [06:03<1:06:25,  1.33s/it]

Average agent reward:  670.0


 80%|████████  | 8009/10001 [06:45<27:48,  1.19it/s]

Average agent reward:  150.0


 90%|█████████ | 9008/10001 [07:29<14:07,  1.17it/s]

Average agent reward:  370.0


100%|██████████| 10001/10001 [08:18<00:00, 20.08it/s]

Average agent reward:  550.0





`tqdm` provides a progress bar to visualize the training process.

env_batch is an instance of EnvBatch with number_environments parallel environments.

`batch_states = env_batch.reset()` resets all environments and returns the initial states as a batch (NumPy array).

tqdm.trange(0, 3001) creates a loop from 0 to 3000 with a progress bar.

The progress_bar updates every iteration to show progress during training.

i is the current training step (out of 3001 steps).

`batch_actions = agent.act(batch_states)`

The agent selects actions for all environments in parallel using the current batch of states (batch_states).

agent.act returns an array of actions (batch_actions) for each environment.

env_batch.step(batch_actions) applies the selected actions across all environments.
Returns:
batch_next_states – Next states of each environment.
batch_rewards – Rewards from the actions.
batch_dones – Flags indicating if an episode has ended in each environment.
_ – Additional info (unused here).

`batch_rewards *= 0.01`

Rewards are scaled down by 0.01.
This prevents exploding gradients and stabilizes training by keeping rewards small.

The agent learns from the batch experience:

Current state (batch_states)

Actions taken (batch_actions)

Rewards (batch_rewards)

Next states (batch_next_states)

Done flags (batch_dones)

agent.step updates the policy/network using this batch of experiences.

Every 1000 steps:
The agent is evaluated by running evaluate(agent, env, n_episodes=10).

`np.mean` calculates the average reward over 10 evaluation episodes.
This tracks the agent’s performance over time.






## Part 3 - Visualizing the results

In [None]:
import glob
import io
import base64
import imageio
from IPython.display import HTML, display

def show_video_of_model(agent, env):
  state, _ = env.reset()
  done = False
  frames = []
  while not done:
    frame = env.render()
    frames.append(frame)
    action = agent.act(state)
    state, reward, done, _, _ = env.step(action[0])
  env.close()
  imageio.mimsave('video.mp4', frames, fps=30)

show_video_of_model(agent, env)

def show_video():
    mp4list = glob.glob('*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

show_video()

