<a href="https://colab.research.google.com/github/Kshitij04Poojary/Iterated-Prisoners-Dilemma/blob/main/DQNIPD_Explanations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random

Following defines a simple feed-forward Neural Network with only 2 layers. We use pytorch to make it. This is the part we need to edit. Here you can integrate an rnn or an lstm instead of a 2 layer Neural Network.

In [None]:
class DQN(nn.Module):
  def __init__(self,input_size,hidden_size,output_size):
    super(DQN,self).__init__()
    self.fc1 = nn.Linear(input_size,hidden_size)
    self.fc2 = nn.Linear(hidden_size,output_size)

  def forward(self,x):
    x = torch.relu(self.fc1(x))     #applies RELU( Rectified Linear Unit) activation function to the first layer to maintain the dimensionality(linearity) of the layer
    x = self.fc2(x)                 #output doesnt require an activation function
    return x

Replay Buffer is basically a Circular Stack (Dequeue?). It is a data structure used to track and access transitions that the agent encounters. The 4 states are : state (current state), action (action taken by the agent according to the current state), reward (reward recieved immeadeatly by the agent for the action), next_state(state observed after the current action). The whole point of the replay buffer is that it allows the agent to learn from it's previous actions. In it's sample function, it picks up random transitions according to the batch size. This is then used to update the parameters of the agent's Neural Network. It is random to minimize the bias it can form due to recent experiences and allowing the agent to learn from diverse experiences.

In [None]:
class ReplayBuffer:
  def __init__(self,cap):
    self.buffer = []
    self.cap = cap      #cap is the capacity of the buffer
    self.pos = 0
  def push(self,state,action,reward,next_state):  #push function to append a new transition in the buffer
    if self.pos <= self.cap:
      self.buffer.append(None)          #initialize the buffer with None
    self.buffer[self.pos] = (state,action,reward,next_state)
    self.pos = (self.pos + 1) % self.cap
  def sample(self,batch):             #zips random batch of transitions according to the batch size and returns it
    return zip(*random.sample(self.buffer,batch))
  def len(self):
    return len(self.buffer)


In [None]:
#Hyperparameters
input_size = 5  # State representation size
hidden_size = 64  # Hidden layer size
output_size = 2  # Number of actions
batch = 64
gamma = 0.99  # Discount factor
epsilon_start = 1.0
epsilon_end = 0.01
epsilon_decay = 0.995
target_update = 10  # Update target network every 10 steps
num_episodes = 200

The payoff matrix basically aims to maximize the profit, which in a non-iterated prisoner's dilemma is the one where 1 deflects and the other coorporates. The rewards are assigned with this in mind. Rows indicate the agent(player1) and columns the oppnent(player2).
Both Coorporate : Agent - 3, Opp - 0

1 Defect, 1 Coorporate : Agent - 5, Opp-1 (Agent Defects)

1 Coorporate , 1 Defect : Agent - 1 , Opp - 5 (Opp Defects)

Both Defect : Agent - 0, Opp - 0

In [None]:
class IPD:
  def __init__(self):
    self.actions = 2 #Coorporate or Deflect
    self.payoff = np.array([[3, 0], [5, 1], [1, 5], [0, 0]])

  def reward_for_action(self,action1,action2):
    reward1 = self.payoff[action1][action2]
    reward2 = self.payoff[action2][action1]
    return reward1,reward2

Initial steps of the DQN algorithm involves making a policy network and a target network that are one and the same. The target network is a replica of the policy network

In [None]:
policy_network = DQN(input_size,hidden_size,output_size)
target_network = DQN(input_size,hidden_size,output_size)
target_network.load_state_dict(policy_network.state_dict())
target_network.eval()    #according to pytorch, it puts it in eval mode that avoids certain actions to take place or something
optimizer = optim.Adam(policy_network.parameters(),lr=0.001)  #optimizer for the policy network params
replay_buffer = ReplayBuffer(cap = 1000)

The below segment is a part of the DQN algorithm. Epsilon is the "exploration" parameter. In the below section, we check if the random number is less than epsilon, in which case we will choose exploration and pick a random action. Else, we choose to exploit the network's previous knowledge. This is done by calculating all q values of the policy network and picking the action with the highest Q Value.

In [None]:
def select_action(state,epsilon):
  if np.random.rand() < epsilon:  #exploration
    return np.random.randint(output_size) #pick random action
  else:
    with torch.no_grad():   #exploitation
      q_values = policy_network(torch.tensor(state,dtype = torch.float32))  #calculates q values. converts state into a tensor. (torch no grad makes computation faster)
      return q_values.argmax().item() #finds maximum q value and finds the scalar value of the index of the action which is a tensor and return it as action

The below step is used to update the q values in the policy network itself (The target network is then later updated after some amt of time). We sample a mini batch from the replay buffer and then convert it to tensors. We then calculate the q_values(current state), next_q_values(next state) and expected q values using the Bellman Equation. We than calculate the loss with respect to the q values of actions actually taken by the agent and the expected q values. Loss is calculated by gradient back propogation and then the parameters that lead to the least amount of loss is updated in the policy network.

In [None]:
def update_q_values():
  if len(replay_buffer) > batch:  #checks if there are enough transitions in the replay buffer to make a mini batch
    state, action, reward, next_state = replay_buffer.sample(batch)
    state = torch.tensor (state, dtype = torch.float32)  #convert state to tensor
    action = torch.tensor (action, dtype = torch.long)   #actions are convert into long tensor because actions are integers and can either be 0 or 1
    reward = torch.tensor (reward, dtype = torch.float32)
    next_state = torch.tensor (next_state, dtype = torch.float32)

    q_vals = policy_network(state)   #calculate q values of current state
    next_q_vals = target_network(next_state).max(1)[0].detach()    #calculate q values of next state and finding its maximum
    expected_q_val = reward + gamma* next_q_vals   #Bellman's Equation to find expected q values
    loss = nn.functional.mse_loss (q_vals.gather(1,actions.unsqueeze(1)),expected_q_vals.unsqueeze(1))  # q_vals.gather(1,actions.unsqueeze(1)) gathers
    # gathers q values corresponding to the particular action column
    # actions.unsqueeze(1) creates a column tensor of actions and expected_q_vals.unsqueeze(1) creates a column tensor of expected q values
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()   #updates params of policy network that give the least amount of loss

Below is the main training loop.

epsilon = epsilon_end + (epsilon_start-epsilon_end)*np.exp(-episode/epsilon_decay).

In the above formula. epsilon_end is the value of epsilon that is tending to zero. This is the target value of epsilon which represents the minimum exploration rate. epsilon_start is the value of epsilon that is closer to 1. It is the initial value of epsilon which represents the maximum exploration rate. epsilon_decay is the factor which is responsible for the decrease of epsilon over time, so that the Network can move from exploration of new values to exploiting its previous knowledge.

np.exp(-episode/epsilon_decay).This is known as the exponential decay factor which is negative and progressively becomes more negative as episode number increases. This causes the epsilon value to tend to zero over time.
(epsilon_start-epsilon_end)*np.exp(-episode/epsilon_decay).This essentially scales the value between epsilon start and epsilon end. Multiplying it with the exponential decay factor makes the epsilon value shift from epsilon start to epsilon end.
epsilon_end + (epsilon_start-epsilon_end)*np.exp(-episode/epsilon_decay).Adding it to the epsilon_end makes it so that it allows the value of epsilon to decrease over time while also allowing exploration in the early stages of training.

In [None]:
ipd = IPD()  #object of our IPD env
for episode in range(num_episodes):
  state = [0,0,0,0]  #initialization
  total_reward = 0
  for t in range(100):
    epsilon = epsilon_end + (epsilon_start-epsilon_end)*np.exp(-episode/epsilon_decay)
    action = select_action(state,epsilon)  #agent action selection
    opp_action = np.random.randint(2)      #opponent action selection
    reward, opp_reward = ipd.step(action,opp_action)
    next_state = [action,opp_action,reward,opp_reward,0]
    replay_buffer.push(state,action,reward,next_state)  #push transition states into replay buffer
    state = next_state
    total_reward = total_reward + reward
    update_q_values()
    if t % target_update == 0:
      target_network.load_state_dict(policy_network.state_dict())  #load policy network params to the target network periodically. This is determined by the target_update param
      #we defined and is called synchronization
    print(f"Episode {episode + 1}, Total Reward: {total_reward}")

Below stimulates and tests the code against Random Opponent Strategy for IPD.

In [None]:
def play_IPD_random(policy_network,num_episodes):
  ipd = IPD()
  total_rewards = []
  for episode in range(num_episodes):
    state = [0,0,0,0]
    total_reward=0
    for t in range(100):
      epsilon = epsilon_end + (epsilon_start-epsilon_end)*np.exp(-episode/epsilon_decay)
      action = select_action(state,epsilon)
      opp_action = np.random.randint(2)  #random opponent, this will change as we compare other stratergies
      reward , _ = IPD.step(action,opp_action)  #we are only concerned with the agents rewards not the opponents
      total_reward = total_reward+reward
      next_state = [action,opp_action,reward,0,0]
      state = next_state
      total_rewards.append(total_reward)
  return total_rewards


def test_against_random(policy_network, num_episodes):
  random_rewards = play_IPD_random(policy_network, num_episodes)
  mean_reward = np.mean(random_rewards)  #finding mean reward for agent
  print("Average reward against random strategy:", mean_reward)

test_against_random(policy_network, num_episodes=100)

The following things need to be worked on after this :
1.  input rnn/lstm
2.  opponent stratergy (in training)
3.  testing algos/stratergies (for eval)
4.  hyperparameter optimization
5. maybe integrate data?

Resources : ChatGPT


https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

https://www.youtube.com/watch?v=t3fbETsIBCY


