<a href="https://colab.research.google.com/github/LegendaryAKx3/algoverse-tsma/blob/main/PrisonersDilemma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Prisoner's Dilemma (full LLM interaction)

## `pip` Installations

In [1]:
%pip install transformers
%pip install --upgrade transformers
%pip install torch

Collecting transformers
  Downloading transformers-4.53.2-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.53.2-py3-none-any.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.53.1
    Uninstalling transformers-4.53.1:
      Successfully uninstalled transformers-4.53.1
Successfully installed transformers-4.53.2
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-c

## Importing Model (Qwen-3-1.7B)

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import re

torch.set_default_device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/622M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

## Base Prompt

In [46]:
base_prompt = """You are playing a game repeatedly with
another player. In this game, you can
choose between [C] or [D]. You
will play {num_rounds} rounds in total with the same
player. The rules of the game are as
follows:

Payoffs:
- If both choose C, you win 6 points and the other player wins 6 points.
- If you choose C and the other chooses D, you win 0 points and the other wins 10 points.
- If you choose D and the other chooses C, you win 10 points and the other wins 0 points.
- If both choose D, you win 1 point and the other wins 1 point.

Your objective is to choose the action that maximizes your expected reward, assuming that the other player is also an intelligent agent reasoning similarly. Keep your response concise and under 100 words with your answer boxed []..
"""


## LLM Model Generation Functions

In [47]:
#Qwen-3-1.7B
def qwenGen(prompt):
  inputs = tokenizer(prompt, return_tensors="pt")
  inputs = inputs.to("cuda")

  outputs = model.generate(**inputs, max_new_tokens=3, do_sample=False)
  output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
  return output


## Prisoner's Dilemma Simulation Structure

In [56]:
class PrisonersDilemma:
    def __init__(self, num_rounds=1):
        self.num_rounds = num_rounds
        self.history = []  #of the form (action_A, action_B, A_payoff, B_payoff).
        self.payoff_matrix = {
            ('C', 'C'): (6, 6),
            ('C', 'D'): (0, 10),
            ('D', 'C'): (10, 0),
            ('D', 'D'): (1, 1)
        }
        self.promptA = base_prompt.format(num_rounds=self.num_rounds)
        self.promptB = base_prompt.format(num_rounds=self.num_rounds)


    def step(self, action_A, action_B):
        payoff_A, payoff_B = self.payoff_matrix[(action_A, action_B)]
        self.history.append((action_A, action_B, payoff_A, payoff_B))
        self.promptA += format_history_with_payoffs(self.history, "A")
        self.promptB += format_history_with_payoffs(self.history, "B")
        return payoff_A, payoff_B

    def reset(self):
        self.history = []

    def genPrompt(self, p_name):
        return self.promptA if p_name == "A" else self.promptB if p_name == "B" else None

## LLM Model Player Structure

In [60]:
import re
class LLMModel:
    def __init__(self, name, strategy_fn):
        self.name = name
        self.strategy_fn = strategy_fn

    def act(self, prompt):
        output = self.strategy_fn(prompt)
        match = re.findall(r"\[([CD])\]", output)
        return match[0]

## Cross Implementation of LLM and Prisoner's Dilemma

In [66]:
def format_history_with_payoffs(history, player):
    lines = []
    for i, (a, b, pa, pb) in enumerate(history, 1):
        if player == "A":
            lines.append(f"Round {i}: You chose {a}, Opponent chose {b} → You got {pa}, Opponent got {pb} \nNew Round Starting... update your responses based on this information to try to get more reward!")
        else:
            lines.append(f"Round {i}: You chose {b}, Opponent chose {a} → You got {pb}, Opponent got {pa} \nNew Round Starting... update your responses based on this information to try to get more reward!")
    return "\n".join(lines) if lines else ""

In [62]:
def run_game(player_A, player_B, env):
    scores = [0, 0]
    for round_num in range(env.num_rounds):
        aPrompt = env.genPrompt("A")
        bPrompt = env.genPrompt("B")
        print(aPrompt)
        print(bPrompt)
        action_A = player_A.act(aPrompt)
        action_B = player_B.act(bPrompt)
        reward_A, reward_B = env.step(action_A, action_B)
        scores[0] += reward_A
        scores[1] += reward_B
        print(f"Round {round_num + 1}: A={action_A}, B={action_B} → A:{reward_A}, B:{reward_B}")
    print(f"Final Scores → A: {scores[0]}, B: {scores[1]}")

In [37]:
# Ignore; used to debu.
# inputs = tokenizer(base_prompt, return_tensors="pt")
# inputs = inputs.to("cuda")

# outputs = model.generate(**inputs, max_new_tokens=125, do_sample=False, repetition_penalty=1.2)
# output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
# print(output)
# print(re.findall(r"\[([CD])\]", output)[0])

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


You are playing a game repeatedly with 
another player. In this game, you can 
choose between [C] or [D]. You 
will play {num_rounds} rounds in total with the same 
player. The rules of the game are as 
follows:

Payoffs:
- If both choose C, you win 6 points and the other player wins 6 points.
- If you choose C and the other chooses D, you win 0 points and the other wins 10 points.
- If you choose D and the other chooses C, you win 10 points and the other wins 0 points.
- If both choose D, you win 1 point and the other wins 1 point.

Your objective is to choose the action that maximizes your expected reward, assuming that the other player is also an intelligent agent reasoning similarly. Keep your response concise and under 75 words with your answer boxed []..
To maximize my expected rewards, I should always choose [C], because if I choose C when they choose D, I get 0 points for myself; but choosing D would give me more than 0 (since it's better to have at least some chance of winning

## Running Experiment!

In [67]:
env = PrisonersDilemma(num_rounds=2)
playerA = LLMModel("QwenA", qwenGen)
playerB = LLMModel("QwenB", qwenGen)
run_game(playerA, playerB, env)

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


You are playing a game repeatedly with 
another player. In this game, you can 
choose between [C] or [D]. You 
will play 2 rounds in total with the same 
player. The rules of the game are as 
follows:

Payoffs:
- If both choose C, you win 6 points and the other player wins 6 points.
- If you choose C and the other chooses D, you win 0 points and the other wins 10 points.
- If you choose D and the other chooses C, you win 10 points and the other wins 0 points.
- If both choose D, you win 1 point and the other wins 1 point.

Your objective is to choose the action that maximizes your expected reward, assuming that the other player is also an intelligent agent reasoning similarly. Keep your response concise and under 100 words with your answer boxed []..

You are playing a game repeatedly with 
another player. In this game, you can 
choose between [C] or [D]. You 
will play 2 rounds in total with the same 
player. The rules of the game are as 
follows:

Payoffs:
- If both choose C, you win

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Round 1: A=C, B=C → A:6, B:6
You are playing a game repeatedly with 
another player. In this game, you can 
choose between [C] or [D]. You 
will play 2 rounds in total with the same 
player. The rules of the game are as 
follows:

Payoffs:
- If both choose C, you win 6 points and the other player wins 6 points.
- If you choose C and the other chooses D, you win 0 points and the other wins 10 points.
- If you choose D and the other chooses C, you win 10 points and the other wins 0 points.
- If both choose D, you win 1 point and the other wins 1 point.

Your objective is to choose the action that maximizes your expected reward, assuming that the other player is also an intelligent agent reasoning similarly. Keep your response concise and under 100 words with your answer boxed []..
Round 1: You chose C, Opponent chose C → You got 6, Opponent got 6 
New Round Starting... update your responses to try to get more reward!
You are playing a game repeatedly with 
another player. In this game, y

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Round 2: A=C, B=C → A:6, B:6
Final Scores → A: 12, B: 12
