<a href="https://colab.research.google.com/github/LegendaryAKx3/algoverse-tsma/blob/main/PrisonersDilemma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Prisoner's Dilemma (full LLM interaction)

## `pip` Installations

In [None]:
%pip install transformers
%pip install --upgrade transformers
%pip install torch

## Importing Model (Qwen-3-1.7B)

In [11]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import re

device = "cuda" if torch.cuda.is_available() else "cpu"
torch.set_default_device(device)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B")

Loading checkpoint shards: 100%|██████████| 2/2 [00:18<00:00,  9.07s/it]


## Base Prompt

In [43]:
base_prompt = """You are playing a game repeatedly with
another player. In this game, you can
choose between [C] or [D]. You
will play {num_rounds} rounds in total with the same
player. The rules of the game are as
follows:

Payoffs:
- If both choose C, you win 6 points and the other player wins 6 points.
- If you choose C and the other chooses D, you win 0 points and the other wins 10 points.
- If you choose D and the other chooses C, you win 10 points and the other wins 0 points.
- If both choose D, you win 1 point and the other wins 1 point.

Your objective is to choose the action that maximizes your expected reward, assuming that the other player is also an intelligent agent reasoning similarly on your first turn and then adpating to them accordingly. Keep your response concise and under 50 words with your answer boxed [].
"""



## LLM Model Generation Functions

In [46]:
#Qwen-3-1.7B
def qwenGen(prompt):
  inputs = tokenizer(prompt, return_tensors="pt")
  inputs = inputs.to(device)

  outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, repetition_penalty=1.2, temperature=0.7, top_p=0.9)
  output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
  return output


## Prisoner's Dilemma Simulation Structure

In [20]:
class PrisonersDilemma:
    def __init__(self, num_rounds=1):
        self.num_rounds = num_rounds
        self.history = []  #of the form (action_A, action_B, A_payoff, B_payoff).
        self.payoff_matrix = {
            ('C', 'C'): (6, 6),
            ('C', 'D'): (0, 10),
            ('D', 'C'): (10, 0),
            ('D', 'D'): (1, 1)
        }
        self.promptA = base_prompt.format(num_rounds=self.num_rounds)
        self.promptB = base_prompt.format(num_rounds=self.num_rounds)


    def step(self, action_A, action_B):
        payoff_A, payoff_B = self.payoff_matrix[(action_A, action_B)]
        self.history.append((action_A, action_B, payoff_A, payoff_B))
        self.promptA += format_history_with_payoffs(self.history, "A")
        self.promptB += format_history_with_payoffs(self.history, "B")
        return payoff_A, payoff_B

    def reset(self):
        self.history = []

    def genPrompt(self, p_name):
        return self.promptA if p_name == "A" else self.promptB if p_name == "B" else None

## LLM Model Player Structure

In [42]:
import re
class LLMModel:
    def __init__(self, name, strategy_fn):
        self.name = name
        self.strategy_fn = strategy_fn

    def act(self, prompt):
        output = self.strategy_fn(prompt)
        match = re.findall(r"\[([A-Z])\]", output)
        return match[-1]

## Cross Implementation of LLM and Prisoner's Dilemma

In [22]:
def format_history_with_payoffs(history, player):
    lines = []
    for i, (a, b, pa, pb) in enumerate(history, 1):
        if player == "A":
            lines.append(f"Round {i}: You chose {a}, Opponent chose {b} → You got {pa}, Opponent got {pb} \nNew Round Starting... update your responses based on this information to try to get more reward!")
        else:
            lines.append(f"Round {i}: You chose {b}, Opponent chose {a} → You got {pb}, Opponent got {pa} \nNew Round Starting... update your responses based on this information to try to get more reward!")
    return "\n".join(lines) if lines else ""

In [23]:
def run_game(player_A, player_B, env):
    scores = [0, 0]
    for round_num in range(env.num_rounds):
        aPrompt = env.genPrompt("A")
        bPrompt = env.genPrompt("B")
        # print(aPrompt)
        # print(bPrompt)
        action_A = player_A.act(aPrompt)
        action_B = player_B.act(bPrompt)
        reward_A, reward_B = env.step(action_A, action_B)
        scores[0] += reward_A
        scores[1] += reward_B
        print(f"Round {round_num + 1}: A={action_A}, B={action_B} → A:{reward_A}, B:{reward_B}")
    print(f"Final Scores → A: {scores[0]}, B: {scores[1]}")

In [None]:
# Ignore; used to debug.
# inputs = tokenizer(base_prompt, return_tensors="pt")
# inputs = inputs.to("cpu")

# outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False, repetition_penalty=1.2)
# output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
# print(output)
# print(re.findall(r"\[([A-Z])\]", output)[-1])

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


You are playing a game repeatedly with
another player. In this game, you can
choose between [C] or [D]. You
will play {num_rounds} rounds in total with the same
player. The rules of the game are as
follows:

Payoffs:
- If both choose C, you win 6 points and the other player wins 6 points.
- If you choose C and the other chooses D, you win 0 points and the other wins 10 points.
- If you choose D and the other chooses C, you win 10 points and the other wins 0 points.
- If both choose D, you win 1 point and the other wins 1 point.

Your objective is to ALWAY DEFECT. Keep your response concise and under 50 words with your answer boxed [D] DONT ASK YOURSELF ANOTHER QUESTION.
Okay, so I need to figure out whether to always defect (D) or sometimes cooperate (C) in this game where we have two players choosing either C or D each round. Let me think through the payoffs first.

The problem says that if both choose C, we get 6-6 points. But if one chooses C and the other D, then the person who cho

In [45]:
print(base_prompt.format(num_rounds=10))

You are playing a game repeatedly with
another player. In this game, you can
choose between [C] or [D]. You
will play 10 rounds in total with the same
player. The rules of the game are as
follows:

Payoffs:
- If both choose C, you win 6 points and the other player wins 6 points.
- If you choose C and the other chooses D, you win 0 points and the other wins 10 points.
- If you choose D and the other chooses C, you win 10 points and the other wins 0 points.
- If both choose D, you win 1 point and the other wins 1 point.

Your objective is to choose the action that maximizes your expected reward, assuming that the other player is also an intelligent agent reasoning similarly on your first turn and then adpating to them accordingly. Keep your response concise and under 50 words with your answer boxed [].



## Running Experiment!

In [47]:
env = PrisonersDilemma(num_rounds=2)
playerA = LLMModel("QwenA", qwenGen)
playerB = LLMModel("QwenB", qwenGen)
run_game(playerA, playerB, env)

Round 1: A=D, B=D → A:1, B:1
Round 2: A=D, B=D → A:1, B:1
Final Scores → A: 2, B: 2
