# Prisoner's Dilemma Background

The **Prisoner's Dilemma** is a classic example in game theory where two rational individuals may not cooperate, even though cooperation would yield a better outcome for both.

## Scenario:
Two suspects, **A** and **B**, are arrested for a crime. They are held in separate cells, unable to communicate with each other. The prosecutor offers each prisoner a deal:

- **If both prisoners remain silent (cooperate with each other)**, they each get a light sentence of 1 year.
- **If both prisoners betray each other (defect)**, they each get 3 years in prison.
- **If one betrays and the other remains silent**, the betrayer goes free (0 years), while the silent prisoner gets 5 years.

## Payoff Matrix:

|               | **B: Silent**           | **B: Betray**           |
|---------------|-------------------------|-------------------------|
| **A: Silent** | A: 1 year, B: 1 year     | A: 5 years, B: 0 years  |
| **A: Betray** | A: 0 years, B: 5 years   | A: 3 years, B: 3 years  |

## Key Points:
- **Best collective outcome:** Both remain silent, each getting 1 year.
- **Dominant strategy (rational choice):** Both prisoners will likely choose to betray because it minimizes their sentence regardless of the other's choice.
- **Paradox:** Although both prisoners would be better off if they cooperated, they often betray and both end up with 3 years (a worse outcome).

The Prisoner’s Dilemma highlights how individual rational choices can lead to worse outcomes for all parties involved, even when mutual cooperation is more beneficial.


# Experiment Set-up
  1. We need an LLM representing Prisoner A, and a rule-based system representing Prisoner B (to avoid confounding factors given the inherent non-deterministic nature of LLM outputs)
  2. We will use system prompt to explain the rules of the Prisoner's Dillema scenario to them.
  3. We will use `temperature=0` to minimize confounding factors.

LLM of Choice

About Llama
https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf

1. `Llama-3.1-8B`: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
2. `Gemma-2-9B`: https://huggingface.co/google/gemma-2-9b-it
3. `Mistral-NeMo-12B`: 
   - https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 (mistral repo)
   - https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct (nvidia's repo that doesn't work; just use mistral's repo)

## Vast.ai Setup

In [None]:
# !pip install openai pandas python-dotenv transformers tqdm torch huggingface-hub pytest accelerate

In [None]:
# Define the path for the .env file
# env_file_path = ".env"

# # Define the environment variables
# env_variables = {
#     "OPENAI_API_KEY": "XXX",
#     "HF_ACCESS_TOKEN": "XXX"
# }

# # Create the .env file and write the environment variables
# with open(env_file_path, "w") as f:
#     for key, value in env_variables.items():
#         f.write(f"{key}={value}\n")

# print(f"{env_file_path} created with environment variables.")


In [None]:
import torch 
# Check available GPUs and print info
num_gpus = torch.cuda.device_count()
print(f"\nUsing {num_gpus} GPUs:")
for i in range(num_gpus):
	props = torch.cuda.get_device_properties(i)
	print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
	print(f"   Total Memory: {props.total_memory / 1024**3:.2f} GB")
	print(f"   CUDA Capability: {props.major}.{props.minor}")

# Configure model loading for multi-GPU setup with proper memory formatting
	max_memory = {
		i: f"{int(torch.cuda.get_device_properties(i).total_memory * 0.85 / (1024**3))}GB"
		for i in range(num_gpus)
	}

print(f"\nAllocating memory across devices: {max_memory}")


# Print model distribution across devices
# print("\nModel layer distribution:")
# for name, module in model.named_modules():
# 	if hasattr(module, "device"):
# 		print(f"{name}: {module.device}")

In [3]:
# from helper.llm_helper import LLMHelper, HuggingFaceModel
# llm_helper = LLMHelper()
# from scenario_prompts import *
# import random
# from typing import Optional, Tuple, Dict

# import os
# import torch

# # Clear GPU cache
# torch.cuda.empty_cache()

# # Set memory configuration to avoid fragmentation
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# from huggingface_hub import login

# # Use the token to log in
# login(hf_access_token)

## Setup Predefined Prisoner's Dilemma Strategy

In [None]:
# from helper.llm_helper import HuggingFaceModel
# llm_helper = LLMHelper()

# llm_helper.call_llm(model_name=HuggingFaceModel.LLAMA, prompt='who are you?')

In [None]:
from prisoner_dilemma import PrisonersDilemma, PlayerStrategy
from helper.llm_helper import HuggingFaceModel

# Define any necessary parameters for the LLM
llm_params = {
    "temperature": 0.0, 
    "do_sample": False,
    "max_new_tokens": 1024
}  # Customize as needed
cot = False

# Initialize the Prisoner's Dilemma simulation
pd_game = PrisonersDilemma(
    # model_name="gpt-3.5-turbo",  # Use your LLM model name
    model_name=HuggingFaceModel.LLAMA,  # Use your LLM model name
    player2_strategy=PlayerStrategy.ALWAYS_COOPERATE,
    n_rounds=5,  # Run for 10 rounds
    n_iterations=1,  # One complete experiment
    llm_params=llm_params,
    prisoner_A_filename='llama_prisoner_a.csv'
)

# Run the simulation
pd_game.simulate(cot=cot)


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Q (gpt-3.5-turbo): 
            The following is a summary of the previous rounds:
            No prior rounds.
            What is the lowest/highest payoff prisoner A can get in a single round?
            
A (gpt-3.5-turbo): In a single round of the Iterated Prisoner's Dilemma game, the lowest payoff that Prisoner A can get is 5 years in prison if Prisoner A cooperates and Prisoner B defects. The highest payoff that Prisoner A can get is 1 year in prison if both prisoners cooperate.

Q (gpt-3.5-turbo): 
            The following is a summary of the previous rounds:
            No prior rounds.
            Which actions is prisoner A allowed to play?
            
A (gpt-3.5-turbo): Prisoner A is allowed to choose between cooperating or defecting in the current round.

Q (gpt-3.5-turbo): 
            The following is a summary of the previous rounds:
            No prior rounds.
            What is the lowest/highest payoff prisoner B can get in a single round?
            
A (gpt-3.5