# Proof of Life Overview

In this notebook, we'll demonstrate proof of life for our project by fine tuning a small LLM to correctly predict the next best action in a game of poker.

## Problem Statement

The goal is to fine-tune a large language model (LLM) to consistently make optimal poker decisions. Specifically, we aim to train the model to analyze a given game state (such as the board, hand strength, position, and prior actions) and predict the best possible move. The challenge lies in capturing the complex decision-making process that expert poker players use, which involves probability estimation, opponent modeling, and strategic betting patterns.

This basic formulation will allow us to establish the viability of using more complex methods

## Mathematical Formulation

We define the problem as a sequence prediction task, where:

The input is the structured game state and natural language instruction (e.g., "You are on the button with AK offsuit. The action folds to you. What is the best move?").
The output is the optimal poker action (e.g., "Raise 3BB" or "Check").
The objective is to maximize accuracy in predicting the optimal move compared to expert decisions.
Mathematically, we seek to optimize the probability of the correct action given the current game state (which includes the actions before it).

As we are predicting the next optimal move from a set of discrete values (check/bet/raise/fold), we'll use cross entropy loss as our primary loss function.

## Data Requirements

For fine-tuning, we require a high-quality dataset that consists of:

Training Data: Structured game states and corresponding optimal actions based on expert or solver-generated strategies.

The features we have in our dataset include:
* Player positions (BTN, SB, BB, etc.)
* Hole cards (e.g., "Ace of Spades, King of Diamonds")
* Board state (Flop, Turn, River)
* Bet sizes
* Opponent actions
* Labels: The correct poker decision (Fold, Call, Raise + size).

Test Data: A separate dataset to evaluate model performance on unseen game states.


## Success Metrics
The success metric in this case will be how accurately the model can predict the next best move.

# Imports

In [None]:
!pip install transformers datasets accelerate torch

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupt

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
import pandas as pd

# Data Processing

In [None]:
ds = load_dataset("RZ412/PokerBench")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/5.16k [00:00<?, ?B/s]

(…)lop_500k_train_set_prompt_and_label.json:   0%|          | 0.00/607M [00:00<?, ?B/s]

(…)flop_60k_train_set_prompt_and_label.json:   0%|          | 0.00/62.0M [00:00<?, ?B/s]

(…)tflop_10k_test_set_prompt_and_label.json:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

(…)reflop_1k_test_set_prompt_and_label.json:   0%|          | 0.00/965k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/563200 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11000 [00:00<?, ? examples/s]

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['instruction', 'output'],
        num_rows: 563200
    })
    test: Dataset({
        features: ['instruction', 'output'],
        num_rows: 11000
    })
})

In [None]:
train_data = ds["train"].select(range(3000))
test_data = ds["test"]

For this dataset, our input is a prompt and the output is an action.

In [None]:
train_data.head(5)

Unnamed: 0,instruction,output
0,\n\nYou are a specialist in playing 6-handed N...,fold
1,\n\nYou are a specialist in playing 6-handed N...,call
2,\n\nYou are a specialist in playing 6-handed N...,bet 23
3,\n\nYou are a specialist in playing 6-handed N...,check
4,\n\nYou are a specialist in playing 6-handed N...,call


In [None]:
# load in a tokenizer to process the instructions and output
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Since our data is raw text data, we'll need to use a tokenizer to split the sentences into individual words/roots that can be fed into an LLM

In [None]:
# Tokenization
def tokenize_function(examples):
    inputs = examples["instruction"]
    targets = examples["output"]
    model_inputs = tokenizer(inputs, padding="max_length", truncation=True, max_length=512)

    # Tokenize expected outputs as labels for supervised fine-tuning
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, padding="max_length", truncation=True, max_length=512)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


train_data = train_data.map(tokenize_function, batched=True)
test_data = test_data.map(tokenize_function, batched=True)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Standard Causal LM fine-tuning
)


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]



Map:   0%|          | 0/11000 [00:00<?, ? examples/s]

In [None]:
train_data

Dataset({
    features: ['instruction', 'output', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 3000
})

# Load Model

For our first model, we'll use a small LLM, specifically GPT 2

In [None]:
# GPT-2 model

model = GPT2LMHeadModel.from_pretrained("gpt2")

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Then, to get a baseline, we'll evaluate the model on the test set before finetuning to get a rough idea of it's accuracy.

We measure accuracy here by finding the proportion of correct predictions that the model has. Specifically, a prediction is correct if the output of the model matches the expected label.

In [None]:
def evaluate_model(model, dataset, max_new_tokens=50):
    test_results = []
    correct = 0
    for example in dataset:
        test_prompt = f"Instruction: {example['instruction']}\n### Response:"
        expected_output = example["output"]

        inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            eos_token_id=tokenizer.eos_token_id
        )
        model_output = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

        test_results.append({
            "Prompt": test_prompt,
            "Model Output": model_output,
            "Expected Output": expected_output
        })

        if model_output == expected_output.strip():
            correct += 1

    accuracy = correct / len(dataset)
    print(f"Accuracy: {accuracy:.4f}")
    return test_results


In [None]:
print("Evaluating Pre-Finetuned Model...")
pre_finetuned_results = evaluate_model(model, test_data.select((range(200))))
print("Accuracy:", sum(1 for result in pre_finetuned_results if result["Prompt"] == result["Model Output"]) / len(pre_finetuned_results))
print("Pre-Finetuned Model Evaluation Complete.")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Evaluating Pre-Finetuned Model...
Accuracy: 0.0000
Accuracy: 0.0
Pre-Finetuned Model Evaluation Complete.


From a first pass, we can see that the model has an accuracy of 0%. Further examination of the model's output shows that it's essentially just rehashing the prompt and not actually outputting a one word answer that corresponds to the next action.

In [None]:
print(pre_finetuned_results[0]["Model Output"])

Instruction: 

You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.

Here is a game summary:

The small blind is 0.5 chips and the big blind is 1 chips. Everyone started with 100 chips.
The player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.
In this hand, your position is BTN, and your holding is [King of Heart and Three of Heart].
Before the flop, BTN raise 2.5 chips, and BB call. Assume that all other players that is not mentioned folded.
The flop comes Ten Of Heart, Three Of Spade, and Two Of Diamond, then BB bet 4 chips, and BTN call.
The turn comes Five Of Diamond, then BB check.
You currently have Two Pair(Two Pair, Kings and Threes with Ten kicker).

Now it is your turn to make a move.
To remind you, the current pot size is 13.0 chips, and your holding is [King of Heart and Three of Heart]. You currently have Two Pair.

Decide on an action based on the strength of your ha

To hopefully improve our accuracy, we'll set up a basic training framework for our model.

In [None]:
# Training Arguments
training_args = TrainingArguments(
    output_dir="./gpt2-poker-finetuned",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,  # Adjust for Colab memory
    per_device_eval_batch_size=8,
    num_train_epochs=3,  # Tune as needed
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=500,
    save_total_limit=2,
    fp16=True,  # Mixed precision training for efficiency
    report_to="none"
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
)


  trainer = Trainer(


# Finetune

In [None]:
# Training
trainer.train()

print("Fine-tuning complete! Model saved to './gpt2-poker-finetuned'")

Epoch,Training Loss,Validation Loss
1,No log,0.162414
2,0.234400,0.153781
3,0.101500,0.151941


Fine-tuning complete! Model saved to './gpt2-poker-finetuned'


# Test

After finetuning, we'll call the evaluation function again to see if our model has improved at all.

In [None]:
print("Evaluating Fine-Tuned Model...")
test_results = evaluate_model(model, test_data.select(range(200))
print("Accuracy:", sum(1 for result in test_results if result["Prompt"] == result["Model Output"]) / len(test_results))
print("Fine-Tuned Model Evaluation Complete.")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Evaluating Fine-Tuned Model...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Accuracy: 0.0000
Accuracy: 0.0
Fine-Tuned Model Evaluation Complete.


Even after finetuning, our model's accuracy is still 0. An examination of the model's outputs reveals that it's still just rehashing the prompt.

In [None]:
print(test_results[1]["Model Output"])

Instruction: 

You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.

Here is a game summary:

The small blind is 0.5 chips and the big blind is 1 chips. Everyone started with 100 chips.
The player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.
In this hand, your position is CO, and your holding is [Ace of Spade and Queen of Diamond].
Before the flop, HJ raise 2.0 chips, CO raise 6.5 chips, and HJ call. Assume that all other players that is not mentioned folded.
The flop comes Eight Of Heart, Eight Of Club, and Ten Of Club, then HJ check, CO bet 7 chips, HJ raise 16 chips, and CO call.
The turn comes King Of Club, then HJ check.
You currently have One Pair(One Pair, Eights with Ace, King, Queen kickers).

Now it is your turn to make a move.
To remind you, the current pot size is 46.0 chips, and your holding is [Ace of Spade and Queen of Diamond]. You currently have One Pair.

Decid

In [None]:
print(test_results[0]["Expected Output"])

check


# Next Steps
* Our current experiments with GPT-2 have shown that it struggles with prompt comprehension and concise action generation. In contrast, larger models like GPT-4 perform significantly better in producing accurate single-word actions. To improve results, we should shift our testing to more powerful models, such as GPT-3.5, GPT-4, or other similarly scaled architectures (e.g., Falcon, Mistral, or LLaMA). This will allow us to better assess the impact of model size on performance.
* To experiment with larger models, we'll also need more access to computing resources. When making this proof of life notebook, we tried to experiment with falcon 7b, a larger LLM with 7 billion parameters (in comparison to GPT2's 1.5 billion). However, when we tried to load it in, colab ran out of RAM.
* We would also like to test peformance on reasoning models like Tinyzero and Ragen. Because these are reasoning models, they may be better at thinking through their decision than a standard language model like ChatGPT.