# Executable Demo Notebook

This Colab file contains the demo file for our PokerZero project. It provides the code necessary to load our model weights and perform inference on some prompt. It also provides the code necessary to load the self-play data we generated and uploaded to HuggingFace.

### Installation

In [1]:
%%capture
import os

os.environ["TORCH_CUDA_ARCH_LIST"] = "8.0"
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm==0.8.2
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm==0.8.2
    !pip install -U datasets

In [2]:
# @title Colab Extra Install { display-mode: "form" }
%%capture
import os

if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests

    modules = list(sys.modules.keys())
    for x in modules:
        sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get(
        "https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt"
    ).content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

In [3]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch

max_seq_length = 1024  # Can increase for longer reasoning traces
lora_rank = 32  # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=max_seq_length,
    load_in_4bit=True,  # False for LoRA 16bit
    fast_inference=True,  # Enable vLLM fast inference
    max_lora_rank=lora_rank,
    gpu_memory_utilization=0.8,  # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r=lora_rank,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],  # Remove QKVO if out of memory
    lora_alpha=lora_rank,
    use_gradient_checkpointing="unsloth",  # Enable long context finetuning
    random_state=3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-13 07:59:53 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.5.1: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 79.24%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 224.
Unsloth: vLLM's KV Cache can use up to 9.38 GB. Al

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 05-13 08:00:19 [cuda.py:239] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-13 08:00:19 [cuda.py:288] Using XFormers backend.
INFO 05-13 08:00:20 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-13 08:00:20 [model_runner.py:1110] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...
INFO 05-13 08:00:20 [loader.py:1155] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 05-13 08:00:21 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

INFO 05-13 08:00:47 [weight_utils.py:281] Time spent downloading weights for unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit: 26.605778 seconds
INFO 05-13 08:00:47 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 05-13 08:00:49 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 05-13 08:00:50 [model_runner.py:1146] Model loading took 2.3276 GB and 29.658315 seconds
INFO 05-13 08:01:00 [worker.py:267] Memory profiling takes 9.50 seconds
INFO 05-13 08:01:00 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.79) = 11.68GiB
INFO 05-13 08:01:00 [worker.py:267] model weights take 2.33GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.23GiB; the rest of the memory reserved for KV Cache is 8.10GiB.
INFO 05-13 08:01:01 [executor_base.py:111] # cuda blocks: 14743, # CPU blocks: 3640
INFO 05-13 08:01:01 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 230.36x
INFO 05-13 08:01:02 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. I

Capturing CUDA graph shapes: 100%|██████████| 31/31 [00:59<00:00,  1.93s/it]

INFO 05-13 08:02:02 [model_runner.py:1570] Graph capturing finished in 60 secs, took 0.63 GiB
INFO 05-13 08:02:02 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 72.11 seconds





Unsloth: Just some info: will skip parsing ['k_norm', 'pre_feedforward_layernorm', 'q_norm', 'post_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['k_norm', 'pre_feedforward_layernorm', 'q_norm', 'post_feedforward_layernorm']


tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.5.1 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


Define the system prompt that we will input into every prompt.

In [4]:
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
... Your reasoning here ...
</reasoning>
<answer>
... Your poker action here (fold, call, raise X). Raise is the same as bet and call is the same as check ...
</answer>
"""

Load in the LoRA weights from HuggingFace.

In [5]:
from huggingface_hub import snapshot_download

# Download the LoRA weights from HuggingFace
repo_path = snapshot_download(repo_id="wesleyyliu/pokerzerofinal")

# Load the LoRA weights
lora_request = model.load_lora(repo_path)

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/611 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/815 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/240M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

Function for running inference

In [6]:
from vllm import SamplingParams


def generate(prompt):
    text = tokenizer.apply_chat_template(
        [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt},
        ],
        tokenize=False,
        add_generation_prompt=True,
    )

    sampling_params = SamplingParams(
        temperature=0.8,
        top_p=0.95,
        max_tokens=1024,
    )
    generated_text = (
        model.fast_generate(
            [text], sampling_params=sampling_params, lora_request=lora_request
        )[0]
        .outputs[0]
        .text
    )

    return generated_text

Define a prompt and run inference.

In [7]:
prompt = """You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.

Here is a game summary:

The small blind is 0.5 chips and the big blind is 1 chips. Everyone started with 100 chips.
The player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.
In this hand, your position is HJ, and your holding is [King of Diamond and Jack of Spade].
Before the flop, HJ raise 2.0 chips, and BB call. Assume that all other players that is not mentioned folded.
The flop comes King Of Spade, Seven Of Heart, and Two Of Diamond, then BB check, and HJ check.
The turn comes Jack Of Club, then BB check, HJ bet 3 chips, BB raise 10 chips, and HJ call.
The river comes Seven Of Club, then BB check.


Now it is your turn to make a move.
To remind you, the current pot size is 24.0 chips, and your holding is [King of Diamond and Jack of Spade].

Decide on an action based on the strength of your hand on this board, your position, and actions before you. Do not explain your answer.
Your optimal action is:"""

print(generate(prompt))

Processed prompts: 100%|██████████| 1/1 [00:06<00:00,  6.98s/it, est. speed input: 49.48 toks/s, output: 26.96 toks/s]

<reasoning>
You have a pair of King and Jack, which is a strong hand. The board has a made high pair (K♠K♠S♠J♣) which doesn’t seem to have any potential overpairs or significant draws. The pot is now 24 chips, and you have called a bet of 3 chips, and your opponent (BB) has just raised you 10 chips. BB's raise might have been a speculative play or a continuation bet after hitting a set or a straight on the turn. However, since you are ahead with your King and Jack, and the board is mostly suited to your two cards, it’s wise to at least consider calling or even raising if you believe BB is not as strong. But given the situation and your hand, folding to a strong opponent's raise seems unwise.
</reasoning>
<answer>
raise 20 chips
</answer>





### Loading the Dataset

Here, we will provide some code to load the dataset and view some samples from our HuggingFace dataset that we obtained from self-play.

In [12]:
from datasets import load_dataset, Dataset

data = load_dataset("wesleyyliu/PokerBenchExpanded")["train"]
data_truncated = data.select(range(5))

In [18]:
for i, sample in enumerate(data_truncated):
    print(f"---------- Sample {i} ----------")
    print(sample["prompt"])
    print(f"---------- Answer: ----------")
    print(sample["answer"])

---------- Sample 0 ----------
You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.

Here is a game summary:
The small blind is 10 chips and the big blind is 20 chips. Everyone started with 1000 chips.
The player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.
In this hand, your position is BB, and your holding is ['D4', 'SA'].
Before the flop, TransformerPlayer4 declared call; TransformerPlayer5 declared fold; you declared raise 30; TransformerPlayer1 declared fold; TransformerPlayer2 declared raise 40; TransformerPlayer3 declared call; TransformerPlayer4 declared call; you declared raise 50; TransformerPlayer2 declared call; TransformerPlayer3 declared call; TransformerPlayer4 declared fold.
The flop comes CT, H5, S2, then TransformerPlayer2 declared raise 20; TransformerPlayer3 declared call; you declared call.
The turn comes ['CT', 'H5', 'S2', 'HQ'], then TransformerPlayer2 dec