<a href="https://colab.research.google.com/github/weber50432/COMP0258-poker-LLM/blob/master/colab/model_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
# Define the path to the models directory in Google Drive
models_dir = '/content/drive/MyDrive/models'
# Check if the directory exists
if not os.path.exists(models_dir):
    # If it doesn't exist, create it
    os.makedirs(models_dir)
output_dir = '/content/drive/MyDrive/outputs'
# Check if the directory exists
if not os.path.exists(output_dir):
    # If it doesn't exist, create it
    os.makedirs(output_dir)
data_dir = '/content/drive/MyDrive/data'
# Check if the directory exists
if not os.path.exists(data_dir):
    # If it doesn't exist, error log
    print("data directory not found")

Mounted at /content/drive


In [4]:
%%capture
# Normally using pip install unsloth is enough

if True:
    # Temporarily as of Jan 31st 2025, Colab has some issues with Pytorch
    # Using pip install unsloth will take 3 minutes, whilst the below takes <1 minute:
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29 peft trl triton
    !pip install --no-deps cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [5]:
from unsloth import FastLanguageModel
from tqdm import tqdm
from transformers import TextStreamer
from datasets import load_dataset
import random
import pandas as pd
import json

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [6]:
usning_custom_dataset = True
pretrained_model_name = "lora_Qwen2.5_14B_model-5000"

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth


In [7]:
if True:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = f"{models_dir}/{pretrained_model_name}", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# prompt = You MUST copy from above!
prompt = """
### Instruction:
{}

### Response:
{}"""


inputs = tokenizer(
[
    prompt.format(
        "You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.\n\nHere is a game summary:\n\nThe small blind is 0.5 chips and the big blind is 1 chips. Everyone started with 100 chips.\nThe player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.\nIn this hand, your position is CO, and your holding is [Ace of Heart and King of Heart].\nYou currently have High Card(Ace-high).\nBefore the flop, CO raise 2.3, and BB raise 13.5. Assume that all other players that is not mentioned folded.\n\nNow it is your turn to make a move.\nTo remind you, the current pot size is 16.3 chips, and your holding is [Ace of Heart and King of Heart].\n\nDecide on an action based on the strength of your hand on this board, your position, and actions before you. Do not explain your answer.\nYour optimal action is:"
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
outputs = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/210k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

Unsloth 2025.3.9 patched 48 layers with 48 QKV layers, 48 O layers and 48 MLP layers.



### Instruction:
You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.

Here is a game summary:

The small blind is 0.5 chips and the big blind is 1 chips. Everyone started with 100 chips.
The player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.
In this hand, your position is CO, and your holding is [Ace of Heart and King of Heart].
You currently have High Card(Ace-high).
Before the flop, CO raise 2.3, and BB raise 13.5. Assume that all other players that is not mentioned folded.

Now it is your turn to make a move.
To remind you, the current pot size is 16.3 chips, and your holding is [Ace of Heart and King of Heart].

Decide on an action based on the strength of your hand on this board, your position, and actions before you. Do not explain your answer.
Your optimal action is:

### Response:
raise 27.4<|im_end|>


In [8]:
prompt = """
### Instruction:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    # inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, output in zip(instructions, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = prompt.format(instruction, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

if usning_custom_dataset:
    # open a json file from data_dir
    with open(f'{data_dir}/poker-preflop/preflop_1k_test_set_prompt_and_label.json', 'r') as f:
            preflop_dataset = json.load(f)
    print(f"reading {len(preflop_dataset)} preflop data.")
    with open(f'{data_dir}/poker-postflop/postflop_10k_test_set_prompt_and_label.json', 'r') as f:
            postflop_dataset = json.load(f)
    print(f"reading {len(postflop_dataset)} preflop data.")
else:
    dataset = load_dataset("RZ412/PokerBench", split = "test")
    dataset = dataset.map(formatting_prompts_func, batched = True,)

reading 1000 preflop data.
reading 10000 preflop data.


## testing a radom sample

In [9]:
if not usning_custom_dataset:
    print("Processing hg dataset.")
    index = random.randint(0, len(dataset))
    print("Groud truth: ",dataset[index]['output'])
    inputs = tokenizer([prompt.format(dataset[index]['instruction'],"")], return_tensors = "pt").to("cuda")
else:
    print("Processing custom dataset.")
    index = random.randint(0, len(preflop_dataset))
    print("Groud truth: ",preflop_dataset[index]['output'])
    inputs = tokenizer([prompt.format(preflop_dataset[index]['instruction'],"")], return_tensors = "pt").to("cuda")
# text_streamer = TextStreamer(tokenizer)
# outputs = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 10)
outputs = model.generate(**inputs, max_new_tokens = 10)

generated_text = tokenizer.batch_decode(outputs)[0]
generated_text = generated_text.split("### Response:")[1].strip()
generated_text = generated_text.replace(EOS_TOKEN, "")
print("Prediction: ",generated_text) # Print the generated text

Processing custom dataset.
Groud truth:  fold
Prediction:  call


In [10]:
if not usning_custom_dataset:
    ground_truths = []
    predictions = []
    for index in tqdm(range(len(dataset)), desc="Processing hg dataset"):
        # print(dataset[index]['output'])
        ground_truths.append(dataset[index]['output'])
        inputs = tokenizer([prompt.format(dataset[index]['instruction'],"")], return_tensors = "pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens = 10)
        response = tokenizer.batch_decode(outputs)[0]
        response = response.split("### Response:")[1].strip()
        response = response.replace(EOS_TOKEN, "")
        # print(response)
        predictions.append(response)
        # break

    results_df = pd.DataFrame({
    "Prediction": predictions,
    "Ground Truth": ground_truths
    })
    # Save the DataFrames to CSV files
    results_df.to_csv(f"{output_dir}/{pretrained_model_name}_total_predictions.csv", index=False)

else:
    preflop_ground_truths = []
    preflop_predictions = []
    for index in tqdm(range(len(preflop_dataset)), desc="Processing custom preflop dataset"):
        # print(dataset[index]['output'])
        preflop_ground_truths.append(preflop_dataset[index]['output'])
        inputs = tokenizer([prompt.format(preflop_dataset[index]['instruction'],"")], return_tensors = "pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens = 10)
        response = tokenizer.batch_decode(outputs)[0]
        response = response.split("### Response:")[1].strip()
        response = response.replace(EOS_TOKEN, "")
        # print(response)
        preflop_predictions.append(response)
        # break
    preflop_results_df = pd.DataFrame({
    "Prediction": preflop_predictions,
    "Ground Truth": preflop_ground_truths
    })
    preflop_results_df.to_csv(f"{output_dir}/{pretrained_model_name}_preflop_predictions.csv", index=False)


    postflop_ground_truths = []
    postflop_predictions = []
    save_interval = 1000
    output_file = f"{output_dir}/{pretrained_model_name}_postflop_predictions.csv"

    for index in tqdm(range(len(postflop_dataset)), desc="Processing custom postflop dataset"):
        postflop_ground_truths.append(postflop_dataset[index]['output'])

        inputs = tokenizer([prompt.format(postflop_dataset[index]['instruction'], "")], return_tensors="pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens=10)
        response = tokenizer.batch_decode(outputs)[0]
        response = response.split("### Response:")[1].strip()
        response = response.replace(EOS_TOKEN, "")

        postflop_predictions.append(response)

        if (index + 1) % save_interval == 0 or (index + 1) == len(postflop_dataset):
            postflop_results_df = pd.DataFrame({
                "Prediction": postflop_predictions,
                "Ground Truth": postflop_ground_truths
            })
            postflop_results_df.to_csv(output_file, index=False)
            print(f"Saved progress at iteration {index + 1}")

    print("Processing completed.")

Processing custom preflop dataset: 100%|██████████| 1000/1000 [16:06<00:00,  1.03it/s]
Processing custom postflop dataset:  10%|█         | 1000/10000 [19:35<3:06:26,  1.24s/it]

Saved progress at iteration 1000


Processing custom postflop dataset:  20%|██        | 2000/10000 [39:08<2:53:48,  1.30s/it]

Saved progress at iteration 2000


Processing custom postflop dataset:  30%|███       | 3000/10000 [58:16<2:15:46,  1.16s/it]

Saved progress at iteration 3000


Processing custom postflop dataset:  40%|████      | 4000/10000 [1:17:36<1:50:35,  1.11s/it]

Saved progress at iteration 4000


Processing custom postflop dataset:  50%|█████     | 5000/10000 [1:37:08<1:33:57,  1.13s/it]

Saved progress at iteration 5000


Processing custom postflop dataset:  60%|██████    | 6000/10000 [1:56:45<1:27:41,  1.32s/it]

Saved progress at iteration 6000


Processing custom postflop dataset:  70%|███████   | 7000/10000 [2:16:22<1:00:22,  1.21s/it]

Saved progress at iteration 7000


Processing custom postflop dataset:  80%|████████  | 8000/10000 [2:35:59<37:14,  1.12s/it]

Saved progress at iteration 8000


Processing custom postflop dataset:  90%|█████████ | 9000/10000 [2:55:39<18:45,  1.13s/it]

Saved progress at iteration 9000


Processing custom postflop dataset: 100%|██████████| 10000/10000 [3:15:19<00:00,  1.17s/it]

Saved progress at iteration 10000
Processing completed.





In [11]:
if not usning_custom_dataset:
    print(ground_truths)
    print(predictions)
else:
    print(preflop_ground_truths)
    print(preflop_predictions)
    print(postflop_ground_truths)
    print(postflop_predictions)

['fold', 'fold', 'fold', 'check', 'check', 'call', 'check', 'call', 'check', 'fold', 'fold', 'fold', 'raise 2.5', 'check', 'fold', 'check', 'raise 32.0', 'raise 2.5', 'check', 'check', 'fold', 'check', 'raise 11.0', 'fold', 'fold', 'call', 'call', 'raise 10.0', 'raise 2.3', 'raise 10.0', 'call', 'raise 11.0', 'check', 'check', 'check', 'raise 19.0', 'raise 3.0', 'check', 'fold', 'fold', 'raise 28.0', 'call', 'fold', 'fold', 'fold', 'raise 2.3', 'raise 10.0', 'call', 'call', 'raise 13.0', 'raise 21.0', 'check', 'check', 'call', 'fold', 'fold', 'raise 17.5', 'call', 'fold', 'check', 'check', 'call', 'raise 28.0', 'raise 14.0', 'raise 26.0', 'fold', 'raise 27.0', 'raise 21.0', 'fold', 'call', 'raise 20.0', 'check', 'call', 'call', 'raise 3.0', 'fold', 'fold', 'raise 2.3', 'fold', 'raise 2.3', 'call', 'raise 23.0', 'raise 13.0', 'raise 21.0', 'check', 'call', 'fold', 'call', 'fold', 'raise 3.0', 'fold', 'raise 22.9', 'raise 12.0', 'fold', 'fold', 'raise 2.5', 'check', 'check', 'fold', 'cal