# PPO Training with LoRA-Tuned LLM for Texas Hold'em Poker

This notebook contains the complete implementation and testing of PPO (Proximal Policy Optimization) training for a poker-playing agent using LoRA-tuned language models.

## Content Overview:
1. **Poker Game Testing** - Basic poker game functionality
2. **Hand Evaluation** - Card evaluation and ranking
3. **Game State Management** - Player states and actions
4. **Multi-player Games** - Testing 2-player and 6-player scenarios
5. **PPO Configuration** - Reward calculation and training setup
6. **Model Loading** - Hugging Face model integration
7. **Agent Testing** - Random agents and LLM agents
8. **Training Pipeline** - Complete PPO training workflow

## 1. Basic Poker Game Testing

In [1]:
from poker_game import Deck, Card

deck = Deck()  # Reproducible shuffling
cards = deck.deal(5)  # Deal 5 cards
print(cards)  # [Ah, Kd, Qs, Jc, Ts]

[Eight Of Club, Ace Of Spade, Eight Of Heart, Nine Of Diamond, Nine Of Spade]


In [2]:
from poker_game import HandEvaluator, Card

cards = [Card.from_string(s) for s in ['Ah', 'Ad', 'Kh', 'Kd', 'Qs']]
hand_rank, tiebreakers = HandEvaluator.evaluate_hand(cards)
description = HandEvaluator.get_hand_description(cards)
print(description)  # "Two Pair"

Two Pair


## 2. Game State and Player Management

In [3]:
from poker_game import GameState, PlayerState, Action

player = PlayerState(player_id=0, stack=100.0)
player.bet(10.0)  # Place a bet
print(player.stack)  # 90.0

90.0


In [4]:
from poker_game import PokerGame, Action

# Create a game
game = PokerGame(num_players=2, starting_stack=100.0, small_blind=0.5, big_blind=1.0)

# Start a hand
state = game.reset()

# Get valid actions for current player
player = state.current_player()
valid_actions = state.get_valid_actions(player)

# Execute an action
new_state, hand_complete, result = game.step(Action.RAISE, amount=6.0)

# Print game state
print(game.get_game_state_string())

Hand #1 - PREFLOP
Board: (no cards yet)
Pot: $8.0

Players:
  BB (P0): $99.0 (ACTIVE) [Nine Of Diamond Ace Of Diamond]
  SB (P1): $93.0 (ACTIVE) [Two Of Heart Seven Of Diamond]

Action on: BB (P0)
Current bet: $7.0


In [5]:
# Method 1: Use state.current_player()
current_player = state.current_player()
print(f"Current player: {current_player.player_id} ({current_player.position})")

# Method 2: Use state.current_player_idx
print(f"Current player index: {state.current_player_idx}")

# Method 3: Check if betting round is complete
if not state.is_betting_round_complete():
    print("It's someone's turn - action required")
else:
    print("Betting round is complete - no action needed")

# Method 4: Get valid actions for current player
valid_actions = state.get_valid_actions(current_player)
print(f"Valid actions for current player: {[a.value for a in valid_actions]}")

Current player: 0 (BB)
Current player index: 0
It's someone's turn - action required
Valid actions for current player: ['fold', 'call', 'raise']


In [6]:
# Test LLM prompt generation
print(f"Current betting round: {state.betting_round}")
print(f"Action history: {state.action_history}")
print()

prompt = state.get_llm_prompt(player_perspective=0)
print("LLM Prompt:")
print(prompt)

Current betting round: BettingRound.PREFLOP
Action history: [{'player_id': 1, 'action': 'bet', 'amount': 0.5, 'betting_round': 'preflop', 'pot_after': 0.5, 'is_blind': True}, {'player_id': 0, 'action': 'bet', 'amount': 1.0, 'betting_round': 'preflop', 'pot_after': 1.5, 'is_blind': True}, {'player_id': 1, 'action': 'raise', 'amount': 6.0, 'betting_round': 'preflop', 'pot_after': 8.0, 'is_blind': False}]

LLM Prompt:
You are a specialist in playing heads-up No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.

Here is a game summary:

The small blind is 0.5 chips and the big blind is 1.0 chips. Everyone started with 100.0 chips.
The player positions involved in this game are SB, BB.
In this hand, your position is BB, and your holding is [Nine Of Diamond and Ace Of Diamond].
Before the flop, SB raise 6.0 chips. Assume that all other players that is not mentioned folded.

Now it is your turn to make a move.
To remind you, the current pot s

## 3. Multi-Player Game Testing

In [7]:
# Test 6-player game
from poker_game import PokerGame

# Create a 6-player game
game6 = PokerGame(num_players=6, starting_stack=100.0, small_blind=0.5, big_blind=1.0)
state6 = game6.reset()

print('6-handed poker:')
print(f'Button position: {state6.button_position}')
print('Player positions:')
for i, player in enumerate(game6.players):
    print(f'  Player {i}: {player.position}')
print('Action history:')
for action in state6.action_history:
    player_pos = game6.players[action['player_id']].position
    print(f"  {action['action']} by {player_pos}")

6-handed poker:
Button position: 1
Player positions:
  Player 0: CO
  Player 1: BTN
  Player 2: SB
  Player 3: BB
  Player 4: UTG
  Player 5: HJ
Action history:
  bet by SB
  bet by BB


In [8]:
print('Direct stack check:')
for i, player in enumerate(game6.players):
    print(f'  Player {i} ({player.position}): ${player.stack:.1f}')

Direct stack check:
  Player 0 (CO): $100.0
  Player 1 (BTN): $100.0
  Player 2 (SB): $99.5
  Player 3 (BB): $99.0
  Player 4 (UTG): $100.0
  Player 5 (HJ): $100.0


## 4. PPO Configuration and Rewards

In [9]:
from ppo.config import PPOConfig
from ppo.rewards import RewardCalculator
from poker_game.game_state import Action
import numpy as np

cfg = PPOConfig()
rc  = RewardCalculator(big_blind=cfg.big_blind)

# Terminal reward: initial 100, ending 110, BB=1 => reward should be 10.0
r = rc.calculate_reward(player_id=0, action=Action.CALL, hand_result={},
                        initial_stack=100.0, final_stack=110.0, big_blind=cfg.big_blind)
print("terminal reward =", r)  # Expected: 10.0

# Construct a 3-step trajectory: first two steps 0, last step has terminal reward
rewards = np.array([0.0, 0.0, r])
print("Reward trajectory:", rewards)

terminal reward = 10.0
Reward trajectory: [ 0.  0. 10.]


## 5. Model Loading and Authentication

## 6. Agent Testing and Action Processing

In [11]:
# A.1 Set up a minimal poker environment
from poker_game.game_logic import PokerGame
from poker_game.game_state import Action
from ppo.agents import _amount_from_label, _legalize_label

env = PokerGame(num_players=2, starting_stack=100.0, small_blind=0.5, big_blind=1.0, seed=42)
state = env.reset()
actor = state.current_player()                # Current acting player
legal = state.get_valid_actions(actor)

print("Pot =", state.pot, "Legal =", legal)

# A.2 Amount conversion (e.g. 0.5 pot bet)
amt = _amount_from_label("bet_0.50pot", state, actor)
print("bet_0.50pot -> amount =", amt)

# A.3 Legalization (in preflop first round usually allows BET; if not allowed, downgrade to CHECK/CALL/FOLD)
action, final_amt = _legalize_label("bet_0.50pot", legal, state, actor)
print("legalized:", action, final_amt)

# A.4 An "obviously illegal" label (like raise_allin may not be legal in most early states) → see if downgrade works
action2, amt2 = _legalize_label("raise_allin", legal, state, actor)
print("illegal label fallback ->", action2, amt2)

Pot = 1.5 Legal = [<Action.FOLD: 'fold'>, <Action.CALL: 'call'>, <Action.RAISE: 'raise'>]
bet_0.50pot -> amount = 0.75
legalized: Action.CALL 0.0
illegal label fallback -> Action.RAISE 99.5


In [12]:
from ppo.agents import RandomAgent
from poker_game.game_logic import PokerGame

env = PokerGame(num_players=2, starting_stack=100.0, small_blind=0.5, big_blind=1.0, seed=7)
state = env.reset()

agent0 = RandomAgent(0, seed=1)
agent1 = RandomAgent(1, seed=2)

steps = 0
while True:
    cur = state.current_player()
    legal = state.get_valid_actions(cur)
    a, amt, info = (agent0 if cur.player_id==0 else agent1).act(state, legal)
    state, done, result = env.step(a, amt)
    steps += 1
    if done or steps > 50:
        print("Hand over. result:", result)
        break

print("Final stacks:", [p.stack for p in env.players])

Hand over. result: {'winners': [0], 'win_type': 'fold', 'pot': 1.5, 'final_board': []}
Final stacks: [100.5, 99.5]


## 7. PPO Training Setup

In [20]:
from ppo.config import PPOConfig
from ppo.rewards import RewardCalculator
from ppo.ppo_trainer import PPOTrainer
from ppo.agents import RandomAgent

cfg = PPOConfig()
cfg.num_episodes = 5
cfg.num_players  = 2  # Must match the number of agents
cfg.save_frequency = 0

rc = RewardCalculator(cfg.big_blind)
trainer = PPOTrainer(cfg, rc)

agents = [RandomAgent(0, seed=1), RandomAgent(1, seed=2)]
env = trainer._init_env()  # Initialize environment

for ep in range(1, cfg.num_episodes + 1):
    trainer.cfg.seed = 42 + ep   # Set seed
    log = trainer.play_one_episode(env=env, agents=agents)  # Play with agents
    print(f"Episode {ep} completed")
    
print("Training setup test completed!")

Episode 1 completed
Episode 2 completed
Episode 3 completed
Episode 4 completed
Episode 5 completed
Training setup test completed!


## 8. LLM Agent Testing (Mock Setup)

In [15]:
import types, math, torch
from ppo.agents import LLMAgent, DISCRETE_ACTIONS
from poker_game.game_logic import PokerGame

# 1) Create an LLMAgent, but we won't actually use its model/tokenizer (avoiding loading large models)
agent = LLMAgent.__new__(LLMAgent)  # Bypass __init__
agent.player_id = 0
agent.max_seq_len = 512
agent.temperature = 0.0           # Set to 0: greedy, for easier assertion
agent.top_p = 0.9
agent.use_scoring = True
agent._controller = None
agent._adapter_name = None

# Mock tokenizer/model minimal fields needed (won't actually be used)
agent.tokenizer = types.SimpleNamespace(pad_token_id=0, eos_token_id=0)
class _FakeParam: 
    def device(self): return torch.device("cpu")
class _FakeModel:
    def __init__(self): self._p = torch.nn.Parameter(torch.zeros(1))
    def parameters(self): return [self._p]
    def to(self, device): return self

agent.model = _FakeModel()
print("Mock LLM agent created successfully!")

Mock LLM agent created successfully!


In [19]:
# --- Smoke test: PPOTrainer viability check (without loading LLM) ---
from ppo.config import PPOConfig
from ppo.rewards import RewardCalculator
from ppo.ppo_trainer import PPOTrainer
from ppo.agents import RandomAgent

# 1) Small config + reward calculator + Trainer
cfg = PPOConfig()
cfg.num_episodes = 5
cfg.num_players = 2
cfg.steps_per_episode = 50
cfg.save_frequency = 0
rc = RewardCalculator(big_blind=cfg.big_blind)
trainer = PPOTrainer(cfg, reward_calculator=rc)

# 2) Two random agents, avoiding loading large models
agents = [RandomAgent(0, seed=1), RandomAgent(1, seed=2)]

# 3) Test episode execution
log = trainer.play_one_episode(agents=agents)
print("OK. keys:", list(log.keys()))
print("steps =", log["steps"], "stacks:", log["stacks_init"], "→", log["stacks_final"])
print("result =", log.get("result"))

OK. keys: ['steps', 'stacks_init', 'stacks_final', 'result', 'terminal_rewards']
steps = 1 stacks: [100.0, 100.0] → [101.0, 99.0]
result = {'winners': [0], 'win_type': 'showdown', 'pot': 2.0, 'pot_share': 2.0, 'winning_hand': 'Two Pair', 'final_board': ['Ace Of Club', 'Five Of Diamond', 'Two Of Spade', 'King Of Heart', 'King Of Diamond'], 'all_hands': [(0, 'Two Pair'), (1, 'Two Pair')]}


## 9. Final Model Integration Test

⚠️ **Note**: This cell contains the code that previously caused the notebook corruption due to disk space issues. Only run if you have sufficient disk space.

In [18]:
# WARNING: This cell caused the original notebook corruption due to disk quota exceeded
# Only run if you have sufficient disk space and system resources

from transformers import AutoTokenizer
from ppo.config import PPOConfig

# Create a config instance
C = PPOConfig()

# Test tokenizer loading (this should work without loading the full model)
try:
    tok = AutoTokenizer.from_pretrained(C.base_repo_or_path, trust_remote_code=True)
    print("Tokenizer loaded successfully. Vocab size:", len(tok))
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    print("This is likely due to insufficient disk space or memory issues.")

Tokenizer loaded successfully. Vocab size: 128256
