# models.py
## What It Contains:

Neural Network Architectures: Defines the core RL agent architecture (e.g., BestPokerModel), which uses a dueling DQN structure with noisy linear layers and residual blocks.

Custom Layers & Blocks: Implements helper components like NoisyLinear (for exploration) and ResidualBlock (for improved gradient flow).

Checkpoint Conversion: Provides a utility function (convert_half_to_full_state_dict) to convert model checkpoints from a reduced (half-poker) encoding to the full input dimensions used by the agent.


# envs.py
## What It Contains:

BaseFullPokerEnv: A base class implementing the complete poker game logic including dealing cards, betting rounds, stage progression, hand evaluation, and opponent actions.

TrainFullPokerEnv: An extension of the base environment tailored for training. It adds features like tracking all-in actions and modified reward computation.

Helper Functions: Contains a simple poker hand evaluator and functions to compute the belief state for each player.

# utils.py
## What It Contains:

Observation Encoders: Functions (encode_obs and encode_obs_eval) that transform the raw game observations into numerical state vectors suitable for the RL agent.

Epsilon Scheduling: Implements an epsilon_by_frame function that calculates the current epsilon value for epsilon-greedy exploration.

Replay Buffer: A class to store and sample experience tuples (state, action, reward, next state, done) during training.

Logging Helper: A simple logging function (log_decision) to print debug or decision-related messages when enabled.

# train_rl.py
## What It Contains:

Training Loop: The main training script for the RL agent.

RL Agent Setup: Instantiates the BestPokerModel (from models.py) and uses the training environment (TrainFullPokerEnv from envs.py).

Experience Replay and Target Network: Uses a replay buffer and periodically updates a target network to stabilize training.

Checkpointing: Saves model checkpoints and training metrics periodically to allow resuming training later.

## How to Run:
You can start training by running the training command (either directly or via the main entry point described below). For example:

In [None]:
!python main.py train --checkpoint checkpoints/optional_checkpoint.pt

_(Omit --checkpoint if you want to start from scratch.)_


# simulate.py (or evaluate.py)
## What It Contains:

Simulation/Evaluation Loop: This script sets up a poker environment (BaseFullPokerEnv) for running simulation episodes.

Greedy Policy: Uses the trained agent (loaded from a checkpoint) and runs episodes using a deterministic, greedy policy (i.e., no exploration).

Output: Prints episode rewards and additional game outcome details (like winners, scores, etc.) for analysis.

## How to Run:
You can evaluate a trained model by running:

In [None]:
!python main.py simulate --checkpoint checkpoints/optional_checkpoint.pt --episodes 10

_(Omit --checkpoint if you want to start from scratch.)_


# main.py
## What It Contains:

Unified Entry Point: Acts as the central script that ties the project together. It uses subcommands to either launch the training loop or the simulation/evaluation loop.

Subcommand Handling:

The train subcommand calls the training routine from train_rl.py.

The simulate subcommand forwards arguments to the simulation script (simulate.py).

## How to Run:

### For Training:

In [None]:
!python main.py train --checkpoint checkpoints/optional_checkpoint.pt

_(Omit --checkpoint if you want to start from scratch.)_

### For Simulation/Evaluation:

In [None]:
!python main.py simulate --checkpoint checkpoints/optional_checkpoint.pt --episodes 10

# train_random.py

## What It Contains:

Training Loop:
This script sets up a full training loop for the RL agent using the BestPokerModel architecture and the TrainFullPokerEnv environment. It is designed to train the agent via an epsilon-greedy strategy, experience replay, and target network updates.

Random Opponent Policy:
A random action policy is explicitly assigned to one of the opponent agents (e.g., opponent with ID 1). This introduces stochastic behavior into the training environment, helping to simulate more varied and unpredictable opponent actions.

Experience Replay and Checkpointing:
The script uses a replay buffer to store experiences and periodically updates a target network to stabilize training. It also saves model checkpoints and training metrics to disk for future resumption or analysis.

## How to Run:

You can start training the agent with a random opponent by running:

In [None]:
! python main.py train_random --checkpoint checkpoints/optional_checkpoint.pt

_(Omit --checkpoint if you want to start from scratch.)_

# Performance Metrics

**Episode Reward:** The cumulative reward obtained in each episode.

**Average Reward:** Typically computed over the last 100 episodes.

**Epsilon Value:** The current exploration rate used in the epsilon-greedy policy.

These metrics are printed to the console at regular intervals (e.g., every 100 episodes) and periodically saved to a CSV file (e.g., every 10,000 episodes) in the checkpoint directory. This logging helps monitor training performance over time and facilitates analysis or debugging.


# Stop or Kill Script

Run this line of code to quit the program

In [None]:
!taskkill /F /IM python.exe