**BLACKJACK AGENT TRAINING**

**Part 1**

* Attempting to use Softmax (Categorical Distribution) implementation instead of Sigmoid (Binary Bernoulli Distribution)

**Results**

Took 32 minutes to run on CPU

Parameters:

epochs=2000,
        learning_rate=0.0003,
        batch_size=2048, # Significantly larger batch size recommended for stability
        k_epochs=128,
        epsilon=0.2,
        beta_kl=0.01,
        entropy_coeff=0.001,
        log_iterations=100,
        gamma=0.99

# Imports

In [18]:
import torch
import torch.nn as nn
import torch.optim as optim
# import random
# import numpy as np
from tqdm import tqdm
import gymnasium as gym

# Testing

In [19]:
env = gym.make("Blackjack-v1", sab=True) # `render_mode="human"` creates a pygame popup window to analyze play # `sab=True` uses the Sutton & Barto version

In [20]:
# Reset the Environment, and get an observation
obs, _ = env.reset()

Observation Space
* player_sum: The sum of the player's cards (integer between 4 and 21+).
* dealer_card: The value of the dealer's visible card (1–10).
* usable_ace: True if the player has a usable ace (counts as 11), otherwise False.

In [21]:
print(obs)

(19, 2, 0)


In [22]:
done = False
while not done:
    action = env.action_space.sample()
    obs, reward, done, truncated, info = env.step(action)

* obs: New observation after the action.
* reward: Final reward: +1 for win, 0 for draw, -1 for loss.
* done: Whether the episode has ended.
* truncated: Whether the episode was truncated (usually False here).
* info: Extra info (often empty in Blackjack).

In [23]:
print(reward)

1.0


In [24]:
env.action_space

Discrete(2)

The Blackjack action space is Discrete(2):
* 0 = Stick
* 1 = Hit

# Agent

In [25]:
class BlackJackAgent(nn.Module):
    def __init__(self, obs_size=3, hidden_size=10, output_size=2):
        super(BlackJackAgent, self).__init__()
        self.layer_1 = nn.Linear(obs_size, hidden_size)
        self.layer_2 = nn.Linear(hidden_size, output_size)
        self.action_probs_activation_layer = nn.Softmax(dim=1)
    
    def forward(self, x):
        x = torch.relu(self.layer_1(x))
        logits = self.layer_2(x)
        return logits       # later use nn.Softmax to get probabilities

    def get_action_probs(self, logits):
        """Get the probabilities of each action."""
        return self.action_probs_activation_layer(logits)
    
    def sample_best_action(self, obs):
        """Get the deterministic action with the highest probability
        for a given observation.
        
        Parameters:
            obs (torch.tensor): the agent's current observable state in the playable environment. Expected shape is either `(num_features,)` for a single observation
            or `(batch_size, num_features)` for a batch of observations.
        
        Returns:
            action (int or torch.tensor): 
                - If `obs` is a single observation (i.e., `obs.dim() == 1`), returns a scalar `int` representing the chosen action. 

                - If `obs` is a batch of observations (i.e., `obs.dim() > 1`),
                returns a `torch.Tensor` of `int`s, where each element is the
                chosen action for the corresponding observation in the batch"""
        # Ensure observation is a tensor and has a batch dimension if it's a single observation
        if obs.dim() == 1:
            obs = obs.unsqueeze(0) # Add a batch dimension if it's a single observation

        logits = self.forward(obs)
        probs = self.get_action_probs(logits)
        action = torch.argmax(probs, dim=1) 
        if obs.size(0) == 1:    # This method checks if there is only 1 element in a 1-D tensor
            return action.item() # Returns a Python scalar for a single observation
        else:
            return action # Returns a tensor of actions for a batch

# Training Loop

In [26]:
def training_blackjack_agent(epochs=50, learning_rate=0.0001, batch_size=64, gamma=0.99, k_epochs=64, epsilon=0.2, beta_kl=0.01, max_grad_norm=0.5, entropy_coeff=0.01, log_iterations=10) -> BlackJackAgent:
    print(f"Training BlackJack Agent's Policy with {epochs} epochs, {learning_rate} learning rate, batch size {batch_size}, and KL beta {beta_kl}.")
    env = gym.make("Blackjack-v1", sab=True) # # `sab=True` uses the Sutton & Barto version
    New_Policy = BlackJackAgent()   # STEP 1 || CREATE π_new
    optimizer = optim.Adam(params=New_Policy.parameters(), lr=learning_rate)

    # STEP 2 || FOR I ITERATION STEPS OMITTED 
    # STEP 3 || CREATE REFERENCE MODEL OMITTED
    for epoch in tqdm(range(epochs), desc=f"Main Epoch (Outer Loop)", leave=False):     # STEP 4 || FOR M ITERATION STEPS
        
        batch_trajectories = []     # Will contain a batch of trajectories

        # STEP 5 || Sample a batch D_b from D --> OMITTED 
        # STEP 6 || Update the old policy model π_old <- π_new
        Policy_Old = BlackJackAgent()
        Policy_Old.load_state_dict(New_Policy.state_dict())
        Policy_Old.eval()   # Prevent Gradient tracking

        # --- STEP 7 || Collect a Batch of Experiences ---
        # Loop Agent prediction, recording trajectories to lists:
        for i in range(batch_size):
            
            # Create local episode trajectory library
            episode_trajectory = {"states": [], "actions": [], "rewards": [], "log_probs": []}
            obs, _ = env.reset()
            done, truncated = False, False
            while not done and not truncated:
                obs_tensor = torch.tensor(obs, dtype=torch.float32).unsqueeze(0) # add batch dim to feed to NN; TENSOR SHAPE=(1, 3)
                with torch.no_grad():
                    logits = Policy_Old(obs_tensor)
                    dist = torch.distributions.Categorical(logits=logits) # Create a Stochastic Distribution to sample from
                    action = dist.sample() # Tensor of shape (1,1)
                    log_prob = dist.log_prob(action)    # Tensor of shape (1, 1) > Tensor of shape(1,)
                # list, float, boolean, boolean, dict    
                next_obs, reward, done, truncated, info = env.step(action.item())

                # Store completed episode_trajectory information
                episode_trajectory["states"].append(obs)
                episode_trajectory["actions"].append(action.item())
                episode_trajectory["rewards"].append(reward)
                episode_trajectory["log_probs"].append(log_prob)
                
                obs = next_obs  # Update the observation
                if (truncated):
                    print("Debug: EPISODE TRUNCATED")
            # Add completed episode information into Batch of Experiences 
            batch_trajectories.append(episode_trajectory)

        # These lists will hold data from ALL episodes in the current batch for Advantage Calculation
        all_states, all_actions, all_old_log_probs, all_discounted_rewards= [], [], [], []

        # STEP 8 || Calculate Discounted Rewards
        for episode_trajectory in batch_trajectories:   # Loop through all the episode trajectories
            rewards = episode_trajectory["rewards"]
            states = episode_trajectory["states"]
            actions = episode_trajectory["actions"]
            log_probs = episode_trajectory["log_probs"]
            
            discounted_reward = 0
            returns_for_episode = []
            for reward in reversed(rewards):
                discounted_reward = reward + gamma * discounted_reward
                returns_for_episode.insert(0, discounted_reward)

            # Add each trajectory information for the batch if the states list is populated
            if states:
                all_states.extend(states)
                all_actions.extend(actions)
                all_old_log_probs.extend(log_probs)
                all_discounted_rewards.extend(returns_for_episode)  # keep appending to the discounted rewards list


        # --- Debugging: Print lengths and samples of collected data ---
        # print(f"DEBUG (Epoch {epoch + 1}): Length of all_states: {len(all_states)}")
        # print(f"DEBUG (Epoch {epoch + 1}): Length of all_actions: {len(all_actions)}")
        # print(f"DEBUG (Epoch {epoch + 1}): Length of all_old_log_probs (list): {len(all_old_log_probs)}")
        # print(f"DEBUG (Epoch {epoch + 1}): Length of all_advantages (list): {len(all_discounted_rewards)}")
        # Example content check (first 2 elements if available)
        # --- End Debugging ---


        # --- IMPORTANT: Pre-tensorization checks and conversions ---
        # Check if any essential list is empty before converting to tensors.
        # This prevents RuntimeError due to dimension mismatches with empty tensors.
        if not all_states or not all_actions or not all_old_log_probs or not all_discounted_rewards:
            print(f"Warning: Epoch {epoch + 1}: Insufficient data collected for optimization. "
                f"Skipping policy update for this epoch.")
            # Print specific counts to help diagnose which list is empty
            print(f"  Counts: States={len(all_states)}, Actions={len(all_actions)}, "
                f"LogProbs={len(all_old_log_probs)}, Advantages={len(all_discounted_rewards)}")
            continue # Skip to the next epoch if no meaningful data was collected


        # Convert all collected batch list data into PyTorch tensors
        all_states_tensor = torch.tensor(all_states, dtype=torch.float32)    # Shape: (batch_size, 3)
        all_actions_tensor = torch.tensor(all_actions, dtype=torch.long)    # Shape: (batch_size, )
        # Stack individual log_prob tensors from the batch of episodes and then flatten
        all_old_log_probs_tensor = torch.cat(all_old_log_probs).squeeze(-1) # Create a long 1-D tensor of all the log probs using the 'old' policy; Resulting shape: (batch_size,)
        all_discounted_rewards_tensor = torch.tensor(all_discounted_rewards, dtype=torch.float32)    # Shape: (Total_Steps_in_Batch,)

        # STEP 9 || Calculate the Advantage of each Time Step for each Trajectory using normalization
        all_advantages_tensor = (all_discounted_rewards_tensor - all_discounted_rewards_tensor.mean()) / (all_discounted_rewards_tensor.std() + 1e-8)

        # Detach these tensors from any computation graph history
        # as they represent fixed data for the policy updates in k_epochs.
        # This prevents the "RuntimeError: Trying to backward through the graph a second time".
        all_states_tensor = all_states_tensor.detach()
        all_actions_tensor = all_actions_tensor.detach()
        all_old_log_probs_tensor = all_old_log_probs_tensor.detach()
        all_advantages_tensor = all_advantages_tensor.detach()

        New_Policy.train()  # prepare NN for updates
        loss_hist = []  # Track the loss of each optimization step

        # --- STEP 10 || GRPO Optimization ---
        for k_epoch in tqdm(range(k_epochs), desc=f"Epoch {epoch+1}/{epochs} (Inner K-Epochs)", leave=False):
            new_logits = New_Policy(all_states_tensor)
            new_dist = torch.distributions.Categorical(logits=new_logits)
            new_log_probs = new_dist.log_prob(all_actions_tensor)
            entropy = new_dist.entropy().mean() # Calculate entropy for regularization

            R1_ratios = torch.exp(new_log_probs - all_old_log_probs_tensor)  # Exponent trick

            unclipped_surrogates = R1_ratios * all_advantages_tensor 
            clipped_surrogates = torch.clamp(input=R1_ratios, min=1.0-epsilon, max=1.0+epsilon) * all_advantages_tensor

            policy_loss = -torch.min(unclipped_surrogates, clipped_surrogates).mean()

            # --- KL Divergence Calculation ---
            # Create distributions for old policies using the trajectory states
            with torch.no_grad():
                old_logits = Policy_Old(all_states_tensor)
            old_dist = torch.distributions.Categorical(logits=old_logits)

            # Calculate KL divergence per sample, then take the mean over the batch
            kl_div_per_sample = torch.distributions.kl.kl_divergence(p=new_dist, q=old_dist)
            kl_loss = kl_div_per_sample.mean() # Mean over the batch

            # Calculate Total Loss for GRPO step and store it in the loss history
            total_loss = policy_loss + beta_kl * kl_loss - entropy_coeff * entropy
            loss_hist.append(total_loss)

            # STEP 11 || Policy Updates
            optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(New_Policy.parameters(), max_norm=max_grad_norm)
            optimizer.step()    # Update policy parameters using gradient ascent
        
        
        # --- Logging Metrics ---
        if (epoch + 1) % log_iterations == 0:
            # 1. Concatenate all loss tensors into one tensor, ensuring they are detached  to prevent gradient tracking
            losses_tensor = torch.stack([loss.detach().cpu() for loss in loss_hist])    # Shape: (N,)

            # 2. Calculate the mean of the concatenated tensor
            mean_loss = losses_tensor.mean()

            print(f"Epoch {epoch + 1}/{epochs}, Mean Loss: {mean_loss.item():.4f}, Mean Ratio: {R1_ratios.detach().mean().item():.5f}, Entropy Term: {entropy:.5f}")

            avg_reward = sum(sum(ep["rewards"]) for ep in batch_trajectories) / batch_size
            print(f"Average reward per episode in batch: {avg_reward:.2f}")

    New_Policy.eval()

    env.close() # Close the environment after training
    print("Training complete.")
    return New_Policy # Return the trained policy

In [27]:
_ = training_blackjack_agent()

Training BlackJack Agent's Policy with 50 epochs, 0.0001 learning rate, batch size 64, and KL beta 0.01.


Main Epoch (Outer Loop):  20%|██        | 10/50 [00:01<00:04,  8.74it/s]

Epoch 10/50, Mean Loss: -0.0474, Mean Ratio: 1.01407, Entropy Term: 0.67201
Average reward per episode in batch: -0.23


Main Epoch (Outer Loop):  40%|████      | 20/50 [00:02<00:03,  8.94it/s]

Epoch 20/50, Mean Loss: -0.0099, Mean Ratio: 0.99448, Entropy Term: 0.30686
Average reward per episode in batch: -0.03


Main Epoch (Outer Loop):  60%|██████    | 30/50 [00:03<00:02,  9.05it/s]

Epoch 30/50, Mean Loss: -0.0045, Mean Ratio: 1.00595, Entropy Term: 0.27528
Average reward per episode in batch: -0.16


Main Epoch (Outer Loop):  80%|████████  | 40/50 [00:04<00:01,  9.15it/s]

Epoch 40/50, Mean Loss: -0.0054, Mean Ratio: 1.00265, Entropy Term: 0.33185
Average reward per episode in batch: -0.22


                                                                        

Epoch 50/50, Mean Loss: -0.0171, Mean Ratio: 0.98943, Entropy Term: 0.38757
Average reward per episode in batch: -0.16
Training complete.




Training BlackJack Agent's Policy with 10 epochs, 0.0001 learning rate, batch size 4, and KL beta 0.01.
* Batch of Trajectories:
* [{'states': [(12, 10, 0)], 'actions': [0], 'rewards': [-1.0], 'log_probs': [tensor([-0.1239])]}, 
* {'states': [(20, 7, 0)], 'actions': [0], 'rewards': [1.0], 'log_probs': [tensor([-0.0815])]}, 
* {'states': [(12, 1, 0), (17, 1, 0)], 'actions': [1, 1], 'rewards': [0.0, -1.0], 'log_probs': [tensor([-1.5968]), tensor([-1.9474])]}, 
* {'states': [(6, 6, 0)], 'actions': [0], 'rewards': [-1.0], 'log_probs': [tensor([-0.2144])]}, 
* {'states': [(7, 4, 0)], 'actions': [0], 'rewards': [-1.0], 'log_probs': [tensor([-0.2734])]}, 
* {'states': [(13, 3, 1)], 'actions': [0], 'rewards': [-1.0], 'log_probs': [tensor([-0.1471])]}, 
* {'states': [(15, 10, 0)], 'actions': [0], 'rewards': [-1.0], 'log_probs': [tensor([-0.1000])]}, 
* {'states': [(12, 10, 0)], 'actions': [0], 'rewards': [1.0], 'log_probs': [tensor([-0.1239])]}, 
* {'states': [(14, 7, 0)], 'actions': [0], 'rewards': [-1.0], 'log_probs': [tensor([-0.1320])]}]

In [28]:
# Example usage (assuming you have a way to call this function, e.g., in a main block)
if __name__ == '__main__':
    # You can adjust these parameters as needed
    # Using a larger batch_size for more stable training and to reduce empty batch issues
    trained_policy = training_blackjack_agent(
        epochs=2000,
        learning_rate=0.0003,
        batch_size=2048, # Significantly larger batch size recommended for stability
        k_epochs=128,
        epsilon=0.2,
        beta_kl=0.01,
        entropy_coeff=0.001,
        log_iterations=100,
        gamma=0.99
    )

    print("\nTesting the trained policy:")
    test_env = gym.make("Blackjack-v1", sab=True)
    total_test_rewards = 0
    num_test_episodes = 1000

    for _ in range(num_test_episodes):
        obs, _ = test_env.reset()
        done = False
        truncated = False
        episode_reward = 0
        while not done and not truncated:
            obs_tensor = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
            with torch.no_grad():
                action = trained_policy.sample_best_action(obs_tensor)
            obs, reward, done, truncated, _ = test_env.step(action)
            episode_reward += reward
        total_test_rewards += episode_reward

    print(f"Average reward over {num_test_episodes} test episodes: {total_test_rewards / num_test_episodes:.4f}")
    test_env.close()

Training BlackJack Agent's Policy with 2000 epochs, 0.0003 learning rate, batch size 2048, and KL beta 0.01.


Main Epoch (Outer Loop):   5%|▌         | 100/2000 [01:34<37:02,  1.17s/it]

Epoch 100/2000, Mean Loss: -0.0008, Mean Ratio: 1.00103, Entropy Term: 0.15603
Average reward per episode in batch: -0.07


Main Epoch (Outer Loop):  10%|█         | 200/2000 [03:22<38:28,  1.28s/it]

Epoch 200/2000, Mean Loss: -0.0016, Mean Ratio: 1.00027, Entropy Term: 0.12882
Average reward per episode in batch: -0.06


Main Epoch (Outer Loop):  15%|█▌        | 300/2000 [05:33<38:31,  1.36s/it]

Epoch 300/2000, Mean Loss: -0.0005, Mean Ratio: 0.99999, Entropy Term: 0.10107
Average reward per episode in batch: -0.05


Main Epoch (Outer Loop):  20%|██        | 400/2000 [07:57<34:28,  1.29s/it]

Epoch 400/2000, Mean Loss: -0.0009, Mean Ratio: 0.99992, Entropy Term: 0.09868
Average reward per episode in batch: -0.06


Main Epoch (Outer Loop):  25%|██▌       | 500/2000 [10:07<33:37,  1.35s/it]

Epoch 500/2000, Mean Loss: -0.0019, Mean Ratio: 0.99884, Entropy Term: 0.09913
Average reward per episode in batch: -0.07


Main Epoch (Outer Loop):  30%|███       | 600/2000 [12:20<32:46,  1.40s/it]

Epoch 600/2000, Mean Loss: -0.0013, Mean Ratio: 0.99946, Entropy Term: 0.08959
Average reward per episode in batch: -0.04


Main Epoch (Outer Loop):  35%|███▌      | 700/2000 [14:43<32:49,  1.52s/it]

Epoch 700/2000, Mean Loss: -0.0012, Mean Ratio: 0.99922, Entropy Term: 0.08814
Average reward per episode in batch: -0.01


Main Epoch (Outer Loop):  40%|████      | 800/2000 [17:01<26:28,  1.32s/it]

Epoch 800/2000, Mean Loss: -0.0008, Mean Ratio: 0.99920, Entropy Term: 0.08933
Average reward per episode in batch: -0.03


Main Epoch (Outer Loop):  45%|████▌     | 900/2000 [19:14<24:06,  1.31s/it]

Epoch 900/2000, Mean Loss: -0.0015, Mean Ratio: 0.99796, Entropy Term: 0.07866
Average reward per episode in batch: -0.07


Main Epoch (Outer Loop):  50%|█████     | 1000/2000 [21:32<23:45,  1.43s/it]

Epoch 1000/2000, Mean Loss: -0.0006, Mean Ratio: 1.00083, Entropy Term: 0.08190
Average reward per episode in batch: -0.05


Main Epoch (Outer Loop):  55%|█████▌    | 1100/2000 [23:54<16:57,  1.13s/it]

Epoch 1100/2000, Mean Loss: -0.0017, Mean Ratio: 0.99970, Entropy Term: 0.08072
Average reward per episode in batch: -0.06


Main Epoch (Outer Loop):  60%|██████    | 1200/2000 [25:27<12:28,  1.07it/s]

Epoch 1200/2000, Mean Loss: -0.0013, Mean Ratio: 0.99900, Entropy Term: 0.07769
Average reward per episode in batch: -0.04


Main Epoch (Outer Loop):  65%|██████▌   | 1300/2000 [27:00<10:55,  1.07it/s]

Epoch 1300/2000, Mean Loss: -0.0012, Mean Ratio: 0.99976, Entropy Term: 0.07350
Average reward per episode in batch: -0.07


Main Epoch (Outer Loop):  70%|███████   | 1400/2000 [28:36<09:08,  1.09it/s]

Epoch 1400/2000, Mean Loss: -0.0015, Mean Ratio: 0.99885, Entropy Term: 0.08780
Average reward per episode in batch: -0.02


Main Epoch (Outer Loop):  75%|███████▌  | 1500/2000 [30:11<07:42,  1.08it/s]

Epoch 1500/2000, Mean Loss: -0.0008, Mean Ratio: 1.00010, Entropy Term: 0.07182
Average reward per episode in batch: -0.02


Main Epoch (Outer Loop):  80%|████████  | 1600/2000 [31:46<06:15,  1.07it/s]

Epoch 1600/2000, Mean Loss: -0.0013, Mean Ratio: 0.99853, Entropy Term: 0.07718
Average reward per episode in batch: -0.00


Main Epoch (Outer Loop):  85%|████████▌ | 1700/2000 [33:20<04:44,  1.05it/s]

Epoch 1700/2000, Mean Loss: -0.0017, Mean Ratio: 0.99882, Entropy Term: 0.08602
Average reward per episode in batch: -0.04


Main Epoch (Outer Loop):  90%|█████████ | 1800/2000 [35:00<04:16,  1.28s/it]

Epoch 1800/2000, Mean Loss: -0.0012, Mean Ratio: 0.99997, Entropy Term: 0.08081
Average reward per episode in batch: -0.02


Main Epoch (Outer Loop):  95%|█████████▌| 1900/2000 [37:16<02:21,  1.41s/it]

Epoch 1900/2000, Mean Loss: -0.0012, Mean Ratio: 0.99921, Entropy Term: 0.07588
Average reward per episode in batch: -0.02


                                                                            

Epoch 2000/2000, Mean Loss: -0.0010, Mean Ratio: 0.99996, Entropy Term: 0.07215
Average reward per episode in batch: -0.06
Training complete.

Testing the trained policy:
Average reward over 1000 test episodes: -0.0780




In [29]:
trained_policy

BlackJackAgent(
  (layer_1): Linear(in_features=3, out_features=10, bias=True)
  (layer_2): Linear(in_features=10, out_features=2, bias=True)
  (action_probs_activation_layer): Softmax(dim=1)
)

In [30]:
test_env = gym.make("Blackjack-v1", render_mode="rgb_array", sab=True)
total_test_rewards = 0

In [31]:
num_test_episodes = 10

In [32]:
print(f"Testing Blackjack Agent")
for episode in range(num_test_episodes):
    print(f"Resetting env for episode: {episode+1}")
    obs, _ = test_env.reset()
    done = False
    truncated = False
    episode_reward = 0
    while not done and not truncated:
        obs_tensor = torch.tensor(obs, dtype=torch.float32).unsqueeze(0)
        with torch.no_grad():
            action = trained_policy.sample_best_action(obs_tensor)
            print(f"obs_tensor: {obs_tensor} || Action taken: {action}")
        obs, reward, done, truncated, _ = test_env.step(action)
        episode_reward += reward
        if (truncated): print("truncated")
    print(f"Reward: {episode_reward} || Final Observation before reward: {obs}")

Testing Blackjack Agent
Resetting env for episode: 1
obs_tensor: tensor([[11., 10.,  0.]]) || Action taken: 1
obs_tensor: tensor([[15., 10.,  0.]]) || Action taken: 1
obs_tensor: tensor([[20., 10.,  0.]]) || Action taken: 0
Reward: 0.0 || Final Observation before reward: (20, 10, 0)
Resetting env for episode: 2
obs_tensor: tensor([[12.,  7.,  0.]]) || Action taken: 1
obs_tensor: tensor([[19.,  7.,  0.]]) || Action taken: 0
Reward: 1.0 || Final Observation before reward: (19, 7, 0)
Resetting env for episode: 3
obs_tensor: tensor([[15., 10.,  0.]]) || Action taken: 1
Reward: -1.0 || Final Observation before reward: (25, 10, 0)
Resetting env for episode: 4
obs_tensor: tensor([[12., 10.,  0.]]) || Action taken: 1
obs_tensor: tensor([[18., 10.,  0.]]) || Action taken: 0
Reward: -1.0 || Final Observation before reward: (18, 10, 0)
Resetting env for episode: 5
obs_tensor: tensor([[13.,  3.,  0.]]) || Action taken: 0
Reward: 1.0 || Final Observation before reward: (13, 3, 0)
Resetting env for 

In [33]:
# Run to safely terminate the gym environment
env.close()

Currently the final state which reveals what the dealer ended up with in the end is not shown. By trying to access the dealer's final hand or by adding custom logging within the environment, you'll gain the critical information needed to definitively understand the why behind each reward.