The goal of this project is to understand and compare different Reinforcement Learning (RL) training strategies for solving the classic Blackjack card game. Blackjack is simple enough for tabular methods but still illustrates important RL trade-offs:
- On-Policy vs Off-Policy learning
- Monte Carlo vs Temporal-Difference (TD) updates
- Importance Sampling and its effect on variance
- How training time, sample efficiency, and final policy quality interact
gymnasium’s Blackjack-v1 environment used:
- The agent learns to stick or hit to maximize its return.
- States are defined by: player sum, dealer’s visible card, and usable ace.
The agent’s goal is to approximate basic strategy, the near-optimal play for Blackjack, which realistically yields a slight player disadvantage due to the house edge.
- ✅ 1. On-Policy Monte Carlo Control (ε-greedy)
In on-policy Monte Carlo control, the agent learns about its own behavior by acting with the same policy it’s trying to improve. It follows an ε-greedy approach, which means it usually picks what it thinks is the best action but occasionally explores other options to cover more states over time. The value estimates are only updated at the end of each episode, so the agent needs to experience full trajectories to learn.
This method is simple and stable for small problems but requires a lot of episodes to see all possible states and actions.
- 🔄 2. Off-Policy Monte Carlo (Ordinary Importance Sampling)
Off-policy Monte Carlo with ordinary importance sampling uses a separate behavior policy (often fully random), to generate episodes, but it wants to learn about a different target policy that’s more optimal and greedy. To account for the mismatch between what it does and what it wants to learn, it uses importance sampling (IS) ratios to adjust the returns based on how likely each action was under the target policy.
This gives you the flexibility to reuse any data, but it comes with very high variance, which can make training inefficient when episodes don’t match well with the target policy.
- ⚖️ 3. Off-Policy Monte Carlo (Weighted Importance Sampling)
Weighted importance sampling works the same way as ordinary importance sampling but adds a normalization step to control the variance. The agent still collects episodes using a behavior policy and wants to learn about a better target policy, but it scales its updates so that extreme weights don’t throw off the learning process as much.
This approach reduces the variance of off-policy updates, making it more practical than ordinary IS, although it can still be less efficient than on-policy methods when you can sample directly.
- ⚡ 4. Q-Learning (Off-Policy TD Control)
Q-Learning takes a different approach by using temporal-difference updates instead of waiting until the end of an episode. It updates its Q-values step by step as it goes, bootstrapping with the estimated value of the next state’s best action. The agent learns about an optimal greedy policy while still exploring with an ε-greedy behavior.
This makes Q-Learning stable and scalable, especially for larger or continuous state spaces, but it needs careful tuning of learning rate and exploration to get the best results.
Each method was trained for 100,000 episodes in the same Blackjack environment with the same state representation, so they were directly comparable. After training, I tested each learned policy by playing 50,000 new hands, this time acting fully greedy to see how well the policy performs without any exploration.
I then collected and compared several key metrics side by side: the average return during training, the final test average return, the win rate, and the total training time for each method. Finally, I compared each learned policy to a standard basic Blackjack strategy to see how closely they matched the optimal stick/hit decisions, which really shows how well each method approximates expert play.
Below you can see what each method’s learned policy looks like as a heatmap. These show where the agent decided to stick or hit across the full state space.
The training curve shows how the average return improves (or stagnates) as episodes progress.
These heatmaps show how closely each learned policy matches the standard basic Blackjack strategy. Green areas mean the agent’s decision matches the optimal play; red areas show where it chose differently.
| Method | Train Avg Return | Test Avg Return | Win Rate | Training Time (s) | % Match w/ Basic |
|---|---|---|---|---|---|
| On-Policy MC | –0.1076 | –0.0674 | 42.60% | 123.84 | 83.89% |
| Off-Policy Ordinary IS | –0.3963 | –0.0478 | 43.25% | 123.55 | 90.00% |
| Off-Policy Weighted IS | –0.3931 | –0.0384 | 43.59% | 131.17 | 85.56% |
| Q-Learning | –0.1117 | –0.0665 | 42.18% | 134.38 | 87.22% |
- Train Avg Return shows what the agent experienced during learning, off-policy MC looks worse due to high IS variance.
- Test Avg Return show the actual performance of the final greedy policy.
-
- Win Rate gives you an intuitive sense of how often the agent wins in practice, so you see how close it gets to basic Blackjack play.
- Training Time shows that all methods are comparable in this small tabular game, but Q-learning’s online updates make it more scalable for larger problems.
- % Match w/ Basic shows how closely each final policy aligns with the standard Blackjack basic strategy.
- On-Policy MC improved steadily and stayed stable throughout training. Its final policy matches basic Blackjack strategy quite well, even if its match percentage isn’t the highest.
- Off-Policy Ordinary IS shows how high variance can waste samples, but with enough episodes it still got very close to the basic strategy, reaching about 90% match.
- Weighted IS did exactly what it’s supposed to: it reduced the variance and gave a better final policy than ordinary IS, matching about 85% of the optimal actions.
- Q-Learning was stable and practical with its step-by-step updates. It finished with about an 87% match, which shows why TD methods are so reliable in small tabular problems.
Comparing the final policies to the basic strategy really highlights how the training return curve doesn’t tell the whole story. What matters is how closely the learned actions match the real strategy. All four methods got close if you give them enough data.
It’s also good to see that the final average returns stay slightly negative, which makes sense because Blackjack always has a house edge. So the agents learned realistic play that matches what an optimal human would do.
- For small tabular problems, On-Policy MC and Q-learning are practical, stable, and fast.
- Off-Policy MC is mostly useful to understand the theory of importance sampling.
- Weighted IS is a helpful improvement but doesn’t outclass simpler methods when you can explore directly.
-
Install dependencies
pip install -r requirements.txt
-
Train and compare
python main.py
- Generates heatmaps and the learning curve:
results/heatmaps/results/learning_curves/
- Generates heatmaps and the learning curve:
-
Evaluate policies
python test_policy.py
- Saves test stats:
results/eval/policy_test_results.csvresults/eval/average_return_comparison.pngresults/eval/win_rate_comparison.png
- Saves test stats:
-
Compare learned policies to the basic Blackjack strategy
python compare_to_basic.py
- Checks how closely each learned policy matches a standard basic strategy.
- Saves difference heatmaps for each method:
results/heatmaps/*_vs_basic_heatmap.png - Prints out the % of states where the learned policy matches the basic strategy.
- Built with Gymnasium,
numpy,matplotlib, andseaborn. - Reinforcement Learning foundations from Sutton & Barto, 2018.








