# Simple dice poker

The rules of the simple poker game:
For simplicity do we use only 2 players?
First to make it simpler we take out all cards from the deck that is not either an ace, 2, 3, 4, 5
Next we define the rules for what is a good hand. For the cards the ace is the best and the 2 is the worst card.
next a pair is better than just having a single card, 
3 pairs is then better than 2
4 pairs is better than 3
Then pairs with higher cards is better than pairs with lower cards.
a straight (in original poker) in this case will be then be the worst hand out of simplicity.

we then have 2 actions, stop or continue.
if we stop, we loose the current stake.
if we continue, we double the stake.
Then if we get to 5 cards, the player with the best hand wins.

for this example we take into account the opponents action of continuing or stopping in that if we make it to the 5th card, the opponent has not stopped yet, and that might mean that they have good cards.

I think we could cut down on the states in that if we have 3 4's out of 4 cards, then the two other cards arent really relevant, and so we have a bigger learning potential.
but when do we cut down? because  having a 5 and 2 3's out of 4 cards, is also interesting whether there is a 5 or a 3.

if we want to make it interesting we could take into account raising or just knocking to continue. This could be done by taking into account whether the opponent just raised. we could then have x2 the amount of states for whether the opponent just raised or not. But this would make a single round last potentially many actions of raising / not raising. and it could go into a loop where both keep on raising forever. we could make rules to defend this, but it would take a lot more time so for simplicity we dont take raising into account, except for it happens automatically, where in each round it doubles. 

Now for reward. 
No matter what you

### points:
you could argue just to make a state for each (best) hand you have. but lets say you have a 9 and a 10, and there is a nine on the board. knowing you have a 10 makes it less probable that the opponent beats you with 2 10's since you already have one of them, so a sepperate state is neccesary to model this.  

for this scenario (online) updating doesnt really matter, since for one episode we can never return to the same state since states flow in one direction.
however, i chose to do this anyway as a learning experience, making it slightly more difficult to implement, but making the code easier to copy eg. for another project where it does matter.

Sometimes, something counter intuitive happens. Say the board has 10, 10, 9 but you have 1, 2. here it might seem like you are deemed to have a negative value, but actually the action value for continuing in this state is actually positive around 2.3. This could be do to the fact that there is some randomness involved, and it might have learned more from earlier examples where the opponent (itself) had not yet learned, but i think it is actually do to the fact that the algorithm has learned to bluff. if there is a 10 10, on the board, then the probability is lower that the other player has a 10, and thus if you bluff, then if the other player does not have a 10, then at worst you are even, but at best, the opponent folds. In this way the algorithm seems to have learned to exploit the system. On the other hand if the board then had a 10, 9 and 8, the value for continuing is around -12, since there is actually a good probability that the opponent might have at least one of those 3, another pair or at least a 3.

### What would it have required to model real poker?:

In real poker we would have to model the following:
- Also model royal cards
- Different models /state spaces pr. amount of players playing
- more card combinations like straight and flush
- 5 cards layed on the board, adding even more states
- model the combinations of the players doing fold, check, raise, all in
- model the different possible stakes
- potentially model the states of remaining chips that other players have

Just from noticing that we have more continuous features, this type of reinforcement learning would not have been adequate in practice, given that we have finite time.

In [1]:
import math
import numpy as np
from src.game import Game, RLPlayer
from src.strategy import Strategy
import pickle


## load pre-trained model from memory (optional):

In [5]:
# load model from memory (optional)
with open("strat/strat.pkl", "rb") as f:
    strat = pickle.load(f)

## Train a new Model:

In [None]:
n_params = 2*(sum([math.comb(11,2)*math.comb(9+i, 0+i) for i in range(4)]))
print("Total number of learnable values: ",n_params) # the amount of distinct values
print("mean number of visits pr. state: ", 1e7/n_params)
print("Alpha convergence towards :", 1/1+0.1*(1e7/n_params))
print("Epsilon convergence towards: ", 0.25*(0.95**(2.5e7/1e5)))

print(0.25*(0.97**20))

Total number of learnable values:  31460
mean number of visits pr. state:  317.86395422759057
Epsilon convergence towards:  6.742816344747503e-07
0.13594858573168675


In [27]:
# Train a model 
epsilon_start = 0.25
strat = Strategy(n = 2, gamma = 1, alpha = 0.5, decay_rate = 0.1, epsilon = epsilon_start)

players = [RLPlayer(strat), RLPlayer(strat)] # giving them the same strat

for i in range(int(2.5e7)): #25 mil simulations ~ 40 minutes
    if i % 1e5 == 0:
        epsilon *=0.95
        strat.epsilon = epsilon
    game = Game(players)
    game.simulate_game()


## Save model (optional):

In [60]:
with open("strat/strat.pkl", "wb") as f:
    pickle.dump(strat, f)

## Testing

In [59]:
hand = [10,8]
board = [9, 9, 6]
#action: 0 -> continue, 1 -> fold
action = 0

print("number of visits: ", strat.n_action_updates[len(board)][action][strat._get_state_idx(hand, board)])
print(f"action value for action {"Continue" if action==0 else "fold"}:",strat.action_values[len(board)][action][strat._get_state_idx(hand, board)])


number of visits:  1406.0
action value for action Continue: -3.618537682921602
