# Flip 7

Flip 7 is a push-your-luck card game built around probabilistic risk management and tactical play. The objective of the game is to be the first player to reach 200 points. Players take turns to *'hit'* or *'stay'*. Aiming to gain as many points without flipping the same number twice. If a player reveals the same number twice they receive 0 points for the round. The 15 bonus points are on offer if a player can flip 7 number cards without going bust.

The deck includes score modification cards and 3 action cards (*Freeze, Flip Three, Second Chance*). 

[Rules Book](https://cdn.shopify.com/s/files/1/0611/3958/3198/files/25_FLIP_7_TB_RULES_C_ND_1.pdf?v=1734983801)

# Agent v0

Agent v0 is a basic model to test initial set up of the environment. 

- Only plays one hand, seeking to maximise the score of the one hand
- Only base number cards (0 - 12) are used 
- Only one opponent
- Opponent takes random actions
- Reward is not impacted by opponent actions
- No limitations on hand size

## Set Up

In [60]:
import random
import gymnasium as gym
from gymnasium.wrappers import FlattenObservation, RecordEpisodeStatistics
import numpy as np
import typing
import logging
import stable_baselines3 as sb3
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
import statistics
from collections import Counter

## Game Classes

### Card

In [2]:
class Card:
    def __init__(self, val: str):
        self.val = int(val)
        self.label = val

    def __repr__(self) -> str:
        return self.label

### Deck

In [3]:
_cardList = {
    "0": 1,
    "1": 1,
    "2": 2,
    "3": 3,
    "4": 4,
    "5": 5,
    "6": 6,
    "7": 7,
    "8": 8,
    "9": 9,
    "10": 10,
    "11": 11,
    "12": 12,
    # "+2": 1,
    # "+4": 1,
    # "+6": 1,
    # "+8": 1,
    # "+10": 1,
    # "x2": 1,
    # "Freeze": 3,
    # "Flip Three": 3,
    # "Second Chance": 3
}

In [4]:
class Deck:
    def __init__(self):
        self.cards = []

        for c in _cardList:
            for _ in range(_cardList[c]):
                self.cards.append(Card(c))

        self.trash = []

    def shuffle(self) -> None:
        random.shuffle(self.cards)

    def draw(self) -> Card:
        return self.cards.pop() #Last term in list is top card in deck

    def __repr__(self) -> str:
        deckString = "["

        for c in self.cards:
            deckString += c.label + " "

        deckString += "]"
        return deckString

### Player

In [5]:
class Player:
    def __init__(self, deck: Deck):
        self.points = 0
        self.active = True
        self.hand = []
        self.deck = deck

    def calcVal(self) -> int:
        val = 0
        for c in self.hand:
            val += c.val
        
        return val
    
    def turn(self, choice) -> tuple[bool, int]:
        if choice == 0:
            return self.hit()
        
        else:
            return self.stay()
    
    def hit(self) -> tuple[bool, int]:
        card = self.deck.draw()
        if card.label in [c.label for c in self.hand]: #Bust
            self.active = False
            return False, 0

        self.hand.append(card)
        return True, 0

    def stay(self) -> tuple[bool, int]:
        self.active = False
        reward = self.calcVal()
        self.points += reward
        return False, reward
    
    def discardHand(self) -> None:
        self.deck.trash += self.hand
        self.hand = []

### Opponents

#### Random Choice

In [8]:
class RCOpponent(Player):
    def __init__(self, deck: Deck, risk):
        super().__init__(deck)
        self.risk = risk

    def turn(self) -> tuple[bool, int]:
        if not self.active:
            return True, 0 # Returns True because active status does not change
        
        if self.hand:
            choice = random.uniform(0,1) 
            
        else:
            choice = 0 # No cards in hand, must hit

        if choice < self.risk: #Chooses between hit and stay with given riskiness
            return self.hit()

        else:
            return self.stay()

## Env 

In [21]:
class Flip7(gym.Env):
    def __init__(self):
        self.deck = Deck()
        self.agent = Player(self.deck)

        self.players = 2
        self.activePlayers = self.players

        self.opponents = []

        for _ in range(self.players - 1):
            riskiness = random.randint(4,9)
            self.opponents.append(RCOpponent(self.deck, 0.1*riskiness))

        self.turnOrder = 0
        self.i = [0]

        self.observation_space = gym.spaces.Dict(
            {
                "turnOrder":gym.spaces.Box(1,8, shape=(1,), dtype=np.int64),
                "hand": gym.spaces.Box(0,2, shape=(21,), dtype = np.int64),
                "points": gym.spaces.Box(0,485, shape=(1,), dtype=np.int64),
                "oppHands": gym.spaces.Box(0,2, shape=(21,self.players - 1), dtype = np.int64),
                "oppPoints": gym.spaces.Box(0,485, shape=(self.players - 1,), dtype=np.int64),
                "trash": gym.spaces.Box(0,12, shape=(21,), dtype = np.int64)
            }
        )

        self.action_space = gym.spaces.Discrete(2) # 0 -> Hit, 1 -> Stay

    
    def _get_obs(self) -> dict[str, typing.Union[np.ndarray, int]]:
        hand = np.zeros(21, dtype=np.int64)

        for c in self.agent.hand:
            hand[list(_cardList.keys()).index(c.label)] += 1 #Array counting number of each card

        oppHands = []
        for op in self.opponents:
            opHand = np.zeros(21, dtype=np.int64)
            for c in op.hand:
                opHand[list(_cardList.keys()).index(c.label)] += 1 #Array counting number of each card

            oppHands.append(opHand)

        oppHands = np.stack(oppHands, axis=1)

        oppPoints = np.array([o.points for o in self.opponents], dtype=np.int64) #Array of opponent scores

        trash = np.zeros(21, dtype=np.int64)

        for c in self.deck.trash:
            trash[list(_cardList.keys()).index(c.label)] += 1 #Array counting number of each card
        
        return {"turnOrder": np.array([self.turnOrder], dtype=np.int64),"hand": hand, "points": np.array([self.agent.points], dtype=np.int64), "oppHands": oppHands, "oppPoints": oppPoints, "trash": trash}
    
    
    def _get_info(self) -> dict[str, np.ndarray]:
        return {
            "Current": [c.label for c in self.agent.hand]
        }
    
    
    def opponentRound(self): # Performs opponent actions between steps
        for opp in self.opponents[self.turnOrder - 1:]: # Opponents after agent in order
            self.i[0] += 1
            activeStatus, _ = opp.turn()

            if not activeStatus:
                self.activePlayers -= 1

        for opp in self.opponents[:self.turnOrder - 1]: # Opponents before agent in order
            self.i[0] += 1
            activeStatus, _ = opp.turn()

            if not activeStatus:
                self.activePlayers -= 1
    

    def reset(self, seed: typing.Optional[int] = None, options: typing.Optional[dict] = None) -> tuple[dict[str, typing.Union[np.ndarray, int]], dict[str, np.ndarray]]: 
        self.deck = Deck()
        self.deck.shuffle()
        self.agent = Player(self.deck)

        self.i =[0]

        self.players = 2
        self.activePlayers = self.players

        self.opponents = []
        for _ in range(self.players - 1):
            riskiness = random.randint(4,9)
            self.opponents.append(RCOpponent(self.deck, 0.1*riskiness))

        self.turnOrder = random.randint(1, self.players) # 1 -> agent goes first, 2 -> agent goes second, ...

        # Players that are earlier in turnOrder than agent take turn before first step
        for order in range(1,self.turnOrder):
            turnPlayer = self.opponents[-order]

            activeStatus, _ = turnPlayer.turn()

            if not activeStatus:
                self.activePlayers -= 1

        observation = self._get_obs()
        info = self._get_info()

        return observation, info
    
    
    def step(self, action):
        if not self.agent.hand or action == 0:
            activeStatus, reward = self.agent.hit()
        
        else:
            activeStatus, reward = self.agent.stay()
            
            
        if not activeStatus:
            self.activePlayers -= 1


        if self.agent.active:
            self.opponentRound()

        else:
            while self.activePlayers:
                self.opponentRound()

        observation = self._get_obs()
        info = self._get_info()

        terminated = (self.activePlayers == 0)
        truncated = False

        return observation, reward, terminated, truncated, info

### Register Env

In [22]:
gym.register(
    id = "barrys_zone/Flip7-v0",
    entry_point=Flip7,
)

In [81]:
#gym.pprint_registry()

## Model Training

In [70]:
env = gym.make('barrys_zone/Flip7-v0')

In [71]:
episodes = 1000000

### Random Actions - Baseline

In [72]:
rewards = []
epLens = []
for _ in range(episodes):
    obs, _ = env.reset()
    handReward = 0

    terminate = False
    truncate = False

    epLen = 0

    while not (terminate or truncate):
        epLen +=1
        action = env.action_space.sample()
        obs, reward, terminate, truncate, info = env.step(action)
        handReward += reward
    
    rewards.append(handReward)
    epLens.append(epLen)

In [None]:
print("====REWARD====")
print("Mean:", sum(rewards)/episodes)
print("Standard Deviation:", statistics.stdev(rewards))
print("Max:", max(rewards))
print("Median:", statistics.median(rewards))
print("Zero Point Hands:", round(Counter(rewards)[0]*100 / episodes,2),'%')

print("\n====EPISODE LENGTH====")
print("Mean:", sum(epLens)/episodes)
print("Standard Deviation:", statistics.stdev(epLens))
print("Max:", max(epLens))
print("Median:", statistics.median(epLens))

====REWARD====
Mean: 11.620398
Standard Deviation: 8.937530766439476
Max: 72
Median: 10.0
Zero Point Hands: 13.94 %

====Episode Length====
Mean: 2.733636
Standard Deviation: 1.012220946283589
Max: 11
Median: 2.0
Zero Point Hands: 0.0 %


### Train Agent

In [52]:
envTrain = DummyVecEnv([lambda: RecordEpisodeStatistics(FlattenObservation(Flip7()))])  

In [None]:
agent = PPO("MlpPolicy", envTrain, verbose=1)

Using cpu device


In [None]:
agent.learn(total_timesteps=episodes)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 2.65     |
|    ep_rew_mean     | 11.8     |
| time/              |          |
|    fps             | 1640     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 2.78        |
|    ep_rew_mean          | 10.8        |
| time/                   |             |
|    fps                  | 688         |
|    iterations           | 2           |
|    time_elapsed         | 5           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.012596073 |
|    clip_fraction        | 0.111       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.683      |
|    explained_variance   | -0.00787    |
|    learning_rate        | 0.

<stable_baselines3.ppo.ppo.PPO at 0x1cb0db06e90>

Training Time: 31m 46.3s

In [78]:
agent.save("./models/Flip7-v0")

In [79]:
agent = sb3.PPO.load("./models/Flip7-v0")

## Test Model

In [58]:
envTest = DummyVecEnv([lambda: RecordEpisodeStatistics(FlattenObservation(Flip7()))])  

In [None]:
rewards, epLens = sb3.common.evaluation.evaluate_policy(agent, 
                                                        envTest,
                                                        n_eval_episodes = episodes,
                                                        return_episode_rewards=True)




In [None]:
print("====REWARD====")
print("Mean:", sum(rewards)/episodes)
print("Standard Deviation:", statistics.stdev(rewards))
print("Max:", max(rewards))
print("Median:", statistics.median(rewards))
print("Zero Point Hands:", round(Counter(rewards)[0]*100 / episodes,2),'%')

print("\n====EPISODE LENGTH====")
print("Mean:", sum(epLens)/episodes)
print("Standard Deviation:", np.std(epLens))
print("Max:", max(epLens))
print("Median:", statistics.median(epLens))

====REWARD====
Mean: 18.212457
Standard Deviation: 13.980860792083565
Max: 49.0
Median: 25.0
Zero Point Hands: 35.55 %

====EPISODE LENGTH====
Mean: 3.957128
Standard Deviation: 1.0859254079429213
Max: 9
Median: 4.0


## Comparison

| Model | Random Choice | v0 |
|---|---|---|
| Reward Mean | 11.62 | 18.21 |
| Reward Std | 8.94 | 13.98 |
| Reward Max | 72 | 49 |
| Reward Median | 10 | 25 |
| Zero Reward Hands | 13.94% | 35.55% |
| Episode Length Mean | 2.73 | 3.96 |
| Episode Length Std | 1.01 | 1.09 |
| Episode Length Max | 11 | 9 |
| Episode Length Median | 2 | 4 |

From the reward mean and median it can be seen that agent v0 performs better on average of achieving higher scores consistently. v0 is an improvement on random choosing in this regard.

v0 appears *riskier* than random choice. Seen through it's higher episode lengths (*turns played*) and zero reward hands. v0 has deemed the risk of extra risk to be worth the reward, leading to a greater amount of busts but also greater average scores.

v0 does not achieve as high a max, this is likely due to v0 learning that it is safer to bank the high points values rather than to push its luck further. A valuable understanding in this game.