In [1]:
%load_ext autoreload
%autoreload 2

In [1]:
import sys
import torch
sys.path.append("./")

## Introduction

### State, Action & Reward Representation

- We only consider 1v1 UNO

- 2 states representations: naive representation for tabular methods (where you count all possibilities) results in way too large a state space to run tabular methods (if you do not even account for values this results in ~130M possible states). Needs reducing the state space through simplification --> 4 planes where each entry of a plane corresponds to a given card and 3 of those planes to having a given number of a specific card (you can either have 0, 1, or 2 of each card) ; the last plane is the target / open card.

- Motivations of using function approximation: state space is way too big.

- In total, there are 61 different actions a player can take, which are playing each of the 60 different cards and drawing a card from the deck.

- The reward is defined as +1 for winning, −1 for losing, and 0 for all intermediate states. Also we assume the game is endless so all final rewards are either +1 or -1.

## Neural Network Architecture & Hyperparameter Selection

### Model Architecture

Add plot later


### Hyperparameter Tuning

Add description later

## Agents & Algorithms

### Baseline: Random Policy
Random vs. Random: we observe that the advantage given by starting the game is not negligible (the random player that starts has an average reward of .02 if I remember). This is not surprising though: in a game where the goal is to let go of your cards as fast as possible, getting rid of cards in a first place will result in higher rewrad in the long run.

In [11]:
from tests.eval import test_trained_agents
from uno.agents.random_agent import RandomAgent
random_agent = RandomAgent(61)
test_trained_agents(random_agent, random_agent, 10000, True)

------------------------------------------------------------
Average Rewards
------------------------------------------------------------
RANDOM Agent Average Reward: 0.0084
RANDOM Agent Average Reward: -0.0084

------------------------------------------------------------
Total Number of Games: 10000
RANDOM Agent wins 5042 games (RANDOM Agent win rate: 50.42%)
RANDOM Agent wins 4958 games (RANDOM Agent win rate: 49.58%)
Draws 0 games (Draw rate: 0.0%)


(0.0084, -0.0084)

### Algorithm I: REINFORCE

TODO: Change this plot later

TODO: Add description
<!-- Average Rewards

------------------------------------------------------------
Reinforce Agent Average Reward: 0.0038
RANDOM Agent Average Reward: -0.0038

------------------------------------------------------------
Total Number of Games: 10000
Reinforce Agent wins 5019 games (Reinforce Agent win rate: 50.19%)
RANDOM Agent wins 4981 games (RANDOM Agent win rate: 49.81%)
Draws 0 games (Draw rate: 0.0%) -->

<!-- Very inconclusive -->
![avg_reward_reinforce.png](checkpoint/REINFORCE/avg_reward_reinforce.png)

In [12]:
from tests.tests import *
from uno.agents.reinforce_agent import ReinforceAgent
reinforce_agent = CHECKPOINTS['REINFORCE Agent']
rewards = test_trained_agents(random_agent, reinforce_agent, 10000, True)

------------------------------------------------------------
Average Rewards
------------------------------------------------------------
RANDOM Agent Average Reward: -0.0212
REINFORCE Agent Average Reward: 0.0212

------------------------------------------------------------
Total Number of Games: 10000
RANDOM Agent wins 4894 games (RANDOM Agent win rate: 48.94%)
REINFORCE Agent wins 5106 games (REINFORCE Agent win rate: 51.06%)
Draws 0 games (Draw rate: 0.0%)


In [13]:
rewards = test_trained_agents(reinforce_agent, random_agent, 10000, True)

------------------------------------------------------------
Average Rewards
------------------------------------------------------------
REINFORCE Agent Average Reward: 0.0132
RANDOM Agent Average Reward: -0.0132

------------------------------------------------------------
Total Number of Games: 10000
REINFORCE Agent wins 5066 games (REINFORCE Agent win rate: 50.66%)
RANDOM Agent wins 4934 games (RANDOM Agent win rate: 49.34%)
Draws 0 games (Draw rate: 0.0%)


### Algorithm II: Monte Carlo On Policy Approximation

TODO: Add description

<div class='container'>
<img style="height: auto; width: 45%;" class="img" src="log/MC/mc-agent-[200000]-[0.0001]-[0.95]-[0.95]-[first].png" />
&nbsp;
&nbsp;
<img style="height: auto; width: 45%;" class="img" src="log/MC/mc-agent-[200000]-[0.0001]-[0.95]-[0.95]-[first].png" /></div>
</div>

In [14]:
from tests.tests import *
from uno.agents.mc_agent import MCAgent
mc_agent = CHECKPOINTS['MC Agent']
rewards = test_trained_agents(random_agent, mc_agent, 10000, True)

------------------------------------------------------------
Average Rewards
------------------------------------------------------------
RANDOM Agent Average Reward: -0.0258
MC Agent Average Reward: 0.0258

------------------------------------------------------------
Total Number of Games: 10000
RANDOM Agent wins 4871 games (RANDOM Agent win rate: 48.71%)
MC Agent wins 5129 games (MC Agent win rate: 51.29%)
Draws 0 games (Draw rate: 0.0%)


In [15]:
rewards = test_trained_agents(mc_agent, random_agent, 10000, True)

------------------------------------------------------------
Average Rewards
------------------------------------------------------------
MC Agent Average Reward: 0.0384
RANDOM Agent Average Reward: -0.0384

------------------------------------------------------------
Total Number of Games: 10000
MC Agent wins 5192 games (MC Agent win rate: 51.92%)
RANDOM Agent wins 4808 games (RANDOM Agent win rate: 48.08%)
Draws 0 games (Draw rate: 0.0%)


### Algorithm III: Double Q-Learning

TODO: Add more descriptions

#### Double-Q Network

The Double Deep Q-Network (Double DQN) is an extension of the standard Deep Q-Network (DQN) that aims to address the overestimation bias often found in Q-learning algorithms. In traditional DQN, the same network is used to both select and evaluate the best action, which can lead to an overoptimistic estimation of action values. To mitigate this issue, the Double DQN employs two separate networks: one for action selection and another for action evaluation. During each iteration, the agent plays n games using ε-greedy exploration and generates a set of trajectories (s, a, r, s'). The Double DQN update rule then utilizes both networks by selecting the action with the highest Q-value from the first network and evaluating that action using the second network. This decouples the action selection and evaluation processes, reducing the overestimation bias and improving the stability of learning.

dqn agent was trained with a set of different parameters:
Number of episodes: 100k, 80k, 50k,
decaying epsilon from 0.95 to 0.01 with decaying factor 0.95, updating every 1000 episodes.
Constant epsilon: 0.05,0.5,0.8.


<div class='container'>
<img style="height: auto; width: 45%;" class="img" src="log/DQN/dqn-agent-[200000]-[0.0001]-[no decay_0.08]-[0.95]-[first].png" />
&nbsp;
&nbsp;
<img style="height: auto; width: 45%;" class="img" src="log/DQN/dqn-agent-[200000]-[0.0001]-[no decay_0.08]-[0.95]-[second].png" /></div>
</div>

In [16]:
from tests.tests import *
from uno.agents.dqn_agent import DQNAgent
dqn_agent = CHECKPOINTS['DQN Agent']
rewards = test_trained_agents(dqn_agent, random_agent, 10000, True)

------------------------------------------------------------
Average Rewards
------------------------------------------------------------
DQN Agent Average Reward: 0.029
RANDOM Agent Average Reward: -0.029

------------------------------------------------------------
Total Number of Games: 10000
DQN Agent wins 5145 games (DQN Agent win rate: 51.45%)
RANDOM Agent wins 4855 games (RANDOM Agent win rate: 48.55%)
Draws 0 games (Draw rate: 0.0%)


In [17]:
rewards = test_trained_agents(random_agent, dqn_agent, 10000, True)

------------------------------------------------------------
Average Rewards
------------------------------------------------------------
RANDOM Agent Average Reward: -0.0148
DQN Agent Average Reward: 0.0148

------------------------------------------------------------
Total Number of Games: 10000
RANDOM Agent wins 4926 games (RANDOM Agent win rate: 49.26%)
DQN Agent wins 5074 games (DQN Agent win rate: 50.74%)
Draws 0 games (Draw rate: 0.0%)


### Algorithm IV: SARSA

TODO: Add description

<div class='container'>
<img style="height: auto; width: 45%;" class="img" src="log/SARSA/sarsa-agent-[200000]-[0.0001]-[0.95]-[0.95]-[first]-[0].png" />
&nbsp;
&nbsp;
<img style="height: auto; width: 45%;" class="img" src="log/SARSA/sarsa-agent-[200000]-[0.0001]-[0.95]-[0.95]-[second]-[0].png" /></div>
</div>

In [4]:
import numpy as np
np.random.seed(2023)
from tests.tests import *
from uno.agents.sarsa_agent import SARSAAgent
sarsa_agent = CHECKPOINTS['SARSA Agent']
# sarsa_agent = torch.load(
#         "checkpoint/SARSA/best_agent.pt", map_location=DEVICE)
# sarsa_agent.eps = 0.01
# # For testing purpose only (remove this line later)
# setattr(sarsa_agent, "name", "SARSA Agent")
rewards = test_trained_agents(random_agent, sarsa_agent, 10000, True)

100%|██████████| 10000/10000 [01:42<00:00, 97.60it/s]


------------------------------------------------------------
Average Rewards
------------------------------------------------------------
RANDOM Agent Average Reward: -0.1204
SARSA Agent Average Reward: 0.1204

------------------------------------------------------------
Total Number of Games: 10000
RANDOM Agent wins 4398 games (RANDOM Agent win rate: 43.98%)
SARSA Agent wins 5602 games (SARSA Agent win rate: 56.02%)
Draws 0 games (Draw rate: 0.0%)


In [5]:
np.random.seed(2023)
from tests.tests import *
rewards = test_trained_agents(sarsa_agent, random_agent, 10000, True)

100%|██████████| 10000/10000 [01:26<00:00, 115.85it/s]


------------------------------------------------------------
Average Rewards
------------------------------------------------------------
SARSA Agent Average Reward: 0.141
RANDOM Agent Average Reward: -0.141

------------------------------------------------------------
Total Number of Games: 10000
SARSA Agent wins 5705 games (SARSA Agent win rate: 57.05%)
RANDOM Agent wins 4295 games (RANDOM Agent win rate: 42.95%)
Draws 0 games (Draw rate: 0.0%)


## Contests

In this section, we play against each other.

In [12]:
import numpy as np
np.random.seed(2023)
from tests.tests import contests
from tests.eval import *
stats = contests(n=1000)
stats

100%|██████████| 1000/1000 [00:05<00:00, 167.84it/s]
100%|██████████| 1000/1000 [00:11<00:00, 84.11it/s]
100%|██████████| 1000/1000 [00:09<00:00, 103.83it/s]
100%|██████████| 1000/1000 [00:09<00:00, 102.01it/s]
100%|██████████| 1000/1000 [00:11<00:00, 88.67it/s]
100%|██████████| 1000/1000 [00:09<00:00, 100.97it/s]
100%|██████████| 1000/1000 [00:11<00:00, 88.86it/s]
100%|██████████| 1000/1000 [00:16<00:00, 62.26it/s]
100%|██████████| 1000/1000 [00:15<00:00, 65.74it/s]
100%|██████████| 1000/1000 [00:13<00:00, 72.53it/s]
100%|██████████| 1000/1000 [00:10<00:00, 98.12it/s]
100%|██████████| 1000/1000 [00:12<00:00, 78.63it/s]
100%|██████████| 1000/1000 [00:12<00:00, 80.26it/s]
100%|██████████| 1000/1000 [00:14<00:00, 70.28it/s]
100%|██████████| 1000/1000 [00:14<00:00, 68.13it/s]
100%|██████████| 1000/1000 [00:10<00:00, 90.94it/s]
100%|██████████| 1000/1000 [00:11<00:00, 88.03it/s]
100%|██████████| 1000/1000 [00:12<00:00, 81.31it/s]
100%|██████████| 1000/1000 [00:12<00:00, 82.85it/s]
100%|███

Unnamed: 0,Random Agent,SARSA Agent,MC Agent,REINFORCE Agent,DQN Agent
Random Agent,50.70%,45.30%,48.60%,51.30%,52.60%
SARSA Agent,56.30%,52.40%,56.30%,56.60%,58.20%
MC Agent,51.00%,45.20%,51.50%,53.60%,52.00%
REINFORCE Agent,50.50%,42.40%,50.50%,51.10%,48.50%
DQN Agent,48.90%,44.20%,48.70%,50.70%,51.20%


## Discussion & Limitation

#### Hyperparameters & Architecture

A majority of the agents we trained do not seem to have any edge over the random agent, which sounds very frustrating considering UNO does not look that complicated after all. A lot of parameters come into play regarding the performance of the agents we propose here however, namely the architecture of our NNs and the hyperparameters.

Most of our fine-tuning is heavily inspired by this paper (Winning UNO with reinforcement learning). In that regard, we believe a two-layer connected neural network should be deep enough to report good results for either of the algorithms we used. It is worth noting that the hyperparameters they use give a lot of importance to exploration throughout the training, even with decay ($\epsilon =  0.95, \kappa = 0.995$ with decay every 10th of the way still results in a very explorative behavior: $\epsilon \times \kappa^{10} \approx 0.90$???). The discount rate is high ($\gamma = 0.95$) considering winning rapidly or in a long time does not matter that much in the end, and the learning rate is $\alpha=0.0001$.

[Not sure about all that] The limited successes of some of our agents might stem from this aversion of some of the algorithms to exploration. For instance, REINFORCE does not give way to exploration. As the state space is really large, the policy might focus too much on specific episodes and states which have repeated throughout the training???

#### Multiple players

We have pondered over adding another random agent to solidify the training of our agents. Although the drawbacks of such a solution seem obvious (training will take longer - why should it be any different from 1v1 in terms of policy), adding players might result in more variance in the states covered throughout the episodes as there is more interaction between all players: in the end, this could reflect a better exploration throughout the training eventually reporting better results

#### Two more base agents

Some basic strategies come to mind for other baseline agents: one, which is widely played, is to play a card of the same value whenever it is possible, whatever the color is("value"-strategy), and an other one could be to play the target color whenever possible ("color"-strategy). The value-strategy usually pans out better in the end because there is ...

We havent added the "saying UNO" part but the obvious modeling is adding a Bernoulli variable when left with one card only: as this would be symmetric for all players, it wouldn't change the average reward, and would just take longer training cause of the variance added.

#### State representation

There are a lot of possible state representations of UNO. Choosing a good representation lies in reducing the complexity of the training as much as possible while still reflecting the dynamics of the game. We have thought about adding randomness to colors and values (as in, for example, every standard card could have a 1/4 chance to be of a given color every time) to get rid of these dimensions, and then building this agent over a baseline agent that understands these colors and values matchings. But it seemed too distant from the actual game, and maybe not worth it.

#### Game is difficult

Despite these modifications, Uno is a very stochastic game, and agents struggle to win over 60% of their games against random agents. This is very frustrating considering UNO does not seem to be that difficult of a game to play in the first place.









## Conclusion