# AlphaGo Zero for connect 2

https://web.stanford.edu/~surag/posts/alphazero.html <br>
https://www.youtube.com/watch?v=62nq4Zsn8vc&ab_channel=JoshVarty <br>

### Value network and self play
1) Play a bunch of games against yourself and record whether a given board state eventually lead to a win or a loss. We take a snapshot of what the board looks like when presented to each player and want to determine how good of a position they are currently in before they take a move. <br>

`
 BOARD STATE 
[ 0  0  0  0] # Start of the game, presented to Player 1. Player 1 plays in the first position.
[-1  0  0  0] # Player 2 is presented with this board. Player 2 plays in the third position.
[ 1  0 -1  0] # Player 1 is presented with this board. Player 1 plays in the second position
[-1 -1  1  0] # Player 2 is presented with this board. Player 2 was lost as Player 1 can connect 2
`

2) Create a labelled training set for the value NN by going back through all of the game states held by Player 1 and marking them with a reward of 1. Similarily go through all of the states held by Player 2 and mark them with a reward of -1. <br>
` 
  BOARD STATE   RES 
([ 0  0  0  0], 1) # Player 1 
([-1  0  0  0],-1) # Player 2 
([ 1  0 -1  0], 1) # Player 1 
([-1 -1  1  0],-1) # Player 2 
`

3) After completing thousands of self play games, we shuffle up the board training data and feed it into our neural network to train it to recognise what the probability of winning is given a particular board state. The output will be from -1 to 1. This is essentially an image recognition task, which is why AlphaGo uses resnets.

<img src="value_network.PNG" alt="drawing" width="400"/> 


### Policy Network
Takes a game state / board as input and outputs next move a probability distribution of next moves, with move with the highest probability being the move which the policy network thinks will do the best.

<img src="policy_network.PNG" alt="drawing" width="400"/> 

Based upon the output above, we will win 88% of the time if you make a move into the first position and have a 10% probability of winning in the last position. Wierdly, it outputs a non-zero chance of winning if we take invalid moves where there are already pieces. This is because we cannot control the NN to output 0 values for invalid moves. Instead, we will have to write code to mask out invalid moves and redistribute the values.

The way we train the network is to actually encourage it to give the same output an Monte Carlo Tree search (more on that later). To do this we create a dataset that includes a board state, and also what the MCTS suggested we do at that position.
`
BOARD STATE             MCTS        
([ 0  0  0  0], [0.1, 0.4, 0.4, 0.1]) 
([-1  0  0  0], [0.0, 0.3, 0.3, 0.3]) 
([ 1  0 -1  0], [0.0, 0.8, 0.0, 0.2]) 
`

There is actually an interesting circular dependency, where by we are training our policy network to output results that match the MCTS, but we actually use our policy to network to choose the most promising next positions to expolore with MCTS. In theory, improving the policy network will improve the MCTS and visa versa.


### AlphaGo Zero actually uses the same network to output the policy and value

<img src="policy_value.PNG" alt="drawing" width="400"/> 

### MCTS
Normally MCTS would entail doing rollouts until the a particular game is ran to conclusion and then the result is backed up to update the previous states; however, with AlphaGo Zero we actually use the results from the value network to score each state. Of course, if we do happen to reach a terminal game state, then the actual result will be used instead and backed up. The lack of focus of complete rollouts can be hugely befefitial in a game like Go where there are any moves until termination.

In [None]:
import torch

from game import Connect2Game
from model import Connect2Model
from trainer import Trainer

In [19]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

args = {
    'batch_size': 64,
    'num_simulations': 100,                 # Number of Monte Carlo simulations for each move
    'numIters': 500,                        # Total number of training iterations
    'numEps': 100,                          # Number of full games (episodes) to run during each iteration
    'numItersForTrainExamplesHistory': 20,
    'epochs': 2,                            # Number of epochs of training per iteration
    'checkpoint_path': 'latest.pth'         # location to save latest set of weights
}

game = Connect2Game()
board_size = game.get_board_size()
action_size = game.get_action_size()

model = Connect2Model(board_size, action_size, device)

trainer = Trainer(game, model, args)
trainer.learn()

1/500

Policy Loss 1.4157464504241943
Value Loss 0.8966259807348251
Examples:
tensor([0.2723, 0.2117, 0.2783, 0.2376], device='cuda:0')
tensor([0.0400, 0.0100, 0.9300, 0.0200], device='cuda:0')

Policy Loss 1.4188497811555862
Value Loss 0.8844451904296875
Examples:
tensor([0.2717, 0.2126, 0.2786, 0.2370], device='cuda:0')
tensor([0.0400, 0.0100, 0.9300, 0.0200], device='cuda:0')
2/500

Policy Loss 1.4166783094406128
Value Loss 0.871482715010643
Examples:
tensor([0.2666, 0.2081, 0.2799, 0.2454], device='cuda:0')
tensor([0.2600, 0.3400, 0.0000, 0.4000], device='cuda:0')

Policy Loss 1.411540761590004
Value Loss 0.8693072274327278
Examples:
tensor([0.2656, 0.2087, 0.2807, 0.2450], device='cuda:0')
tensor([0.2600, 0.3400, 0.0000, 0.4000], device='cuda:0')
3/500

Policy Loss 1.4092235565185547
Value Loss 0.8500819355249405
Examples:
tensor([0.2764, 0.2219, 0.2709, 0.2309], device='cuda:0')
tensor([0.0500, 0.9500, 0.0000, 0.0000], device='cuda:0')

Policy Loss 1.404719591140747
Value Loss 0.

tensor([0.2400, 0.3700, 0.0000, 0.3900], device='cuda:0')

Policy Loss 1.2755832523107529
Value Loss 0.4786642976105213
Examples:
tensor([0.2329, 0.3019, 0.2686, 0.1966], device='cuda:0')
tensor([0.0400, 0.9600, 0.0000, 0.0000], device='cuda:0')
23/500

Policy Loss 1.2757640182971954
Value Loss 0.4857849106192589
Examples:
tensor([0.2135, 0.2235, 0.3059, 0.2571], device='cuda:0')
tensor([0.2400, 0.3700, 0.0000, 0.3900], device='cuda:0')

Policy Loss 1.2745856046676636
Value Loss 0.4816209636628628
Examples:
tensor([0.2282, 0.3114, 0.2674, 0.1930], device='cuda:0')
tensor([0.0400, 0.9600, 0.0000, 0.0000], device='cuda:0')
24/500

Policy Loss 1.275465726852417
Value Loss 0.4746265709400177
Examples:
tensor([0.2260, 0.3167, 0.2661, 0.1913], device='cuda:0')
tensor([0.0400, 0.9600, 0.0000, 0.0000], device='cuda:0')

Policy Loss 1.2649015486240387
Value Loss 0.46696967631578445
Examples:
tensor([0.2236, 0.3219, 0.2650, 0.1895], device='cuda:0')
tensor([0.0400, 0.9600, 0.0000, 0.0000], devic

tensor([0.1553, 0.2762, 0.3526, 0.2159], device='cuda:0')
tensor([0.0100, 0.0100, 0.9700, 0.0100], device='cuda:0')
44/500

Policy Loss 1.0835772156715393
Value Loss 0.13969111070036888
Examples:
tensor([0.1538, 0.2475, 0.2658, 0.3329], device='cuda:0')
tensor([0.1800, 0.4000, 0.0000, 0.4200], device='cuda:0')

Policy Loss 1.0874934494495392
Value Loss 0.13948346488177776
Examples:
tensor([0.1525, 0.2499, 0.2632, 0.3344], device='cuda:0')
tensor([0.1800, 0.4000, 0.0000, 0.4200], device='cuda:0')
45/500

Policy Loss 1.3262956142425537
Value Loss 0.14889003708958626
Examples:
tensor([0.1200, 0.3230, 0.3406, 0.2163], device='cuda:0')
tensor([0.0400, 0.0000, 0.0000, 0.9600], device='cuda:0')

Policy Loss 1.323670208454132
Value Loss 0.14345251396298409
Examples:
tensor([0.1500, 0.2498, 0.2574, 0.3427], device='cuda:0')
tensor([0.1800, 0.4100, 0.0000, 0.4100], device='cuda:0')
46/500

Policy Loss 1.0784142911434174
Value Loss 0.1206500343978405
Examples:
tensor([0.1188, 0.4537, 0.2540, 0.17

tensor([0.1100, 0.4400, 0.0000, 0.4500], device='cuda:0')
65/500

Policy Loss 0.825343668460846
Value Loss 0.020335307344794273
Examples:
tensor([0.1009, 0.3166, 0.1677, 0.4148], device='cuda:0')
tensor([0.1100, 0.4400, 0.0000, 0.4500], device='cuda:0')

Policy Loss 0.8185189366340637
Value Loss 0.01953367900568992
Examples:
tensor([0.1000, 0.3193, 0.1648, 0.4159], device='cuda:0')
tensor([0.1100, 0.4400, 0.0000, 0.4500], device='cuda:0')
66/500

Policy Loss 1.3846348822116852
Value Loss 0.016436856472864747
Examples:
tensor([0.0831, 0.3372, 0.4037, 0.1759], device='cuda:0')
tensor([0.0100, 0.0100, 0.9700, 0.0100], device='cuda:0')

Policy Loss 1.3449247032403946
Value Loss 0.018009663908742368
Examples:
tensor([0.0985, 0.3174, 0.1582, 0.4259], device='cuda:0')
tensor([0.1000, 0.4500, 0.0000, 0.4500], device='cuda:0')
67/500

Policy Loss 0.8449588119983673
Value Loss 0.019462269730865955
Examples:
tensor([0.0814, 0.3330, 0.4078, 0.1778], device='cuda:0')
tensor([0.0100, 0.0100, 0.9700,

Policy Loss 0.5906958281993866
Value Loss 0.0033229757682420313
Examples:
tensor([0.0591, 0.3616, 0.0776, 0.5017], device='cuda:0')
tensor([0.0600, 0.3900, 0.0000, 0.5500], device='cuda:0')

Policy Loss 0.5798753648996353
Value Loss 0.0033900798007380217
Examples:
tensor([0.0358, 0.2723, 0.5837, 0.1081], device='cuda:0')
tensor([0.0100, 0.0100, 0.9700, 0.0100], device='cuda:0')
87/500

Policy Loss 0.5913925766944885
Value Loss 0.0035384841030463576
Examples:
tensor([0.0050, 0.8672, 0.1042, 0.0236], device='cuda:0')
tensor([0.0100, 0.9900, 0.0000, 0.0000], device='cuda:0')

Policy Loss 0.5802256762981415
Value Loss 0.003283182595623657
Examples:
tensor([0.0342, 0.2621, 0.5994, 0.1043], device='cuda:0')
tensor([0.0100, 0.0100, 0.9700, 0.0100], device='cuda:0')
88/500

Policy Loss 0.5414046496152878
Value Loss 0.003058533649891615
Examples:
tensor([0.0046, 0.8748, 0.0991, 0.0215], device='cuda:0')
tensor([0.0100, 0.9900, 0.0000, 0.0000], device='cuda:0')

Policy Loss 0.5326973311603069
Va

Policy Loss 0.35867326706647873
Value Loss 0.0006018265994498506
Examples:
tensor([0.0117, 0.0790, 0.8728, 0.0364], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.35350507870316505
Value Loss 0.000663402744976338
Examples:
tensor([0.0255, 0.3829, 0.0263, 0.5654], device='cuda:0')
tensor([0.0200, 0.4400, 0.0000, 0.5400], device='cuda:0')
108/500

Policy Loss 0.3554718494415283
Value Loss 0.0006608959229197353
Examples:
tensor([0.0249, 0.3926, 0.0258, 0.5567], device='cuda:0')
tensor([0.0200, 0.4500, 0.0000, 0.5300], device='cuda:0')

Policy Loss 0.3397469334304333
Value Loss 0.0006371914714691229
Examples:
tensor([0.0022, 0.9666, 0.0283, 0.0030], device='cuda:0')
tensor([0.0100, 0.9900, 0.0000, 0.0000], device='cuda:0')
109/500

Policy Loss 0.37506968528032303
Value Loss 0.0006533311534440145
Examples:
tensor([0.0102, 0.0674, 0.8912, 0.0312], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.3547863289713

Policy Loss 0.35288283973932266
Value Loss 0.0006677059718640521
Examples:
tensor([0.0098, 0.4865, 0.0071, 0.4965], device='cuda:0')
tensor([0.0100, 0.4900, 0.0000, 0.5000], device='cuda:0')

Policy Loss 0.34699973091483116
Value Loss 0.0006653726813965477
Examples:
tensor([0.0129, 0.0418, 0.8717, 0.0736], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
129/500

Policy Loss 0.32382161170244217
Value Loss 0.0005503819120349362
Examples:
tensor([0.0151, 0.9183, 0.0281, 0.0385], device='cuda:0')
tensor([0.0100, 0.9900, 0.0000, 0.0000], device='cuda:0')

Policy Loss 0.33198293298482895
Value Loss 0.000533939299202757
Examples:
tensor([0.0092, 0.4892, 0.0066, 0.4950], device='cuda:0')
tensor([0.0100, 0.4900, 0.0000, 0.5000], device='cuda:0')
130/500

Policy Loss 0.3484288230538368
Value Loss 0.0005141284200362861
Examples:
tensor([0.0110, 0.0373, 0.8867, 0.0650], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.351342491805

tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
149/500

Policy Loss 0.7910537868738174
Value Loss 1.6575294012000086e-05
Examples:
tensor([0.0063, 0.0151, 0.9241, 0.0545], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.7506889849901199
Value Loss 1.6560432300138928e-05
Examples:
tensor([0.0094, 0.4958, 0.0013, 0.4935], device='cuda:0')
tensor([0.0100, 0.5000, 0.0000, 0.4900], device='cuda:0')
150/500

Policy Loss 0.7102719247341156
Value Loss 1.7593353732081596e-05
Examples:
tensor([0.0290, 0.3104, 0.3816, 0.2790], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')

Policy Loss 0.7228385731577873
Value Loss 1.514968357696489e-05
Examples:
tensor([0.0289, 0.3052, 0.3751, 0.2908], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
151/500

Policy Loss 0.7379591315984726
Value Loss 1.1613931746978778e-05
Examples:
tensor([0.0095, 0.4965, 0.0011, 0.4930], device='cuda:0')
tensor([0.0100, 0

Policy Loss 0.4082847535610199
Value Loss 2.7101419242114844e-07
Examples:
tensor([0.0161, 0.0856, 0.1029, 0.7954], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
170/500

Policy Loss 0.38253311067819595
Value Loss 2.5575818440870535e-07
Examples:
tensor([9.9785e-03, 5.0051e-01, 2.0684e-04, 4.8930e-01], device='cuda:0')
tensor([0.0100, 0.5000, 0.0000, 0.4900], device='cuda:0')

Policy Loss 0.3951420672237873
Value Loss 2.2054994630593683e-07
Examples:
tensor([0.0050, 0.0043, 0.9041, 0.0866], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
171/500

Policy Loss 0.4087851643562317
Value Loss 1.4611956800081316e-07
Examples:
tensor([0.0050, 0.0041, 0.9043, 0.0866], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.39724067598581314
Value Loss 1.4378847623675028e-07
Examples:
tensor([0.0148, 0.0684, 0.0843, 0.8324], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
172/500

Policy Loss 1.4110453873872757
Value Loss 3.824204231062112e-09
Examples:
tensor([0.0075, 0.0376, 0.0013, 0.9536], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')
190/500

Policy Loss 1.3062707483768463
Value Loss 4.837995892792435e-09
Examples:
tensor([1.1220e-02, 4.6284e-01, 2.1908e-04, 5.2572e-01], device='cuda:0')
tensor([0.0100, 0.4400, 0.0000, 0.5500], device='cuda:0')

Policy Loss 1.2195196151733398
Value Loss 5.563608573844192e-09
Examples:
tensor([1.1548e-02, 4.5094e-01, 2.4449e-04, 5.3727e-01], device='cuda:0')
tensor([0.0100, 0.4400, 0.0000, 0.5500], device='cuda:0')
191/500

Policy Loss 1.2305779159069061
Value Loss 6.543308450623897e-09
Examples:
tensor([0.0099, 0.0554, 0.0014, 0.9333], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')

Policy Loss 1.23905748128891
Value Loss 6.865587431903464e-09
Examples:
tensor([2.5764e-03, 4.9592e-04, 9.7911e-01, 1.7815e-02], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], d

tensor([1.7257e-02, 8.6007e-01, 7.8635e-05, 1.2259e-01], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')

Policy Loss 0.3481181990355253
Value Loss 8.086537897633583e-10
Examples:
tensor([4.4212e-03, 9.3308e-04, 9.8965e-01, 4.9991e-03], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
210/500

Policy Loss 0.33916059136390686
Value Loss 6.967675880709123e-10
Examples:
tensor([1.0014e-02, 4.4073e-01, 6.6712e-05, 5.4919e-01], device='cuda:0')
tensor([0.0100, 0.4400, 0.0000, 0.5500], device='cuda:0')

Policy Loss 0.3315369375050068
Value Loss 7.220660735995921e-10
Examples:
tensor([4.5950e-03, 9.7308e-04, 9.8972e-01, 4.7075e-03], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
211/500

Policy Loss 0.3500557467341423
Value Loss 6.314384015659869e-10
Examples:
tensor([1.0050e-02, 4.4208e-01, 6.1837e-05, 5.4781e-01], device='cuda:0')
tensor([0.0100, 0.4400, 0.0000, 0.5500], device='cuda:0')

Policy Loss 0.32840570

Value Loss 3.43415879067166e-10
Examples:
tensor([3.7183e-03, 4.2873e-04, 9.8966e-01, 6.1937e-03], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.3385726921260357
Value Loss 3.270086423867724e-10
Examples:
tensor([9.9605e-03, 5.6977e-01, 8.5002e-06, 4.2026e-01], device='cuda:0')
tensor([0.0100, 0.5700, 0.0000, 0.4200], device='cuda:0')
230/500

Policy Loss 0.34844133257865906
Value Loss 3.009526516883909e-10
Examples:
tensor([1.0008e-02, 5.6843e-01, 7.9975e-06, 4.2155e-01], device='cuda:0')
tensor([0.0100, 0.5700, 0.0000, 0.4200], device='cuda:0')

Policy Loss 0.32129889354109764
Value Loss 2.8694822212793625e-10
Examples:
tensor([0.0272, 0.0574, 0.0097, 0.9056], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
231/500

Policy Loss 0.30292368680238724
Value Loss 2.6185052737215386e-10
Examples:
tensor([0.0255, 0.0518, 0.0090, 0.9137], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')

Policy L

tensor([1.0108e-02, 3.9955e-01, 4.5613e-06, 5.9034e-01], device='cuda:0')
tensor([0.0100, 0.4000, 0.0000, 0.5900], device='cuda:0')
249/500

Policy Loss 0.543217733502388
Value Loss 2.1375314018801106e-10
Examples:
tensor([6.1993e-03, 3.0474e-04, 9.8913e-01, 4.3648e-03], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.5270585305988789
Value Loss 2.2443403385752347e-10
Examples:
tensor([1.9873e-02, 4.8836e-01, 9.5703e-06, 4.9176e-01], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')
250/500

Policy Loss 0.4969482943415642
Value Loss 2.4067006720862594e-10
Examples:
tensor([1.9824e-02, 5.2303e-01, 8.3856e-06, 4.5714e-01], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')

Policy Loss 0.49411969259381294
Value Loss 2.3834594714555735e-10
Examples:
tensor([6.4125e-03, 3.2229e-04, 9.8930e-01, 3.9682e-03], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
251/500

Policy Loss

Value Loss 5.113741374795922e-10
Examples:
tensor([7.1091e-03, 3.2652e-04, 9.9037e-01, 2.1941e-03], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.4292001724243164
Value Loss 4.520671406660348e-10
Examples:
tensor([0.0963, 0.2397, 0.0015, 0.6625], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
269/500

Policy Loss 0.43067263066768646
Value Loss 2.997624232170537e-10
Examples:
tensor([0.0928, 0.2197, 0.0015, 0.6859], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')

Policy Loss 0.4282526560127735
Value Loss 2.8319220579664517e-10
Examples:
tensor([0.0888, 0.2021, 0.0015, 0.7076], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
270/500

Policy Loss 0.36378272622823715
Value Loss 2.9207491980542954e-10
Examples:
tensor([0.0844, 0.1850, 0.0015, 0.7290], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')

Policy Loss 0.3586153872311115
Value Los

tensor([1.1726e-02, 4.9408e-01, 8.5874e-07, 4.9419e-01], device='cuda:0')
tensor([0.0100, 0.4700, 0.0000, 0.5200], device='cuda:0')
288/500

Policy Loss 1.050182580947876
Value Loss 3.034116013989063e-11
Examples:
tensor([7.3834e-03, 1.0092e-04, 9.8954e-01, 2.9791e-03], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 1.09710593521595
Value Loss 3.2379342262967015e-11
Examples:
tensor([2.0893e-02, 8.6437e-02, 1.3891e-06, 8.9267e-01], device='cuda:0')
tensor([0.0300, 0.9700, 0.0000, 0.0000], device='cuda:0')
289/500

Policy Loss 0.9913017451763153
Value Loss 3.899502942772415e-11
Examples:
tensor([2.2195e-02, 9.9509e-02, 1.2953e-06, 8.7829e-01], device='cuda:0')
tensor([0.0300, 0.9700, 0.0000, 0.0000], device='cuda:0')

Policy Loss 1.0217683985829353
Value Loss 4.1279486773238006e-11
Examples:
tensor([7.6626e-03, 1.0806e-04, 9.8938e-01, 2.8500e-03], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
290/500

Policy Loss 0.92

Policy Loss 0.3059500753879547
Value Loss 2.0714940873745036e-10
Examples:
tensor([9.0385e-03, 1.4326e-04, 9.8995e-01, 8.7254e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.3010019585490227
Value Loss 2.101262566722717e-10
Examples:
tensor([1.0192e-02, 3.8951e-01, 5.5995e-07, 6.0030e-01], device='cuda:0')
tensor([0.0100, 0.3900, 0.0000, 0.6000], device='cuda:0')
308/500

Policy Loss 0.3059111759066582
Value Loss 1.85726642398798e-10
Examples:
tensor([9.2491e-03, 1.4740e-04, 9.8978e-01, 8.2579e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.32944248616695404
Value Loss 1.9920639443560262e-10
Examples:
tensor([9.1524e-03, 9.5786e-01, 5.0508e-09, 3.2987e-02], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')
309/500

Policy Loss 0.5511059314012527
Value Loss 2.3763294110246136e-10
Examples:
tensor([1.4313e-01, 3.7562e-01, 1.2339e-04, 4.8113e-01], device='cuda:0')
tenso

tensor([1.0198e-02, 5.5557e-01, 1.8513e-07, 4.3423e-01], device='cuda:0')
tensor([0.0100, 0.5600, 0.0000, 0.4300], device='cuda:0')

Policy Loss 0.31305212527513504
Value Loss 4.358290117689734e-11
Examples:
tensor([1.9909e-02, 1.0604e-02, 8.1021e-05, 9.6941e-01], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
327/500

Policy Loss 0.3256884813308716
Value Loss 3.567819650829307e-11
Examples:
tensor([1.0054e-02, 5.5952e-01, 1.6724e-07, 4.3043e-01], device='cuda:0')
tensor([0.0100, 0.5600, 0.0000, 0.4300], device='cuda:0')

Policy Loss 0.3123810738325119
Value Loss 3.508393575657465e-11
Examples:
tensor([1.9986e-02, 8.1670e-03, 7.9648e-05, 9.7177e-01], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
328/500

Policy Loss 0.2869662120938301
Value Loss 3.044044183386774e-11
Examples:
tensor([1.9955e-02, 7.1550e-03, 7.9214e-05, 9.7281e-01], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')

Policy Loss 0.29628746

Value Loss 8.955132468901894e-11
Examples:
tensor([1.9937e-02, 5.4576e-01, 6.6779e-09, 4.3430e-01], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')
346/500

Policy Loss 0.4792078509926796
Value Loss 8.710873689032894e-11
Examples:
tensor([1.9980e-02, 5.7811e-01, 6.0464e-09, 4.0191e-01], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')

Policy Loss 0.45473162084817886
Value Loss 9.31538596260495e-11
Examples:
tensor([1.9864e-02, 6.0874e-01, 5.3580e-09, 3.7140e-01], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')
347/500

Policy Loss 0.43757034093141556
Value Loss 9.552057755879417e-11
Examples:
tensor([1.0157e-02, 3.2004e-01, 1.9157e-07, 6.6981e-01], device='cuda:0')
tensor([0.0100, 0.3200, 0.0000, 0.6700], device='cuda:0')

Policy Loss 0.4235099069774151
Value Loss 9.606890977176263e-11
Examples:
tensor([9.3912e-03, 5.3916e-05, 9.8994e-01, 6.1357e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000

tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
365/500

Policy Loss 0.3038770481944084
Value Loss 2.5698129735296504e-10
Examples:
tensor([9.8555e-03, 5.7823e-01, 9.9622e-08, 4.1191e-01], device='cuda:0')
tensor([0.0100, 0.5800, 0.0000, 0.4100], device='cuda:0')

Policy Loss 0.3329858183860779
Value Loss 2.3847555874478843e-10
Examples:
tensor([9.9130e-03, 5.7988e-01, 9.4930e-08, 4.1020e-01], device='cuda:0')
tensor([0.0100, 0.5800, 0.0000, 0.4100], device='cuda:0')
366/500

Policy Loss 0.35400011390447617
Value Loss 2.2532445353995456e-10
Examples:
tensor([9.9815e-03, 5.7683e-01, 9.1264e-08, 4.1319e-01], device='cuda:0')
tensor([0.0100, 0.5800, 0.0000, 0.4100], device='cuda:0')

Policy Loss 0.32526999339461327
Value Loss 2.496133369445097e-10
Examples:
tensor([9.7684e-03, 5.8019e-01, 8.3197e-08, 4.1004e-01], device='cuda:0')
tensor([0.0100, 0.5800, 0.0000, 0.4100], device='cuda:0')
367/500

Policy Loss 0.330368235707283
Value Loss 2.3990237574267326e-10
Examples:
tensor([9.

Examples:
tensor([9.4488e-03, 1.6288e-05, 9.9009e-01, 4.4726e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.30234217271208763
Value Loss 1.6341449027290622e-11
Examples:
tensor([9.4469e-03, 1.5334e-05, 9.9009e-01, 4.5087e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
385/500

Policy Loss 1.291241854429245
Value Loss 1.7574899868755267e-11
Examples:
tensor([1.2603e-02, 3.9388e-02, 5.3659e-10, 9.4801e-01], device='cuda:0')
tensor([0.0300, 0.9700, 0.0000, 0.0000], device='cuda:0')

Policy Loss 1.3065101653337479
Value Loss 1.8217753694482752e-11
Examples:
tensor([9.6775e-03, 1.5715e-05, 9.8987e-01, 4.4089e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
386/500

Policy Loss 1.2731195092201233
Value Loss 2.2153570644611875e-11
Examples:
tensor([1.2407e-02, 4.5158e-01, 4.3738e-08, 5.3601e-01], device='cuda:0')
tensor([0.0100, 0.4200, 0.0000, 0.5700], device='cuda:0')

Policy Lo

Policy Loss 0.2803253848105669
Value Loss 1.378327799117507e-10
Examples:
tensor([1.0162e-02, 3.4791e-01, 4.6780e-08, 6.4193e-01], device='cuda:0')
tensor([0.0100, 0.3500, 0.0000, 0.6400], device='cuda:0')
404/500

Policy Loss 0.28654908761382103
Value Loss 1.3058225334372509e-10
Examples:
tensor([1.0058e-02, 3.5176e-01, 4.5226e-08, 6.3818e-01], device='cuda:0')
tensor([0.0100, 0.3500, 0.0000, 0.6400], device='cuda:0')

Policy Loss 0.28932565078139305
Value Loss 1.27309975062051e-10
Examples:
tensor([9.8660e-03, 1.8492e-05, 9.8995e-01, 1.6675e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
405/500

Policy Loss 0.27503708004951477
Value Loss 1.3080111993524213e-10
Examples:
tensor([9.9650e-03, 3.5513e-01, 4.4212e-08, 6.3490e-01], device='cuda:0')
tensor([0.0100, 0.3500, 0.0000, 0.6400], device='cuda:0')

Policy Loss 0.2926289662718773
Value Loss 1.2491983836238063e-10
Examples:
tensor([1.0159e-02, 3.4857e-01, 4.5699e-08, 6.4128e-01], device='cuda:0')
ten

tensor([1.9997e-02, 6.1412e-03, 4.0708e-06, 9.7386e-01], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
423/500

Policy Loss 0.2768261879682541
Value Loss 8.164174891689413e-11
Examples:
tensor([9.8943e-03, 6.0236e-01, 1.1910e-08, 3.8774e-01], device='cuda:0')
tensor([0.0100, 0.6000, 0.0000, 0.3900], device='cuda:0')

Policy Loss 0.28670370765030384
Value Loss 7.661431905559013e-11
Examples:
tensor([9.5653e-03, 2.0170e-05, 9.9004e-01, 3.7603e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
424/500

Policy Loss 0.2865273803472519
Value Loss 6.445102784802259e-11
Examples:
tensor([1.0117e-02, 5.9809e-01, 1.1272e-08, 3.9180e-01], device='cuda:0')
tensor([0.0100, 0.6000, 0.0000, 0.3900], device='cuda:0')

Policy Loss 0.29496878385543823
Value Loss 6.087056553250036e-11
Examples:
tensor([1.9992e-02, 3.4256e-03, 4.4185e-06, 9.7658e-01], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')
425/500

Policy Loss 0

Policy Loss 0.5723298862576485
Value Loss 7.234836341130091e-11
Examples:
tensor([2.0187e-02, 4.3983e-01, 6.0181e-11, 5.3998e-01], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')

Policy Loss 0.5403863899409771
Value Loss 7.695776654825792e-11
Examples:
tensor([1.9552e-02, 4.8277e-01, 5.7774e-11, 4.9768e-01], device='cuda:0')
tensor([0.0200, 0.9800, 0.0000, 0.0000], device='cuda:0')
443/500

Policy Loss 0.46711229532957077
Value Loss 8.461548128835261e-11
Examples:
tensor([1.8641e-02, 5.2574e-01, 5.5237e-11, 4.5562e-01], device='cuda:0')
tensor([0.0100, 0.9900, 0.0000, 0.0000], device='cuda:0')

Policy Loss 0.4725724346935749
Value Loss 8.348943758562655e-11
Examples:
tensor([1.0068e-02, 3.8892e-01, 1.6358e-08, 6.0101e-01], device='cuda:0')
tensor([0.0100, 0.3900, 0.0000, 0.6000], device='cuda:0')
444/500

Policy Loss 0.4447357580065727
Value Loss 9.056486116598705e-11
Examples:
tensor([9.8093e-03, 1.5724e-05, 9.8999e-01, 1.8900e-04], device='cuda:0')
tensor(

tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')

Policy Loss 0.3101510629057884
Value Loss 2.2258158510757298e-10
Examples:
tensor([9.4806e-03, 5.0006e-01, 1.1868e-08, 4.9046e-01], device='cuda:0')
tensor([0.0100, 0.5000, 0.0000, 0.4900], device='cuda:0')
462/500

Policy Loss 0.2965414747595787
Value Loss 2.1910594172336317e-10
Examples:
tensor([4.3950e-02, 2.8774e-02, 9.1283e-07, 9.2728e-01], device='cuda:0')
tensor([0.0200, 0.0000, 0.0000, 0.9800], device='cuda:0')

Policy Loss 0.30445992946624756
Value Loss 2.1152440909943948e-10
Examples:
tensor([9.7404e-03, 2.1032e-05, 9.9002e-01, 2.1693e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
463/500

Policy Loss 0.3055267930030823
Value Loss 2.2094949481132886e-10
Examples:
tensor([9.8699e-03, 4.9491e-01, 1.3299e-08, 4.9522e-01], device='cuda:0')
tensor([0.0100, 0.5000, 0.0000, 0.4900], device='cuda:0')

Policy Loss 0.3030960150063038
Value Loss 2.1008893930085648e-10
Examples:
tensor([2.9725e-0

tensor([1.0264e-02, 4.6068e-01, 6.2131e-09, 5.2906e-01], device='cuda:0')
tensor([0.0100, 0.4600, 0.0000, 0.5300], device='cuda:0')
481/500

Policy Loss 0.7131358981132507
Value Loss 5.842143435685898e-11
Examples:
tensor([1.0085e-02, 4.6626e-01, 6.3406e-09, 5.2365e-01], device='cuda:0')
tensor([0.0100, 0.4600, 0.0000, 0.5300], device='cuda:0')

Policy Loss 0.6624820530414581
Value Loss 6.438731492419691e-11
Examples:
tensor([1.0350e-02, 4.5945e-01, 7.1707e-09, 5.3020e-01], device='cuda:0')
tensor([0.0100, 0.4600, 0.0000, 0.5300], device='cuda:0')
482/500

Policy Loss 0.610584169626236
Value Loss 6.753231307499163e-11
Examples:
tensor([1.0056e-02, 4.6483e-01, 7.0845e-09, 5.2511e-01], device='cuda:0')
tensor([0.0100, 0.4600, 0.0000, 0.5300], device='cuda:0')

Policy Loss 0.5803443863987923
Value Loss 7.301469845399922e-11
Examples:
tensor([9.8711e-03, 1.2071e-05, 9.8996e-01, 1.6137e-04], device='cuda:0')
tensor([0.0100, 0.0000, 0.9900, 0.0000], device='cuda:0')
483/500

Policy Loss 0.53

Policy Loss 0.3267146348953247
Value Loss 1.4201653741885423e-10
Examples:
tensor([9.8153e-03, 5.4538e-01, 8.0640e-09, 4.4480e-01], device='cuda:0')
tensor([0.0100, 0.5500, 0.0000, 0.4400], device='cuda:0')

Policy Loss 0.3204425051808357
Value Loss 1.4065287823328276e-10
Examples:
tensor([9.5000e-03, 5.5054e-01, 7.5843e-09, 4.3996e-01], device='cuda:0')
tensor([0.0100, 0.5500, 0.0000, 0.4400], device='cuda:0')
