Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving rank feedback to user #73

Closed
sanderland opened this issue Jun 16, 2020 · 51 comments
Closed

Improving rank feedback to user #73

sanderland opened this issue Jun 16, 2020 · 51 comments

Comments

@sanderland
Copy link
Owner

Give the user feedback on their game , such as 'your opening/middle game/endgame was around 8k'.

@bale-go
Copy link
Contributor

bale-go commented Jun 17, 2020

I did some coding to test the move quality prediction for consecutive segments of the script.
It seems that the predictions are in line with what we have learned from the move_rank/point_loss of users. Between 0 - 50 moves the bulk of user games are below the y=x line, meaning that memorizing joseki gives an additional 2-3 kyu to our strength. In the middle game most games are near the y=x line. While in the late middle game/endgame users play slightly weaker than their kyu rank.
It is important to note that the mean quality of moves never shifts more than 3 kyu ranks from y=x line for any segment.
move_quality

@sanderland
Copy link
Owner Author

What's the code you are using for these histograms? I built a little NN to predict rank and would be curious how it compares. It did immediately find one cheater who was OGS 5d but around 8k and timing out all his losses.

@sanderland
Copy link
Owner Author

thanks for the code. fitting to <25k gives me:

image

@bale-go
Copy link
Contributor

bale-go commented Jun 18, 2020

Pretty cool!
However, I would not give up on the move rank based kyu estimation.
It is based on a pretty solid mathematical background, where we simply invert the working calibration of the p-pick-rank bots to get the kyu rank. (no further training/calibration is necessary)

On the other hand, a neural network is always a blackbox. It can be really useful as a universal function approximator for complex functions, where symbolic/analytic functions are not available. But I think that is not the case here.

@sanderland
Copy link
Owner Author

It was mostly to see what the difference is / how well it performs. It is only a 2 layer, 15->20->10->1 NN using the histogram features as inputs, as I'm trying to avoid overfitting.

@bale-go
Copy link
Contributor

bale-go commented Jun 18, 2020

It is interesting that the NN prediction deviates from y=x similarly to the move rank based kyu estimation.

@sanderland
Copy link
Owner Author

Let's try to get your estimate into v1.3 - any ideas on the user interface? just putting it in the 'info' box will be easiest, but maybe a bit hidden?

@bale-go
Copy link
Contributor

bale-go commented Jun 21, 2020

Maybe a spline connected curve next to the score and win rate (under the timer).
With kyu rank on the y axis and move number on the x axis.
I would suggest to use PCHIP interpolator, since it does not allow unrealistic max and min values that plague regular spline interpolation.

And under the plot, next to Score/Win Rate/Point Loss there could be the overall kyu rank of the game from move 1 to the current move.

@bale-go
Copy link
Contributor

bale-go commented Jun 21, 2020

I created a PR that generates the following plot.
move_quality_plot

@sanderland
Copy link
Owner Author

sanderland commented Jun 21, 2020

frankly that looks like we could do with linear interpolation (=just save the points and let the graphics primitives deal with it -- the score graph is definitely not matplotlib based)

@bale-go
Copy link
Contributor

bale-go commented Jun 21, 2020

I don't think spline is crucial for the plot either.
Spline might be closer to the underlying function, but the stdev is large anyway.

@sanderland
Copy link
Owner Author

image
start of a layout

@Dontbtme
Copy link
Contributor

Dontbtme commented Jun 23, 2020

What about an option to be able to show dots according to their rank estimates rather than point losses?
Say I'm an overall 2kyu, I could look for bad moves that were weaker than my rank, or double digits kyu blunders and so on.

@sanderland
Copy link
Owner Author

@Dontbtme A single move does not have a rank estimate -- it's a statistical estimate that requires many moves to be even close.

@bale-go
Copy link
Contributor

bale-go commented Jun 23, 2020

The problem is that kyu estimation is not a single move statistic.
It needs at least 25/50 moves to get a reasonable estimation for a given segment of the game.

@Dontbtme
Copy link
Contributor

Dontbtme commented Jun 23, 2020

Gotcha. That's a shame :p

@Dontbtme
Copy link
Contributor

Dontbtme commented Jun 23, 2020

I'll give you one last idea before I stop wasting your time and then I'll call it a day :p
Maybe a rank loses a given number of points per move on average depending on the stage of the game.
Suppose 2kyu players lose 2pts per move on average in the opening, 3pts per move on average in the middle game, and 1pt per move on average in the endgame... then that would give an order of magnitude to rank mistakes according to the stage of the game.
That would be useful I think, because the closer we are to the end of the game, the bigger of a mistake a 3 points loss would be, for example.

@bale-go
Copy link
Contributor

bale-go commented Jun 23, 2020

The idea here is that average move rank is a more robust statistic than the average score loss.
It seems that it is possible to build a rank adjustable bot on top of the p-pick bot, which chooses the best of M (which depends on the kyu rank) out of N legal moves.
The cool thing about it is that this method gives us a score loss distribution that is very similar to the human play (see #44 (comment))

The rank estimation "simply" inverts the method used in the p-pick bot (after removing outliers).

@sanderland
Copy link
Owner Author

The idea here is that average move rank is a more robust statistic than the average score loss.

There is no particular evidence for this though, you can probably do well with score loss as well, or both. What is definitely true though is that the 15b single visit scoreLoss is very noisy/biased in endgame.

@bale-go
Copy link
Contributor

bale-go commented Jun 23, 2020

I am pretty sure that using score loss alone or in combination with move rank can be used to create a human-like player that is even closer to the human style than the current bot.
The advantage of using move ranks (p-pick) is that it gives us a really simple and surprisingly well working first model. But it is definitely not the last word in the quest for mimicking human play.

What is definitely true though is that the 15b single visit scoreLoss is very noisy/biased in endgame.

This is exactly what I meant by more robust. It works even at policy level throughout the game.

@sanderland
Copy link
Owner Author

image

random 5k vs 6k game which black won by 131 points -- something definitely seems off with the rank estimate

@Dontbtme
Copy link
Contributor

Dontbtme commented Jun 23, 2020

ROFL
Look at the winrate and score graphs: they're all over the place! I wouldn't worry, though. Seems like White played better overall but messed up big time at some point, so the rank graph might be working just fine but you picked the wrong game to try it on :D

@bale-go
Copy link
Contributor

bale-go commented Jun 23, 2020

I cloned the v1.3 branch. I really like the way you solved to show the rank estimate.
I checked a random [sgf](sgf_ogs/katrain_power66 (7k) vs katrain-6k (4k) 2020-06-15 11 24 28_W+26.3.sgf).
The analyze_rank.py does not give the same result as the generated plot. The first and last segments were different by 4-5 kyus.

@bale-go
Copy link
Contributor

bale-go commented Jun 24, 2020

I found a game that shows the issue nicely.
It is played between two really strong bots on computer go server.

The first segment is only 30 moves long (15 each) resulting in a poor estimation of that part of the game.
Screenshot from 2020-06-24 08-59-07

I fixed it by not plotting estimations that are made from less than 75% of the total segment length.
Screenshot from 2020-06-24 09-38-28

Using 20b model the estimation of strong bots is even more accurate. Please notice that the calibration of calibrated rank bot does not apply for 20b, but the trends are the same.
Screenshot from 2020-06-24 09-51-06

@bale-go bale-go mentioned this issue Jun 24, 2020
@sanderland
Copy link
Owner Author

sanderland commented Jun 24, 2020

yeah I was a bit to aggressive in wanting the line to look nice across the whole length. maybe we can fake it by extrapolating the first point backward ;)

@bale-go
Copy link
Contributor

bale-go commented Jun 25, 2020

What do you mean by taking it all the way?
Making a histogram of the move ranks and using that to predict kyu rank might work for the full game, but I'm afraid that it would not be accurate for segments. Average helps with uncertainty of the prediction a lot (divides it by sqrt(N)).

@sanderland
Copy link
Owner Author

Well a course histogram with a few bins may be better, it's worth considering

@bale-go
Copy link
Contributor

bale-go commented Jun 25, 2020

I did some number crunching.
I analysed 3000 sgfs (19x19 at least 100 moves).
The 2D kernel density plots are pretty good to show the effect of the patch.

2D kernel density plot without the move_cap_patch:
move_quality_old

2D kernel density plot with the move_cap_patch, factor used is 0.07:
move_quality_patch007

2D kernel density plot with the move_cap_patch, factor used is 0.09:
move_quality_patch009

From the results I think 0.07 factor is a little bit too small, since it hinders the accurate estimation of sub 15k ranks.
I made histograms for different kyu games to show the effect of the patch in an other way (moves between 75 and 125).
move_histogram_1

As expected the patch does not affect SDK players too much. However at DDK levels it really helps with the spread of the distribution.

@bale-go
Copy link
Contributor

bale-go commented Jun 26, 2020

The user data based AI (#74 (comment)) can be used for estimating user ranks more accurately.
It works much better at higher strengths than the previous calibrated rank bot. I created a PR, which includes the changes.
move_quality_patch009_usercal-compare

@bale-go
Copy link
Contributor

bale-go commented Jun 27, 2020

The rank estimation of segments became much more accurate with the new user data based AI.
Due to the non-linearity of "outlier free mean of move ranks vs. the # of legal moves on board" function (#74 (comment)) for the new bot it is not possible to easily calculate the kyu rank of the entire game. Thus I used the median of the move quality of the segments to calculate the estimated kyu rank/game kyu rank.

I think it is pretty convincing, especially comparing it to previous estimations of user kyu ranks.
move_quality_patch009_usercalib3

@sanderland
Copy link
Owner Author

It looks pretty good, and there are some impressive outputs when I try it on the OGS games. Incredible noise as well though, two games from the same player.

Move quality for moves 1 to 178 B: 10.4k W: 4.1k
Move quality for moves 1 to 285 B: 3.4d W: 4.6d

@bale-go
Copy link
Contributor

bale-go commented Jun 28, 2020

I think a way to further decrease the noise could be by running the policy analysis a few times (3-5). The rank estimation function would be fed by the average of the reported move ranks.

@sanderland
Copy link
Owner Author

I think a way to further decrease the noise could be by running the policy analysis a few times (3-5). The rank estimation function would be fed by the average of the reported move ranks.

The policy is deterministic!

@bale-go
Copy link
Contributor

bale-go commented Jun 28, 2020

That is interesting.
When I run analyse_rank.py multiple times on the same sgf, I get slightly different result each time.
1st run:

  • File name: test/katrain_AI (Calibrated Rank) vs AI (Calibrated Rank) 2020-06-27 00 14 50.sgf
    Move quality for moves 1 to 324 B: 6.4k W: 6.8k
    Move quality for moves 1 to 80 B: 3.2k W: 5.0k
    Move quality for moves 41 to 120 B: 6.4k W: 7.0k
    Move quality for moves 81 to 160 B: 6.4k W: 6.8k
    Move quality for moves 121 to 200 B: 9.2k W: 4.7k
    Move quality for moves 161 to 240 B: 7.2k W: 6.1k
    Move quality for moves 201 to 280 B: 4.1k W: 8.7k
    Move quality for moves 241 to 320 B: 2.9k W: 9.0k

2nd run:

  • File name: test/katrain_AI (Calibrated Rank) vs AI (Calibrated Rank) 2020-06-27 00 14 50.sgf
    Move quality for moves 1 to 324 B: 6.7k W: 6.7k
    Move quality for moves 1 to 80 B: 3.3k W: 4.9k
    Move quality for moves 41 to 120 B: 7.2k W: 7.2k
    Move quality for moves 81 to 160 B: 7.2k W: 6.7k
    Move quality for moves 121 to 200 B: 8.6k W: 3.7k
    Move quality for moves 161 to 240 B: 6.7k W: 5.2k
    Move quality for moves 201 to 280 B: 6.1k W: 8.3k
    Move quality for moves 241 to 320 B: 3.9k W: 7.9k

3rd run:

  • File name: test/katrain_AI (Calibrated Rank) vs AI (Calibrated Rank) 2020-06-27 00 14 50.sgf
    Move quality for moves 1 to 324 B: 7.3k W: 5.9k
    Move quality for moves 1 to 80 B: 3.9k W: 3.7k
    Move quality for moves 41 to 120 B: 7.3k W: 5.4k
    Move quality for moves 81 to 160 B: 7.8k W: 5.9k
    Move quality for moves 121 to 200 B: 9.4k W: 4.4k
    Move quality for moves 161 to 240 B: 7.9k W: 5.9k
    Move quality for moves 201 to 280 B: 4.9k W: 9.7k
    Move quality for moves 241 to 320 B: 3.1k W: 7.2k

@sanderland
Copy link
Owner Author

Aha, that may be because of the random rotations it does! Other than rotations it's deterministic, and there is no real way to force them currently.

@bale-go
Copy link
Contributor

bale-go commented Jun 28, 2020

I see. Probably using the stronger 20b model will help in this respect.

@sanderland
Copy link
Owner Author

Yes, it's more consistent in that respect

@sanderland
Copy link
Owner Author

20b

lee sedol vs alphago game 4

image

alphazero vs alphago master
image
image
image

(15b) on more alphazero vs alphago master
image
image
image

alphago ddk confirmed

@bale-go
Copy link
Contributor

bale-go commented Jul 2, 2020

Yes, it is weird to see so many yellow and red dots in games like that.
Then again, alphago style bots play strange moves when they are losing as they only care about winning the game.
In all of the above mentioned games the losing player played weaker moves.

Katago on the other hand plays to maximize the score, making its style more similar to humans.
I checked several Katago vs. Katago games on CGOS and they were all 5d+
If you look at Lee Sedol vs. AlphaGo games where Alphago won, you do not see the decrease in move quality near the end of the game.

1st game (Lee Sedol black):
Screenshot from 2020-07-02 08-45-44

2nd game (Lee Sedol white):
Screenshot from 2020-07-02 08-50-10

3rd game (Lee Sedol black):
Screenshot from 2020-07-02 08-56-08

5th game (Lee Sedol black):
Screenshot from 2020-07-02 08-58-20

Only in the last game did AlphaGo dip below 5d (except for the 4th game where it lost).
According to Michael Redmond it is where alphago missed the tombstone squeeze.

@sanderland
Copy link
Owner Author

I was trying to estimate the effect of the strength parameter of score loss and found even a fairly low number to beat the highest calibrated rank. Even though on OGS I found strength=0.5 to be around 5k maybe -- something is weird!

image

SGF: https://gokibitz.com/kifu/BJ4N3xpCI?path=341

@bale-go
Copy link
Contributor

bale-go commented Jul 4, 2020

I could also reproduce the behaviour.
Interestingly, the rank estimation, which is based on the same calibration, is quite accurate for score loss.
After checking the log of several games I found that calibrated rank plays a quite strong game overall, but at certain moves it makes serious blunders leading to losing the game. That is why the rank estimation could not take them into account, the really bad moves were treated as outliers.

I found that it almost exclusively happens when two equally good moves are possible with high policy values, for example: top move: 50%, second best 42% (third best 3% etc.)
The obvious move detector does not see it, since the top move is only 50%. The P:Pick then chooses randomly the third or the fourth best. See red ellipses in the game between a 5d calibrated rank (B) bot and 0.5 score loss (W).
Screenshot from 2020-07-04 12-51-00

I decided to solve the issue by using the top policy when the sum of the two best policy value was over a certain threshold. After the patch, calibrated rank bot (5d) won all of its games against ScoreLoss.
Screenshot from 2020-07-04 11-55-45

I tested calibrated rank bot (2k) against score loss (0.5) too. The rank estimation worked and the two bots played an even game.
Screenshot from 2020-07-04 13-56-23

I created a PR for ai.py. Unfortunately, I could not simply apply the patch for the rank estimation since the policy value of the second best move is not available in graph.py.

@bale-go bale-go mentioned this issue Jul 4, 2020
@sanderland
Copy link
Owner Author

Let's give it a try. I'm a bit concerned about whether this eventually goes to obvious_n or something. Also, whether this is appropriate for the lower strength ais as I'm sure this case happens a lot in joseki, and it won't even play the second best move!

@sanderland
Copy link
Owner Author

still looks kind of similar, strength=0.5 murders calibrated rank 5d while being judged ddk :o
image

@bale-go
Copy link
Contributor

bale-go commented Jul 4, 2020

Strange. I ran 6 games, 5d calibrated rank won all of them.
I use 15b model and 500 max visits.

@sanderland
Copy link
Owner Author

I ran two more and calibrated rank won, perhaps it was a fluke

@bale-go
Copy link
Contributor

bale-go commented Jul 5, 2020

It seems that the bots became too strong (OGS ranks).
I run several games to see the percentage of overrides.
Obvious best moves get overridden in 16% of the moves. The best two moves patch caused an additional 12% of the moves overridden (28% in total). That caused a significant change in the strength.
I opted to set a constant and pretty high threshold for the two best moves to keep the original calibration, but remove obvious cases. Now only 3% of the moves are overridden due to two obvious moves.

@sanderland
Copy link
Owner Author

Yes, particularly the weaker bots are getting a lot stronger. the effect on the higher ranks seems less, perhaps because policy alone becomes a less good strategy. updated ogs bots, let's see them plummet

@sanderland sanderland removed the 1.3 label Jul 7, 2020
@sanderland sanderland changed the title Rank feedback to user Improving rank feedback to user Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants