<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Chess</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/chess/">https://discovery.cs.illinois.edu/microproject/chess/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Lichess.org Open Database

The website [lichess.com](https://lichess.org) is a completely free platform for playing chess online. It provides a [database](https://database.lichess.org/) containing a massive amount of games, puzzles, and engine evaluations. Over 5 billion games are stored in the database with about a hundred million being added each month. Since working with the full dataset is not feasible, we will use only the games from the first month: January 2013. There were 121,114 games played then (after cleaning), and we provide them in a csv format in `lichess_games_1-2013.csv`. Each row represent one game played.

Load the dataset into a DataFrame called `df`:

In [0]:
df = ...
df

If you're not familiar with chess, here's an overview of the columns:
- WhiteElo: The Elo of the person playing white (Elo, or rating, is a measure of skill level)
- BlackElo: The Elo of the person playing black
- Result: The result of the game. `"1-0"` means white won, `"0-1"` means black won, and `"1/2-1/2"` means a draw
- Termination: Whether the game ended normally (such as by checkmate) or if a player ran out of time
- Opening: The name for the opening of the game (the first few moves played)
- ECO: The code for the opening, as classified by the [Encyclopaedia of Chess Openings](https://en.wikipedia.org/wiki/Encyclopaedia_of_Chess_Openings)
- TimeControl: The amount of time each player had to play the game. For example, `"600+8"` means each player had 600 seconds (10 minutes), and 8 seconds were added after each move

<hr style="color: #DD3403;">

## Puzzle 1: Conditional Probabilities

### Puzzle 1.1: Result Given Termination

Does the result of a game depend on whether it by timeout? For example, is a draw more or less likely to occur when a game ends by timeout?
Calculate the following probabilities for reference:
- $P(\text{Result} = \text{``1-0''})$
- $P(\text{Result} = \text{``0-1''})$
- $P(\text{Result} = \text{``1/2-1/2''})$

In [0]:
P_white_wins = ...
P_white_wins

In [0]:
P_black_wins = ...
P_black_wins

In [0]:
P_draw = ...
P_draw

You should find that white wins over half the time and that draws are by far the least common result.

Now calculate the following conditional probabilities:
- $P(\text{Result} = \text{``1-0"}\;|\;\text{Termination} = \text{Time Forfeit})$
- $P(\text{Result} = \text{``0-1"}\;|\;\text{Termination} = \text{Time Forfeit})$
- $P(\text{Result} = \text{``1/2-1/2"}\;|\;\text{Termination} = \text{Time Forfeit})$

In [0]:
P_white_wins_given_time_forfeit = ...
P_white_wins_given_time_forfeit

In [0]:
P_black_wins_given_time_forfeit = ...
P_black_wins_given_time_forfeit

In [0]:
P_draw_given_time_forfeit = ...
P_draw_given_time_forfeit

Run the cell below (don't change anything) to get a summary of the results!

In [0]:
pd.DataFrame({
    "P(Result)": [P_white_wins, P_black_wins, P_draw],
    "P(Result | Time forfeit)": [P_white_wins_given_time_forfeit, P_black_wins_given_time_forfeit, P_draw_given_time_forfeit]
}, index=["White wins", "Black wins", "Draw"])

Notice that white wins a little bit more when the game ends by time forfeit, and the odds of a draw greatly decrease.
This is likely because of the [unique circumstances](https://en.wikipedia.org/wiki/Draw_(chess)#:~:text=If%20only%20one%20player%20has,is%20lost%20by%20the%20player.) required for a draw to occur from a time forfeit: a draw only occurs if it's *impossible* for the player who didn't run out of time to checkmate the other player!

In [0]:
### TEST CASE for Puzzle 1: Conditional Probabilities
#
# What is this cell?
# - This cell contains test cases for the MicroProject. Even though you can modify this
#   cell, you should treat it like it's a read-only cell since it will be replaced with
#   a fresh version when your code is checked.
#
# - If this cell runs without any error in the output, you PASSED all test cases!
#   We try and make these test cases as useful and complete as possible, but there is
#   a chance your code may be incorrect even though you pass the test cases (these
#   tests should be seen as a way to give you confidence that code you believe is
#   actually correct, not as a robust check to catch all possible errors).
#
# - If this cell results in any errors, check you previous cells, make changes, and
#   RE-RUN your code and then re-run this cell.  Keep repeating this until the cell
#   passed with no errors! :)

tada = "\N{PARTY POPPER}"
import math

assert(math.isclose(P_white_wins, 0.5119887048565814)), "Your value for P_white_wins is incorrect"
assert(math.isclose(P_black_wins, 0.4552652872500289)), "Your value for P_black_wins is incorrect"
assert(math.isclose(P_draw, 0.032746007893389696)), "Your value for P_draw is incorrect"
assert(math.isclose(P_white_wins_given_time_forfeit, 0.5294521814962607)), "Your value for P_white_wins_given_time_forfeit is incorrect"
assert(math.isclose(P_black_wins_given_time_forfeit, 0.4568325361380513)), "Your value for P_black_wins_given_time_forfeit is incorrect"
assert(math.isclose(P_draw_given_time_forfeit, 0.01371528236568801)), "Your value for P_draw_given_time_forfeit is incorrect"

print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Puzzle 2: Game Result Probabilities

### Puzzle 2.1: Probability of Winning Formula

A player's Elo changes after each game, and the change depends on the Elo of the opponent. Someone who wins against a higher-rated player gains more Elo than if they win against a lower-rated player. This reflects the fact that a higher-rated player is more considered more likely to win. But exactly how likely?

Suppose two players, A and B, are playing a game of chess. A common formula for the probability that A wins is:
$$P(\text{A wins}) = \frac{1}{1 + 10^{(-r/400)}}$$
where $$r = \text{A's Elo} - \text{B's Elo}$$

For example, if A has an Elo of 1000 and B has an Elo of 1200, then $r = -200$.

Therefore, $P(\text{A wins}) = \frac{1}{1 + 10^{200/400}} = 0.2403$. It makes sense that A has a low chance of winning, since they are rated 200 points lower than B.

This formula assumes that there are only two results: A wins or B wins. A [more complicated formula](https://en.wikipedia.org/wiki/Elo_rating_system#Formal_derivation_for_win/draw/loss_games) accounts for draws, but we won't use it here. This means that $$P(\text{B wins}) = 1 - P(\text{A wins})$$

Create a function `p_A_wins` that takes two arguments, `elo_A` and `elo_B`, and returns the probability that player A wins.

In [0]:
...

You may have heard of the [chess cheating scandal](https://en.wikipedia.org/wiki/Carlsen%E2%80%93Niemann_controversy) from 2022 where Magnus Carlsen, the world champion at the time, accused his opponent Hans Niemann of cheating after [losing to him](https://www.chess.com/events/2022-sinquefield-cup/03/Carlsen_Magnus-Niemann_Hans_Moke). At the time, Hans Niemann had an Elo of 2688, and Magnus Carlsen had an Elo of 2861. Use your function to estimate the probability of Hans Niemann winning:

In [0]:
p_Hans_wins = ...
p_Hans_wins

Keep in mind that the formula from this section is a very rough approximation. In reality, the probability of winning is affected by many factors, such as whether the player is playing white or black, the time control, and whether their current Elo truly reflects their skill level.

In [0]:
### TEST CASE for Puzzle 2.1: Probability of Winning Formula

assert("p_A_wins" in vars()), "Make sure to define the function `p_A_wins`"
assert(math.isclose(p_A_wins(2000, 2000), 0.5)), "Your function does not return the correct probability"
assert(math.isclose(p_A_wins(1000, 1200), 0.2402530733520421)), "Your function does not return the correct probability"
assert(math.isclose(p_A_wins(1000, 2000), 0.0031523091832602115)), "Your function does not return the correct probability"
assert(math.isclose(p_A_wins(1500, 1200), 0.8490204427886767)), "Your function does not return the correct probability"

assert("p_Hans_wins" in vars()), "Make sure to define the variable `p_Hans_wins`"
assert(not math.isclose(p_Hans_wins, 0.730245413297424)), "Your value for p_Hans_wins is incorrect. Remember that your function return the probability of the first player winning."
assert(math.isclose(p_Hans_wins, 0.269754586702576)), "Your value for p_Hans_wins is incorrect"

print(f"{tada} All Tests Passed! {tada}") 

### Puzzle 2.2: Your Elo

Let's take a look at your probabilities of winning against different opponents!
- If you have a chess Elo, put it in the variable `your_elo`.
- If you don't have an Elo, feel free to use `1200` or another reasonable value.  *(An Elo of `500` would represent someone who understands the rules of chess, but uses very little strategy; an Elo of `1200` might be someone who plays recreationally; an Elo of `2000` begins to be range of players who play professionally.)*

In [0]:
your_elo = ...
your_elo

The lowest possible Elo on Lichess is 400. Calculate the probability of winning against a player with that Elo.

In [0]:
p_win_400 = ...
p_win_400

Now calculate the average Elo of every playing in the dataset. Make sure to include both white and black players.

*Hint: There are exactly as many white players as black players in the dataset, so you can average the two means.*

In [0]:
average_elo = ...
average_elo

Now calculate the probability of winning against a player with the average Elo.

In [0]:
p_win_average_elo = ...
p_win_average_elo

What about yourself? Calculate the probability of winning against a someone with your own Elo.

In [0]:
p_win_yourself = ...
p_win_yourself

Congratulations, you played yourself!

The highest Elo ever reach by a human was **2882** by Magnus Carlsen in 2014. Calculate the probability of winning against Magnus Carlsen at his peak.

In [0]:
p_win_magnus = ...
p_win_magnus

Feel free to try other Elo values in a new cell!

In [0]:
### TEST CASE for Puzzle 2.2: Probability of Win Formula

_p = lambda a, b: 1 / (1 + 10 ** ((b - a) / 400))

assert("your_elo" in vars()), "Make sure to define the variable `your_elo`"
assert(100 < your_elo < 3000), "Your Elo is not in a reasonable range"
assert("p_win_400" in vars()), "Make sure to define the variable `p_win_400`"
assert(math.isclose(p_win_400, _p(your_elo, 400))), "Your value for p_win_400 is incorrect"
assert("average_elo" in vars()), "Make sure to define the variable `average_elo`"
assert(math.isclose(average_elo, 1600.8983024258137)), "Your value for average_elo is incorrect"
assert("p_win_yourself" in vars()), "Make sure to define the variable `p_win_yourself`"
assert(math.isclose(p_win_yourself, 0.5)), "Your value for p_win_yourself is incorrect"
assert("p_win_average_elo" in vars()), "Make sure to define the variable `p_win_average_elo`"
assert(math.isclose(p_win_average_elo, _p(your_elo, average_elo))), "Your value for p_win_average_elo is incorrect"
assert("p_win_magnus" in vars()), "Make sure to define the variable `p_win_magnus`"
assert(math.isclose(p_win_magnus, _p(your_elo, 2882))), "Your value for p_win_magnus is incorrect"

print(f"{tada} All Tests Passed! {tada}") 

### Puzzle 2.3: Plotting Your Winning Chances

We've looked at a few different win probabilities. Now let's plot a bunch of different ones to see how the probability changes with Elo!

Using a for loop, create a DataFrame called `df_win_prob` with the columns `Elo` and `p_win`.
- You should consider all values of Elo from 400 to 2900 (inclusive).
- The `p_win` column should contain the probability of you winning against a player with that Elo.

Creating this can be done in a way very similar to a simulation, where the real-world events are not random -- but, instead, is the current Elo and the probability of winning with that Elo.


It may be useful to use the two-argument `range` function in Python.  Traditionally, we usually use the one-argument `range` function that starts at `0` and stops at the number you specify.  For example, `range(100)` starts at `0`, stops at `100` and includes all the numbers `0...99`.

```py
for i in range(100):
  print(i)
# Result: 0  1  2  ...  98  99
```

Using the two argument `range` function, you specify the **start** and the **stop**.  If we want to start at `42` instead of `0`, the following code starts at `42` instead of the default of `0`:

```py
for i in range(42, 100):
  print(i)
# Result: 42  43  44  ...  98  99
```

If may be useful to do something like this for your `Elo`, similar to:

```py
for Elo in range(10, 20):
  print(Elo)
# Result: 10  11  12  ...  18  19
```


In [0]:
...

Using `df_win_prob` and `plot.line`, plot a line chart.  The x-axis should be `Elo`, and the y-axis should be `p_win`.

In [0]:
...

In [0]:
### TEST CASE for Puzzle 2.3: Plotting Your Winning Chances
_p = lambda a, b: 1 / (1 + 10 ** ((b - a) / 400))
assert("df_win_prob" in vars()), "Make sure to create the DataFrame `df_win_prob`"
assert("Elo" in df_win_prob and "p_win" in df_win_prob), "Make sure to have the columns `Elo` and `p_win` in your DataFrame"

assert(400 in df_win_prob.values ), "Your DataFrame does not have the Elo 400"
assert(2900 in df_win_prob.values ), "Your DataFrame does not have the Elo 2900"
assert(len(df_win_prob) == 2501), "Your DataFrame does not have the correct number of rows"

assert(set(df_win_prob.Elo) == set(range(400, 2900 + 1))), "Your DataFrame does not have the correct Elo values"
for i, row in df_win_prob.iterrows():
    assert(math.isclose(row.p_win, _p(your_elo, row.Elo))), f"Your DataFrame does not have the correct p_win value for Elo={row.Elo}"

print(f"{tada} All Tests Passed! {tada}") 

<hr style="color: #DD3403;">

## Puzzle 3: Probability vs. Reality

Let's take everything we've done and put it together!

- We have a function to calculate the probability of winning given two Elo ratings, and
- We have a dataset of games where we know which side won the game.

Since our function doesn't allow for draws, filter the dataset to only include decisive (no draw) games. Store the result in a new DataFrame called `df_decisive`:

In [0]:
df_decisive = ...
df_decisive

In [0]:
### TEST CASE for Puzzle 3: df_decisive
assert("df_decisive" in vars()), "Make sure to create the DataFrame `df_decisive`"
assert(set(df_decisive.Result.unique()) == {"1-0", "0-1"}), "Make sure to filter the DataFrame to only include decisive games"
print(f"{tada} All Tests Passed! {tada}") 

### Puzzle 3.1: Probability of White Winning

The code below creates a copy of your `df_decisive` DataFrame and adds a new column `p_win_white` that is the probability that white is expected to win the game given the elo differences.  This code uses the `p_A_wins` you defined earlier and the `apply` function of the DataFrame.

Run the code to add the column:

In [0]:
df_decisive = df_decisive.copy(deep=True)
df_decisive["p_win_white"] = df_decisive.apply(lambda row: p_A_wins(row["WhiteElo"], row["BlackElo"]), axis=1)
df_decisive

### Puzzle 3.2: Unlikely Upsets

Let's explore games that are **unlikely upsets**.  We'll define an **unlikely upset** to be a game where:

- Black wins when there is at least a 90% probability that white was to win, **OR**
- White wins when there is at least a 90% probability that black was to win

Find all the unlikely upsets in `df_decisive` and save it as a new DataFrame `df_upsets`:

In [0]:
df_upsets = ...
df_upsets

### Puzzle 3.3: Likely Wins

To see how common unlikely upsets are, we need to also find out how often **likely wins** occurs.  We'll define an **likely win** to be a game where:

- White wins when there is at least a 90% probability that white was to win, **OR**
- Black wins when there is at least a 90% probability that black was to win

Find all the likely wins in `df_decisive` and save it as a new DataFrame `df_likely`:

In [0]:
df_likely = ...
df_likely

### Puzzle 3.4: Percentage of Likely Wins

If our Elo formula is an accurate representation of a player's skill, among games there there's a 90% or greater chance for one side to win that side should **actually win 90% of the time**.

Calculate $P(likely\ win)$, or the probability that, among games where there's a 90% or greater chance for one side to win, that there was an likely win:

In [0]:
P_likely_win = ...
P_likely_win

In [0]:
### TEST CASE for Puzzle 3: Probability vs. Reality
assert("df_likely" in vars()), "Make sure to create the DataFrame `df_decisive`"
assert(len(df_likely) == 7098)
assert(len(df_upsets) == 793)
assert(math.isclose(P_likely_win, 0.899505766062603))

print(f"{tada} All Tests Passed! {tada}") 
print()
print(f"You calculated: P(likely win) = {round(P_likely_win * 100, 2)}% among all games in a month of chess!")
print("...this is almost exactly 90%, which highlights the Elo system seems to be working!")

<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and return to https://discovery.cs.illinois.edu/microproject/chess/ and complete the section **"Commit and Grade Your Notebook"**.

3. If you see a 100% grade result on your GitHub Action, you've completed this MicroProject! 🎉
