## Introducing the question

I am interested in determining how the difference between two ratings relates to the probability of winning for each player. 

For example, suppose that we both meet to play a chess match. I, having been a decent player, have a rating of 1800. You just started playing the other day so you have a rating of 1000. The idea behind a rating is to let us know that we are not evently matched. Given my rating, you should know that I will probably crush you, and your chance of winning is dismally lower than mine. So how can I quantify this intuition? 

My intuition tells me that the bigger the difference, the more likely the person with the higher rating will win.

In this notebook, we make variables to cook our models with. 

## Making new variables

Our ingredients for this cake will be the following:

1) diff_rating will quantify the difference between the player ratings,

2) abs_diff_rating will simply take the absolute value of the quantity above.

3) higher_rating_won will be a binary variable taking 1 if the person with the higher rating won, and 0 if they didn't.

4) higher_rating_coded will determine whether white won (takes value of 1, 0 otherwise).

5) result will encode wether the higher rated player lost, tied, or won (0, 1, 2). 

6) white_higher_rate encodes whether the higher rated player was white or not (1, 0). 

TODO Could the variable names higher_rating_coded and white_higher_rate be confusing?

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [2]:
games = pd.read_csv('games.csv')

In [3]:
games['diff_rating'] = games.white_rating - games.black_rating

In [4]:
games['abs_diff_rating'] = np.abs(games['diff_rating'])

In [6]:
games['higher_rating'] = ''
games.loc[games.diff_rating > 0, 'higher_rating'] = 'white'
games.loc[games.diff_rating < 0, 'higher_rating'] = 'black'
games.loc[games.diff_rating == 0, 'higher_rating'] = 'same'

In [7]:
games['higher_rating_won'] = 0
games.loc[games.winner == games.higher_rating, 'higher_rating_won'] = 1

In [8]:
games['higher_rating_coded'] = 0
games.loc[games.higher_rating == 'white', 'higher_rating_coded'] = 1

In [9]:
games['result'] = 0
games.loc[games.winner == 'draw' , 'result'] = 1
games.loc[games.higher_rating_won == 1, 'result'] = 2

In [10]:
games['white_higher_rated'] = 0
games.loc[games.higher_rating == 'white', 'white_higher_rated'] = 1

Now would be a good time to show the variables we just created. Also, let's see the original winner variable. 

In [11]:
random_indices = np.random.randint(len(games), size = 10)
games.iloc[random_indices][['higher_rating','higher_rating_coded', 'diff_rating', 
                            'abs_diff_rating', 'higher_rating_won', 'winner',
                           'result', 'white_higher_rated']]

Unnamed: 0,higher_rating,higher_rating_coded,diff_rating,abs_diff_rating,higher_rating_won,winner,result,white_higher_rated
15871,black,0,-4,4,0,white,0,0
6468,white,1,204,204,0,black,0,1
3136,white,1,32,32,0,black,0,1
6818,white,1,96,96,1,white,2,1
18632,white,1,323,323,1,white,2,1
14138,white,1,23,23,0,black,0,1
3635,black,0,-101,101,0,white,0,0
8328,white,1,128,128,0,black,0,1
3760,black,0,-15,15,1,black,2,0
1909,black,0,-70,70,1,black,2,0


Looking at a random sample of our dataset, we can see that we've done what we set out to do.

In [38]:
games.to_csv('games_new_vars.csv')