In [65]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
import seaborn as sns
import sklearn as sk
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

In [2]:
games = pd.read_csv('games.csv')

In [3]:
games.columns

Index(['id', 'rated', 'created_at', 'last_move_at', 'turns', 'victory_status',
       'winner', 'increment_code', 'white_id', 'white_rating', 'black_id',
       'black_rating', 'moves', 'opening_eco', 'opening_name', 'opening_ply'],
      dtype='object')

I am interested in determining how the difference between two ratings relates to the probability of winning for each player. 

For example, suppose that we both meet to play a chess match. Me, having been a decent player, have a rating of 1800. You just started playing the other day so you have a rating of 1000. The idea behind a rating is to let us know that we are not evently matched. Given my rating, you should know that I will probably crush you, and your chance of winning should be lower than mine. So how can I quantify this intuition? 

My intuition tells me that the bigger the difference, the more likely the person with the higher rating will win.

Our ingredients for this cake will be a few new variables:

1) diff_rating will quantify the difference between the player ratings,

2) abs_diff_rating will simply take the absolute value of the quantity above.

3) hr_won will be a binary variable taking 1 if the person with the higher rating won, and 0 if they didn't.


Now, we intend to use the abs_diff_rating to predict whether hr_won will be a 1 or a 0 using a logistic regression. 

In [4]:
# Create a new variable for the difference in rating between two players.
games['diff_rating'] = games.white_rating - games.black_rating

In [67]:
# Create a new variable with absolute value of differences
games['abs_diff_rating'] = np.abs(games['diff_rating'])

In [92]:
# higher_rating will encode who has the higher rating. 
# Given the organization of the data, positive differences in the diff_rating indicate white had a higher rating, 
# whereas negative ones indicated a higher rating for black. 
games['higher_rating'] = ''
for i in list(range(len(games))):
    if games.diff_rating[i] > 0:
        games.higher_rating[i] = 'white'
    elif games.diff_rating[i] < 0:
        games.higher_rating[i] = 'black'
    else:
        games.higher_rating[i] = 'same'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [69]:
# Now, we encode a variable that returns 1 if the higher rating and 0 otherwise. 
games['hr_won'] = 0
for i in list(range(len(games))):
    if games.higher_rating[i] == games.winner[i]:
        games.hr_won[i] = 1
    else:
        games.hr_won[i] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [71]:
# Also need to code higher rating as a binary so that we can easily use it 
games['higher_rating_coded'] = 0
for i in list(range(len(games))):
    if games.higher_rating[i] == 'white':
        games.higher_rating_coded[i] = 1
    else: # Note that I'm ignoring draws with this type of analysis, rn, in the moment.
        games.higher_rating_coded[i] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Now would be a good time to show the variables we just created. Also, let's see the original winner variable. 

In [76]:
games[['higher_rating','higher_rating_coded', 'diff_rating', 'abs_diff_rating', 'hr_won', 'winner']].head(10)

Unnamed: 0,higher_rating,higher_rating_coded,diff_rating,abs_diff_rating,hr_won,winner
0,white,1,309,309,1,white
1,white,1,61,61,0,black
2,black,0,-4,4,0,white
3,black,0,-15,15,0,white
4,white,1,54,54,1,white
5,white,1,248,248,0,draw
6,white,1,97,97,1,white
7,black,0,-695,695,1,black
8,white,1,47,47,0,black
9,white,1,172,172,1,white


With this small output, we can see that we've done what we set out to do. If the higher_rating is equal to the winner, we get a 1 for hr_won, 0 otherwise. 

Now, we will take abs_diff_rating and use it as our predictor for hr_won using a logistic regression.

In [88]:
#  Setup our logistic regression and variables to use
clf = LogisticRegression()
X = games[['abs_diff_rating']]
y = games.hr_won

In [89]:
clf.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [90]:
clf.score(X, y)

0.6278791504636554

We need an explanation for this. Does this mean that 62% of the time we accurately predict whether or not the higher rating will win? 
Also, can I get an individualized prediction? Given a difference in rating, what is the probability that the higher rating will win? 

In [91]:
clf.predict(X)

array([1, 1, 0, ..., 1, 1, 1])