## Simple logistic regression

Here, we will fit a few logistic regression models to predict the probability of a higher rating player winning. 
Order of business: 

1) Perform k-fold cross validation (k = 5) for a simple logistic regression. Predictor will be absolute value of difference (abs_diff_rating) 
in rating, and response will be whether the higher rating won (higher_rating_won). 

2) Get familiar with the multiple logistic regression code before mass producing in a python file. 

In [7]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import cross_val_score

In [8]:
games = pd.read_csv('games_new_vars.csv')

In [9]:
predictor = games[['abs_diff_rating']]
response = games['higher_rating_won']
lg = LogisticRegression()
score = cross_val_score(lg, predictor, response, cv = 5).sum() / 5
score

0.6275795649700114

But I want more! I'm going to try and add two more variables into our logistic regression here. 

1) White is known to have an advantage over black by getting to move first. So, I intend to code a white_higher_rated
(white higher rating) binary variable. 1 if White had the higher rating 0 otherwise. This may help show how being white or black affects having a higher rating over the other player. 

2) Number of turns. I hypothesize that longer turned games usually result in more venly matched games, despite the rating. Perhaps by knowing the number of turns made, I can say something about the probability of the higher rated player winning. 


In [10]:
games['white_higher_rated'] = 0
games.loc[games.higher_rating == 'white', 'white_higher_rated'] = 1

In [11]:
lg_mult = LogisticRegression()
predictors = games[['abs_diff_rating', 'higher_rating_coded', 'turns', 'white_higher_rated']]
response = games['higher_rating_won']
score = cross_val_score(lg_mult, predictors, response, cv = 5).sum() / 5
score

0.6353077143538222

We get a one percentage point increase by adding in more variables. 

A few days later, I remembered we should probably try normalizing our variables before fitting them into the model. Why? Because turns and differences in rating are working in different scales (for example, most turns are around 60, while most differences in rating may be in the hundreds). 

In [29]:
lg_mult = LogisticRegression()
predictors = preprocessing.normalize([games.abs_diff_rating.values, 
                         games.turns.values, 
                         games.higher_rating_coded, 
                         games.white_higher_rated]).reshape(20058, 4)
response = games['higher_rating_won']
score = cross_val_score(lg_mult, predictors, response, cv = 5).sum() / 5
score

0.6158141382384883

After normalizing, I get a two percent decrease in the score. 