# Goal: Find drivers of upsets in chess games and create a model to predict upsets 

* Upset is defined as a player with a lower rating winning a game agenst a player with a higher rating
* Model should make a predictions after having 'interviewed' each player as to thier intended opening as white and what defence they intend to use against a given opening as black

### Initial Thoughts

<br>

* Going into this project I am of two minds.

<br>

**First**
* Chess is a skill based game with no random elements (except assigning first move). 
* Because of this the player with the highest level of skill will win any game not determined by variation in player performance. 
* Because of this a given game will be won by the player with the highest level of skill a large majority of the time.
* If this is true conditions underwhich variation in performance is the highest should result in the highest likelyhood of an upset.

<br>

**Second**
* It may also be the case that more skilled players are able to maintain consistancy better than less skilled players under conditions that would increased variation in thier performance.
* If this is true, those conditions may make upsets less likely as the variance would have a grater effect on the the less skilled player than on the more skilled player.

<br>

**Moving Forward**
* Though these two schools of thought may point at differing conclutions, both seem grounded in reason and I am eager to see what the data can tell us

### Initial Hypothisese About Drivers

* There will be few instances of upsets, possibly leading to an imbalanced data set
* As ratings for both players increase, the likelyhood of an upset will decrease 
* As the margin between player ratings increase the likelyhood of an upsets will decrease
* Shorter time incraments will increase the likelyhood of an upset
* Unranked games will have a higher likelyhood of an upset than ranked games
* Games where the higher rated player is moving the white pieces (gaining first move advantage) will have a decreased likelyhood of of an upset
* Some opening/defense stratagies may be more or less prone to upsets
* openings/defences that are more popular or perfered by higher rated players may be more/less prone to upset

# Imports

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import os

from sklearn.model_selection import train_test_split
import sklearn.preprocessing

import warnings
warnings.filterwarnings("ignore")

from scipy import stats
import re

import wrangle as w

# Acquire

* Data acquired from Kaggle at https://www.kaggle.com/datasnaek/chess
* It contained 20,058 rows and 9 columns before cleaning
* Each row represents a chess game played on Lichess.org
* Each column represents a feature of those games

# Prepare

**Data was very clean initially, I performed the following steps to insure that is was ready for exploration:**
* Removed columns that did not contain useful information\* 
* Renamed columns to premote readability\*
* Checked for nulls in the data (there were none)
* Checked that column data types were apropriate
* Removed white space from values in object columns
* There were no rows lost during preperation
* Added Target column 'upset' indicating weather the lower rated player won the game
* Added additional features to investigate (columns that could be calculated one row at a time)\*
* Split data into train, validate and test (approx. 60/20/20), stratifying on 'upset'
* Added additional features to investigate (columns that requiered an aggregate calculation by column)\*
* aggregat calculations were performed on train data
* resulting calculations were then applied to create columns in train, validate, and test data.

\* See data dictionary for full list of column names

In [2]:
# acquiring, cleaning, and adding pre-split features to data
df = w.wrangle_chess_data(reprep = True)

# Splitting data into train, validate, and test
train, validate, test = w.split_my_data(df)

# Adding post split features to data
train, validate, test = w.fe_post_split(train, validate, test)

In [3]:
train.columns

Index(['rated', 'turns', 'ended_as', 'winning_pieces', 'time_increment',
       'white_rating', 'black_rating', 'opening_name', 'upset', 'rating_dif',
       'game_rating', 'lower_rated_white', 'time_block', 'time_control_group',
       'opening_ave_rating', 'opening_popularity_total',
       'opening_popularity_1500', 'opening_popularity_2000'],
      dtype='object')

In [5]:
for column in train.columns:
    
    print()
    print(column)
    print('--------------')
    print(train[column].value_counts())
    


rated
--------------
True     9045
False    2187
Name: rated, dtype: int64

turns
--------------
39     174
51     173
53     170
45     170
43     169
      ... 
185      1
216      1
184      1
176      1
207      1
Name: turns, Length: 199, dtype: int64

ended_as
--------------
resign       6231
mate         3576
outoftime     932
draw          493
Name: ended_as, dtype: int64

winning_pieces
--------------
white    5626
black    5087
draw      519
Name: winning_pieces, dtype: int64

time_increment
--------------
10+0     4342
15+0      721
15+15     489
5+5       405
5+8       383
         ... 
30+25       1
6+15        1
2+20        1
150+3       1
0+40        1
Name: time_increment, Length: 350, dtype: int64

white_rating
--------------
1500    466
1559     26
1547     26
1670     26
1696     25
       ... 
2028      1
2174      1
2323      1
2387      1
2049      1
Name: white_rating, Length: 1408, dtype: int64

black_rating
--------------
1500    439
1400     38
1501     33
14

# Explore

## How often do upsets occur?

In [None]:
values = [len(train.upset[train.upset == True]), len(train.upset[train.upset == False])] 
labels = ['Upset','Non-Upset', ] 
plt.pie(values, labels=labels, autopct='%.0f%%')
plt.title('Games Ending in Upsets Represent 1/3 of the test data')
plt.show()

**About 1/3 of games will end in an upset** <br>
This is much higher than I expected and may be do to the Lichess.org matching system pairing similarly rated players for matches.

## Dose first turn advantage effect upsets?

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10))
#fig.suptitle('Upset Percentage is 4% Higher in Games Where the Lower Rated Player has the First Move')

values = [len(train.upset[(train.lower_rated_white == True) & (train.upset == True)]),
          len(train.upset[(train.lower_rated_white == True) & (train.upset == False)])]
labels = ['Upset', 'Non-Upset']

ax1.pie(values, labels=labels, autopct='%.0f%%')
ax1.title.set_text('Lower Rated Player has First Move')

values = [len(train.upset[(train.lower_rated_white == False) & (train.upset == True)]),
          len(train.upset[(train.lower_rated_white == False) & (train.upset == False)])]
labels = ['Upset', 'Non-Upset'] 

ax2.pie(values, labels=labels, autopct='%.0f%%')
ax2.title.set_text('Higher Rated Player has First Move')

plt.show()

**Upset Percentage is 4% Higher in Games Where the Lower Rated Player has the First Move.** <br>
This is lower than I expected. I will not use a chi-square test to investigate whether this pattern will hold for the entire population of chess games.



**Ho: "Games ending in Upset" and "The lower ranked player having the first move" are independant of one another.** 

**Ha: "Games ending in Upset" and "The lower ranked player having the first move" are dependant on one another.**

**I will be using a confidance interval of 95% resulting an an alpha of .05.**

In [None]:
observed = pd.crosstab(train.lower_rated_white, train.upset)

chi2, p, degf, expected = stats.chi2_contingency(observed)

print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

**Because our P value is less than our alpha we can conclude that there is a high likelyhood that "Games ending in Upset" and "The lower ranked player having the first move" are dependant on one another. This means that we should expect the same 4% difference in upsets, based on having the first move, that we saw in the train data to also exist in the total population of chess games. For this reason I believe that the lower rated player having the first move is a driver of upsets and would be good feature to model on.**

## Does a game being rated effect upsets?

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10))
#fig.suptitle('Upset Percentage is 4% Higher in Games Where the Lower Rated Player has the First Move')

values = [len(train.upset[(train.rated == True) & (train.upset == True)]),
          len(train.upset[(train.rated == True) & (train.upset == False)])]
labels = ['Upset', 'Non-Upset']

ax1.pie(values, labels=labels, autopct='%.0f%%')
ax1.title.set_text('Game is Rated')

values = [len(train.upset[(train.rated == False) & (train.upset == True)]),
          len(train.upset[(train.rated == False) & (train.upset == False)])]
labels = ['Upset', 'Non-Upset'] 

ax2.pie(values, labels=labels, autopct='%.0f%%')
ax2.title.set_text('Game is not Rated')

plt.show()

In [None]:
observed = pd.crosstab(train.rated, train.upset)

chi2, p, degf, expected = stats.chi2_contingency(observed)

print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

### 1) Do upsets and non-upsets have a significantly differint number of moves

In [None]:
print(f"The mean number of moves for upsets are {train[train.upset == True].turns.mean()}.")

In [None]:
print(f"The mean number of moves for a non-upset are {train[train.upset == False].turns.mean()}.")

**Given the small difference in means for these two groups it is unlikely that the two have a statistically significant difference. However, the data meets all of the conditions for a t-test. So lets see what the results are.**

**HO: The mean of moves in games that are upsets is not significantly differint from the mean of moves in games that are not upsets.**

**HA: The mean of moves in games that are upsets is significantly differint from the mean of moves in games that are not upsets.**

**I will be using a confidance interval of 95% resulting an an alpha of .05.**

In [None]:
stats.ttest_ind(train[train.upset == True].turns, train[train.upset == False].turns)

**Because the t-test resulted in a pvalue that was below the alpha, we have reason to believe that the mean of moves in games that are upsets is significantly differint from the mean of moves, even though the difference in those means is only about 2 moves.**

### 2) Does the lower rated player moving the white peices effect the likelyhood of an upset?

In [None]:
train[train.lower_rated_white == True].upset.mean()

In [None]:
train[train.lower_rated_white == False].upset.mean()

In [None]:
train.upset.mean()

### Examine Object Variables

In [None]:
list(train.columns)

In [None]:
# distribution of the data
columns = ['rated',
           'ended_as',
           'winning_pieces',
           'upset',
           'lower_rated_white',
           'time_block',
          ]

for col in columns:
    
    df[col].value_counts().plot(kind='bar', title = f"{col} distribution")
    
    plt.show()

### Takeaways

* Resignations usually happen when mate is enevitable I see no reason to seperate the two
* I wonder if running out of time has an effect on upsets? 
* White does have an advantage, though it is much smaller than I thought it would be, at about 10% higher number of wins than black
* time_code, opening_code, and opening_name, have too many values to sort through at the moment and will have to be binned or pruened
* upsets represent about 1/3 or the data, which is higher than I thought it would be

### Examine Quantitative Variables

In [None]:
# distribution of the data
cols = ['turns', 'white_rating', 'black_rating']

for col in cols:
    plt.hist(df[col])
    plt.title(col+' distripution')
    plt.show()

### Takeaways

* Turns is slightly right skewed 
* Black and white rating distributions are pretty normally distributed and are nearly if not entierly identical

In [None]:
df.time_code.value_counts()

### I'm goint to try to prune the object columns by removing the values that do not have a significant represintation I an setting my trial cut off point at 50 or more occurrences 

In [None]:
df.to_csv('games_preped.csv')
df = pd.read_csv('games_preped.csv')

In [None]:
df.drop(columns=['Unnamed: 0'], inplace = True)

In [None]:
df.head()

In [None]:
# distribution of the data
columns = ['ended_as', 'winning_pieces', 
           'time_code', 'opening_code', 
           'opening_name', 'upset']

for col in columns:
    
    df[col].value_counts().plot(kind='bar', title = f"{col} distribution")
    
    plt.show()