# Goal: 
* Discover drivers of upsets in chess games played on Lichess.org
* Use those drivers to develop a machine learning model to predict whether a game will end in upset

# Imports

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import os

from sklearn.model_selection import train_test_split
import sklearn.preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings("ignore")

from scipy import stats
import re

import wrangle as w
import explore as e
import modeling as m

# Acquire

* Data acquired from [Kaggle](https://www.kaggle.com/datasnaek/chess)
* It contained 20,058 rows and 9 columns before cleaning
* Each row represents a chess game played on Lichess.org
* Each column represents a feature of those games

# Prepare

**Prepare Steps:**
* Removed columns that did not contain useful information
* Renamed columns to premote readability
* Checked for nulls in the data (there were none)
* Checked that column data types were apropriate
* Removed white space from values in object columns
* Added Target column 'upset' indicating weather the lower rated player won the game
* Added additional features to investigate:
    * rating_difference
    * game_rating
    * lower_rated_white
    * time_control_group
* Encoded 'Time Control Group' and 'Opening Name' as binary features for each column
* Split data into train, validate and test (approx. 60/25/15), stratifying on 'upset'
* Outliers have not been removed for this itteration of the project


* Split train, validate, and test into X and y dataframes
* Scaled continuous variable
* Encoded  variables
* converted 'Upset' to catagorical variable with values upset or non-upset

# Data Dictionary

| Feature | Definition |
|:--------|:-----------|
|Rated| True or False, The game's result is reflected in each player's rating|
|Winning Pieces| The color of pieces the winning player was moving|
|White Rating| Rating of the player moving the white pieces using the Glicko-2 rating method for games played on Lichess|
|Black Rating| Rating of the player moving the white pieces using the Glicko-2 rating method for games played on Lichess|
|Rating Difference| The difference in rating between the players in the game|
|Game Rating| The average rating of the two players in the game|
|Lower Rated White| True or False, The lower rated player is moving the white pieces|
|Opening Name| The name of the opening played in the game|
|Time Control Group| The amount of time aloted to each player to make their moves, **Standard** (60 min or more), **Rapid** (30 - 15 min), **Blitz** (5 - 3 min), or **Bullet** (2 or less), **Other** (any other time limit)|
|Upset (Target)| True or False, The lower rated player won the game|
|Additional Features|Encoded values for 'Time Control Group' and 'Opening Name'|

In [3]:
# acquiring, cleaning, and adding features to data
df = w.wrangle_chess_data(reprep = True)

# splitting data into train, validate, and test
train, validate, test = w.split_my_data(df)

# adding scaled columns of continuous features
train, validate, test = w.scale_data(train, validate, test)

train.rating_difference_scaled.value_counts()

0.000000    55
0.010724    50
0.006032    48
0.014745    45
0.002681    44
            ..
0.469839     1
0.426944     1
0.331769     1
0.290885     1
0.297587     1
Name: rating_difference_scaled, Length: 721, dtype: int64

In [4]:
train.columns

Index(['time_control_group', 'rated', 'winning_pieces', 'increment_code',
       'white_rating', 'black_rating', 'opening_name', 'upset',
       'rating_difference', 'game_rating', 'lower_rated_white',
       'time_control_group_Blitz', 'time_control_group_Bullet',
       'time_control_group_Other', 'time_control_group_Rapid',
       'time_control_group_Standard', 'rating_difference_scaled',
       'game_rating_scaled'],
      dtype='object')

In [8]:
train[['rating_difference','rating_difference_scaled']]

Unnamed: 0,rating_difference,rating_difference_scaled
3596,11,0.098525
5159,238,0.047587
14725,371,
13305,4,
14307,345,
...,...,...
5771,213,0.166890
7838,136,0.274799
8204,435,0.030161
14144,44,


# Explore

## How often do upsets occur?

In [None]:
# get pie chart upsets
e.get_pie_upsets(train)

* About 1/3 of the games in the training data will end in upset

## Dose first move advantage effect upsets?

In [None]:
# get pie chart lower rated white
e.get_pies_white(train)

* Upset percentage is 4% higher in games where the lower rated player makes the first move.

**I will now use a chi-square test to investigate whether 'Upset' and "Lower Rated White" are related** 
* Traditionally the player moving the white pieces moves first
* I will use a confidance interval of 95% 
* The resulting alpha is .05<br>

**Ho: "Upset" and "Lower Rated White" are independant of one another.**<br>
**Ha: "Upset" and "Lower Rated White" are related.**

In [None]:
# get chi-square test
e.get_chi_white(train)

**The p-value is grater than the alpha. Therefore, we have evidence to support that "Upset" and "Lower Rated White" are related. Based on this, and the 4% difference in upsets, observed in the train data, I believe that using the "Lower Rated White" feature in modeling will likely have a small positive impact on the model's accuracy.**

## Does a game being rated effect upsets?

In [None]:
# get pie charts
e.get_pie_rated(train)

* Upset percentage is 3% higher in games that are rated

**I will now use a chi-square test to investigate weather "Upset" and "Rated" are related.**
* I will use a confidance interval of 95% 
* The resulting alpha is .05<br>

**Ho: "Rated" and "Upset" are independant of one another.** <br>
**Ha: "Rated" and "Upset" are related.**

In [None]:
# get chi-square results
e.get_chi_rated(train)

**The p-value is less than the alpha. Therefore, we have evidence to support that "Upset" and "Rated" are related. Based on this and the 3% difference in upsets, observed in the train data, I believe that using the "Rated" feature in modeling will likely have a small positive impact on the model's accuracy.**

## Does player rating have an effect on upsets?
I will examine two subquestions to answer this question.

### 1) Does game rating (The average rating of both players in a game) have an effect on upsets? 

In [None]:
# get bar chart
e.get_game_rating(train)

* The average game rating for upsets is very similer to the average game rating for non-upsets

**Because the average game rating for games that end in upsets is very similer to the average game rating of games that do not end in upsets, it is not likely that "Game Rating" will be a useful feature to model on.**

### 2) Does difference in player rating have an effect on upsets?

In [None]:
# get bar chart
e.ave_diff_rating(train)

* The average difference in player rating is 82 points lower in games ending in upset.

**I will now do a T-test to test for a significant difference between the mean difference in player rating of games ending in upset and the mean difference in player rating of games ending in non-upset.**

* I will use a confidence interval of 95%
* The resulting alpha is .05

**HO: The mean difference in player rating of games ending in upset is not significantly differint from the mean difference in player rating of games not ending in an upset.** <br>
**HA: The mean difference in player rating of games ending in upset is significantly differint from the mean difference in player rating of games not ending in an upset.**

In [None]:
# get T-test resul
e.get_t_rating_diff(train)

**The p-value is less than the alpha. Therefore, we have evidence to support that the mean rating difference of players in games ending in upset is significantly differint from the mean rating difference of players in games that end in non-upsets. Based on this, and the 82 point difference in means, observed in the train data, I believe that using "Rating Difference" during modeling will provide a moderate improvement in the model's accuracy.** 

## Does time block effect upsets?

In [None]:
# Get pie charts
e.get_pie_time(train)

* In time control groups where time is very limited, such as Bullet, Blitz, and Rapid games upset percentage ranges from 30 to 34%.
* In Standard, where time is more plentaful, upsets drop to 22%. 

**I will now perform a chi-square test to determin if "Upset" and "Time control Group" are independant.**
* I will use a confidence interval of 95%
* The resulting alpha is .05

**Ho: "Upset" and "Time Control Group" are independant of one another.** <br>
**Ha: "Upset" and "Time Control Group" are related.**

In [None]:
# get chi-square test
e.get_chi_time(train)

**The p-value is less than the alpha. Therefore, we have evidence to support that "Time Control Group" and "Upset" are related. Based on this, and the differences in upset percentages amoung the differint time groups, I belive that a having a standard time control is a driver of upsets. Adding an encoded version of this feature to the model will likely have a moderate positive effect on the model's accuracy.**

# Does Opening effect upsets?

* There are 1236 unique openings identified in the training data
* This is too many for a thorough examination of each
* I will examin the top ten, by popularity

In [None]:
# get upset distributions of top ten most populer openings
e.get_pie_open(train)

* Percentage of upsets range from 20 - 37 amoung top openings

**I will now run a chi-square test to see if "Opening Name" and "Upset" are dependant on one another.**
* I will use a confidence interval of 95%
* The resulting alpha is .05

**Ho: "Opening Name" and "Upset" are independant of one another.** <br>
**Ha: "Opening Name" and "Upset" are dependant on one another.**

In [None]:
# get chi-square results
e.get_chi_open(train)

**The p-value is less than the Alpha. Therfore, we have evidence to support that "Opening Name" and "Upset" are related. However, their are over 1236 unique openings in opening_names. Adding that number of endoded columns to the model would likely do more harm than good.**

# Exploration Summery

* "Lower Rated White" and "Rated" were each found to be drivers of "Upset"
    * Though the amount of influance each has is likely to be weak
* "Rating Difference" was found to be a driver of "Upset"
* "Time Control Group" was found to be a driver of "Upsets" 
    * Being in the standard time control or not seemed to have a particularly strong influance.
* "Opening Name" was found to be a driver of "Upsets" 
    * Upset percentage ranged from 20-37%
    * encoding all of these features would result in more noise than signal to the model
    * It may be possible to create groups of simmiler openings in order to make a more resonable number features

</br>
* "Game Rating" was not found to be a driver of upsets
<br>

# Features I am moving to modeling With
* "Lower Rated White" (small difference in upset percentage, but relationship to upsets is statistically significant)
* "Rated" (small difference in upset percentage, but relationship to upsets is statistically significant)
* "Time Control standard" (moderate difference in upset percentage, and dependance is statistically significant)
* "Rating Difference" (Large difference in rating observed, and difference is significant)

# Features I'm not moving to modeling with
* "Opening" (Although found to be a driver of "upset" the encoding process would result in more noise than signal at this time)
* "Game Rating" (There is no evidence that "Game Rating" is a driver of upsets)

# Modeling
* I will use accuracy as my evaluation metric  
* non-upsets makeup 67% of the data 
 <br>
* by guessing non-upset for every game one could achieve an accuracy of 67%
* 67% will be the baseline accuracy I use for this project
 <br>
* I will be evaluating models developed using four differint model types and various hyperperamiter configurations 
* Models will be evaluated on train and validate data
* The model that performs the best will then be evaluated on test data 

In [None]:
# prep data for modeling
train_X, validate_X, test_X, train_y, validate_y, test_y = m.model_prep(train,validate,test)

## Decision Tree

In [None]:
# get decision tree results
m.get_tree(train_X, validate_X, train_y, validate_y)

* Deceision Tree accuracy is about equal to the baseline

## Random Forest

In [None]:
# get random forest results
m.get_forest(train_X, validate_X, train_y, validate_y)

* Random Forest accuracy is about equal to the baseline

## Logistic Regression

In [None]:
# get logistic regression results
m.get_reg(train_X, validate_X, train_y, validate_y)

* Logistic regression accuracy is better than baseline on train, and worse thatn baseline on validate
* It is likely over-fit


## KNN

In [None]:
# get knn results
m.get_knn(train_X, validate_X, train_y, validate_y)

* Logistic regression accuracy is better than baseline on train, and worse than baseline on validate
* It is likely over-fit

# Compairing Models

* All models perform at or below baseline
* The two models that perform at baseline are Decision Tree and Random Forest
* Because both are within rounding error of one another in terms of accuracy
* I will proceed with the model that requiers the least amount of processing to run
* I will proceed to test with a Decision Tree Model

# Decision Tree on Test

In [None]:
# get test results for final model
m.get_tree_test(train_X, test_X, train_y, test_y)

### Modeling Summery

* Decision Tree and Random Rorest models had an accuracy of about the same as the baseline
* Logistic Regression and KNN models out performed the baseline accuracy on train data, but the data was overfit leading the models having a worse than baseline accuracy on validate data
* A Decision Tree was selected as the final model and had an accuracy of 67% which is about equal to the baseline accuracy

# Conclusions

### Exploration

* Upsets in chess occur in about 1/3 of games
* Games in which the lower rated player makes the first move, and games that are rated have a slightly higher chance of ending in an upset 
* Games ending in upset have a much lower mean difference in player rating than games not ending in upset
* Games using shorter time control, such as Bullet, Blitz, and Rapid games, have an upset percentage that closely mirrors the overall upset percentage ranging from 30-34% while standard games have a much lower upset percentage at 22%
* Looking at the top 10 openings in terms of frequency in the data set we can conclude that a given opening does affect the likelihood of a game ending in an upset. Upset percentages very by opening from 20-39%
* The average rating of players in a game has no provable effect on the chance of that game ending in upset

### Modeling

**All of the models failed to outperform the baseline. Possible reasons and solutions include:**

* “Rated” and “lower rated white” each only accounted for a small difference in the percentage of upsets

* Time Control group had one outlier, in terms of percentage of upsets, which was standard. The values of each of the other time control groups were much closer together. Perhaps modeling only on this time controle would remove noise from the model.

* While “Opening Name” seemed to be a significant driver of upsets. It contains 1200+ values, that once encoded would adds an overwhilming number of features to the model. The additional noise this would add may have done more harm than good to the model. 

* Finally the values in “Opening Name” seem to include a major opening name along with variations of those openings. Creating clusters of all the variations of each opening may result in a more manageable number of features.   

**Should I have ocation to revisit this project I would like to try the following:**

* Cluster together opening variants in "Opening Name" to reduce the number of features input into the model
* Run the models without "Opening Name" to see there is any improvement made by just removing the additional noise.
* Look for other ways to describe "Opening Name" Such as by popularity of the opening or average rating of players playing that opening