# Predicting the winner of Races with XGBoost

## 1. Introduction
XGBoost is an open-source software library for predictive modelling, created by Tianqi Chen in 2014. The name XGBoost stands for "Extreme Gradient Boosting" and implements ”Gradient Boosting” as proposed in Greedy Function Approximation: A Gradient Boosting Machine by Friedman, while the term ”Extreme” refers to the engineering goal to maximize the resources used by the algorithm to achieve high accuracy, computational efficiency and scalability. What started off as a terminal application for a research project, has become a scalable end-to-end tree boosting system that has integrations with scikit-learn in Python, the caret package in R, as well as big data frameworks like Apache Spark and Hadoop. Since its introduction, XGBoost has been used in more than half of the winning solutions in Kaggle competitions, after it gained much popularity and attention in the community by winning the Higgs boson machine learning challenge.

Tree based algorithms are widely used in supervised machine learning. XGBoost is an implementation of gradient boosting, representing a new generation of GBM algorithms with tweaks to the traditional tree boosting and major system improvements. It has developed to the go-to algorithm for many machine learning engineers and practitioners in the data science community.

Let's see how well XGBoost performs on the given race game data set.

In [1]:
import pandas as pd
import numpy as np

from data.transformation import compute_elo, extract_player_info
from datetime import datetime, timedelta

races = pd.read_csv("data/cleaned_data/races_cleaned.csv")
races.dropna(inplace=True)
races['race_driven'] = races['race_driven'].astype("datetime64")
races.head()

Unnamed: 0,id,race_created,race_driven,track_id,challenger,opponent,money,fuel_consumption,winner,status,forecast,weather
0,1,2012-03-06,2012-03-06 00:00:00,12,5,2,30,0.63,5,finished,"a:4:{s:5:""sunny"";i:10;s:5:""rainy"";i:70;s:8:""th...",rainy
1,2,2012-03-06,2012-03-06 00:03:00,12,5,4,30,0.63,4,finished,"a:4:{s:5:""sunny"";i:70;s:5:""rainy"";i:15;s:8:""th...",sunny
3,4,2012-03-06,2012-03-06 00:06:00,12,5,4,30,0.63,5,finished,"a:4:{s:5:""sunny"";i:25;s:5:""rainy"";i:75;s:8:""th...",sunny
5,6,2012-03-06,2012-03-06 00:17:00,12,5,10,100,0.63,5,finished,"a:4:{s:5:""sunny"";i:30;s:5:""rainy"";i:20;s:8:""th...",snowy
8,9,2012-03-06,2012-03-06 00:08:00,3,10,4,30,0.63,4,finished,"a:4:{s:5:""sunny"";i:45;s:5:""rainy"";i:40;s:8:""th...",sunny


## 2. Feature Engineering
Adding ELO as essential feature to our data. We will only consider the ELO (estimate of player strength) of the previous day.

In [2]:

elo = pd.concat([pd.read_csv(f"data/processed_data/elo_part{i}.csv") for i in [1,2,3]])
elo['date'] = elo['date'].astype("datetime64")
elo.head()

Unnamed: 0,date,player_0,player_1,player_2,player_3,player_4,player_5,player_6,player_7,player_8,...,player_14635,player_14638,player_14639,player_14641,player_14644,player_14652,player_14654,player_14656,player_14664,player_14669
0,2012-03-05,1500.0,1500.0,1500.0,1500.0,1500.0,1500,1500.0,1500.0,1500.0,...,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0
1,2012-03-06,1463.529574,1495.0,1600.190991,1453.646091,1640.841192,1500,1500.0,1637.462868,1503.095535,...,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0
2,2012-03-07,1463.529574,1495.0,1696.786584,1453.646091,1644.143286,1500,1313.14025,1730.475883,1475.660032,...,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0
3,2012-03-08,1459.143171,1495.0,1755.439164,1453.646091,1635.076165,1500,1302.068114,1746.221834,1452.378189,...,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0
4,2012-03-09,1459.143171,1495.0,1793.493717,1453.646091,1681.457812,1500,1244.702118,1756.903623,1435.426534,...,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0,1500.0


In [3]:

def lookup_elo(elo_frame, day, player_id):
    prev_day = day - timedelta(days=1)
    return elo_frame[elo_frame.date == day][f"player_{player_id}"]

# Let's see if this works
lookup_elo(elo, pd.to_datetime("2012-03-06"), 1)

1    1495.0
Name: player_1, dtype: float64

Let's apply this new feature to every race

In [4]:
races['challenger_elo'] = [lookup_elo(elo, d, p_id) for d, p_id in zip(races.race_driven, races.challenger)]
races['opponent_elo'] = [lookup_elo(elo, d, p_id) for d, p_id in zip(races.race_driven, races.opponent)]
races.head()

Unnamed: 0,id,race_created,race_driven,track_id,challenger,opponent,money,fuel_consumption,winner,status,forecast,weather,challenger_elo,opponent_elo
0,1,2012-03-06,2012-03-06 00:00:00,12,5,2,30,0.63,5,finished,"a:4:{s:5:""sunny"";i:10;s:5:""rainy"";i:70;s:8:""th...",rainy,"1 1500 Name: player_5, dtype: int64","1 1600.190991 Name: player_2, dtype: float64"
1,2,2012-03-06,2012-03-06 00:03:00,12,5,4,30,0.63,4,finished,"a:4:{s:5:""sunny"";i:70;s:5:""rainy"";i:15;s:8:""th...",sunny,"Series([], Name: player_5, dtype: int64)","Series([], Name: player_4, dtype: float64)"
3,4,2012-03-06,2012-03-06 00:06:00,12,5,4,30,0.63,5,finished,"a:4:{s:5:""sunny"";i:25;s:5:""rainy"";i:75;s:8:""th...",sunny,"Series([], Name: player_5, dtype: int64)","Series([], Name: player_4, dtype: float64)"
5,6,2012-03-06,2012-03-06 00:17:00,12,5,10,100,0.63,5,finished,"a:4:{s:5:""sunny"";i:30;s:5:""rainy"";i:20;s:8:""th...",snowy,"Series([], Name: player_5, dtype: int64)","Series([], Name: player_10, dtype: float64)"
8,9,2012-03-06,2012-03-06 00:08:00,3,10,4,30,0.63,4,finished,"a:4:{s:5:""sunny"";i:45;s:5:""rainy"";i:40;s:8:""th...",sunny,"Series([], Name: player_10, dtype: float64)","Series([], Name: player_4, dtype: float64)"


## 3. XGBoost

In [5]:
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split

races['label'] = [1 if c == w else 0 for c,w in zip(races.challenger, races.opponent)]

features = ["track_id", "challenger_elo", "opponent_elo", "weather", "money", "fuel_consumption"]

X_train, X_test, y_train, y_test = train_test_split(races[features], races["label"], test_size=0.33, random_state=42)