# Football match prediction (GPU version)

In this experiment we are going to use the [Kaggle football dataset](https://www.kaggle.com/hugomathien/soccer). The dataset has information from +25,000 matches, +10,000 players from 11 European Countries with their lead championship during seasons 2008 to 2016. It also contains players attributes sourced from EA Sports' FIFA video game series. The problem we address is to try to predict if a match is going to end as win, draw or defeat. 

Part of the code use in this notebook is this [kaggle kernel](https://www.kaggle.com/airback/match-outcome-prediction-in-football).

The details of the machine we used and the version of the libraries can be found in [experiment 01](./experiments/01_airline.ipynb).

In [25]:
import os,sys
import pandas as pd
import numpy as np
import seaborn as sns
import itertools
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from libs.metrics import classification_metrics_multilabel, binarize_prediction
import xgboost as xgb
import lightgbm as lgb
from libs.loaders import load_football
from libs.football import get_fifa_data, create_feables
from libs.timer import Timer
from libs.conversion import convert_cols_categorical_to_numeric
import pickle
import pkg_resources
import json


print("System version: {}".format(sys.version))
print("XGBoost version: {}".format(pkg_resources.get_distribution('xgboost').version))
print("LightGBM version: {}".format(pkg_resources.get_distribution('lightgbm').version))

%matplotlib inline

ImportError: cannot import name 'classification_metrics_multilabel'

### Data loading and management


In [3]:
%%time
countries, matches, leagues, teams, players = load_football()
print(countries.shape)
print(matches.shape)
print(leagues.shape)
print(teams.shape)
print(players.shape)

INFO:libs.loaders:MOUNT_POINT not found in environment. Defaulting to /fileshare


(11, 2)
(25979, 115)
(11, 3)
(299, 5)
(183978, 42)
CPU times: user 4 s, sys: 864 ms, total: 4.86 s
Wall time: 20.2 s


In [4]:
leagues

Unnamed: 0,id,country_id,name
0,1,1,Belgium Jupiler League
1,1729,1729,England Premier League
2,4769,4769,France Ligue 1
3,7809,7809,Germany 1. Bundesliga
4,10257,10257,Italy Serie A
5,13274,13274,Netherlands Eredivisie
6,15722,15722,Poland Ekstraklasa
7,17642,17642,Portugal Liga ZON Sagres
8,19694,19694,Scotland Premier League
9,21518,21518,Spain LIGA BBVA


In [5]:
matches.head()

Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,...,SJA,VCH,VCD,VCA,GBH,GBD,GBA,BSH,BSD,BSA
0,1,1,1,2008/2009,1,2008-08-17 00:00:00,492473,9987,9993,1,...,4.0,1.65,3.4,4.5,1.78,3.25,4.0,1.73,3.4,4.2
1,2,1,1,2008/2009,1,2008-08-16 00:00:00,492474,10000,9994,0,...,3.8,2.0,3.25,3.25,1.85,3.25,3.75,1.91,3.25,3.6
2,3,1,1,2008/2009,1,2008-08-16 00:00:00,492475,9984,8635,0,...,2.5,2.35,3.25,2.65,2.5,3.2,2.5,2.3,3.2,2.75
3,4,1,1,2008/2009,1,2008-08-17 00:00:00,492476,9991,9998,5,...,7.5,1.45,3.75,6.5,1.5,3.75,5.5,1.44,3.75,6.5
4,5,1,1,2008/2009,1,2008-08-16 00:00:00,492477,7947,9985,1,...,1.73,4.5,3.4,1.65,4.5,3.5,1.65,4.75,3.3,1.67


In [6]:
#Reduce match data to fulfill run time requirements
cols = ["country_id", "league_id", "season", "stage", "date", "match_api_id", "home_team_api_id", 
        "away_team_api_id", "home_team_goal", "away_team_goal", "home_player_1", "home_player_2",
        "home_player_3", "home_player_4", "home_player_5", "home_player_6", "home_player_7", 
        "home_player_8", "home_player_9", "home_player_10", "home_player_11", "away_player_1",
        "away_player_2", "away_player_3", "away_player_4", "away_player_5", "away_player_6",
        "away_player_7", "away_player_8", "away_player_9", "away_player_10", "away_player_11"]
match_data = matches.dropna(subset = cols)
print(match_data.shape)

(21374, 115)


Now, using the information from the matches and players, we are going to create features based on the FIFA attributes. This computation is heavy, so we are going to save it the first time we create it.  

In [7]:
%%time
fifa_data_filename = 'fifa_data.pk'
if os.path.isfile(fifa_data_filename):
    fifa_data = pd.read_pickle(fifa_data_filename)
else:
    fifa_data = get_fifa_data(match_data, players)
    fifa_data.to_pickle(fifa_data_filename)
print(fifa_data.shape)

(21374, 23)
CPU times: user 29min 27s, sys: 1min 6s, total: 30min 33s
Wall time: 31min 14s


Finally, we are going to compute the features and labels. The labels are related to the result of the team playing at home, they are: `Win`, `Draw`, `Defeat`. 

In [8]:
%%time
bk_cols = ['B365', 'BW', 'IW', 'LB', 'PS', 'WH', 'SJ', 'VC', 'GB', 'BS']
bk_cols_selected = ['B365', 'BW']      
feables = create_feables(match_data, fifa_data, bk_cols_selected, get_overall = True)
print(feables.shape)

Generating match features...
Generating match labels...
Generating bookkeeper data...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


(19673, 48)
CPU times: user 10min 44s, sys: 52.2 s, total: 11min 37s
Wall time: 11min 53s


In [9]:
feables.head()

Unnamed: 0,match_api_id,home_team_goals_difference,away_team_goals_difference,games_won_home_team,games_won_away_team,games_against_won,games_against_lost,season,League_1.0,League_1729.0,...,away_player_9_overall_rating,away_player_10_overall_rating,away_player_11_overall_rating,B365_Win,B365_Draw,B365_Defeat,BW_Win,BW_Draw,BW_Defeat,label
0,493017.0,0.0,0.0,0.0,0.0,0.0,0.0,2008.0,1,0,...,70.0,68.0,63.0,0.313804,0.276886,0.40931,0.307825,0.27941,0.412765,Win
1,493025.0,0.0,0.0,0.0,0.0,0.0,0.0,2008.0,1,0,...,67.0,73.0,68.0,0.327179,0.286281,0.38654,0.290493,0.300176,0.409331,Defeat
2,493027.0,0.0,0.0,0.0,0.0,0.0,0.0,2008.0,1,0,...,55.0,58.0,64.0,0.672897,0.209346,0.117757,0.672269,0.226891,0.10084,Win
3,493034.0,1.0,2.0,1.0,1.0,0.0,0.0,2008.0,1,0,...,74.0,70.0,69.0,0.207407,0.259259,0.533333,0.192717,0.274476,0.532807,Win
4,493040.0,-2.0,0.0,0.0,0.0,0.0,0.0,2008.0,1,0,...,60.0,63.0,65.0,0.535211,0.267606,0.197183,0.565759,0.25499,0.17925,Draw


Let's now split features and labels.

In [10]:
features = feables[feables.columns.difference(['match_api_id', 'label'])]
labs = feables['label']
print(features.shape)
print(labs.shape)

(19673, 46)
(19673,)


Once we have the features and labels defined, let's create the train and test set.

In [12]:
%%time
X_train, X_test, y_train, y_test = train_test_split(features, labs, test_size=0.2, random_state=42, stratify=labs)

In [24]:
X_train = X_train.astype(np.float64)
X_test = X_test.astype(np.float64)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(type(y_train))
xx = X_train.astype(np.float64)
y_train.dtypes

(15738, 46)
(15738,)
(3935, 46)
(3935,)
<class 'pandas.core.series.Series'>


dtype('O')

In [23]:
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)

TypeError: a float is required

In [None]:
lgb_train = lgb.Dataset(X_train.values, y_train.values, free_raw_data=False)
lgb_test = lgb.Dataset(X_test.values, y_test, reference=lgb_train, free_raw_data=False)

### XGBoost analysis
Once we have done the feature engineering step, we can start to train with each of the libraries. We will start with XGBoost. 

We are going to save the training and test time, as well as some metrics. 

In [None]:
results_dict = dict()
num_rounds = 300
labels = ["Win", "Draw", "Defeat"]

In [None]:
params = {'max_depth':3, 
          'objective': 'multi:softprob', 
          'num_class': len(labels),
          'min_child_weight':5, 
          'learning_rate':0.1, 
          'colsample_bytree':0.8, 
          'scale_pos_weight':2, 
          'gamma':0.1, 
          'reg_lamda':1, 
          'subsample':1,
          'tree_method':'exact', 
          'updater':'grow_gpu'
          }


In [None]:
with Timer() as t_train:
    xgb_clf_pipeline = xgb.train(params, dtrain, num_boost_round=num_rounds)
    
with Timer() as t_test:
    y_prob_xgb = xgb_clf_pipeline.predict(dtest)

In [None]:
y_pred_xgb = binarize_prediction(y_prob_xgb)

In [None]:
report_xgb = classification_metrics_multilabel(y_test, y_pred_xgb, labels)

In [None]:
results_dict['xgb']={
    'train_time': t_train.interval,
    'test_time': t_test.interval,
    'performance': report_xgb 
}

In [None]:
del xgb_clf_pipeline


Now let's try with XGBoost histogram.

In [None]:
params = {'max_depth':0, 
          'objective': 'multi:softprob', 
          'num_class': len(labels),
          'min_child_weight':5, 
          'learning_rate':0.1, 
          'colsample_bytree':0.80, 
          'scale_pos_weight':2, 
          'gamma':0.1, 
          'reg_lamda':1, 
          'subsample':1,
          'tree_method':'hist', 
          'max_leaves':2**3, 
          'grow_policy':'lossguide', 
         }


In [None]:
with Timer() as t_train:
    xgb_hist_clf_pipeline = xgb.train(params, dtrain, num_boost_round=num_rounds)
    
with Timer() as t_test:
    y_prob_xgb_hist = xgb_hist_clf_pipeline.predict(dtest)

In [None]:
y_pred_xgb_hist = binarize_prediction(y_prob_xgb_hist)

In [None]:
report_xgb_hist = classification_metrics_multilabel(y_test, y_pred_xgb_hist, labels)

In [None]:
results_dict['xgb_hist']={
    'train_time': t_train.interval,
    'test_time': t_test.interval,
    'performance': report_xgb_hist
}

In [None]:
del xgb_hist_clf_pipeline

### LightGBM analysis

Now let's compare with LightGBM.

In [None]:
params = {'num_leaves': 2**3,
         'learning_rate': 0.1,
         'colsample_bytree': 0.80,
         'scale_pos_weight': 2,
         'min_split_gain': 0.1,
         'min_child_weight': 5,
         'reg_lambda': 1,
         'subsample': 1,
         'objective':'multiclass',
         'num_class': len(labels),
         'task': 'train'
         }

In [None]:
with Timer() as t_train:
    lgbm_clf_pipeline = lgb.train(params, lgb_train, num_boost_round=num_rounds)
    
with Timer() as t_test:
    y_prob_lgbm = lgbm_clf_pipeline.predict(X_test.values)

In [None]:
y_pred_lgbm = binarize_prediction(y_prob_lgbm)

In [None]:
report_lgbm = classification_metrics_multilabel(y_test, y_pred_lgbm, labels)

In [None]:
results_dict['lgbm']={
    'train_time': t_train.interval,
    'test_time': t_test.interval,
    'performance': report_lgbm 
}

In [None]:
del lgbm_clf_pipeline

Finally, the results.

In [None]:
# Results
print(json.dumps(results_dict, indent=4, sort_keys=True))

As it can be seen, in the case of multilabel LightGBM is faster than XGBoost in both versions. The performance metrics are really poor, so we wouldn't recommend to bet based on this algorithm :-)