# The shape of football games

## The dataset

The original data can be found [here](https://www.kaggle.com/hugomathien/soccer). It contains briefly:

## Tables

* Team: it contains three id keys to relate to other tables, and the long and short name of the team.
* Team Attributes: historical players attributes updates for each team (not used in our model).
* Player: general player information like `name`, `birthday`, `weight` and `height`.
* Player_Attributes: historical players attributes updates. This table is linked to the `Player` table by `player_fifa_api_id`
* Match: it is the most important table, where each row describes a match using `date`, `season`, `league`, the id of the two participant teams, the id of the starting 22 players and their position in the field. 
* League and Country: it contains the name of the league and its home country.

<img src="FootballTDA.png"> 

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from database import Database 
from cross_validation import extract_features_for_prediction
import pandas as pd
import numpy as np
from numpy import random
import soccer_basics 
from random import expovariate, gauss
from sklearn.ensemble import RandomForestClassifier
from utils import read_pickle
from notebook_functions import *

## Load the tables

The class `database` is set to manage the tables in order to modify the teams.  

In [None]:
database = Database()

## Modify teams

The method `hire_player` is used to move your favorite player to a selected team to simulate how the championship would go. You just need to select the team where you want put the player and then select the player to be replaced. The list of teams is sorted by the total of points that each team has totaled during the championship. Players are sorted by the number of appearances they had that year. 
Let's see how things would have gone.

**Note**: the higher the number of appearances of the player to be replaced, the greater the impact of the hired player!

In [None]:
new_player_df = database.hire_player()

In [None]:
new_player_df.head()

Get the team ids, which are going to be used later 

In [None]:
team_ids = get_team_ids(new_player_df)

We want to make sure that the columns order is the same as in the training set.

In [None]:
new_players_df_stats = get_useful_cols(new_player_df)

In [None]:
new_players_df_stats.head()

## Feature selection

In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensive ones.
Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating.
Below, is possible to see the features used for each player:
* **Attack**: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys",                 "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties",                   "vision", "shot_power", "long_shots"
* **Defense**: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing"
* **Goalkeeper**: "overall_rating"

From this set of features, the next step we did was to, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.

Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense. 


In this way a match is described by 14 features (GK overall value, best attack, std attack, mean attack, best defense, std defense, mean defense), that mapped the match in the space, following the characterizes of the two team.

## Feature extraction

The aim of TDA is to catch the structure of the space underlying the data. In our project we assume that the neigborood of a data point hides meaningfull information which are correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.

In [None]:
best_pipeline_params, best_model_feat_params = get_best_params()

In [None]:
pipeline = get_pipeline(best_pipeline_params)

In [None]:
x_train, y_train = load_dataset()

In [None]:
x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)

In [None]:
rf_model = RandomForestClassifier(**best_model_feat_params)

In [None]:
rf_model.fit(x_train, y_train)

In [None]:
matches_probabilities = get_probabilities(rf_model, x_test, team_ids)

In [None]:
matches_probabilities.head()

In [None]:
compute_final_standings(matches_probabilities, 'premier league')

## Messi in each team
Below, is possible to see the effect that Messi would have had on the final standings of the Premier League 2014/2015. The results are obtained by running 20 different simulations, eahc one with the player with the most number of appereances replaced by Messi.

In [None]:
teams_with_messi.set_index(np.arange(1, 21), drop=True)

# Benchmarks: Market's odds and Elo ratings

While the performance is not our main goal, we nevertheless set up two simple benchmarks to make sure our (topological) model is a reasonable approximation of the reality.

The task we choose is simply the ternary match outcome prediction: will the home team win, the away team or will there be a draw?

The first benchmark is obtained from Market's probabilities for the three outcomes -- they are obtained by simply inverting the odds (see soccer_basics.py for details).

The second benchmark is by using instead Elo ratings, a standard tool for assessing teams' or players' strenghts: <a href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo rating system</a>. For the related World Football Elo Ratings see:     . For a deeper mathematical discussion around this concept, see <a href="https://www.eloratings.net/about"> National teams Elo rating</a>, <a href="https://www.stat.berkeley.edu/~aldous/Papers/me-Elo-SS.pdf">Elo's rating mathematics</a>

We calculate the benchmarks on the Premier League dataset.

Our model is capable an accuracy of 0.531, which is comparable with market's performace. 

In [None]:
probabilities_with_odds = get_dataset(42198).get_data(dataset_format='dataframe')[0]

In [None]:
probabilities_with_odds.head()

In [None]:
soccer_basics.useful_updates1(probabilities_with_odds)
soccer_basics.get_elo(probabilities_with_odds, 20, 100)
soccer_basics.useful_updates2(probabilities_with_odds, 100)

market's ternary prediction: 1, X or 2



In [None]:
print('market prediction, all data and 2014-2015 season')
acc1 = len(probabilities_with_odds[probabilities_with_odds['result'] == 
                                   probabilities_with_odds['market_prediction']]) / float(len(probabilities_with_odds))
df = probabilities_with_odds.reset_index()

print(np.round(acc1, 3))

Elo based ternary prediction:



In [None]:
print('Elo based prediction, all data and 2015, with 30 matches quarantine')
soccer_basics.ternary_prediction(probabilities_with_odds, 30)