# Description of Data

**FIFA World Ranking:** A key component of the project is to compare against a baseline model that relies on FIFA rankings. We obtained the rankings from [FIFA international men teams' rankings from August 1993 to June 2018](https://www.kaggle.com/tadhgfitzgerald/fifa-international-soccer-mens-ranking-1993now). These rankings were introduced in December 1992, but the ranking system has already been revamped several times due to criticisms of the effectiveness of the calculation algorithm to measure the relative strength of national teams. In fact, the current version of the FIFA World Ranking is a recent revamp after World Cup 2018 that is based on the Elo rating system used in chess and Go. The previous version of the FIFA World Ranking was used from 2006-2018, and as such, this forms the timeframe of the data analysis in our project. The calculation system is based on points accumulation from international matches where each match grants ranking points calculated as such:

$$\textrm{Ranking points}=\textrm{Result points}\times\textrm{Match status}\times\textrm{Opposition strength}\times\textrm{Regional Strength}$$

Basically, a win, draw, or loss gives varying points (where a win and a loss from a penalty shootout is given different points as well). A multiplier based on the type of match, i.e. whether it is a friendly, or a tournament match, is then applied to capture the varying level of significance of a match. Intuitively, this captures the fact that there are different stakes for each match. As such, a team may not be playing at their full strength all the time. Subsequently, the remaining measures capture the strength of the opponent in a match. This encapsulates the point that a win against a better opponent should be viewed differently than a win against a lousier opponent. A measure of regional strength is included as another multiplier on the strength of the opponent.

**Match statistics:** The dataset of National team outcomes between 1872 and 2018 provided by the staff is used to construct the dependent variables for our training and validation sets. Similarly, we found [World Cup 2018 Stats](https://gitlab.com/djh_or/2018-world-cup-stats/blob/master/world_cup_2018_stats.csv) to construct our test set results. These datasets only have very basic match data, consisting of mainly the scores and the location of the match. We tried looking for more advanced match statistics, such as team possession, and fouls committed. However, as mentioned previously, these were not readily available (for free). Nonetheless, we feature engineered on these match statistics to extract as much information from them as possible. This match-based dataset forms the basis of our analysis.

**Player and team statistics:** Player and team based statistics, such as ratings, positional data and wages for the former and playing style for the latter, was scraped from [sofifa](https://sofifa.com). sofifa is a website that collects from the FIFA game databases and has historical data from FIFA 2007, released on September 25, 2006. Earlier FIFA versions were released without frequent updates. However, starting from FIFA 2013, the game was frequently updated throughout the year to account for player and team improvements. As such, prior to 2013, we do not have various data points for each year. Nonetheless, this is not a huge issue as in-game data do not vary largely within a specific game edition. It is thus reasonable to extrapolate throughout the year. 

![](http://www.fifaworldcupnews.com/wp-content/uploads/2018/04/FIFA-Video-Game-Series-history-cover.jpg)

The sofifa dataset also does not include all national teams. Moreover, different sets of national teams are included for each FIFA game edition. This does not pose a problem for 'better', more popular teams as they are generally included in all games by demand. However, alot of national teams are generally not included because they are not deemed 'popular' enough to be selected by gamers. As such, we lose alot of matches (where a match is an observation in our full dataset) by merging the sofifa dataset with our match-based dataset. This is again not a huge problem because we still end up with around 1800+ matches, which we deemed to be sufficient for our analysis. We noted that a small number of prior World Cup matches were removed because some of the teams that were in prior World Cups were not inclued in the game (i.e. Ghana and Slovakia in World Cup 2010). Thankfully, there was a World Cup 2018 game update to the FIFA 2018 game edition that ensured that we have the necessary data for our test set.

Moreover, even the national teams that are included are not particularly accurate with the list of squad players. The sofifa dataset roughly approximates the list of players that are often called up by their national teams. Even the World Cup 2018 game update did not have the exact World Cup roster for all the teams!

![](https://www.thesun.co.uk/wp-content/uploads/2018/05/wrong-6.jpg?strip=all&quality=100&w=742&h=417&crop=1)

It might seem that feature engineering on individual player statistics, as compared to using the aggregated team statistics, might give us greater flexibility. As such, we nonetheless went ahead with trying to match player names with match squad data. However, as there were different naming conventions across different datasets for players' name, we found that it would be difficult to accurately match player names without some extensive manual curation. Moreover, this would yield minimal benefit for many of the features according to our EDA (see EDA section for details). As such, we relied on a mix of existing team statistics and some aggregated summary statistics of individual players data for each team to approximate the talent available in each team. While the in-game squad data is not particularly accurate, since we are already approximating the strength of a team through aggregation, it is not a huge issue. Moreover, as we shall see, our approximation does reasonably well.

While the sofifa dataset has its limitations, it is the most comprehensive, and readily available dataset on soccer statistics. As such, the dataset is the main complement to our match-based data.

We also experimented with data from other sources, such as [European matches](https://www.kaggle.com/hugomathien/soccer).

**Country statistics:** Inspired by the prior World-Cup prediction model that we reviewed by Andreas Groll and his team at the University of Dortmund, we also scraped some country information from Wikipedia such as [GDP PPP](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)) and [population](https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)).


## Feature Engineering

1. **Representation of home and away team:** As a soccer match involves 2 teams, we needed our models to be agnostic to whether a team is home or away. This ensures that swapping home and away teams does not change our predictions. The simplest way to do so is to take the difference of team features. As such, we can think of our predictors as differences in skills/abilities of teams. 


2. **Missing value representation:** Some of the data that we rely on are based upon the FIFA in-game statistics stored on sofifa.com. As they only have statistics from FIFA 2006 onwards, our training data starts from 2006. This nicely matches wth the FIFA World Ranking model that we are comparing with. Importantly, we note that the value and wages data is missing for less renowned players. In the soccer job market, top players in European leagues earn substantially more than other soccer players. As such, it makes sense to impute the missing value and wages as 0 to capture the disparity in wages. For the remaining missing values, we imputed them with 0 as our data is represented in differences between 2 teams, and 0 encodes the lack of information which neither advantages nor disadvantages either team. 


3. **Momentum:** In sports, we often notice that one team tends to keep winning or keep losing as they gain momentum due to teams' morale, conditions and other factors. This can be represented by taking into account the team's performance in past games. We engineered a couple of momentum-based statistics by looking at the number of wins, goals scored, and goals conceded in their past few games. We also experimented with weighing some of these momentum-based statistics by taking into account the rating of the teams the team played in. This is slightly alike the FIFA World Rankings and is our attempt to capture similar information from our data.


4. **Statistics of players:** We aggregate players' statistics together as part of a team's offensive, defensive abilities. One exception is the goalkeepers' ratings, as we believe that defense is pivotal for any good team. Good goalkeepers can potentially swing the outcome of a match. As such, we single out goalkeepers' ratings and include them as part of the features.

<table><tr>    
<td> <img src="https://www.straitstimes.com/sites/default/files/styles/article_pictrure_780x520_/public/articles/2018/06/29/ST_20180629_FDKOREANS290A8P_4097009.jpg?itok=a3Ffr9IV&timestamp=1530212009" width="50%"/> </td>
<td> <img src="https://66.media.tumblr.com/5bef2a2e7230d97d4753cc251289ee95/tumblr_pazvs9HVL91tovmb9o1_500.jpg" width="50%"/> </td>
</tr></table>


5. **GDP within the same continent:** Rather than considering the overall GDP difference, GDP difference within the confederation seems a better indicator of how well a team is doing against another. Confederations tend to be on the same continent, so we compare the GDPs of countries on the same continents. When not in the same continent, this is encoded again with 0 to indicate the lack of information which neither advantages nor disadvantage any team.

Our final train dataset consisted of 1897 rows while our test dataset consists of the 64 matches of the 2018 World Cup. Below is a table summarizing the features we used in the end.

| Feature Name        | Description           | Data Type  |
| ------------- |:---------------------------:| -----:|
| is_home      | indicates whether a team has home team advantage (host or same continent as host) | Categorical (-1, 0, 1) |
| attack_diff      | difference in team attack ratings | Float |
| bup_dribbling_diff      | difference in quality of dribbling | Float |
| bup_passing_diff      | difference in quality of passing distance and support from teammates | Float |
| bup_speed_diff      | difference in speed in which attacks are put together | Float |
| cc_crossing_diff      | difference in frequency of crosses into the box | Float |
| cc_passing_diff      | difference in amount of risk taken in pass decision and run support | Float |
| cc_shooting_diff      | difference in frequency of shots taken | Float |
| d_aggresion_diff      | difference in intensity taken in tackling the ball possessor | Float |
| d_pressure_diff      | difference in how high the pitch the team starts pressuring | Float |
| d_width_diff      | difference in how narrow or wide a team shape is set up without possession | Float |
| defence_diff      | difference in team defense ratings | Float |
| goalkeeeper_overall_diff      | difference in goalkeepers' ratings | Float |
| growth_diff      | difference of the difference in overall and potential (measure of potential of a player) ratings | Float |
| midfield_diff      | difference in team midfield ratings | Float |
| overall_diff      | difference in team overall ratings| Float |
| prestige_diff      | difference in teams' prestige | Float |
| full_age_diff      | difference in average age of full squad | Float |
| start_age_diff      | difference in average age of starting squad | Float |
| value_euros_millions_diff      | difference in estimated worth of all players in the team in millions of EUROS | Float |
| wage_euros_thousands_diff      | difference in estimated wage of all players in the team in thousands of EUROS | Float |
| attack_home_defence_away_diff      | Team 1's attack ratings - Team 2's defense ratings | Float |
| attack_away_defence_home_diff      | Team 2's attack ratings - Team 1's defense ratings | Float |
| rank_diff      | difference in teams' FIFA rankings | Float |
| gdp_diff      | log of difference in GDP if within the same continent | Float |
| raw_gdp_diff      | difference between GDP | Float |
| win_momentum_past_1_games_diff     | difference in win momentum of the past game | Float |
| win_momentum_past_2_games_diff      | difference in win momentum of the past 2 games | Float |
| win_momentum_past_3_games_diff       | difference in win momentum of the past 3 games | Float |
| win_momentum_past_4_games_diff       | difference in win momentum of the past 4 games | Float |
| win_momentum_past_5_games_diff       | difference in win momentum of the past 5 games | Float |
| lose_momentum_past_1_games_diff      | difference in lose momentum of the past game | Float |
| lose_momentum_past_2_games_diff    | difference in lose momentum of the past 2 games | Float |
| lose_momentum_past_3_games_diff     | difference in lose momentum of the past 3 games | Float |
| lose_momentum_past_4_games_diff     | difference in lose momentum of the past 4 games | Float |
| lose_momentum_past_5_games_diff      | difference in lose momentum of the past 5 games | Float |



Below is our response variable.

| Column Name        | Description           | Data Type  |
| ------------- |:---------------------------:| -----:|
| home_win      | 1 indicates team 1 wins; -1 team 2 wins; 0 tie | Categorical (-1, 0, 1) |