# Overview and Motivation

While soccer is one of the most popular sports in the world, its analytics is lackluster in comparison to other popular sports such as baseball, American football, hockey, and basketball. This is particularly evident for the FIFA World Cup. The amount of analytics done on the FIFA World Cup is just miniscule in comparison to its scale, given that it is one of the most popular and highly televised sporting events in the world! This can be attributed to the general resistance of the sport (or FIFA) towards technology. Even though numerous other sports already use video technology to review plays while on the pitch, FIFA World Cup 2018 was the first time Video Assistant Referee (VAR) was used in a major soccer tournament. 

![](https://pics.me.me/si-fifa-fortnite-var-room-fifaworld-cup-russia-2018-use-34316423.png)
<center>Source: https://pics.me.me/si-fifa-fortnite-var-room-fifaworld-cup-russia-2018-use-34316423.png</center>

Another possible explanation is it's uniqueness as a sporting event of a global scale. The only other comparison is the Olympics. As a result, it is just logistically difficult for international teams to play each other very frequently! In fact, soccer clubs are often very resistant to allowing their players leave for international friendlies, especially if it is in the middle of their sporting season. 

![](https://pbs.twimg.com/media/CREXjPkWIAA1_OG.jpg)
<center>Source: https://pbs.twimg.com/media/CREXjPkWIAA1_OG.jpg</center>

As countries just do not compete against each other at the highest level frequently, there is a lack of high quality data that can be immediately fed into models to perform predictions. This is one of the biggest limitations on our project. While we would have liked to construct a model based on historical match based statistics, such as the possesion rate, the number of shots on target, etc., these data are just not readily available (free). Most of these statistics for international matches seem to have been calculated only since the start of the 2010s. Even then, only matches at the highest level, during tournaments or featuring top teams, have these statistics available. As such, we rely on a mix of simple match based statistics, and team and individual data from the popular FIFA games for our World Cup predictions.

The World Cup is composed of 64 matches in total - 48 matches in group stages and 16 matches in knockout (15 + 1 for third place). We plan to predict the outcome of each of the 64 matches independently instead of predicting which teams proceed in each round. This strategy allows our results to be comparable across models. By framing the problem in this way, we plan to approach this problem as a classification problem. Each game can be treated as a multi-class classification problem, where there are three outcomes: win for the home team (or team 1, indicated with 1), win for the away team (or team 2, indicated with -1), or a draw (indicated with 0). In the knockout rounds, we limit the outcome to: win for the home team (or team 1), win for the away team (or team 2), as draws are not allowed. Note that while we refer to teams as 'home' or 'away', we are merely abusing the terminology to distinguish between teams. There is only one true 'home' team for World Cups. 

To validate how accurate FIFA rankings are, we aim to use a baseline model that leverages FIFA rankings and some other simple team predictors to predict the World Cup results. We plan to create a more advanced model without relying on FIFA rankings. Instead, our advanced model is based on features that we self collected and engineered. The feature engineering process is a follow up to our initial EDA, where we identified features that could possibly impact match results. Ultimately, our analysis attempts to create a model that can predict the World Cup results as accurately as possible, while offering an insight into the features helpful in soccer analytics.

# Literature Review and Related Works

Even though soccer analytics is not a well-established field, there are a number of attempts to model the 2018 World Cup using machine learning techniques.

For example, Andreas Groll and his team at the University of Dortmund utilized [Poisson regression, random forest, and ranking methods](https://arxiv.org/abs/1806.03208) to simulate the World Cup 100,000 times. Features that Groll used "include economic factors such as a country’s GDP and population, FIFA’s ranking of national teams, and the properties of the teams themselves, such as their average age, the number of Champions League players they have, whether they have home advantage, and so on." (https://www.technologyreview.com/s/611397/machine-learning-predicts-world-cup-winner/) Through these simulations, Groll's model predicted Spain as the most likely winner of the World Cup followed by Germany. We all know how that turned out...

![](https://i.ytimg.com/vi/ITlBUIWlsXQ/maxresdefault.jpg)
<center>Source: https://i.ytimg.com/vi/ITlBUIWlsXQ/maxresdefault.jpg</center>

Similarly, Gerald Muriuki utilized [logistic regression](https://scotch.io/@itsmuriuki/predicting-fifa-world-cup-2018-using-machine-learning) to simulate the World Cup. Using the historic match dataset as his train set, Muriuki only trained on matches in which both teams are playing in the 2018 World Cup. The only feature he used was one-hot encoding all the teams, essentially representing a team's past performance against the other teams as a feature. The model predicted Brazil as the most likely winner. 

Finally, Rodrigo Nader utilized [SVM](https://towardsdatascience.com/using-machine-learning-to-simulate-world-cup-matches-959e24d0731) to simulate the World Cup. Nader used ratings (Atk, Mid, Def, ovr) about the teams scraped from Fifa Index to build his model. The model predicted Spain as the most likely winner, defeating Brazil in the finals.

As we can see, there were many attempts before the World Cup to predict it, with Spain or Brazil prevailing in most cases. However, we did not find any existing work that modelled World Cup 2018 after it had ended. While this may result in some hindsight bias (which we actively tried to avoid), there is still great value in reflecting back on the actual results of the World Cup. This would not only give us some insights into the actual statistics that matter in determining a soccer match, but also provide us with a model that could possibly be applied to future soccer matches. 

Moreover, as mentioned previously, we model the World Cup based on individual matches for comparability. This is different from the simulation models done previously, which simulates the World Cup sequentially. Nonetheless, our existing model can be extended to predict matches sequentially, although there is not much value in doing so since World Cup 2018 has ended.  

Nonetheless, it is evident that the data we chose to use and the features we chose to engineer were inspired by the related works. We hope to combine the features used individually by these related works to make a model of our own.

# Exploratory Data Analysis

# Baseline Model

Our most basic model would be to just predict the majority class every time. In this case, `home_win` = 1 is the majority class. Doing this "prediction" on our train set results in an accuracy of 43.8% while doing this "prediction" on our test results in an accuracy of 42.19%. This is pretty decent when we have three classes. Any model we build should be better than this test accuracy of just guessing the majority class.

Our baseline model was pretty simple. We utilized the differences in FIFA rankings, offense ratings, defense ratings, midfield ratings, overall ratings, and whether the home team is actually playing at home. We will make a train and validation set out of the original train set. We then fitted the model on several different classification algorithms, using cross-validation to train on the models.

![Baseline Model Results](img/baseline_results.png)


Ultimately, we selected the model with the highest validation accuracy, which was logistic regression in this case. We then will see how the model performs on the test set. Due to the elimination style during playoffs of the World Cup, there are no draws in these playoff matches. We attempted three different approaches to predicting the outcome of theese playoff matches. 

The first approach is to predict the outcome at 90 minutes (when a regular game ends so that our train set and test set are more "similar" to each other) before accounting for penalty kicks and overtime, allowing the model which trained on the training set to also predict draws in the playoff matches of the World Cup. 

In the second approach, we will predict the outcome at the end of the match; for a playoff match, if the model predicts a draw as most likely, then we instead predict the second most likely outcome of that match. We call this approach the "Softmax Approach". 

The third approach also attempts to predict the outcome at the end of the match. Instead of just predicting the second most likely outcome for a playoff match, we instead just train another model specifically on past World Cup playoff matches on the final outcome of these matches. We then just use this model to predict the playoff matches while the model trained on the training set will predict the preliminaries match. Since we have limited amount of past World Cup playoff matches (only 24 matches because some of them were removed to accomodate the team ratings dataset), we trained a basic logistic regression model as this model in the baseline model case. We call this approach the "WC Playoff Model". 

![Baseline Test Results](img/baseline_test_results.png)

Impressive! The baseline model gave us an indea on what accuracy our more advanced model should hope to achieve. In this case the Softmax approach did better than the WC Playoff Model approach; this might be due to the small training set for the playoff matches, the small number of features, or just pure luck. 

We did some basic analysis to see what exactly the model is getting wrong by plotting some confusion matrices. 

![Baseline Confusion Matrices](img/baseline_cm.png)

Our model does not predict draws at all. This is a bit concerning, but it makes sense given that from our EDA of these simple features we saw that we really could not distinguish draws from home wins and home losses at all for any of the features. For most of the feature distributions, they were always "sandwiched" between the two other distributions. Maybe our more advanced models will be able to better predict draws.

We see that across train set and test set, the proportions in each entry of the confusion matrices are approximately the same, which is good. This might be an indication that our train and test sets are approximately similar. 

While it seems like because the accuracy in predicting home loss and home win in the 90 minutes model are higher than those in the softmax approach, we must remember that the true labels of the former approach and the latter two approaches are different, where all the draws in the playoff matches of the test set became either home losses or wins. As a result, the overall accuracy of the the softmax approach is still higher than the former approach despite having lower home loss and home win accuracies; thus we can only compare the overall accuracies of the former approach with the latter two approaches. We can still compare the home loss and home win accuracies between the latter two approaches though since they are using the same test labels, and we see that while the WC Playoff Model is better at predicting home losses, it is worse in predicting home wins by a larger magnitude, resulting in lower overall accuracy. 

More importantly, we were curious just how important each feature is, especially the FIFA ranking feature, the one feature we are trying to replace. Feature importance of random forest allows us exactly to do so, and since it has similar performance to logistic regression in the train/validation set, we can utilize it in this case.

![Baseline Feature Importance](img/baseline_feature_importance.png)


It seems like the FIFA rankings is not that important of a feature! Hopefully we can make a better model than the baseline model.


# Beyond Baseline

# Bonus: Neural Network on Small Training Set

Out of plain curiosity, we wondered how a simple neural network would perform on our problem (Everyone wants to try deep learning nowadays). Because our training set is so small, we do not believe that the neural network will outperform any of our models from the previous part. In fact, it might overfit to the training set and perform worse than our other models. This notebook is just to experiment around with neural networks and see its performance on a small dataset. We will only look at the outcome at 90 minutes and Softmax in the test set, since it would be pretty complicated to come up with a scheme for the WC Playoff Model configuration, especially since there are so few WC playoff matches in our dataset.

Our simple neural network had a total of 3 hidden node layers, with 15 nodes in each layer. We added regularization on each layer and also add drop out layers to try to prevent overfitting.

![Neural Net Design](img/neural_net_design.png)


![Neural Net Results](img/neural_net_results.png)


Surprisngly, the neural network actually did better than we thought it would. This most likely stems from the regularization we added as well as the dropout layers. However, the neural network stilldid not perform as well as our best model.

We have thus shown that neural networks do not really help improve accuracy in this small dataset; it shows that we do not really need that complex of a model in this problem.




# Summary and Future Work

In this project we explored the idea of predicting the 2018 World Cup through a data science approach. We collected data related to matches, team ratings, and FIFA rankings. We use as our training set the matches played by national teams starting from 2006. However, due to limitations of the team ratings dataset, we had to get rid of many matches when merging with the matches dataset with the team ratings dataset. The training set has three outcomes, home team loses, draw, and home team wins. Although the dataset designates one team as home team, in most cases this does not mean anything and one can switch home team and away team. As a result, any feature we use will have to be symmetric in nature. 

Due to the elimination style during playoffs of the World Cup, there are no draws in these playoff matches. We attempted different approaches to predicting the outcome of these playoff matches. The first approach is to predict the outcome at 90 minutes (when a regular game ends so that our train set and test set are more "similar" to each other) before accounting for penalty kicks and overtime, allowing the model which trained on the training set to also predict draws in the playoff matches of the World Cup. In the second approach, we predicted the outcome at the end of the match; for a playoff match, if the model predicts a draw as most likely, then we instead predict the second most likely outcome of that match. The third approach also attempts to predict the outcome at the end of the match. Instead of just predicting the second most likely outcome for a playoff match, we instead just train another model specifically on past World Cup playoff matches on the final outcome of these matches; we then just use this model to predict the playoff matches while the model trained on the training set will predict the preliminaries match. 

We wanted to see just how reliable FIFA rankings are in determining the outcome of a match, so we built a baseline model consisting of just FIFA rankings, attack ratings, defense ratings, midfield ratings, and overall ratings of the teams. We tried out a variety of classification methods such as logistic regression, linear discriminant analysis, quadratic discriminant analysis, random forest, and XGBoost, using cross-validation to train on the models and ultimately selecting the model with the highest validation accuracy. With this model, we achieved decent predictions on the 2018 World Cup. Also, with the baseline model, we realized that FIFA rankings do not have that much of an impact on the outcome of the matches, indicating that a model without FIFA rankings can achieve better results than the baseline model if we feature engineer good features.

Through reviewing past works and exploratory data analysis, we realized that features related to other aspects of the team not captured by the baseline model, such as the income, age, and more specific statistics of the players of the team than just one rating on attack, one rating on defense, etc. (such as dribbling skills, passing skills, etc.), or momentum in the past few games, matter. Furthermore, not surprisingly the wealth of the country matters as well, in which we compared the GDP of countries on the same continent, acting as a proxy for comparing GDP of teams in the same confederation.

Similar to the baseline model, we first tried a variety of classification methods and use cross-validation to train them, selecting the model with the highest validation accuracy as the model of use. With this model, we achieved impressive increases to our test accuracies. We then tried stacking, but that did not result in anything significant. Due to the multicollineariy nature of our features, we utilized the dimension reduction techniques to conduct Principal Component Regression and Partial Least Squares Discriminant Analysis. In the end we saw that Partial Least Squares Discriminant Analysis was comparable to the full model, and better in the case of the second approach. 

![Model Results](img/model_results.png)

We also tried out a simple neural network just for educational purposes, and while it performed better than the baseline model, it did not really perform as well as the more advanced models, which is not surprising due to our small training set. 

As shown, the models we built have already shown decent results, but of course there is always room for improvement. The immediate next step to take is to think of more complex features. We would also like to explore the impact of individual players on the match; while team ratings may capture an aggregate view of the players, we would like to be even more granular to see if individual players can "carry" their teams. We had attempted to do so during the beginning of this project, but one of the main constraints that we encountered was that it was basically impossible to match individual players on the national stage. We believe that this problem would become resolved as the field becomes more developed and centralized; eventually a website or database like https://www.baseball-reference.com/ will be built, but for soccer. In applying data science to any field, we believe that domain knowledge about the field is the most necessary requirement to improve the model. As such, another crucial next step would be acquire domain knowledge to recognize what makes a team "good" considering that none of us basically have any knowledge about soccer at all. 

This project proves that there is a lot of potential for the field to expand. There is no doubt in the future we will see teams rely heavily on analytics to make decision, just like how baseball, basketball and American football do in the current era. 