## Introduction

Soccer is the most popular sport in the world, and predicting the outcomes of matches has always been of great interest to sports enthusiasts, analysts, and betting companies. Traditional methods of prediction have relied on simple statistical models, but with the increasing availability of data and advancements in machine learning, more sophisticated models can be developed to improve prediction accuracy.

## Model Architecture

In this work, we propose a `transformer-based`(@zerveas_transformer-based_2021) model for `multivariate time series` representation learning to predict `soccer` match outcomes from leagues around the world.
This framework is inspired by the results attained through unsupervised pre-training of transformer model which were introduced in the paper "Attention Is All You Need" (@vaswani2017attention) by Vaswani et al. in 2017, and have since become the backbone of many state-of-the-art natural language processing models.

The key innovation of `transformers` is their use of self-attention mechanisms, which allow the model to selectively focus on different parts of the input sequence when processing it. This is in contrast to traditional `recurrent neural networks (RNNs)` and `convolutional neural networks (CNNs)`, which process the input sequence sequentially or with a fixed-size window, respectively. The authors of the paper(@zerveas_transformer-based_2021) have made a groundbreaking contribution to the field of multivariate `time series` analysis by developing a novel methodology based on a transformer-based model. It is designed to leverage unlabeled data by training a transformer encoder to extract dense vector representations of multivariate time series, which can then be applied to various downstream tasks such as regression, classification, imputation, and forecasting.

Our model ingests team features inputs, each containing `statistical performance` realised during each played game. The model architecture consists of `positional encoder` to create a dense representation of the input sequence, and an N-layer stack of `self-attention` , `feedforward` networks and a `Batch normalization` layer (to improve the stability and speed of training and it is applied to the output of each multi-head attention and feed-forward layer.). 

The embedding output (Heads), which are the "Z" vectors of both the `home` and `away` teams of the feedforward network, are concatenated to generate a single vector containing information on both teams. This vector is subsequently transmitted through a `multitask` learning `linear layer`, which produces two sorts of results: `goal difference` and `1x2` outcome probabilities.

Now the goal difference probabilities are available, we have the option of generating the `exact score`, either using empirical score probabilities based on previous outcomes per league or by applying the `Poisson` Distribution, in which case we must add two additional linear layer outputs that will be passed by an `exponential function` to apply `Poisson` method.


![alt text](figures/model_architecture.png)