# Natural language processing Task

After completing our first task, which was to extract previews from the Guardian website, we began our second task, which was to reproduce the results of the <cite id="cfvjb"><a href="#zotero|11983139/J4T53NKQ">(Beal et al., 2020)</a></cite> article by processing and analyzing the texts that we had extracted.
The article's authors proceeded as follows:

- <b> Information extraction</b>: they extracted the main features of each sentence in the article's text.

- <b>Allocation of Text Context</b>: each sentence is assigned to a team.

- <b>Text Vectorisation</b>: they converted the sentences into vectors using a Count Vectorizer technique to have a numerical representation of the words in a sentence.

- <b>Prediction</b>: Once the feature set for each game is formed, they trained a Random Forest model using historic data and the numerical representation of the words in the sentence.

<br>Understanding the fundamental concepts and learning the "spacy" tool were required to properly assimilate these operations.
<br>Spacy is an open source Python library for natural language processing that can be used to extract information from text.

<br>The following techniques were used in this task:

- Named Entity Recognition

- Spacy Entity Ruler 

- Count vectorization

## Named-Entity Recognition (NER)

The task of identifying and categorizing key information in text is known as Named Entity Recognition (NER). It is also known as entity extraction or identification.
Each detected entity is assigned to a predefined category. An NER model, for example, may detect the word "Mark" in a text and classify it as a "Person."

example : 

![title](images/ner.png)

![title](images/ner2.png)

In our case, identifying the names of the teams in the previews is a critical task that will allow us to extract the main features in our text.
<br>This operation is not possible with the standard spacy NER because of errors in entity detection; spacy can consider a team name to be a person and vice versa.
<br>To address this issue, we decided to build a model that will allow us to detect our own entities.

## Train a model to detect custom entities

Before we implemented our model, which will allow us to automatically detect the names of the teams, we fed it a training dataset with labels generated by an external text annotation tool.
<br>This annotation identifies the custom entities that our model will learn during its training.

![title](images/ner3.png)
![title](images/ner4.png)

we test again

![title](images/ner5.png)

## Entity ruler 

Using token-based rules or exact phrase matches, the entity ruler allows us to add spans to the Doc entities. It can be used in conjunction with the statistical EntityRecognizer to improve accuracy, or it can be used on its own to implement a rule-based entity recognition system.

We took advantage of the dataset that we have which contains the teams and their different names.
In this sense we have linked each name or nickname of a team to its main entity

![title](images/ner6.png)

For example the nickname Spurs is now detectable in the text that is linked to the Tottenham Hotspur entity

![title](images/ner7.png)

## Get the names of the coaches

We also noticed that in most of the previews, we find the names of the managers but not the names of the teams, so to ensure the extraction of information, we used a database that we have that contains a history of managers for each team.<br>As a result, it is now easier to identify the section of the text that refers to one of the two teams.

Example of the dataset


![title](images/coach1.png)

The final output: for each preview, we have the coaches of each team.



![title](images/coach2.png)

## Previews Preprocessing

First of all, beginning with the tokenization step, which is the task of chopping up texts into pieces in order to remove stop words such as (the,a,an,so,what..). We also removed all punctuation because it isn't important in the text, and then we used a lemmatization technique that allows for lexical processing, such as (runs, running,ran) => run.

Example of text


![title](images/text1.png)

![title](images/token.png)

We continue our normalization and move on to the next step, which is detecting the names of the two teams, the names of the coaches, and changing their names by hometeam, awayteam, homecoach, and awaycoach.The reason for this is so that our model's predictions can generalize.
<br>We noticed that the words 'hosts,' 'home side,' and 'visitors,' which refer to the home team and away team, are frequently used in the previews, and they have been changed.

We take the same example:

![title](images/text2.png)

Preview after cleanning

![title](images/preview.png)

Preview after normalization

![title](images/preview2.png)

## Allocation of texts

This section consists of assigning each sentence to the appropriate team. In a preview, for example, the journalist may discuss squad A or team B. As a result, we will have three columns: one for sentences about team A, another for sentences about team B, and a third for sentences about both teams at the same time.

## Modeling

### Vectorization

When the text processing and allocation phases are completed, it is time to begin the modeling phase.
<br>However, our model will not be able to understand these raw texts, so we must convert them into vectors, which are digital representations of these character strings.Here, the goal is to extract some textual feature so that the model can learn.
<br>Among the vectorization techniques, we highlight the bag of words: it is a very simple technique that calculates the vectors of a text based on the frequency of vocabulary words.
It is simple to interpret and only refers to the frequency of vocabulary words in a given document.
<br>As a result, articles, prepositions, and conjunctions that do not contribute much to meaning are just as important as adjectives or verbs.
<br>An example for the count vectorization technique:

![title](images/countvector.png)

<br>There are other techniques that, in general, work better in machine learning models to address this issue such as TF-IDF: term frequency-inverse document frequency.
<br>The idea behind the TF-IDF approach is that words that appear less frequently in all documents but more frequently in individual documents contribute more to classification.
these terms can be calculated as follows:

![title](images/tfidf.png)

It should be emphasized that for this work, we will utilize the count vectorizer approach to vectorize the preview texts.
<br>This function comprises certain hyperparameters that must be fixed and find the best combinations in order to increase the quality of the vectors.
<br>Among these hyperparameters, we can find:
- stop_words: CountVectorizer provides a predefined set of stop words; in our case, we can specify 'english.' 
- ngram_range: the number of word combinations to consider, for example (1,1) takes only tokens, whereas (1,2) specifies that we want to consider both unigrams (single words) and bigrams (combination of 2 words)
- min_df:stands for minimum document frequency; it disregards words with a document frequency that is strictly lower than the specified threshold.

### Get target values

To enable our model to train and make predictions, we must first provide the target values (the outputs of the matches).

We have set two target values:

- The outcome of a match: home win, away win, draw
    
- The goal difference: the difference in goals scored

![title](images/final_data.png)

### The proportions of the results

It is worth noting that the class distribution of English Premier League games that we have from 2009 to 2022 is 45% home wins, draws 25% and away wins 30%.

![title](images/stats.png)

### Split previews into train and test dataset

Before setting up a machine learning model, we must divide our previews into train and test data. the train dataset is used to train the machine learning model and the test dataset is to assess the fit, which is data that the model has never seen before. To accomplish this, we will split the data into 70% train and 30% test without applying a shuffle to avoid distorting the temporal order of the matches.

### The classifier

There are various methods in machine learning for dealing with classification or regression problems that are highly fascinating to try.
<br>In this work, we will use a random forest classifier that takes vectorized texts as input to predict the outcomes of football matches(Home win, Away win, Draw).
<br> A random forest's fundamental notion is to aggregate a large number of individual decision trees into a single model that function as an ensemble. All individual tree projections are pooled, and the class with the highest votes becomes our model's prediction.
<br>In addition, we can experiment with several hyperparameters in the Random forest classifier to increase model performance, such as:
- n_estimators: the number of trees that the classifier will consider.
- max_depth: the longest path from the root node to the leaf node.
- min_sample_split: the minimal amount of observations required to split any given node.
- Criterion: a function that determines how good a split is. we can experiment with (gini,entropy) values.

## Model Evaluation

Access the true performance of a model is key in its validation step. It allows the modeller to anticipate the capacity of the model to generalise and keep similar predictive power to what has been observed in the training/validation phase.<br>Predicting the outcome of a football game is no exception and usually the same step used when validating any classification model can be followed.<br>Having said that, predicting the outcome of a football game has 2 particular aspects:
- Existence of a solid Benchmark producing prediction: the betting market
- Predictions can be used in a direct investment strategy where economical outcome can be simulated/observed

Because accuracy and precision may not always indicate model capability, there are alternative more effective criteria for measuring model performance for our purpose.
In this regard, we have created a R package that will enable us to set up the following metrics:
- Log loss: For each occurrence, log loss is the negative average of the log of corrected estimated probabilities. It considers the predictability of the result. Each estimated probability is compared to the actual class output value (0 or 1), and a score is computed that penalizes the probability based on the difference between the expected and actual values. The penalty is logarithmic, with a low score for little variations (0.1 or 0.2) and a high score for major differences.
![title](images/LOG.png)
- Brier Score: It is an evaluation metric that is used to check the goodness of a predicted probability score. It is very similar to the mean squared error, but it is only applied to prediction probability scores with values ranging from 0 to 1. It is also similar to the log-loss evaluation metric, but the only difference is that it is gentler in penalizing inaccurate predictions than log loss. The best has a score of 0.0, while the worst has a value of 1.0.
![title](images/BR.png)
- Residual diagnosis: It is the discrepancy between the observed and estimated values. They're a diagnostic tool for evaluating the quality of a model. It aids in the visualization of errors distribution.
- Calibration Plot: In general, we anticipate the class value that has the best probability of being the true class label for any classification task. However, there are situations when we need to estimate the likelihood of a data instance belonging to each class label. It can assist us assess how decisive a classification model is and grasp how'sure' a model is when predicting a class label.The ideal calibrated model's curve is a linear straight line traveling linearly from (0, 0).
- Tailored scoring rules: The key notion is that getting a higher score than your benchmark isn't enough (market). You must outperform it by a comfortable margin that allows you to benefit. To put it another way, we're comparing model forecasts to those of bookies.
-  Trading simulation strategy : We provide the necessary investment instruments for evaluating our model's success. To that end, we provide functions for calculating the amount invested each transaction as well as the projected return on each transaction.

## Reference

<!-- BIBLIOGRAPHY START -->
<div class="csl-bib-body">
  <div class="csl-entry"><i id="zotero|11983139/J4T53NKQ"></i>Beal, R., Middleton, S. E., Norman, T. J., &#38; Ramchurn, S. D. (2020). Combining Machine Learning and Human Experts to Predict Match Outcomes in Football: A Baseline Model. <i>arXiv:2012.04380 [Cs]</i>. http://arxiv.org/abs/2012.04380</div>
</div>
<!-- BIBLIOGRAPHY END -->