https://pandas.pydata.org/docs/user_guide/style.html

# Analysis of reproducibility
### Solène Lemonnier & Pauline Roches

This project is based on the paper "[COVID and Home Advantage in Football: An Analysis of Results and xG Data in European Leagues](https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/)" by Mathieu Acher. This study was published on May 23, 2021. 

## Analysis of the study

After reviewing the study, we were able to outline the context and understand its stakes. The goal of the study is to assess the impact of playing at home considering the presence of supporters in a football stadium. The initial hypothesis is that their presence has a positive effect on team performance, both in terms of points and goals scored, as well as in relation to expected points and goals.\
\
The COVID-19 pandemic provided data that allowed for comparisons between matches played in empty and filled stadiums. However, it is important to note that the impact extended beyond the spectators and also affected the teams themselves, potentially disrupting training sessions or sidelining players due to illness.\
\
To conduct this analysis, six European football leagues were studied from 2014 to 2021. It is worth noting that the crisis was managed differently across leagues, which makes comparisons between them more challenging.




### Methodology 

The data were sourced from the website "[Understat](https://understat.com/)". The variables used include the number of matches, goals, expected goals (xG), expected goals conceded (xGA), points, and expected points (xPoints).\
\
To analyze the data, the following methods were used : 
- The **Wilcoxon signed-rank test** : a non-parametric test used to compare two paired sets of values, often before and after a treatment. The aim is to determine whether the two measurements are significantly different without assuming a specific distribution for the differences. From this test, the **p-value** is derived (the probability of observing results as extreme as those in the sample, under the null hypothesis that there is no effect or difference. If the p-value is small (<0.05), the null hypothesis is rejected, indicating an effect).
- The **Cohen's d** : a measure of effect size, used to determine if an observed difference between two groups is practically significant, beyond just statistical significance.
- The **Mann-Whitney U test** : a non-parametric test used to compare two independent groups to assess whether their distributions differ significantly.

### Key findings

The study observed that, generally, there is a notable advantage in terms of points gained when playing at home. However, during the COVID-19 seasons, this advantage diminished significantly or even reversed.

## Reproduction of the study

### Data collection

To begin our data collection process for each league and season, we utilize the [*Understat* Python package](https://understat.readthedocs.io/en/latest/). Understat is a specialized library designed to interact with the statistical data provided by the website *Understat.com*. This package allows us to programmatically fetch and analyze data related to various leagues, seasons, teams, players, and matches.

### Table

We first reproduced the table using `./reproduction/reproduce_diff_points.py`, which allows us to observe the differences in points and xPoints between seasons for all leagues. The resulting graph is `./reproduction/results/diff_points_xpoints.png`.

In [21]:
from IPython.display import display, HTML

# Créez du code HTML pour afficher les images côte à côte
html_code = """
<div style="display: flex; justify-content: space-between;">
    <img src="reproduction/results/diff_points_xpoints.png" style="max-width: 48%; height: auto;" />
    <img src="results_acherm/diff_points_xpoints_acherm.png" style="max-width: 48%; height: 50%;" />
</div>
"""

# Afficher les images côte à côte
display(HTML(html_code))


By comparing our tables (left table) to those of Mathieu Acher (right table), we observe the same results after carefully checking the rounding. Indeed, since xPoints are floats, it is important to ensure that the points are rounded conventionally. By default, they were rounded down.\
\
We just have one year in one league (La Liga, 2019) that does not have the same values than in the work of Mathieu Acher. Indeed, the Diff xPoints is 187 with Understat library instead of 188. It may be caused by the way the package handle the retrieving of data.\
\
We can see a clear home advantage in the different leagues, which diminishes during the COVID period.

### Graphs

We then reproduced the graphs using `graphs_par_ligue.py`  which allows us to observe the evolution of points earned and expected points both at home and away for all leagues from 2014 to 2020. The outputs are stored in the `evolutions_par_ligue` folder, with one graph per league.

In [4]:
from IPython.display import display, HTML

# Créer du code HTML pour afficher les images côte à côte sur une même ligne
html_code = """
<div style="display: flex; justify-content: space-between; flex-wrap: wrap;">

    <!-- Bundesliga -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./reproduction/results/evolutions_par_ligue/evolution_points_Bundesliga.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_Bundesliga_acherm.png" width="40%" />
    </div>

    <!-- EPL -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./reproduction/results/evolutions_par_ligue/evolution_points_EPL.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_EPL_acherm.png" width="40%" />
    </div>

    <!-- La liga -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./reproduction/results/evolutions_par_ligue/evolution_points_La_liga.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_La_liga_acherm.png" width="40%" />
    </div>

    <!-- Ligue 1 -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./reproduction/results/evolutions_par_ligue/evolution_points_Ligue_1.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_Ligue_1_acherm.png" width="40%" />
    </div>

    <!-- RFPL -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./reproduction/results/evolutions_par_ligue/evolution_points_RFPL.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_RFPL_acherm.png" width="40%" />
    </div>

    <!-- Serie A -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./reproduction/results/evolutions_par_ligue/evolution_points_Serie_A.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_Serie_A_acherm.png" width="40%" />
    </div>

</div>
"""

# Afficher les images côte à côte
display(HTML(html_code))


By comparing our tables (left table) to those of Mathieu Acher (right table), we observe the same averages and, therefore, the same variations. The curves from the left and right tables are identical.\
\
The loss of the home advantage is particularly noticeable during the COVID seasons.

### Statistical tests

#### Non-parametrical Wilcoxon Signed-Rank test 

We reproduced the Wilcoxon test with the function *wilcoxon* from the python library and the Cohen's d with a function that we made by ourselves. The code is done in the `wilcoxon_with_understat.py` and the results are saved in the file `reproduction_wilcoxon.png`.

In [3]:
from IPython.display import display, HTML

# Créez du code HTML pour afficher les images côte à côte
html_code = """
<img src="reproduction/results/reproduction_wilcoxon.png" style="height: auto;" />
"""

# Afficher les images côte à côte
display(HTML(html_code))

# Créez du code HTML pour afficher les images côte à côte
html_code_img = """
<img src="results_acherm/result-sTest-Wilco.png" style="height: auto;" />
"""

# Afficher les images côte à côte
display(HTML(html_code_img))

By comparing our results (first table) with the results of the study (second table), we can observe the same results for the wilcoxon test and the cohen's d. 

## Critères de réplicabilités : 
- jeu de données : librairie python/scrap ou autre jeu de données
- saisons : étendre à 2021-2022-2023
- tests statistiques
- version de python pour le mann whitney u
- tests sur matchs ou équipes

## Replication of the study

### 1st change - Data collection process : Web Scraping

In our initial approach to replicating the data collection process, we opted to use web scraping instead of the Understat library. This method involves directly extracting data from the Understat website. It allows us to tailor the data extraction process as we have a finer control over the data collection.

We added the `./replicabilite/web_scraping/scrap.py` file, which allows us to retrieve all team data from *2014* to *2022* for both *home* and *away* games. The results are stored in the CSV file `./replicabilite/web_scraping/understat_team_stats_home_away.csv`.

We first reproduced the table using `./replicabilite/web_scraping/reproduce_diff_points.py`, which allows us to observe the differences in points and xPoints between seasons for all leagues. The resulting graph is `./replicabilite/web_scraping/results/diff_points_xpoints.png`.

In [17]:
from IPython.display import display, HTML

# Créez du code HTML pour afficher les images côte à côte
html_code = """
<div style="display: flex; justify-content: space-between;">
    <img src="replicabilite/web_scraping/results/diff_points_xpoints.png" style="max-width: 48%; height: auto;" />
    <img src="results_acherm/diff_points_xpoints_acherm.png" style="max-width: 48%; height: 50%;" />
</div>
"""

# Afficher les images côte à côte
display(HTML(html_code))


By comparing our table (left table) to those of Mathieu Acher (right table), we observe the exact same results, nothing is different. But if we compare it with the results we had in our reproduction with another way of data collection, we observe a difference. 


### 2nd change - New statistical method : Repeated Measures ANOVA

The Repeated Measures ANOVA (Analysis of Variance) is a statistical method used to evaluate whether there are significant differences in performance metrics across various conditions, such as matches with and without spectators. This method is particularly suitable when the same teams are observed under different conditions, as it accounts for within-subject variability by controlling for differences within the teams themselves.

So here, the primary goal is to assess the impact of playing conditions on football team performance, specifically examining differences between home and away matches. This analysis focuses on performance metrics such as points, goals scored, expected goals (xG), and expected goals against (xGA).

The method involves the following steps:
- Collecting performance data for teams across different seasons.
- Applying Repeated Measures ANOVA to determine whether there are statistically significant differences in performance based on playing conditions (home vs. away).
- Analyzing results for various leagues and seasons to identify patterns and potential impacts of specific factors, such as the COVID-19 pandemic.

The structure of the results includes:
- League: The football league being analyzed.
- Season: The corresponding season.
- anova-F: The F-statistic from the ANOVA test.
- anova-pvalue: The p-value associated with the test, indicating statistical significance.
- anova-eta-sq: The effect size (eta-squared), representing the magnitude of the observed differences.

In [2]:
from IPython.display import display, HTML

# Créez du code HTML pour afficher les images côte à côte
html_code = """
<img src="replicabilite/new_statistical_method/results/reproduction_anova.png" style="height: auto;" />
"""

# Afficher les images côte à côte
display(HTML(html_code))


#### Findings from the Analysis:

The results of the ANOVA across several major European football leagues from 2014 to 2020 reveal important insights into the home advantage and its variability over time.

Ligue 1:
Most seasons show statistically significant differences between home and away performances (p-value < 0.05). However, the 2020 season is an outlier with a p-value of 0.9529, indicating no significant difference. This suggests that home advantage was negligible in 2020, likely due to the absence of spectators caused by the pandemic.

La Liga:
Every season exhibits significant differences, with all p-values below 0.05, suggesting a consistent home advantage. Even in 2020, despite pandemic-related disruptions, the differences remained significant, implying that home advantage persisted.

English Premier League (EPL):
All seasons except 2020 show significant differences. The 2020 season has a p-value of 0.6022, indicating no significant difference. This suggests the pandemic had a strong impact, reducing the home advantage.

Bundesliga:
Significant differences are observed in all seasons except for 2019 and 2020, with p-values of 0.6049 and 0.0276 respectively. The lack of significance in 2019 suggests variability even before the pandemic, while 2020 results highlight the pandemic's effect.

Serie A:
Significant differences are found in all seasons except 2019 (p = 0.2219) and 2020 (p = 0.1086). This indicates that home advantage was not significant in these years, especially during the pandemic.

Russian Premier League (RFPL):
Several seasons, including 2015, 2017, 2019, and 2020, show no significant differences. This variability suggests that the home advantage fluctuated, with the pandemic further diminishing it in 2020.

#### Conclusion with that new statistical method :
The analysis demonstrates that the impact of COVID-19 on home advantage was not uniform across all leagues. Some leagues, like La Liga, maintained significant differences even during the pandemic. However, others, such as Serie A, RFPL, and the Bundesliga, exhibited a clear reduction in home advantage in 2020. The presence of non-significant results in 2019 for some leagues suggests that home advantage can vary for reasons beyond the pandemic. Nevertheless, the data indicates that COVID-19 had a substantial impact on football matches in 2020, neutralizing the traditional home advantage in several cases.

While the new findings largely support the previous conclusion that home advantage diminished during the COVID-19 seasons, they also introduce important nuances. The previous observation suggested a general trend of reduced home advantage across the board, but the current analysis reveals that this was not uniformly the case for all leagues. For example, La Liga maintained significant differences in home and away performances even in 2020, suggesting that the impact of the pandemic on home advantage was not as pronounced there. This nuanced view indicates that while the pandemic did influence home advantage in many leagues, the extent of its impact varied, highlighting league-specific factors and suggesting that the generalization of diminished home advantage might need reconsideration.

### 3rd change -