https://pandas.pydata.org/docs/user_guide/style.html

# Analysis of reproducibility
### Solène Lemonnier & Pauline Roches

This project is based on the paper "[COVID and Home Advantage in Football: An Analysis of Results and xG Data in European Leagues](https://blog.mathieuacher.com/FootballAnalysis-xG-COVIDHome/)" by Mathieu Acher. This study was published on May 23, 2021. 

## Analysis of the study

After reviewing the study, we were able to outline the context and understand its stakes. The goal of the study is to assess the impact of playing at home considering the presence of supporters in a football stadium. The initial hypothesis is that their presence has a positive effect on team performance, both in terms of points and goals scored, as well as in relation to expected points and goals.\
\
The COVID-19 pandemic provided data that allowed for comparisons between matches played in empty and filled stadiums. However, it is important to note that the impact extended beyond the spectators and also affected the teams themselves, potentially disrupting training sessions or sidelining players due to illness.\
\
To conduct this analysis, six European football leagues were studied from 2014 to 2021. It is worth noting that the crisis was managed differently across leagues, which makes comparisons between them more challenging.




### Methodology 

The data were sourced from the website "[Understat](https://understat.com/)". The variables used include the number of matches, goals, expected goals (xG), expected goals conceded (xGA), points, and expected points (xPoints).\
\
To analyze the data, the following methods were used : 
- The **Wilcoxon signed-rank test** : a non-parametric test used to compare two paired sets of values, often before and after a treatment. The aim is to determine whether the two measurements are significantly different without assuming a specific distribution for the differences. From this test, the **p-value** is derived (the probability of observing results as extreme as those in the sample, under the null hypothesis that there is no effect or difference. If the p-value is small (<0.05), the null hypothesis is rejected, indicating an effect).
- The **Cohen's d** : a measure of effect size, used to determine if an observed difference between two groups is practically significant, beyond just statistical significance.
- The **Mann-Whitney U test** : a non-parametric test used to compare two independent groups to assess whether their distributions differ significantly.

### Key findings

The study observed that, generally, there is a notable advantage in terms of points gained when playing at home. However, during the COVID-19 seasons, this advantage diminished significantly or even reversed.

## Reproduction of the study

We begin by collecting data for each league and season using web scraping.

We added the `scrap.py` file, which allows us to retrieve all team data from *2014* to *2022* for both *home* and *away* games. The results are stored in the CSV file `understat_team_stats_home_away.csv`.

### Table

We then reproduced the table using `reproduce_diff_points.py`, which allows us to observe the differences in points and xPoints between seasons for all leagues. The resulting graph is `diff_points_xpoints.png`.

In [2]:
from IPython.display import display, HTML

# Créez du code HTML pour afficher les images côte à côte
html_code = """
<div style="display: flex; justify-content: space-between;">
    <img src="results/diff_points_xpoints.png" style="max-width: 48%; height: auto;" />
    <img src="results_acherm/diff_points_xpoints_acherm.png" style="max-width: 48%; height: 50%;" />
</div>
"""

# Afficher les images côte à côte
display(HTML(html_code))


By comparing our tables (left table) to those of Mathieu Acher (right table), we observe the same results after carefully checking the rounding. Indeed, since xPoints are floats, it is important to ensure that the points are rounded conventionally. By default, they were rounded down.\
\
We can see a clear home advantage in the different leagues, which diminishes during the COVID period.

### Graphs

We then reproduced the graphs using `graphs_par_ligue.py`  which allows us to observe the evolution of points earned and expected points both at home and away for all leagues from 2014 to 2020. The outputs are stored in the `evolutions_par_ligue` folder, with one graph per league.

In [3]:
from IPython.display import display, HTML

# Créer du code HTML pour afficher les images côte à côte sur une même ligne
html_code = """
<div style="display: flex; justify-content: space-between; flex-wrap: wrap;">

    <!-- Bundesliga -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./results/evolutions_par_ligue/evolution_points_Bundesliga.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_Bundesliga_acherm.png" width="40%" />
    </div>

    <!-- EPL -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./results/evolutions_par_ligue/evolution_points_EPL.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_EPL_acherm.png" width="40%" />
    </div>

    <!-- La liga -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./results/evolutions_par_ligue/evolution_points_La_liga.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_La_liga_acherm.png" width="40%" />
    </div>

    <!-- Ligue 1 -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./results/evolutions_par_ligue/evolution_points_Ligue_1.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_Ligue_1_acherm.png" width="40%" />
    </div>

    <!-- RFPL -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./results/evolutions_par_ligue/evolution_points_RFPL.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_RFPL_acherm.png" width="40%" />
    </div>

    <!-- Serie A -->
    <div style="display: flex; flex-direction: row; align-items: center; justify-content: space-between; margin-bottom: 20px">
        <img src="./results/evolutions_par_ligue/evolution_points_Serie_A.png" width="45%" />
        <img src="./results_acherm/evolutions_par_ligue_acherm/evolution_points_Serie_A_acherm.png" width="40%" />
    </div>

</div>
"""

# Afficher les images côte à côte
display(HTML(html_code))


By comparing our tables (left table) to those of Mathieu Acher (right table), we observe the same averages and, therefore, the same variations. The curves from the left and right tables are identical.\
\
The loss of the home advantage is particularly noticeable during the COVID seasons.

### Statistical tests

Critères de réplicabilités : 
- jeu de données : librairie python/scrap ou autre jeu de données
- saisons : étendre à 2021-2022-2023
- tests statistiques
- version de python pour le mann whitney u
- tests sur matchs ou équipes