#### Rating and Ranking with Markov's Method:

#### The Premier League case

This is the __exercise__ notebook.

We will analyse Premier League results for two seasons; Results have been downloaded from [www.footballwebpages.co.uk](https://www.footballwebpages.co.uk/premier-league) for the following seasons:

- [2021-22 season](https://www.footballwebpages.co.uk/premier-league/match-grid/2021-2022);
- [2022-23 season](https://www.footballwebpages.co.uk/premier-league/match-grid/2022-2023);
- [2023-24 season](https://www.footballwebpages.co.uk/premier-league/match-grid/2023-2024), and
- [2024-25 season](https://www.footballwebpages.co.uk/premier-league/match-grid/2024-2025).



#### Import the needed Python modules

In [1]:
import requests
import io

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

### Set pandas and numpy options for printing results

In [2]:
np.set_printoptions(linewidth=1000)

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.colheader_justify', 'center')

### Premier League winners:

*   **2021-22**: Manchester City (1-point gap from Liverpool, who finished second; Arsenal finished 5th)

*   **2022-23**: Manchester City (5-points gap from Arsenal)

*   **2023-24**: Manchester City (2-point gap from Arsenal)

*   **2024-25**: Liverpool (10 points above Arsenal)

In [3]:
N_TEAMS = 20

### Data file URLs from DSTA GitHub repository

In [4]:
DATA_FOLDER_URL = "https://raw.githubusercontent.com/ale66/learn-datascience/main/week-9/Ranking_premier_league/data"

# League table GitHub URLs
league_table_2021_2022 = f"{DATA_FOLDER_URL}/2021_2022_LeagueTable.csv"
league_table_2022_2023 = f"{DATA_FOLDER_URL}/2022_2023_LeagueTable.csv"
league_table_2023_2024 = f"{DATA_FOLDER_URL}/2023_2024_LeagueTable.csv"
league_table_2024_2025 = f"{DATA_FOLDER_URL}/2024_2025_LeagueTable.csv"

# Match grid GitHub URLs
match_grid_2021_2022 = f"{DATA_FOLDER_URL}/2021_2022_MatchGrid.csv"
match_grid_2022_2023 = f"{DATA_FOLDER_URL}/2022_2023_MatchGrid.csv"
match_grid_2023_2024 = f"{DATA_FOLDER_URL}/2023_2024_MatchGrid.csv"
match_grid_2024_2025 = f"{DATA_FOLDER_URL}/2024_2025_MatchGrid.csv"

# Massey & Keener results
results_2023_2024 = f"{DATA_FOLDER_URL}/2023_2024_MergedResults.csv"

### Utility functions

In [5]:
def read_league_tbl(url: str) -> pd.DataFrame:
    """
    Read a season league table CSV from a GitHub URL.
    The first row is skipped, as it groups information in
    Home, Away and Total, which is not needed.

    :param url: CSV GitHub URL.
    :return: League table as dataframe.
    """
    response = requests.get(url)
    content = response.content.decode("utf-8")
    league_tbl = pd.read_csv(io.StringIO(content), skiprows=1)

    # Add actual team ranking
    league_tbl["Actual_Ranking"] = np.arange(1, N_TEAMS + 1)
    return league_tbl


def read_match_grid(url: str) -> pd.DataFrame:
    """
    Read a season match grid CSV from a GitHub URL.

    Each match entry is in format ="GH-GA" (except from NaN in diagonal).
    GH are goals scored by the home team, and GA are goals scored
    by the away team. The function reads the match grid CSV,
    removes '=' and '"', and adds "0-0" in the diagonal to remove NaNs.

    :param url: CSV GitHub URL.
    :return: Match grid as dataframe.
    """
    response = requests.get(url)
    content = response.content.decode("utf-8")

    return (
    pd.read_csv(io.StringIO(content), dtype=str, index_col=0)
    .replace('"' , '', regex=True)
    .replace('=' , '', regex=True)
    .fillna("0-0")
    )

### Set current working data files and next season files

Hint: Change these variables in case you would like to rate / rank teams based on a different season and check the estimates against the actual rankings of the following season.

In [6]:
# Current (working) season
current_season = "2023 - 2024"
curr_league_tbl = league_table_2023_2024
curr_match_grid = match_grid_2023_2024

# Next season
coming_season = "2024 - 2025"
next_league_tbl = league_table_2024_2025
next_match_grid = match_grid_2024_2025

### Markov's method

#### For Markov, we need the match grid

In [None]:
match_grid = read_match_grid(curr_match_grid)
match_grid

#### Create Markov's V matrix


We remember that $V_{n \times n}$  where $v_{ij}:$ total goals conceded by $i$ to $j$


Here $n=$ number of teams in the league

#### Create Markov's S matrix

Below is another refresher.

$S_{n \times n}$ where $s_{ij}:$ proportion of goals team i conceded to team over the total goals team i conceded.

#### Exercise 1: Complete the code to calculate Markov's V and S matrices

#### Step-by-step:

1.   Parse scores. Example: "3-2". The home team scored 3 goals and the away team 2.

   Hint: Pandas [applymap documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html)

2.   Match every team's home match with the respective away match against the same opponent.

   Hint: The home match of team *i* against *j* is element *ij*. The respective away match is element *ji* - row and column indexes are swapped...

In [8]:
# TODO: Write your code below.
# Save the output in variables V_dataframe and S_dataframe

# Parse score and get goals of a home match


# Parse score and get goals of an away match
# The grid is transposed to match every team's respective
# home and away matches

#### Create transition and counter dictionaries

In [None]:
# Dictionary with teams as keys and lists of probabilities as values
# Each list represents a probability of moving from current team
# to another team of the league (fair-weather fan logic)
transit_dict = S_dataframe.T.to_dict(orient="list")

teams = S_dataframe.columns.tolist()

# Dictionary with teams as keys and number of visits as values
counter_dict = {team: 0 for team in teams}

#### Markov's simulation of the fair-weather fan

By moving towards the team that appears strongest at the moment, the fair-weather fan will end up indicating, with his/her support, the overall best team.

In [None]:
# How many iterations to run?
N = 100_000

In [None]:
# Initialize process by randomly selecting a team
curr_team = np.random.choice(teams)
counter_dict[curr_team] += 1

# Run the simulation
for i in range(N):
    probs = transit_dict[curr_team]
    curr_team = np.random.choice(teams, p=probs)
    counter_dict[curr_team] += 1

# Get the ratings
ratings = [count / (N + 1) for count in counter_dict.values()]

markov_df = (
    pd.DataFrame(ratings, index=teams, columns=["Markov_Rating"])
    .sort_values(by="Markov_Rating", ascending=False)
    )

#### Import the league table to get actual rankings and points scored

In [None]:
league_table = read_league_tbl(league_table_2023_2024)

league_table

#### We keep only teams, actual ranking and points.

In [None]:
required_cols = ["Unnamed: 1", "Pts", "Actual_Ranking"]

renaming = {"Unnamed: 1": "Teams", "Pts": "Points"}

# Make a copy of the league table, keeping only the necessary columns renamed
# Index is reset as the teams for the table join below
league_table = (
    league_table
    .loc[ :, required_cols]
    .copy()
    .rename(columns=renaming)
    .set_index("Teams")
)

league_table

In [None]:
# Tidy-up team names
league_table.index = league_table.index.str.replace(r"\s\(-\d{1,}\)", "", regex=True)

#### Use a MinMaxScaler to scale Markov ratings for plotting.

Please see the relative [sklearn MinMaxScaler documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

In [None]:
# Scale the ratings between the weakest and stronger team actual points.
# MinMaxScaler accepts a tuple (min, max) as input to define the range.
min_max_scaler = MinMaxScaler((league_table.Points.min(), league_table.Points.max()))

markov_df["Markov_Scaled_Rating"] = min_max_scaler.fit_transform(
    markov_df
        .loc[:, "Markov_Rating"]
        .values
        .reshape(-1, 1)
)

#### Add Markov ranking.

In [None]:
markov_df["Markov_Ranking"] = np.arange(1, N_TEAMS + 1)

markov_df

#### Join Markov results with the league table and the actual ratings based on team names

In [None]:
markov_df = markov_df.join(league_table)

markov_df

#### Keep Markov rating and ranking from the match grid

In [None]:
cols_to_keep = [
    "Markov_Rating",
    "Markov_Scaled_Rating",
    "Markov_Ranking"
    ]

# Data needed from Markov output - sort by actual ranking first
data_to_keep = (
    markov_df
    .sort_values("Actual_Ranking", ascending=True)
    .loc[:, cols_to_keep]
    .copy()
    )

#### Import merged data with Massey and Keener results

In [None]:
# Use Teams column as index to join it later with Markov
results_df = pd.read_csv(results_2023_2024, index_col="Teams")

#### Merge Markov results with Massey and Keener results

In [None]:
results_df = results_df.join(data_to_keep)

results_df

#### Plot Markov's scaled rating and ranking side by side with actual ranking and points scored

Documentation for [matplotlib.pyplot horizontal bar plots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.barh.html)

In [None]:
# Initialize grid of plots
figure, axis = plt.subplots(nrows=1, ncols=2, figsize=(12, 4), dpi=160)

# Plot Keener scaled rating - plot 0, row 0
axis[0].barh(
    markov_df["Markov_Ranking"],
    markov_df["Markov_Scaled_Rating"],
    height=0.6, align="center"
    )

# Configure y axis
axis[0].set_yticks(
    markov_df["Markov_Ranking"],
    labels=markov_df.index,
    fontsize=7
    )

axis[0].invert_yaxis()  # labels read top-to-bottom

# X-axis and title
axis[0].tick_params(axis="x", labelsize=6)
axis[0].set_xlabel('Markov Scaled Rating', fontsize=8)
axis[0].set_title(f'Season {current_season} Markov Scaled Rating', fontsize=8)

# Plot actual ranking and point scored - plot 1, row 0
axis[1].barh(
    markov_df["Actual_Ranking"],
    markov_df["Points"],
    height=0.6, align='center'
    )

# Configure y axis
axis[1].set_yticks(
    markov_df["Actual_Ranking"],
    labels=markov_df.index,
    fontsize=7
    )

axis[1].invert_yaxis()  # labels read top-to-bottom

# X-axis and title
axis[1].tick_params(axis="x", labelsize=6)
axis[1].set_xlabel('Actual Points', fontsize=8)
axis[1].set_title(f'Season {current_season} Points Scored', fontsize=8)

# Use 'tight_layout' to avoid overlapping text
plt.tight_layout()
plt.show()

### Get rankings from all methods in a new table

In [None]:
rankings = [
    "Actual_Ranking",
    "Massey_Ranking",
    "Keener_Ranking",
    "Markov_Ranking"
    ]

ranks_df = results_df.loc[:, rankings].copy()
ranks_df

In [None]:
ranks_df.corr()

#### Import the table of the subsequent season to check

In [None]:
next_league_table = read_league_tbl(next_league_tbl)

# Uncomment if you want to see the raw table
# next_league_table

#### Keep necessary columns and rename them

In [None]:
required_cols = ["Unnamed: 1", "P.2", "W.2", "D.2", "L.2", "F",
                 "A", "+/-", "Pts", "Actual_Ranking"]

renaming = {
    "Unnamed: 1": "Teams",
    "P.2": "Total_Matches_Played",
    "W.2": "Total_Wins",
    "D.2": "Total_Draws",
    "L.2": "Total_Losses",
    "F": "Goals_Scored",
    "A": "Goals_Conceded",
    "+/-": "Goal_Difference",
    "Pts": "Points"
    }

# Make a copy of the league table, keeping only the necessary columns renamed
next_league_table = (
    next_league_table
    .loc[:, required_cols]
    .copy()
    .rename(columns = renaming)
)

next_league_table

#### Recall estimated rankings from Massey, Keener and Markov

In [None]:
ranks_df