<a href="https://colab.research.google.com/github/HimalKarkal/netball-analysis/blob/master/Glicko_Rating.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [52]:
# Cloning the data from GitHub

! git clone 'https://github.com/HimalKarkal/netball-analysis.git'

fatal: destination path 'netball-analysis' already exists and is not an empty directory.


# Creating the Rating System

The following block of code defines the Glicko rating function and its dependent functions. The rating function called rate() accepts the following inputs:

1. **ratingP**: Player's current rating
2. **ratingO**: Opponent's current rating. In this implementation, the opponent is the league's rating with respect to the player. The idea of rating the league is to get an idea of how hard the league feels for any particular player.
3. **margin**: This is the difference between the player's score for any given statistic and the positional average for that statistic. The statistics on which a player is rated depends on her position in a game.
4. **positiveStat**: This is a boolean (True or False) input that informs the rating system whether the statistic the players are being rated on is positive or negative. For example, while scoring more goals (positive statistic) should earn a player more rating points, causing more fouls (negative statistic) should cause a player to lose points.

The rate() function outputs a GlickoRating() object containing the player's rating and rating deviation. These two numbers can be accessed by calling the object in the following manner:
1. **GlickoRating** = object.r
2. **Rating deviation** = object.rd

The initial GlickoRating and rating deviation for players is set to 1500 and 350 respectively as suggested in the paper by Mark Glickman.

The rate() function will need to be called twice for each player and statistic: once to update the player's rating, and a second time to update the relativeLeagueRating.

In [53]:
# CREATING THE RATING SYSTEM

import math

class GlickoRating:
    def __init__(self, r=1500, rd=350):
        self.r = r
        self.rd = rd

q = math.log(10) / 400

def g(rd):
  return 1 / math.sqrt(1 + (3 * q**2 * rd**2) / math.pi**2)

def E(r, ro, rdo):
    return 1 / (1 + 10**(-g(rdo) * (r - ro) / 400))

def o(margin, positiveStat = True):
  if positiveStat == True:
    return 1 / (1 + math.exp(-margin))
  else:
    return 1 / (1 + math.exp(margin))

def d2(r, ro, rdo):
  return (q**2 * g(rdo)**2 * E(r, ro, rdo) * (1 - E(r, ro, rdo)) + 1e-9)**-1 #Added epsilon to prevent error.

def rate(ratingP, ratingO, margin, positiveStat=True):
  r = ratingP.r
  rd = ratingP.rd
  ro = ratingO.r
  rdo = ratingO.rd
  e = E(r, ro, rdo)

  if positiveStat == True:
    outcome = o(margin, positiveStat = True)
  else:
    outcome = o(margin, positiveStat = False)

  r_updated = r + (q / ((1 / rd**2) + 1 / d2(r, ro, rdo))) * g(rdo) * (outcome - e)
  rd_updated = math.sqrt(1 / ((1 / rd**2) + 1 / d2(r, ro, rdo)))

  return GlickoRating(r_updated, rd_updated)

# Processing data to rate players

Before we can implement the rating system, we will need to process the data into a format that allows us to apply the rate() function.

The data from GitHub is organised by match. Each match folder contains several files with various data. The files that are of interest to us for the pusrpose of rating players are:

1. **Substitutions**: This file contains information on all the substitutions in the match. This allows us to calculate the time a player spends in each position. A player is assigned to the position she spends the most time in during a match. Consequently, her performance in the match will be judged based on statistics deemed most important for her main position.

2. **Player Stats**: This file contains various player statistics for the match. For example, *goals*, *assists*, *blocks* etc. By combining the Player Stats data from all four matches in a gameweek and sorting by position, we can calculate positional averages for each statistic. This is then used along with individual player stats to calculate the margin required for the rate() function.

3. **Team Stats**: While the purpose of this rating system is to rate individual player performances to enable fair comparisons between players of different positions, there is no stardard against which the system can be evaluated. In the absence of a better evaluation strategy, we decided to evaluate the end-of-season average Glicko Rating for each team to their end-of-season points on the ranking table to evaluate the performance of the system. [You don't really need this. Try to work without it and only include it if totally necessary]

## The Load Files Function

This function will use the *glob* package to recursively search for and store file paths as a list.

We will use this function to create lists containing the file paths for all **Substitiutions** and **Player Stats** file paths.

The loadFiles() function will accept the following inputs:

1. **tournament**: The GitHub data contains data on multiple tournaments like SSN, ANC, and CONCUP. This *string* input specifies the particular tournament we are interested in rating.

2. **season**: This *string* input specifies the season (year) we are interested in rating.

3. **file_name**: This *string* input specifies the file_name we are interested in. For example, substitutions or playerStats.

The function will output a list of filepaths for the specified file name sorted in ascending order of match number.

NOTE: The output of the loadFiles function should contain 56 file-paths for any 'SSN' season. While an entire season of SSN Netball will have 60 matches, the last 4 matches are finals and hence will not be played by all teams. Hence, these four games are excluded from the paths list.

In [54]:
# DEFINING THE loadFiles() FUNCTION

import glob

def loadFiles(tournament, season, file_name):

  file_paths = glob.glob(f"/content/netball-analysis/data/matchCentre/processed/*/*{file_name}*{season}*{tournament}*.csv")
  file_paths.sort()
  file_paths = file_paths[:-4]

  return file_paths

In [55]:
# TESTING THE loadFiles() FUNCTION

tournament = 'SSN'
season = '2022'
file_name = 'substitutions'

paths = loadFiles(tournament, season, file_name)

print(paths)
print()
print(f'The paths list has {len(paths)} file-paths')

del paths, tournament, season, file_name

['/content/netball-analysis/data/matchCentre/processed/116650101_2022_SSN_11665_r1_g1/116650101_substitutions_2022_SSN_11665_r1_g1.csv', '/content/netball-analysis/data/matchCentre/processed/116650102_2022_SSN_11665_r1_g2/116650102_substitutions_2022_SSN_11665_r1_g2.csv', '/content/netball-analysis/data/matchCentre/processed/116650103_2022_SSN_11665_r1_g3/116650103_substitutions_2022_SSN_11665_r1_g3.csv', '/content/netball-analysis/data/matchCentre/processed/116650104_2022_SSN_11665_r1_g4/116650104_substitutions_2022_SSN_11665_r1_g4.csv', '/content/netball-analysis/data/matchCentre/processed/116650201_2022_SSN_11665_r2_g1/116650201_substitutions_2022_SSN_11665_r2_g1.csv', '/content/netball-analysis/data/matchCentre/processed/116650202_2022_SSN_11665_r2_g2/116650202_substitutions_2022_SSN_11665_r2_g2.csv', '/content/netball-analysis/data/matchCentre/processed/116650203_2022_SSN_11665_r2_g3/116650203_substitutions_2022_SSN_11665_r2_g3.csv', '/content/netball-analysis/data/matchCentre/pro

## The Classify Players Function

This function will accept the **Substitutions** dataframe and classify all players to the position they spent the most time in during the match excluding the bench. Players who spend the entire match on the bench are dropped. The output is a dataframe containing *playerId*, *time spent in each of the 8 positions*, and *position* for all players in the substitutions dataframe.

In the main implementation below, instead of passing each match's Substitutions dataframe into the function one by one, a concatenated substitutions dataframe containing all the data for a particular gameweek will be passed.

NOTE: Pandas will need to be imported to read the Substitution file as a dataframe. This function accepts a pandas dataframe and not the csv.

In [56]:
# DEFINING THE classifyPlayers() FUNCTION

import pandas as pd

def classifyPlayers(df_subs):

    # Creating a nested dictionary to store the time spent by a player in each position
    position_dict = {'GS': {}, 'GA': {}, 'WA': {}, 'C': {}, 'WD': {}, 'GD': {}, 'GK': {}, 'S': {}}

    # Looping through each row of the substitutions dataframe to populate the nested dictionary (position_dict)
    for i, row in df_subs.iterrows():

        player, position, duration = row['playerId'], row['startingPos'], row['duration']

        if player in position_dict[position]:
            position_dict[position][player] += duration

        else:
            position_dict[position][player] = duration

    # Converting the nested dictionary (position_dict) to a pandas dataframe, filling Nan's with 0, resetting the index, and renaming the index column to 'playerId'
    df_time_in_position = pd.DataFrame(position_dict).fillna(0).reset_index().rename(columns={'index': 'playerId'})

    # Dropping players that spent the entire match on the bench
    df_time_in_position = df_time_in_position[df_time_in_position['S'] != 3600]

    # Identifying the position, other than the bench (S), where the player spent most of her time during the game
    df_time_in_position['position'] = df_time_in_position.drop(columns=['playerId', 'S']).idxmax(axis=1)

    return df_time_in_position

In [57]:
# TESTING THE classifyPlayers() FUNCTION

df_subs = pd.read_csv('/content/netball-analysis/data/matchCentre/processed/111080703_2020_SSN_11108_r7_g3/111080703_substitutions_2020_SSN_11108_r7_g3.csv')

df_time_in_position = classifyPlayers(df_subs)

print(df_time_in_position.head())
print()
print(df_time_in_position.info())

   playerId      GS      GA   WA    C   WD   GD   GK       S position
0    999128  1887.0  1689.0  0.0  0.0  0.0  0.0  0.0    24.0       GS
1   1001944  1713.0     0.0  0.0  0.0  0.0  0.0  0.0  1887.0       GS
2   1001357  2513.0   674.0  0.0  0.0  0.0  0.0  0.0   413.0       GS
3   1014128  1087.0     0.0  0.0  0.0  0.0  0.0  0.0  2513.0       GS
4   1001711     0.0  1911.0  0.0  0.0  0.0  0.0  0.0  1689.0       GA

<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 0 to 19
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   playerId  20 non-null     int64  
 1   GS        20 non-null     float64
 2   GA        20 non-null     float64
 3   WA        20 non-null     float64
 4   C         20 non-null     float64
 5   WD        20 non-null     float64
 6   GD        20 non-null     float64
 7   GK        20 non-null     float64
 8   S         20 non-null     float64
 9   position  20 non-null     object 
dtypes: float

## The Calculate Means Function

The Glicko rating system was developed to address the shortcomings of the Elo system. Both these systems are designed to rate one-on-one, zero-sum games. Hence, there is no way to directly apply the rating system to each player. Instead, we apply the rating system by pitting a player against their weekly positional average for any given statistic. Then, depending on the posistion they played in, a weighted average of their ratings for various statistics, thought to be most relevant for the position, is taken to produce an absolute rating. Since the glicko rating system uses the margin between player's statistic and the positional average for the statistic, it is necessary to scale all statistics to ensure they are comparable. We achieve this by using z-scores instead of absolute margins. In summary, we need two pieces of data before we can proceed with rating players - first, the **positional means** for each statistic, and second, the **positional standard deviations** for each statistic.

The *calculateMeans()* function, as defined below, will accept the **Player Stats** and **Time in Position** pandas dataframes and return three dataframes:
1. **df_playerStats**: This is the processed version of the input df_playerStats. A new feature 'feedWithoutAttempt' has been calculated and players with less than 10-minutes of game-time have been dropped.
1. **df_means**: The means for each statistic for each positional group
2. **df_stds**: The standard deviations for each statistic for each positional group

Since not all players play the same amount of time on court in a match, comparing them on their absolute statistics will be unfair. Hence, each statistic will be divided by the time (in minutes) a player spent on court to produce time-normalised statistics.

Once again, instad of using this function for the *Player Stats* and *Time in Position* files for each match, it will be more efficient to concatenate the two dataframes for all four matches in each week and run it once.


In [58]:
# DEFINING THE calculateMeans() FUNCTION

def calculateMeans(df_playerStats, df_time_in_position):

  # Getting rid of unnecessary columns from df_playerStats
  df_playerStats = df_playerStats.iloc[:,1:50]
  df_playerStats = df_playerStats.drop(columns = ['oppSquadId'])

  # Using the merge function to add positional classification to df_playerStats
  df_playerStats = pd.merge(df_playerStats, df_time_in_position[['playerId', 'position']], on='playerId', how = 'left')

  # Since players who spent all the time on the bench will not feature in df_time_in_position, there is the possibility of NaNs in the 'position' column. We will drop these.
  df_playerStats = df_playerStats.dropna()

  # Creating a new feature called feedWithoutAttempt for use later
  df_playerStats['feedWithoutAttempt'] = df_playerStats['feeds'] - df_playerStats['feedWithAttempt']

  # Normalising all stats by minutesPlayed
  exclude_list = ['squadId', 'playerId', 'position', 'minutesPlayed']

  for col in df_playerStats.columns:
    if col not in exclude_list:
      df_playerStats[col] = df_playerStats[col] / df_playerStats['minutesPlayed']


  # Dropping players who have played less than 10 minutes because their time normalised statistics are not as reliable
  df_playerStats = df_playerStats[df_playerStats['minutesPlayed'] >= 10]

  # Dropping playerId and grouping by position to calculate mean and standard deviation

  df_means = df_playerStats.drop(columns = ['playerId']).groupby('position').mean().round(2).reset_index()
  df_stds = df_playerStats.drop(columns = ['playerId']).groupby('position').std().round(2).reset_index()

  return df_playerStats, df_means, df_stds

In [59]:
# TESTING THE calculateMeans() FUNCTION

df_playerStats = pd.read_csv('/content/netball-analysis/data/matchCentre/processed/111080703_2020_SSN_11108_r7_g3/111080703_playerStats_2020_SSN_11108_r7_g3.csv')
df_time_in_position = df_time_in_position

df_playerStats, df_means, df_stds = calculateMeans(df_playerStats, df_time_in_position)

print('Positional Means')
print(df_means)
print()
print('Positional Standard Deviations')
print(df_stds)
print()
print('Processed Player Stats')
print(df_playerStats.head())

Positional Means
  position  squadId  attempt1  attempt2  attempt_from_zone1  \
0        C   4461.5      0.00      0.00                0.00   
1       GA   4461.5      0.40      0.03                0.40   
2       GD   3243.0      0.00      0.00                0.00   
3       GK   4461.5      0.00      0.00                0.00   
4       GS   4461.5      0.68      0.04                0.68   
5       WA   4461.5      0.00      0.00                0.00   
6       WD   4461.5      0.00      0.00                0.00   

   attempt_from_zone2  badHands  badPasses  blocked  blocks  ...  passes  \
0                0.00      0.01       0.02      0.0     0.0  ...    0.63   
1                0.03      0.02       0.01      0.0     0.0  ...    0.28   
2                0.00      0.00       0.00      0.0     0.0  ...    0.00   
3                0.00      0.00       0.00      0.0     0.0  ...    0.00   
4                0.04      0.00       0.00      0.0     0.0  ...    0.04   
5                0.00 

# Rating Players Functions

Now that we have functions to help us process data, we are ready to use them to rate players. The idea is to iterate through the rows of **df_playerStats** for each gameweek, rating each player for each statistic against their respective positional averages.

As was briefly mentioned in the text accompanying the Calculate Means Function, the ratings for each statistic will be combined through a weighted average to produce a single **absoluteRating** for each player for each week. The logic here is that different positions have different roles, and hence cannot be judged evenly on each statistic. For instance, in Netball, only the Goal Shooter and Goal Attack positions can score goals. Therefore, it would not make sense to rate positions other than these on *goals* and *attempts*. To enable a fair comparison of Glicko Ratings across positions, we will define a dictionary of weights called **DICT_WEIGHTS** containing subjectively chosen weights for every statistic deemed to be relevant to each position. This allows us to calculate an absolute rating for each player regardless of position with the assumption that these absolute ratings are comparable across positions.

Once we define DICT_WEIGHTS, we will define a function called **processWeek()** that updates ratings for each player for each gameweek. Finally, we will incorporate the processWeek() function into the *masterFunction()* that can implement the rating system over multiple seasons.

## The Dictionary of Weights

This section defines DICT_WEIGHTS, the dictionary containing subjectively chosen weights for the statistics thought to be relevant to each position.

The block also defines NEGATIVE_STATS_LIST. This list contains all statistics within DICT_WEIGHTS that are negative stats. This helps us decide whether a statistic is positive or negative for the *positiveStat* boolean input into the rate() function.

In [60]:
DICT_WEIGHTS = {
'GS':{
    "goal1": 0.35,
    "goalMisses": 0.2,
    "goal2": 0.15,
    "generalPlayTurnovers": 0.1,
    "rebounds": 0.07,
    "blocked": 0.05,
    "penalties": 0.05,
    "feedWithAttempt": 0.01,
    "feedWithoutAttempt": 0.01,
    "pickups": 0.01
},

'GA':{
    "goal1": 0.25,
    "goalMisses": 0.15,
    "goal2": 0.1,
    "feedWithAttempt": 0.1,
    "generalPlayTurnovers": 0.1,
    "centrePassReceives": 0.07,
    "feedWithoutAttempt": 0.07,
    "blocked": 0.05,
    "rebounds": 0.05,
    "penalties": 0.04,
    "gain": 0.01,
    "pickups": 0.01
},

'WA':{
    "centrePassReceives": 0.3,
    "feedWithAttempt": 0.25,
    "generalPlayTurnovers": 0.15,
    "feedWithoutAttempt": 0.1,
    "penalties": 0.07,
    "gain": 0.05,
    "pickups": 0.05,
    "deflections": 0.03
},

'C': {
    "generalPlayTurnovers": 0.35,
    "feedWithAttempt": 0.3,
    "feedWithoutAttempt": 0.15,
    "penalties": 0.07,
    "pickups": 0.07,
    "gain": 0.05,
    "deflections": 0.01
},

'WD': {
    "deflections": 0.3,
    "penalties": 0.25,
    "gain": 0.2,
    "generalPlayTurnovers": 0.1,
    "centrePassReceives": 0.07,
    "pickups": 0.05,
    "feedWithAttempt": 0.02,
    "feedWithoutAttempt": 0.01
},

'GD': {
    "deflections": 0.35,
    "penalties": 0.2,
    "gain": 0.15,
    "blocks": 0.1,
    "rebounds": 0.1,
    "generalPlayTurnovers": 0.05,
    "centrePassReceives": 0.03,
    "pickups": 0.02
},

'GK': {
    "deflections": 0.35,
    "penalties": 0.25,
    "gain": 0.15,
    "blocks": 0.1,
    "rebounds": 0.1,
    "generalPlayTurnovers": 0.03,
    "pickups": 0.02
}
}

NEGATIVE_STATISTICS_LIST = ['blocked', 'goalMisses', 'generalPlayTurnovers', 'penalties']

## The Process Week Function

The aim of this function is to process an entire gameweek, updating the ratings for all players. The ratings for each player will be stored in a nested dictionary called *DICT_PLAYER_RATINGS* that has the following structure: {Statistic -> PlayerId -> List of ratings} Another similarly structured dictionary called *DICT_LEAGUE_RATINGS* will store the league rating relative to each player. A third dictionary called *DICT_ABSOLUTE_RATINGS* will store the absolute rating of each player in each match-week in a list.

The *processWeek()* function will accept the following inputs:
1. **df_playerStats**: The concatenated dataframe of player statistics for the gameweek.
2. **df_substitutions**: The concatenated dataframe of substitutions for the gameweek.
3. **DICT_WEIGHTS**: The dictionary containing subjectively assigned weights for each statistic by position.
4. **DICT_PLAYER_RATINGS**: The dictionary containing lists of player ratings for each statistic.
5. **DICT_LEAGUE_RATINGS**: The dictonary containing lists of league ratings for each statistic relative to each player.
6. **DICT_ABSOLUTE_RATINGS**: The dictionary containing lists of absolute ratings for each player week-by-week.

The processWeek() function will output the following:
1. **df_playerStats**: Input df_playerStats dataframe with an additional column for Glicko Rating.
2. **DICT_PLAYER_RATINGS**: Updated DICT_PLAYER_RATINGS
3. **DICT_LEAGUE_RATINGS**: Updated DICT_LEAGUE_RATINGS
4. **DICT_ABSOLUTE_RATINGS**: Updated DICT_ABSOLUTE_RATINGS

NOTE: The processWeek() function utilises the helper functions defined earlier. The cells defining those functions need to be run before using the function. Additionally, it also requires numpy to be imported as np.


In [61]:
# DEFINING THE processWeek() FUNCTION

import numpy as np

def processWeek(df_playerStats, df_substitutions, DICT_WEIGHTS, DICT_PLAYER_RATINGS, DICT_LEAGUE_RATINGS, DICT_ABSOLUTE_RATINGS):

  # Classifying players to their positions
  df_time_in_position = classifyPlayers(df_substitutions)

  # Processing df_playerStats and calculating Means and Standard Deviations
  df_playerStats, df_means, df_stds = calculateMeans(df_playerStats, df_time_in_position)

  # Iterating df_playerStats row-by-row to rate players
  for i, player in df_playerStats.iterrows():

    # Getting player position, ID, means, and stds
    position = player['position']
    playerId = player['playerId']
    positional_means = df_means[df_means['position'] == position]
    positional_stds = df_stds[df_stds['position'] == position]

    # Looping through relevant statistics for the player's position in DICT_WEIGHTS and rating her for each
    for statistic in DICT_WEIGHTS[position]:

      # Creating a dictionary for statistic within the rating dictionaries, DICT_PLAYER_RATINGS and DICT_LEAGUE_RATINGS, if it does not already exist
      if statistic not in DICT_PLAYER_RATINGS:
        DICT_PLAYER_RATINGS[statistic] = {}
        DICT_LEAGUE_RATINGS[statistic] = {}

      # Creating a new list with initial rating for the player within DICT_PLAYER_RATINGS[statistic] and DICT_LEAGUE_RATINGS[statistic] if it does not already exist
      if playerId not in DICT_PLAYER_RATINGS[statistic]:
        DICT_PLAYER_RATINGS[statistic][playerId] = [GlickoRating()] # This method of initialising a rating object was defined while creating the rating system
        DICT_LEAGUE_RATINGS[statistic][playerId] = [GlickoRating()] # This method of initialising a rating object was defined while creating the rating system

      # Getting data required for rating player
      playerStatisticValue = player[statistic] # The value of the player's time-normalised statistic for the current statistic
      playerRating = DICT_PLAYER_RATINGS[statistic][playerId][-1] # The player's current rating for the statistic will be the last element in the list
      leagueRating = DICT_LEAGUE_RATINGS[statistic][playerId][-1] # The current league rating for the statistic relative to the player will be the last element in the list
      mean = positional_means.iloc[0][statistic] # Getting the positional mean for the current statistic
      standardDeviation = positional_stds.iloc[0][statistic] # Getting the positional standard deviation for the current statistic

      # Calculating margin as the z-score of playerStatisticValue
      margin = (playerStatisticValue - mean) / standardDeviation if standardDeviation != 0 else 0

      # Calculating new player and league ratings for statistic
      updatedPlayerRating = rate(playerRating, leagueRating, margin, positiveStat=(statistic not in NEGATIVE_STATISTICS_LIST))
      updatedLeagueRating = rate(leagueRating, playerRating, 0, positiveStat=(statistic not in NEGATIVE_STATISTICS_LIST)) # The margin is 0 because the z-score of the mean is 0.

      # Updating DICT_PLAYER_RATINGS and DICT_LEAGUE_RATINGS with the new ratings
      DICT_PLAYER_RATINGS[statistic][playerId].append(updatedPlayerRating)
      DICT_LEAGUE_RATINGS[statistic][playerId].append(updatedLeagueRating)

    # Updating DICT_ABSOLUTE_RATINGS:
    # Creating a new list with an initial absolute rating of 1500 for the player in DICT_ABSOLUTE_RATINGS if it does not already exist
    if playerId not in DICT_ABSOLUTE_RATINGS:
      DICT_ABSOLUTE_RATINGS[playerId] = [1500]

    # Calculating the player's absolute rating for the week
    absoluteRating = sum([DICT_WEIGHTS[position][statistic] * DICT_PLAYER_RATINGS[statistic][playerId][-1].r for statistic in DICT_WEIGHTS[position]])

    # Updating DICT_ABSOLUTE_RATINGS with the new absolute rating
    DICT_ABSOLUTE_RATINGS[playerId].append(absoluteRating)

    # Updating df_playerStats by adding the player's absolute Glicko Rating to a new column called 'glickoRating'
    df_playerStats.at[i, 'glickoRating'] = absoluteRating

  return df_playerStats, DICT_PLAYER_RATINGS, DICT_LEAGUE_RATINGS, DICT_ABSOLUTE_RATINGS

In [62]:
# TESTING THE processWeek() FUNCTION

df_playerStats = pd.read_csv('/content/netball-analysis/data/matchCentre/processed/111080703_2020_SSN_11108_r7_g3/111080703_playerStats_2020_SSN_11108_r7_g3.csv')
df_substitutions = pd.read_csv('/content/netball-analysis/data/matchCentre/processed/111080703_2020_SSN_11108_r7_g3/111080703_substitutions_2020_SSN_11108_r7_g3.csv')
DICT_PLAYER_RATINGS = {}
DICT_LEAGUE_RATINGS = {}
DICT_ABSOLUTE_RATINGS = {}

df_playerStats, DICT_PLAYER_RATINGS, DICT_LEAGUE_RATINGS, DICT_ABSOLUTE_RATINGS = processWeek(df_playerStats, df_substitutions, DICT_WEIGHTS, DICT_PLAYER_RATINGS, DICT_LEAGUE_RATINGS, DICT_ABSOLUTE_RATINGS)

print('df_playerStats')
print(df_playerStats.head())
print()
print('DICT_PLAYER_RATINGS')
print(DICT_PLAYER_RATINGS)
print()
print('DICT_LEAGUE_RATINGS')
print(DICT_LEAGUE_RATINGS)
print()
print('DICT_ABSOLUTE_RATINGS')
print(DICT_ABSOLUTE_RATINGS)

df_playerStats
   squadId  playerId  attempt1  attempt2  attempt_from_zone1  \
0      806     80439  0.000000  0.000000            0.000000   
1      806     80574  0.000000  0.000000            0.000000   
2      806    998404  0.000000  0.000000            0.000000   
3      806    999128  0.402685  0.050336            0.402685   
4      806   1001711  0.470958  0.000000            0.470958   

   attempt_from_zone2  badHands  badPasses  blocked  blocks  ...   pickups  \
0            0.000000  0.000000   0.000000      0.0     0.0  ...  0.078560   
1            0.000000  0.000000   0.016667      0.0     0.0  ...  0.050000   
2            0.000000  0.000000   0.000000      0.0     0.0  ...  0.000000   
3            0.050336  0.000000   0.016779      0.0     0.0  ...  0.016779   
4            0.000000  0.031397   0.000000      0.0     0.0  ...  0.156986   

     points  possessionChanges  possessions  quartersPlayed  rebounds  \
0  0.000000           0.019640     0.176759        0.07856

  ## Fixing DICT_ABSOLUTE_RATINGS

  The two functions below handle the following situations which will cause an issue with the order of DICT_ABSOLUTE_RATINGS:

  1. **fix_1()**: If a player is not rated for the current gameweek but was rated earlier, her DICT_ABSOLUTE_RATINGS
  will not be updated this week.
  2. **fix_2()**: If a player is not rated for the first couple of weeks but rated in a subsequent week

The solution for problem 1 is to check whether a playerId exists in DICT_ABSOLUTE_RATINGS that does not exist in df_playerStats, meaning that player has not been rated for the week, a NaN will be appended to that player's DICT_ABSOLUTE_RATINGS list for that week.

The solution for problem 2 is to check the length of the longest list in DICT_ABSOLUTE_RATINGS. This player has obviously played the first match and hence has a full absolute ratings list. Then, for each player's absolute rating list, add a list containing the

NOTE: In the final implementation, fix_1 will be called for each gameweek while fix_2 will be called at the end of all seasons.


In [63]:
# DEFINING fix_1()

def fix_1(DICT_ABSOLUTE_RATINGS, df_playerStats):

  set_playerStats_playerIds = set(df_playerStats['playerId'])
  set_DICT_ABSOLUTE_RATINGS_playerIds = set(DICT_ABSOLUTE_RATINGS.keys())

  list_players_not_rated = set_DICT_ABSOLUTE_RATINGS_playerIds - set_playerStats_playerIds

  for playerId in list_players_not_rated:
    DICT_ABSOLUTE_RATINGS[playerId].append(np.nan)

  return DICT_ABSOLUTE_RATINGS

# Defining fix_2()

def fix_2(DICT_ABSOLUTE_RATINGS):

  longest_list_length = max(len(DICT_ABSOLUTE_RATINGS[playerId]) for playerId in DICT_ABSOLUTE_RATINGS)

  for player_id, ratings_list in DICT_ABSOLUTE_RATINGS.items():
    DICT_ABSOLUTE_RATINGS[player_id] = [np.nan] * (longest_list_length - len(ratings_list)) + ratings_list

  return DICT_ABSOLUTE_RATINGS

# Rating players between 2020 and 2023

The following block of code will utilise all the functions defined above to rate players for all seasons from 2020 to 2023 SSN-Tournament

In [67]:
df_teamRatings = pd.DataFrame()

# Creating empty dictionaries

DICT_PLAYER_RATINGS = {}
DICT_LEAGUE_RATINGS = {}
DICT_ABSOLUTE_RATINGS = {}
seasons = ['2020', '2021', '2022', '2023']

# Looping through each season
for season in seasons:

  # Collecting file-paths
  playerStats_paths = loadFiles('SSN', season, 'playerStats')
  substitutions_paths = loadFiles('SSN', season, 'substitutions')

  # Looping through each match-week and rating
  for week in range(1, 15):

    # Concatenating df_subs and df_playerStats for all 4 games in the week
    df_substitutions = pd.concat([pd.read_csv(substitutions_paths[4 * (week - 1) + game]) for game in range(4)])
    df_playerStats = pd.concat([pd.read_csv(playerStats_paths[4 * (week - 1) + game]) for game in range(4)])

    # Processing Week
    df_playerStats, DICT_PLAYER_RATINGS, DICT_LEAGUE_RATINGS, DICT_ABSOLUTE_RATINGS = processWeek(df_playerStats, df_substitutions, DICT_WEIGHTS, DICT_PLAYER_RATINGS, DICT_LEAGUE_RATINGS, DICT_ABSOLUTE_RATINGS)

    # Applying fix_1() to DICT_ABSOLUTE_RATINGS
    DICT_ABSOLUTE_RATINGS = fix_1(DICT_ABSOLUTE_RATINGS, df_playerStats)

  # Calculating end of season team ratings
  df_seasonTeamRatings = df_playerStats[['squadId', 'glickoRating']].groupby('squadId').mean().reset_index()
  df_seasonTeamRatings['season'] = int(season)

  # Concatenating df_seasonTeamRatings to df_teamRatings
  df_teamRatings = pd.concat([df_teamRatings, df_seasonTeamRatings]).reset_index(drop=True)

# Applying fix_2() to DICT_ABSOLUTE_RATINGS
DICT_ABSOLUTE_RATINGS = fix_2(DICT_ABSOLUTE_RATINGS)

In [68]:
df_teamRatings

Unnamed: 0,squadId,glickoRating,season
0,801,1509.914021,2020
1,804,1513.067538,2020
2,806,1518.148623,2020
3,807,1499.169731,2020
4,810,1529.196158,2020
5,8117,1514.757987,2020
6,8118,1486.794825,2020
7,8119,1460.442613,2020
8,801,1491.74587,2021
9,804,1479.843119,2021
