In this notebook, we build a baseline prediction algorithm to predict the result of professional snookers matches using players' ranking from previous years.

In [1]:
import pandas as pd
import numpy as np
import random

In [2]:
rankings = pd.read_csv('World_Rankings.csv')

In [3]:
rankings.head()

Unnamed: 0,Year,First_Name,Surnames,Ranking,Points
0,1981,Bill,Werbeniuk,9,0
1,1981,Eddie,Charlton,8,0
2,1981,David,Taylor,7,0
3,1981,Doug,Mountjoy,6,0
4,1981,Dennis,Taylor,5,0


In [4]:
#A function search for the ranking with given name and year
#We will use this function in our baseline model
def find_rankings(year, name):
    ranking = -1
    last_name = name.split(' ')[-1]
    df1 = rankings[rankings['Year'] == year]
    df2 = df1[df1['Surnames'] == last_name]
    for ind in df2.index:
        full_name = df2.loc[ind, 'First_Name'].strip() + ' ' + df2.loc[ind, 'Surnames'].strip()
        if full_name == name:
            ranking = df2.loc[ind, 'Ranking']
    
    if ranking == -1:
        return None
    else:
        return ranking


In [5]:
find_rankings(2010, 'Judd Trump')

'27'

In [6]:
matches = pd.read_csv('/Users/tliu/Desktop/Erdos Project/Data/matches.csv')

In [7]:
matches.head()

Unnamed: 0,player1,player2,score1,score2,best_of,tournament_id,date,year
0,Colin Roscoe,Jackie Rea,9,6,17,753,,1982
1,Tommy Murphy,Clive Everton,9,4,17,753,,1982
2,Vic Harris,Marcus Owen,9,4,17,753,,1982
3,Bob Harris,Graham Cripsey,9,6,17,753,,1982
4,Geoff Foulds,Matt Gibson,9,3,17,753,,1982


In [8]:
from sklearn.metrics import accuracy_score
scores = np.zeros((3, 5))

# There are 126 matches in the 750th tournament. They happened in 2013.
# There are 126 matches in the 800th tournament. They happened in 2014.
# There are 142 matches in the 1080th tournament. They happened in 2025.
nth = [750, 800, 1080]

tournaments = list(matches.tournament_id.unique())
for k, n in enumerate(nth):
    for j in range(5):
        past_tournaments = tournaments[:n]
        current = tournaments[n]
        past_index = matches['tournament_id'].map(lambda x: x in past_tournaments)
            

        matches_past = matches[past_index]
        matches_future = matches[matches['tournament_id'] == current]

        #Store the actual winner
        y = np.zeros(len(matches_future))

        #Use last year's ranking to predict match result
        year = matches_future.loc[matches_future.index[0], 'year']

        #We only have rankings up to 2020. For matches after 2021, we use the ranking of 2020 to predict the results.
        if year>2021:
            year = 2021

        prediction = np.zeros(len(matches_future))
        i=0
        not_found = 0
        for ind in matches_future.index:
            match = matches_future.loc[ind]
            player1, player2 = match['player1'], match['player2']
            ranking1, ranking2 = find_rankings(year-1, player1), find_rankings(year-1, player2)

            if ranking1==None and ranking2==None:
                prediction[i] = random.randint(0,1)
                not_found += 1
            
            elif ranking1==None:
                prediction[i] = 1
            
            elif ranking2==None:
                prediction[i] = 0

            elif ranking1 > ranking2:
                prediction[i] = 0

            else:
                prediction[i]=1
            
            i+=1

        scores[k,j] = accuracy_score(prediction, y)
            
        


In [9]:
print('The prediction accuracy for the 750th, 800th, 1080th tournaments are:')
print(scores)

The prediction accuracy for the 750th, 800th, 1080th tournaments are:
[[0.56349206 0.56349206 0.55555556 0.56349206 0.55555556]
 [0.5952381  0.5952381  0.5952381  0.5952381  0.5952381 ]
 [0.46478873 0.52112676 0.53521127 0.5        0.45774648]]


In [10]:
np.mean(scores, axis = 1)

array([0.56031746, 0.5952381 , 0.49577465])

The prediction accuracies for 750th and 800th tournament are around 0.57 and 0.595. The prediction accuracy for 1080th is not consistent. We should expect that because it happens in 2025 but we only have rankings up to 2020 and we are using random number generators for matches with no ranking information.