## Table Tenis Palyer Wining Rate Prediction Model 

Develop a program that can predict the outcome of matches by analyzing the historical performances of the players. Specifically, the program should be capable of forecasting the points each player is likely to achieve, enabling us to determine the win probability for each player. The programm should not use data, wich is only available after the match. For instance, Player 1 is expected to score 60 points, while Player 2 is projected to score 50 points. Consequently, Player 1 has a 70% chance of winning, whereas Player 2 has a 30% chance. Are you interested in taking up this project? If so, I'd be glad to discuss further details with you. This would be the dataset It includes the train, and the test split. The Excel file is added after this message. The goal is to improve the performance of the model, to make it better than a log loss of 0.67. The model should perform based on a variability of the player performance.

In [1]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Load data
TT_data = pd.read_excel("TT.xlsx")

# Preprocess data
TT_data['Event Date and Time'] = pd.to_datetime(TT_data['Event Date and Time'])
TT_data = TT_data.dropna().sort_values(by='Event Date and Time')
df = TT_data

columns = ['Home Service points won', 'Away Service points won', 'Home Receiver points won', 'Away Receiver points won']
for col in columns:
    extracted_data = TT_data[col].str.extract('(\d+/\d+)')[0].map(lambda x: eval(x) if isinstance(x, str) else x)
    TT_data = TT_data.assign(**{col: extracted_data})
    
# Generate lagged features for every player based on scores
def generate_lagged_features_for_player(player_data, columns):
    for col in columns:
        player_data[col + "_lag"] = player_data[col].shift(1)
    return player_data

# Columns after Spieler2 to lag
lag_columns = TT_data.columns[TT_data.columns.get_loc("Spieler2") + 1:]

home_lagged_data = TT_data.groupby("Spieler1").apply(lambda x: generate_lagged_features_for_player(x, lag_columns))
away_lagged_data = TT_data.groupby("Spieler2").apply(lambda x: generate_lagged_features_for_player(x, lag_columns))

# Merge the lagged data
for col in lag_columns:
    TT_data[col + "_lag"] = home_lagged_data[col + "_lag"]
    TT_data[col + "_lag"] = away_lagged_data[col + "_lag"]
    
# Extract the last match played by each player from the original data
last_match_per_player1 = df.groupby('Spieler1').last().reset_index()
last_match_per_player2 = df.groupby('Spieler2').last().reset_index()

# Combine both extracted matches
combined_last_matches = pd.concat([last_match_per_player1, last_match_per_player2], axis=0, ignore_index=True)
for col in columns:
    extracted_data = combined_last_matches[col].str.extract('(\d+/\d+)')[0].map(lambda x: eval(x) if isinstance(x, str) else x)
    combined_last_matches = combined_last_matches.assign(**{col: extracted_data})

# Add "_lag" to the variable names for the combined last matches
combined_last_matches = combined_last_matches.add_suffix('_lag')
combined_last_matches.rename(columns={'Spieler1_lag': 'Spieler1', 
                                      'Spieler2_lag': 'Spieler2',
                                      'Id_lag' : 'Id',
                                     'Event Date and Time_lag':'Event Date and Time'}, inplace=True)

# Add variables matching TT_data that are not present in combined_last_matches and assign NA
for column in TT_data.columns:
    if column not in combined_last_matches.columns:
        combined_last_matches[column] = 0

combined_last_matches = combined_last_matches.drop_duplicates()

# Bind this data to TT_data
TT_data = pd.concat([TT_data, combined_last_matches], axis=0, ignore_index=True, sort=False)

In [2]:
# Handle missing values
TT_data = TT_data.dropna()

# Feature engineering
TT_data["Home Score Ratio"] = TT_data["Spiele1_lag"] / (TT_data["Spiele1_lag"] + TT_data["Spiele2_lag"])
TT_data["Away Score Ratio"] = TT_data["Spiele2_lag"] / (TT_data["Spiele1_lag"] + TT_data["Spiele2_lag"])

rolling_avg_window = 5
player_rolling_avg = {}
home_rolling_avgs = []
away_rolling_avgs = []

for index, row in TT_data.iterrows():
    player1_name = row["Spieler1"]
    player2_name = row["Spieler2"]

    if player1_name not in player_rolling_avg:
        player_rolling_avg[player1_name] = [row["Home Score Ratio"]]
    else:
        player_rolling_avg[player1_name].append(row["Home Score Ratio"])

    if player2_name not in player_rolling_avg:
        player_rolling_avg[player2_name] = [row["Away Score Ratio"]]
    else:
        player_rolling_avg[player2_name].append(row["Away Score Ratio"])

    if len(player_rolling_avg[player1_name]) <= rolling_avg_window:
        home_rolling_avgs.append(np.mean(player_rolling_avg[player1_name]))
    else:
        home_rolling_avgs.append(np.mean(player_rolling_avg[player1_name][-rolling_avg_window:]))

    if len(player_rolling_avg[player2_name]) <= rolling_avg_window:
        away_rolling_avgs.append(np.mean(player_rolling_avg[player2_name]))
    else:
        away_rolling_avgs.append(np.mean(player_rolling_avg[player2_name][-rolling_avg_window:]))

TT_data["Home Rolling Avg Score"] = home_rolling_avgs
TT_data["Away Rolling Avg Score"] = away_rolling_avgs
TT_data["Rolling Avg Score Diff"] = TT_data["Home Rolling Avg Score"] - TT_data["Away Rolling Avg Score"]
TT_data["Is Home"] = 1
TT_data["Home Rolling Avg Margin Set"] = abs(TT_data["Satz1_lag"] - TT_data["Satz2_lag"])
TT_data["Home Rolling Avg Margin Lead"] = abs(TT_data["Spiele1_lag"] - TT_data["Spiele2_lag"])

# Calculate rolling average for receiver and serving points
rolling_avg_receiver_points = []
rolling_avg_serving_points = []
player_receiver_points = {}
player_serving_points = {}

for index, row in TT_data.iterrows():
    player1_name = row["Spieler1"]

    if player1_name not in player_receiver_points:
        player_receiver_points[player1_name] = [row["Home Receiver points won"]]
        player_serving_points[player1_name] = [row["Home Service points won"]]
    else:
        player_receiver_points[player1_name].append(row["Home Receiver points won"])
        player_serving_points[player1_name].append(row["Home Service points won"])

    if len(player_receiver_points[player1_name]) <= rolling_avg_window:
        rolling_avg_receiver_points.append(np.mean(player_receiver_points[player1_name]))
        rolling_avg_serving_points.append(np.mean(player_serving_points[player1_name]))
    else:
        rolling_avg_receiver_points.append(np.mean(player_receiver_points[player1_name][-rolling_avg_window:]))
        rolling_avg_serving_points.append(np.mean(player_serving_points[player1_name][-rolling_avg_window:]))

TT_data["Home Rolling Avg Receiver Points"] = rolling_avg_receiver_points
TT_data["Home Rolling Avg Serving Points"] = rolling_avg_serving_points

# Interaction terms
TT_data['Score Ratio Interaction'] = TT_data['Home Rolling Avg Score'] * TT_data['Away Rolling Avg Score']
TT_data['Margin Set Lead Interaction'] = TT_data['Home Rolling Avg Margin Set'] * TT_data['Home Rolling Avg Margin Lead']
TT_data['Receiver Serving Interaction'] = TT_data['Home Rolling Avg Receiver Points'] * TT_data['Home Rolling Avg Serving Points']

# Polynomial features
TT_data['Home Rolling Avg Score^2'] = TT_data['Home Rolling Avg Score']**2
TT_data['Away Rolling Avg Score^2'] = TT_data['Away Rolling Avg Score']**2
TT_data['Rolling Avg Score Diff^2'] = TT_data['Rolling Avg Score Diff']**2
TT_data['Home Rolling Avg Margin Set^2'] = TT_data['Home Rolling Avg Margin Set']**2
TT_data['Home Rolling Avg Margin Lead^2'] = TT_data['Home Rolling Avg Margin Lead']**2
TT_data['Home Rolling Avg Receiver Points^2'] = TT_data['Home Rolling Avg Receiver Points']**2
TT_data['Home Rolling Avg Serving Points^2'] = TT_data['Home Rolling Avg Serving Points']**2


In [5]:
TT_data1 = TT_data[TT_data['Spiele1']!=0]

# Create the extended feature matrix
X_extended = TT_data1[[
    "Home Rolling Avg Score", "Away Rolling Avg Score", "Rolling Avg Score Diff", 
    "Home Rolling Avg Margin Set", "Home Rolling Avg Margin Lead", 
    "Home Rolling Avg Receiver Points", "Home Rolling Avg Serving Points",
    'Score Ratio Interaction', 'Margin Set Lead Interaction', 
    'Receiver Serving Interaction', 'Home Rolling Avg Score^2', 
    'Away Rolling Avg Score^2', 'Rolling Avg Score Diff^2', 
    'Home Rolling Avg Margin Set^2', 'Home Rolling Avg Margin Lead^2',
    'Home Rolling Avg Receiver Points^2', 'Home Rolling Avg Serving Points^2'
]]

# Scale the features
scaler = StandardScaler()
X_extended_scaled = scaler.fit_transform(X_extended)

# Split the data
X_train, X_val, y_train_home, y_val_home, y_train_away, y_val_away= train_test_split(
    X_extended_scaled, TT_data1["Spiele1"], TT_data1["Spiele2"], test_size=0.2, random_state=42
)

# 5. Train and evaluate the Gradient Boosting model
gb_home = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=1.0, max_features='sqrt', random_state=42)
gb_away = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, subsample=1.0, max_features='sqrt', random_state=42)

gb_home.fit(X_train, y_train_home)
gb_away.fit(X_train, y_train_away)

TT_data1["Predicted Home Score"] = gb_home.predict(X_extended_scaled)
TT_data1["Predicted Away Score"] = gb_away.predict(X_extended_scaled)

# Calculate winning probabilities
total_predicted_scores = TT_data1["Predicted Home Score"] + TT_data1["Predicted Away Score"]
TT_data1["Home Win Probability"] = TT_data1["Predicted Home Score"] / total_predicted_scores
TT_data1["Away Win Probability"] = TT_data1["Predicted Away Score"] / total_predicted_scores

# Create column for true outcomes and calculate log loss
TT_data1["Home Win"] = (TT_data1["Spiele1"] > TT_data1["Spiele2"]).astype(int)
TT_data1["Away Win"] = (TT_data1["Spiele2"] > TT_data1["Spiele1"]).astype(int)

TT_data1["Log Loss"] = TT_data1.apply(
    lambda row: log_loss([row["Home Win"]], [[row["Away Win Probability"], row["Home Win Probability"]]], labels=[0, 1]), 
    axis=1
)

# Calculate average log loss
average_log_loss = TT_data1["Log Loss"].mean().round(4)

print('Log loss :',average_log_loss.round(4))
TT_data1.to_csv('Data with prediction.csv', index=False)

Log loss : 0.6677


In [202]:
TT_data.head()

Unnamed: 0,Id,Event Date and Time,Spieler1,Spieler2,Satz1,Satz2,Spiele1,Spiele2,Home Biggest lead,Away Biggest lead,...,Home Rolling Avg Margin Lead^2,Home Rolling Avg Receiver Points^2,Home Rolling Avg Serving Points^2,Predicted Home Score,Predicted Away Score,Home Win Probability,Away Win Probability,Home Win,Away Win,Log Loss
16398,621,2023-06-26 20:00:00,Biolek M.,Zaskodny M.,0,0,0.0,0.0,0.0,0.0,...,64.0,0.074798,0.122592,-0.333422,0.311278,15.057133,-14.057133,0,0,36.04365
16399,1515,2023-10-02 23:30:00,Tuma D.,Zatecka L.,0,0,0.0,0.0,0.0,0.0,...,196.0,0.121913,0.078486,1.842166,6.135768,0.230908,0.769092,0,0,0.2625442
16400,328,2023-06-28 20:00:00,Stach J.,Zientek Z.,0,0,0.0,0.0,0.0,0.0,...,144.0,0.076808,0.139864,-0.348443,0.047794,1.158971,-0.158971,0,0,36.04365
16401,816,2023-06-25 11:30:00,Jaksa M.,Zika M.,0,0,0.0,0.0,0.0,0.0,...,16.0,0.071881,0.111469,0.517831,0.6606,0.439424,0.560576,0,0,0.5787905
16402,309,2023-06-28 17:30:00,Fnukal R.,Zivny P.,0,0,0.0,0.0,0.0,0.0,...,81.0,0.01058,0.010548,-0.112661,0.651365,-0.209134,1.209134,0,0,2.220446e-16


## Predict future values

In [6]:
#  Load the new Excel file
new_matches = pd.read_excel("Games.xlsx")

# 2. Join on Spieler1 for home player data
home_data = new_matches.merge(TT_data[['Spieler1'] + X_extended.columns.tolist()], on='Spieler1', how='left')

# 3 Extract features, scale and predict using the home model
X_home = scaler.transform(home_data[X_extended.columns])
home_data["Predicted Home Score"] = gb_home.predict(X_home)

# 4 Join on Spieler2 for away player data
away_data = new_matches.merge(TT_data[['Spieler2'] + X_extended.columns.tolist()], left_on='Spieler2', right_on='Spieler2', how='left')

# 5 Extract features, scale and predict using the away model
X_away = scaler.transform(away_data[X_extended.columns])
away_data["Predicted Away Score"] = gb_away.predict(X_away)

# Consolidate predictions
new_matches["Predicted Home Score"] = home_data["Predicted Home Score"]
new_matches["Predicted Away Score"] = away_data["Predicted Away Score"]

# Calculate winning probabilities
total_predicted_scores = new_matches["Predicted Home Score"] + new_matches["Predicted Away Score"]
new_matches["Home Win Probability"] = new_matches["Predicted Home Score"] / total_predicted_scores
new_matches["Away Win Probability"] = new_matches["Predicted Away Score"] / total_predicted_scores


new_matches[['Spieler1', 'Spieler2',
             "Predicted Home Score","Predicted Away Score",
             "Home Win Probability","Away Win Probability"]].to_csv('Games prediction.csv', index=False)

I compute rolling averages for players' scores, serving points, and receiver points, which serve as indicators of recent performance. These rolling averages, along with information about the match setting (home or away) and margins from previous matches, are fed into a Ridge regression model. This model is trained to predict the scores for both home and away players in upcoming matches.

Once the model provides the predicted scores, I use these to calculate the probability of each player winning the match. The player with the higher predicted score is assigned a higher probability of winning. We then use these predicted probabilities against the actual match outcomes to calculate the log loss, which offers a measure of the model's prediction accuracy. A lower log loss indicates better predictive performance, with our algorithm achieving an average log loss of approximately 0.5907.