The Bradley-Terry model for a tournament considers each match to be independent, and
\begin{equation}
    \Pr(\text{Player $a$ wins a match against Player $b$}) = \frac{\exp(\beta_a - \beta_b)}{1 + exp(\beta_a - \beta_b)},
\end{equation}
for a vector of parameters $\beta$ of equal length as the number of players. This model was first invented as a way to rank chess players.

The response variable is the outcome of a specific match. Since there are only two possible outcomes for each match, and the matches are considered independent, the response follows a Bernoulli distribution. Denote $\pi = \Pr(a \text{ wins})$ and the linear predictor $\eta = \beta_a - \beta_b$, we can rearrange the equation to obtain
\begin{equation}
    \log\left(\frac{\pi}{1 - \pi}\right) = \eta.
\end{equation}
Therefore, the link function connecting the mean response to the linear predictor is the logit link function
\begin{equation}
    g(\pi) = \operatorname{logit}(\pi) = \log\left(\frac{\pi}{1 - \pi}\right).
\end{equation}
Let $N$ be the number of matches and $p$ be the number of unique players. The parameter vector $\beta$ has length $p$. The linear predictor for a match between Player $a$ and Player $b$ is $\eta = \beta_a - \beta_b$. The design matrix $X$ is an $N \times p$ matrix where for a specific row representing a match where Player $a$ is the first player and Player $b$ is the second player:

*   The column corresponding to Player $a$ takes the value $1$.
*   The column corresponding to Player $b$ takes the value $-1$.
*   All other columns are $0$.

Usually a constraint, such as $\beta_1 = 0$ or $\sum \beta_i = 0$, is applied as the above design matrix is rank-deficient by $1$. If we exchange the order of players, then we are calculating the probability that Player $b$ wins against Player $a$. Here, the linear predictor changes sign and in the probabiliy fornula, $\Pr( b \text{ wins}) = 1 - \Pr(a \text{ wins})$.

In [179]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from google.colab import drive
drive.mount('/content/drive')
path = '/content/drive/MyDrive/Advanced Computational Maths/Statistics and Machine Learning/Tennis Modelling'
df = pd.read_csv(path + '/mensResults.csv', encoding='latin1')
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%y')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [180]:
def preprocess_data(df):
    '''
    Converts dates and splits data into training (2000-2014) and testing (2015-2016) sets.
    '''
    train_mask = (df['Date'].dt.year >= 2000) & (df['Date'].dt.year <= 2014)
    test_mask = (df['Date'].dt.year >= 2015) & (df['Date'].dt.year <= 2016)

    return df.loc[train_mask].copy(), df.loc[test_mask].copy()

def create_player_map(df, ref_player):
    '''
    Identifies all unique players in the training set and creates a dictionary
    mapping player names to column indices. The reference player is excluded
    from the map (coefficient fixed at 0).
    '''
    # Get all unique players
    unique_players = set(df['Winner'].unique()) | set(df['Loser'].unique())

    # Remove reference player so they don't get a column in X
    if ref_player in unique_players:
        unique_players.remove(ref_player)

    # Sort for consistency and create map
    sorted_players = sorted(list(unique_players))
    return {player: i for i, player in enumerate(sorted_players)}

def build_design_matrix(df, player_map, ref_player):
    '''
    Constructs the design matrix X and response vector y.
    X entries: +1 for Winner, -1 for Loser, 0 otherwise.
    y entries: Always 1 (representing that the 'Winner' won).

    Filters out matches involving players not present in player_map
    (unless they are the reference player).
    '''
    N_matches = len(df)
    N_features = len(player_map)

    winners = df['Winner'].values
    losers = df['Loser'].values

    valid_indices = []
    X_list = []

    for i in range(N_matches):
        w, l = winners[i], losers[i]

        # Check if players are "known" (in the map OR are the reference player)
        w_is_ref = (w == ref_player)
        l_is_ref = (l == ref_player)
        w_in_map = (w in player_map)
        l_in_map = (l in player_map)

        # Proceed only if we can account for both players
        if (w_in_map or w_is_ref) and (l_in_map or l_is_ref):
            row = np.zeros(N_features)

            if w_in_map:
                row[player_map[w]] = 1
            if l_in_map:
                row[player_map[l]] = -1

            X_list.append(row)
            valid_indices.append(i)

    X = np.array(X_list)
    y = np.ones(len(X)) # Response is always 1

    return X, y

def train_bradley_terry(X_train, y_train):
    '''
    Fits the Bradley-Terry model using Logistic Regression without an intercept.
    '''
    # sm.Logit does not add an intercept by default.
    model = sm.Logit(y_train, X_train)
    result = model.fit(method='ncg', disp=0)
    return result

def calculate_logistic_loss(result, X, y):
    '''
    Calculates the average negative log-likelihood (Logistic Loss).
    Formula: - (1/N) * sum( log(predicted_probability) )
    '''
    # Predict probability that y=1 (Winner wins)
    preds = result.predict(X)

    # Clip probabilities slightly to prevent log(0) error
    epsilon = 1e-15
    preds = np.clip(preds, epsilon, 1 - epsilon)

    # Calculate Mean Negative Log Likelihood
    loss = -np.mean(np.log(preds))
    return loss

In [181]:
df_train, df_test = preprocess_data(df)

print(f"Training Samples: {len(df_train)}, Test Samples: {len(df_test)}")
REF_PLAYER = "Agassi A."
player_map = create_player_map(df_train, REF_PLAYER)
print(f"Number of predictors (players excluding ref): {len(player_map)}")

X_train, y_train = build_design_matrix(df_train, player_map, REF_PLAYER)
X_test, y_test = build_design_matrix(df_test, player_map, REF_PLAYER)
model_result = train_bradley_terry(X_train, y_train)

train_loss = calculate_logistic_loss(model_result, X_train, y_train)
test_loss = calculate_logistic_loss(model_result, X_test, y_test)
print(f"\nLogistic Loss (Train 2000-2014): {train_loss:.5f}")
print(f"Logistic Loss (Test 2015-2016):  {test_loss:.5f}")

Training Samples: 4489, Test Samples: 420
Number of predictors (players excluding ref): 59

Logistic Loss (Train 2000-2014): 0.59937
Logistic Loss (Test 2015-2016):  0.55010


The choice of reference player does not affect the fitted values (the predicted probabilities of winning). In the Bradley-Terry model, the probability that Player $i$ beats Player $j$ depends only on the difference between their coefficients $\beta_i - \beta_j$. Choosing a different reference player effectively shifts all coefficients by a constant constant $c$ so the difference remains unchanged.

To find the confidence interval for a probability in a logistic regression model, we calculate the CI for the linear predictor first and then transform the endpoints using the logistic function. This approach is preferred because the distribution of the linear predictor is closer to normal than the distribution of the probability itself.

The linear predictor for Federer beating Murray is:
\begin{equation}
    \eta = \hat{\beta}_{\text{Federer}} - \hat{\beta}_{\text{Murray}}.
\end{equation}
Since $\hat{\beta}_{\text{Federer}}$ and $\hat{\beta}_{\text{Murray}}$ are correlated estimates from the same model, the variance of their difference is
\begin{equation}
    \text{Var}(\eta) = \text{Var}(\hat{\beta}_F) + \text{Var}(\hat{\beta}_M) - 2\text{Cov}(\hat{\beta}_F, \hat{\beta}_M).
\end{equation}
A $68\%$ confidence interval corresponds roughly to one standard error ($Z$-score $\approx 1$) on either side of the mean, assuming a normal approximation.
\begin{equation}
    SE(\eta) = \sqrt{\text{Var}(\eta)}, \quad \text{CI}_{\eta} = [\eta - SE(\eta), \eta + SE(\eta)].
\end{equation}
Finally, apply the logistic inverse link function to the endpoints
\begin{equation}
    \Pr(\text{Federer wins}) = \frac{e^{\eta}}{1 + e^{\eta}}.
\end{equation}

In [182]:
def calculate_federer_murray_ci(model_result, player_map):
    p1 = "Federer R."
    p2 = "Murray A."

    # Check if players exist in the map
    if p1 not in player_map or p2 not in player_map:
        print(f"Error: Could not find {p1} or {p2} in the model.")
        return

    # Retrieve Coefficients
    idx_p1 = player_map[p1]
    idx_p2 = player_map[p2]
    beta_p1 = model_result.params[idx_p1]
    beta_p2 = model_result.params[idx_p2]

    # Linear predictor (eta)
    eta = beta_p1 - beta_p2

    # Calculate Variance and Standard Error
    cov_matrix = model_result.cov_params()

    var_p1 = cov_matrix[idx_p1, idx_p1]
    var_p2 = cov_matrix[idx_p2, idx_p2]
    cov_p1_p2 = cov_matrix[idx_p1, idx_p2]

    # Var(A - B) = Var(A) + Var(B) - 2*Cov(A,B)
    var_diff = var_p1 + var_p2 - 2 * cov_p1_p2
    se_diff = np.sqrt(var_diff)

    # Calculate 68% CI for the linear predictor (Z approx 1.0)
    z_score = 1.0

    eta_lower = eta - (z_score * se_diff)
    eta_upper = eta + (z_score * se_diff)

    # Transform using the logistic function
    def logistic(x):
        return 1.0 / (1.0 + np.exp(-x))

    prob_est = logistic(eta)
    prob_lower = logistic(eta_lower)
    prob_upper = logistic(eta_upper)

    return eta, prob_est, prob_lower, prob_upper

eta, prob_est, prob_lower, prob_upper = calculate_federer_murray_ci(model_result, player_map)
print(f"Probability Federer beats Murray: {prob_est:.4f}")
print(f"68% Confidence Interval: [{prob_lower:.4f}, {prob_upper:.4f}]")

Probability Federer beats Murray: 0.6526
68% Confidence Interval: [0.6122, 0.6910]


It is well-known that certain players enjoy an advantage on specific
surfaces. We suggest a new model with
\begin{equation}
    \Pr(\text{Player $a$ wins a match against Player $b$ on surface $s$}) = \frac{\exp(\beta_a + \beta_{a,s} - \beta_b - \beta_{b,s})}{1 + \exp(\beta_a + \beta_{a,s} - \beta_b - \beta_{b,s})},
\end{equation}
where $\beta_{a,s}$ can be interpreted as the advantage of Player $a$ on surface $s$ compared to a baseline fitness $\beta_a$.

In [183]:
def build_surface_design_matrix(data, feature_map, ref_player, ref_surface):
    N = len(data)
    P = len(feature_map)
    X = np.zeros((N, P))

    winners = data['Winner'].values
    losers = data['Loser'].values
    surf = data['Surface'].values

    for i in range(N):
        w, l, s = winners[i], losers[i], surf[i]

        # Baseline Part (Beta_w - Beta_l)
        if w != ref_player and w in feature_map:
            X[i, feature_map[w]] = 1
        if l != ref_player and l in feature_map:
            X[i, feature_map[l]] = -1

        # Interaction Part (Beta_w,s - Beta_l,s)
        if s != ref_surface:
            w_int = f"{w}_{s}"
            l_int = f"{l}_{s}"

            # We must check if the interaction key exists
            if w_int in feature_map:
                X[i, feature_map[w_int]] = 1
            if l_int in feature_map:
                X[i, feature_map[l_int]] = -1

    return X, np.ones(N)

In [184]:
REF_PLAYER = "Agassi A."
REF_SURFACE = "Hard"

# Identify all players and surfaces from training data
players = sorted(list(set(df_train['Winner']) | set(df_train['Loser'])))
surfaces = sorted(list(df_train['Surface'].unique()))
# Baseline: All players except Agassi
baseline_vars = [p for p in players if p != REF_PLAYER]
# Interactions: All players (including Agassi) on non-Hard surfaces
interaction_vars = []
for p in players:
    for s in surfaces:
        if s != REF_SURFACE:
            interaction_vars.append(f"{p}_{s}")

all_feature_names = baseline_vars + interaction_vars
feature_map = {name: i for i, name in enumerate(all_feature_names)}

X_train_full, y_train = build_surface_design_matrix(df_train, feature_map, REF_PLAYER, REF_SURFACE)
# Identify columns that are all zeros (unused interactions)
active_cols_mask = np.abs(X_train_full).sum(axis=0) > 0
X_train_active = X_train_full[:, active_cols_mask]

print(f"Original features: {X_train_full.shape[1]}")
print(f"Active features:   {np.sum(active_cols_mask)}")

model_surf_result = train_bradley_terry(X_train_active, y_train)
X_test_full, y_test = build_surface_design_matrix(df_test, feature_map, REF_PLAYER, REF_SURFACE)
X_test_active = X_test_full[:, active_cols_mask]

train_loss = calculate_logistic_loss(model_surf_result, X_train_active, y_train)
test_loss = calculate_logistic_loss(model_surf_result, X_test_active, y_test)

print(f"\nSurface Model Train Loss: {train_loss:.5f}")
print(f"Surface Model Test Loss:  {test_loss:.5f}")

Original features: 239
Active features:   233

Surface Model Train Loss: 0.56377
Surface Model Test Loss:  0.65420


The surface model has a lower training loss than the simple model. By adding surface-specific parameters, increasing model complexity, the model can fit the historical data much more closely. However, in the test data, the surface model performs worse than the simple model. This may be due to overfitting which is not allowing the model to generalise to new data.

To formally compare the nested models where the simple model is a restricted version of the surface model, we use the likelihood ratio test.

The null Hypothesis is $H_0$: The additional parameters are zero and the alternative Hypothes is $H_1$: At least one surface interaction parameter is non-zero. Our test statistic is
\begin{equation}
    D = -2 \log\left( \frac{\mathcal{L}_{simple}}{\mathcal{L}_{surface}} \right)  = 2(\ell_{surface} - \ell_{simple})
\end{equation}
where $\ell$ is the log-likelihood of the fitted model. Under $H_0$, this statistic $D$ follows a Chi-squared distribution with degrees of freedom equal to the difference in the number of parameters between the two models.

If we reject $H_0$, then it means that the improvement in training fit is statistically significant and not just due to random chance from adding more parameters. If the test loss for the surface model is lower than the simple model, then the hypothesis test agrees with the cross-validation/test-set results.



In [185]:
from scipy.stats import chi2

def perform_likelihood_ratio_test(model_simple, model_complex):
    # Retrieve Log-Likelihoods
    ll_simple = model_simple.llf
    ll_complex = model_complex.llf

    # Calculate Test Statistic (Deviance)
    deviance = 2 * (ll_complex - ll_simple)

    # Calculate Degrees of Freedom Difference
    df_diff = model_complex.df_model - model_simple.df_model

    # Calculate p-value (Survival function of Chi-Squared distribution)
    p_value = chi2.sf(deviance, df_diff)

    print(f"Likelihood Ratio Test Results:")
    print(f"Simple Model LL:  {ll_simple:.4f}")
    print(f"Surface Model LL: {ll_complex:.4f}")
    print(f"Statistic (D):    {deviance:.4f}")
    print(f"df difference:    {int(df_diff)}")
    print(f"p-value:          {p_value:.4e}\n")

    alpha = 0.01
    if p_value < alpha:
        print(f"Reject the null hypothesis at {alpha*100}% level.")
        print("The Surface Model is statistically significantly better.")
    else:
        print(f"Fail to reject the null hypothesis at {alpha*100}% level.")
        print("The Simple Model is sufficient.")

perform_likelihood_ratio_test(model_result, model_surf_result)

Likelihood Ratio Test Results:
Simple Model LL:  -2690.5587
Surface Model LL: -2530.7623
Statistic (D):    319.5929
df difference:    171
p-value:          4.3545e-11

Reject the null hypothesis at 1.0% level.
The Surface Model is statistically significantly better.


Currently, our response variable is binary (Win/Loss). This ignores how close the match was. We can use the game counts W1-W5 and L1-L5 to create a more granular model. We sum the games won by the winner and the loser to get the total score
\begin{align}
    y_i = \sum_{k=1}^5 W_k, \quad \text{Total games won by the match winner}, \\
    n_i = \sum_{k=1}^5 (W_k + L_k), \quad \text{Total games played in the match}.
\end{align}
Instead of a Bernoulli distribution, we now use a binomial distribution to model the proportion of games won. We have $Y_i \sim \text{Binomial}(n_i, \pi_i)$ so the link function is the Logit function
\begin{equation}
    \eta = \log\left(\frac{\pi}{1-\pi}\right),
\end{equation}
where the linear predictor is $\eta = \beta_a - \beta_b$. Then $\pi_i$ is the probability that Player A wins an individual game against Player B. The coefficients $\beta$ now represent a player's ability to win games rather than matches. This approach effectively increases the sample size as each game contributes to the likelihood, allowing the model to distinguish between close or one-sided matches.

The design matrix $X$ remains exactly the same as before. The response vector is altered so that instead of passing a vector of $1$'s, we instead pass the count of successes $y_i$ and trials $n_i$ to the GLM.