# 2. Modeling – Predicting Whether the Stronger Player Wins

In this notebook, we build machine learning models to predict whether the **stronger player (higher-ranked)** wins a tennis match.

**Target variable:**

- `stronger_win = 1` → higher-ranked player wins (expected outcome)
- `stronger_win = 0` → lower-ranked player wins (upset)

We use the cleaned dataset generated in the EDA notebook: `clean_matches_tennis.csv`.


In [20]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# !{sys.executable} -m pip install xgboost
from xgboost import XGBClassifier

pd.set_option("display.max_columns", 100)


## 2.1 Load Cleaned Dataset

We load the cleaned dataset `clean_matches_tennis.csv` created in the EDA notebook.


In [21]:
DATA_PATH = "./clean_matches_tennis.csv"  # 根据你的路径调整，如果在上一级就用 ../Data/...

if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"Cannot find {DATA_PATH}. Make sure you ran the EDA notebook and saved this file.")

df = pd.read_csv(DATA_PATH)
print("Data shape:", df.shape)
df.head()


Data shape: (2760, 24)


Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_name,winner_id,winner_rank,winner_rank_points,winner_age,loser_name,loser_id,loser_rank,loser_rank_points,loser_age,stronger_rank,weaker_rank,rank_gap_abs,points_diff,age_diff,best_of,stronger_win
0,2019-M020,Brisbane,Hard,32,A,20181231,300,Kei Nishikori,105453,9.0,3590.0,29.0,Daniil Medvedev,106421,16.0,1977.0,22.8,9.0,16.0,7.0,1613.0,6.2,3,1
1,2019-M020,Brisbane,Hard,32,A,20181231,299,Daniil Medvedev,106421,16.0,1977.0,22.8,Jo-Wilfried Tsonga,104542,239.0,200.0,33.7,16.0,239.0,223.0,1777.0,-10.9,3,1
2,2019-M020,Brisbane,Hard,32,A,20181231,298,Kei Nishikori,105453,9.0,3590.0,29.0,Jeremy Chardy,104871,40.0,1050.0,31.8,9.0,40.0,31.0,2540.0,-2.8,3,1
3,2019-M020,Brisbane,Hard,32,A,20181231,297,Jo-Wilfried Tsonga,104542,239.0,200.0,33.7,Alex De Minaur,200282,31.0,1298.0,19.8,31.0,239.0,208.0,-1098.0,13.9,3,0
4,2019-M020,Brisbane,Hard,32,A,20181231,296,Daniil Medvedev,106421,16.0,1977.0,22.8,Milos Raonic,105683,18.0,1855.0,28.0,16.0,18.0,2.0,122.0,-5.2,3,1


## 2.2 Target Variable: `stronger_win`

We confirm the distribution of the target:

- 1 → stronger player wins  
- 0 → upset (weaker player wins)


In [22]:
if "stronger_win" not in df.columns:
    raise KeyError("Column 'stronger_win' not found. Make sure you used the EDA notebook definition.")

df["stronger_win"].value_counts(normalize=True)


stronger_win
1    0.614493
0    0.385507
Name: proportion, dtype: float64

In [23]:
df["stronger_win"].value_counts()


stronger_win
1    1696
0    1064
Name: count, dtype: int64

## 2.3 Feature Selection

We focus on pre-match features that are realistically available **before** a match:

**Numeric features:**
- `rank_gap_abs` – absolute ranking gap (weaker_rank − stronger_rank)
- `points_diff` – ranking points difference between winner and loser
- `age_diff` – age difference between winner and loser

**Categorical features:**
- `surface` – court surface
- `tourney_level` – tournament category
- `best_of` – number of sets (3 or 5)

**Target:**
- `y = stronger_win`


In [24]:
# numeric & categorical feature names
numeric_features = ["rank_gap_abs",  "age_diff"]
categorical_features = ["surface", "tourney_level", "best_of"]

for col in numeric_features + categorical_features + ["stronger_win"]:
    if col not in df.columns:
        raise KeyError(f"Missing required column: {col}")

X = df[numeric_features + categorical_features].copy()
y = df["stronger_win"].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

X_train.shape, X_test.shape


((2208, 5), (552, 5))

## 2.4 Preprocessing and Pipelines

We use a `ColumnTransformer` to:

- Pass numeric features as-is (tree-based models do not require scaling).
- One-hot encode categorical features.

Then we wrap this preprocessor with different classifiers:

- Decision Tree
- Random Forest
- XGBoost


In [25]:
preprocess = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)


## 2.5 Baseline Models: Decision Tree & Random Forest


In [26]:
# Decision Tree
dt_clf = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", DecisionTreeClassifier(max_depth=None, random_state=42)),
    ]
)

dt_clf.fit(X_train, y_train)
y_pred_dt = dt_clf.predict(X_test)

acc_dt = accuracy_score(y_test, y_pred_dt)
cm_dt = confusion_matrix(y_test, y_pred_dt)

print("=== Decision Tree ===")
print("Accuracy:", acc_dt)
print("Confusion matrix:\n", cm_dt)
print(classification_report(y_test, y_pred_dt))


# Random Forest
rf_clf = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", RandomForestClassifier(
            n_estimators=200,
            max_depth=None,
            n_jobs=-1,
            random_state=42
        )),
    ]
)

rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

acc_rf = accuracy_score(y_test, y_pred_rf)
cm_rf = confusion_matrix(y_test, y_pred_rf)

print("\n=== Random Forest ===")
print("Accuracy:", acc_rf)
print("Confusion matrix:\n", cm_rf)
print(classification_report(y_test, y_pred_rf))


=== Decision Tree ===
Accuracy: 0.5253623188405797
Confusion matrix:
 [[ 77 136]
 [126 213]]
              precision    recall  f1-score   support

           0       0.38      0.36      0.37       213
           1       0.61      0.63      0.62       339

    accuracy                           0.53       552
   macro avg       0.49      0.49      0.49       552
weighted avg       0.52      0.53      0.52       552


=== Random Forest ===
Accuracy: 0.5670289855072463
Confusion matrix:
 [[ 79 134]
 [105 234]]
              precision    recall  f1-score   support

           0       0.43      0.37      0.40       213
           1       0.64      0.69      0.66       339

    accuracy                           0.57       552
   macro avg       0.53      0.53      0.53       552
weighted avg       0.56      0.57      0.56       552



## 2.6 XGBoost – Baseline Model

We now build a baseline XGBoost model using the same preprocessed features.


In [27]:
xgb_base = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", XGBClassifier(
            objective="binary:logistic",
            eval_metric="logloss",
            tree_method="hist",   # fast on CPU
            random_state=42,
        )),
    ]
)

xgb_base.fit(X_train, y_train)
y_pred_xgb_base = xgb_base.predict(X_test)

acc_xgb_base = accuracy_score(y_test, y_pred_xgb_base)
cm_xgb_base = confusion_matrix(y_test, y_pred_xgb_base)

print("=== XGBoost (baseline) ===")
print("Accuracy:", acc_xgb_base)
print("Confusion matrix:\n", cm_xgb_base)
print(classification_report(y_test, y_pred_xgb_base))


=== XGBoost (baseline) ===
Accuracy: 0.5416666666666666
Confusion matrix:
 [[ 58 155]
 [ 98 241]]
              precision    recall  f1-score   support

           0       0.37      0.27      0.31       213
           1       0.61      0.71      0.66       339

    accuracy                           0.54       552
   macro avg       0.49      0.49      0.49       552
weighted avg       0.52      0.54      0.52       552



## 2.7 XGBoost – Hyperparameter Tuning

We perform hyperparameter tuning for XGBoost using `RandomizedSearchCV`:

- `n_estimators`
- `max_depth`
- `learning_rate`
- `subsample`
- `colsample_bytree`
- `min_child_weight`
- `gamma`

This helps improve model performance beyond the baseline.


In [28]:
xgb_clf = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", XGBClassifier(
            objective="binary:logistic",
            eval_metric="logloss",
            tree_method="hist",
            random_state=42,
        )),
    ]
)

param_dist = {
    "model__n_estimators":    [200, 400, 600, 800],
    "model__max_depth":       [3, 4, 5, 6, 8],
    "model__learning_rate":   [0.01, 0.05, 0.1, 0.2],
    "model__subsample":       [0.7, 0.8, 0.9, 1.0],
    "model__colsample_bytree":[0.7, 0.8, 0.9, 1.0],
    "model__min_child_weight":[1, 3, 5, 7],
    "model__gamma":           [0, 0.1, 0.2],
}

xgb_search = RandomizedSearchCV(
    estimator=xgb_clf,
    param_distributions=param_dist,
    n_iter=20,           # 可以根据时间改小或改大
    scoring="accuracy",
    cv=3,
    verbose=1,
    n_jobs=-1,
    random_state=42,
)

xgb_search.fit(X_train, y_train)

print("Best CV accuracy:", xgb_search.best_score_)
print("Best params:", xgb_search.best_params_)


Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best CV accuracy: 0.6100543478260869
Best params: {'model__subsample': 0.7, 'model__n_estimators': 400, 'model__min_child_weight': 7, 'model__max_depth': 4, 'model__learning_rate': 0.01, 'model__gamma': 0, 'model__colsample_bytree': 1.0}


## 2.8 Evaluate Best XGBoost Model on Test Set


In [29]:
best_xgb = xgb_search.best_estimator_

y_pred_best = best_xgb.predict(X_test)
acc_best = accuracy_score(y_test, y_pred_best)
cm_best = confusion_matrix(y_test, y_pred_best)

print("=== Best XGBoost (tuned) on Test Set ===")
print("Test accuracy:", acc_best)
print("Confusion matrix:\n", cm_best)
print(classification_report(y_test, y_pred_best))


=== Best XGBoost (tuned) on Test Set ===
Test accuracy: 0.6177536231884058
Confusion matrix:
 [[ 39 174]
 [ 37 302]]
              precision    recall  f1-score   support

           0       0.51      0.18      0.27       213
           1       0.63      0.89      0.74       339

    accuracy                           0.62       552
   macro avg       0.57      0.54      0.51       552
weighted avg       0.59      0.62      0.56       552



## 2.9 Save the Best Model

We save the tuned XGBoost pipeline to disk so that it can be used later in the **prediction demo** notebook.


In [30]:
import joblib

MODEL_DIR = "./Model"
os.makedirs(MODEL_DIR, exist_ok=True)

MODEL_PATH = os.path.join(MODEL_DIR, "best_tennis_xgb.pkl")
joblib.dump(best_xgb, MODEL_PATH)

print("Saved best XGBoost model to:", MODEL_PATH)


Saved best XGBoost model to: ./Model\best_tennis_xgb.pkl


In [31]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV

# XGBoost + 预处理 的 pipeline
xgb_clf = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", XGBClassifier(
            objective="binary:logistic",
            eval_metric="logloss",   # 避免 warning
            tree_method="hist",      # CPU 下也比较快
            random_state=42,
        )),
    ]
)


In [32]:
# 超参数搜索空间（可以根据时间大小调节范围）
param_dist = {
    "model__n_estimators":     [200, 400, 600, 800],
    "model__max_depth":        [3, 4, 5, 6, 8],
    "model__learning_rate":    [0.01, 0.05, 0.1, 0.2],
    "model__subsample":        [0.7, 0.8, 0.9, 1.0],
    "model__colsample_bytree": [0.7, 0.8, 0.9, 1.0],
    "model__min_child_weight": [1, 3, 5, 7],
    "model__gamma":            [0, 0.1, 0.2],
}

xgb_search = RandomizedSearchCV(
    estimator=xgb_clf,
    param_distributions=param_dist,
    n_iter=20,              # 尝试 20 组参数（可以改小 10 或改大 30）
    scoring="accuracy",     # 用 accuracy 作为评分
    cv=3,                   # 3 折交叉验证
    verbose=1,
    n_jobs=-1,              # 用所有 CPU 核
    random_state=42,
)

# 开始在训练集上做调参
xgb_search.fit(X_train, y_train)

print("Best CV accuracy:", xgb_search.best_score_)
print("Best params:", xgb_search.best_params_)


Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best CV accuracy: 0.6100543478260869
Best params: {'model__subsample': 0.7, 'model__n_estimators': 400, 'model__min_child_weight': 7, 'model__max_depth': 4, 'model__learning_rate': 0.01, 'model__gamma': 0, 'model__colsample_bytree': 1.0}


In [33]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 取出最好的那个 pipeline（已经包含预处理 + XGB）
best_xgb = xgb_search.best_estimator_

y_pred_best = best_xgb.predict(X_test)
acc_best = accuracy_score(y_test, y_pred_best)
cm_best = confusion_matrix(y_test, y_pred_best)

print("=== Best XGBoost (tuned) on Test Set ===")
print("Test accuracy:", acc_best)
print("Confusion matrix:\n", cm_best)
print(classification_report(y_test, y_pred_best))


=== Best XGBoost (tuned) on Test Set ===
Test accuracy: 0.6177536231884058
Confusion matrix:
 [[ 39 174]
 [ 37 302]]
              precision    recall  f1-score   support

           0       0.51      0.18      0.27       213
           1       0.63      0.89      0.74       339

    accuracy                           0.62       552
   macro avg       0.57      0.54      0.51       552
weighted avg       0.59      0.62      0.56       552



In [34]:
import joblib
import os

MODEL_DIR = "./Model"
os.makedirs(MODEL_DIR, exist_ok=True)

MODEL_PATH = os.path.join(MODEL_DIR, "best_tennis_xgb.pkl")
joblib.dump(best_xgb, MODEL_PATH)

print("Saved best XGBoost model to:", MODEL_PATH)


Saved best XGBoost model to: ./Model\best_tennis_xgb.pkl


### Model Comparison

| Model                             | Test Accuracy |
|----------------------------------|---------------|
| Baseline (always stronger wins)  | **0.614**     |
| Decision Tree                    | 0.525         |
| Random Forest                    | 0.567         |
| XGBoost (baseline)               | 0.542         |
| **XGBoost (tuned)**              | **0.618**     |


## 2.X Model Evaluation – Metrics and Error Analysis

### 2.X.1 Metrics (tuned XGBoost)

For the final model, we use the tuned XGBoost classifier with the following hyperparameters:

- `n_estimators = 400`
- `max_depth = 4`
- `learning_rate = 0.01`
- `subsample = 0.7`
- `colsample_bytree = 1.0`
- `min_child_weight = 7`
- `gamma = 0`

On the held-out test set (20% of the data), the model achieves:

- **Accuracy:** ≈ **0.618**

The confusion matrix on the test set is:

\[
\begin{bmatrix}
39 & 174 \\
37 & 302
\end{bmatrix}
\]

where rows are the **true labels** and columns are the **predicted labels**:

- Row 0: `stronger_win = 0`  (upset: lower-ranked player wins)  
- Row 1: `stronger_win = 1`  (normal: higher-ranked player wins)

So the entries are:

- **True Negative (TN) = 39**  
  - True label = 0 (upset), prediction = 0  
  - Matches where an upset occurs and the model correctly predicts an upset.

- **False Positive (FP) = 174**  
  - True label = 0 (upset), prediction = 1  
  - Matches where an upset occurs, but the model incorrectly predicts that the stronger player will win.  
  - These are **missed upsets**.

- **False Negative (FN) = 37**  
  - True label = 1 (stronger player wins), prediction = 0  
  - Matches where the stronger player actually wins, but the model incorrectly predicts an upset.  
  - These are **false upset alarms**.

- **True Positive (TP) = 302**  
  - True label = 1 (stronger player wins), prediction = 1  
  - Matches where the stronger player wins and the model predicts this correctly.

From the classification report, we also observe:

- For class 1 (**stronger player wins**):  
  - Recall ≈ **0.89** → the model correctly identifies most matches where the stronger player wins.  
- For class 0 (**upset**):  
  - Recall ≈ **0.18** → the model only detects a small fraction of actual upsets.

---

### 2.X.2 Explanation of Model Errors (False Positives & False Negatives)

In this problem, the positive class is:

- `stronger_win = 1` → higher-ranked player wins (the **expected** outcome)

and the negative class is:

- `stronger_win = 0` → an upset occurs (lower-ranked player wins)

From this perspective:

#### False Positives (FP = 174) – Missed Upsets

- These are matches where **an upset actually happened**, but the model still predicted that the stronger player would win.
- In other words, the model is **too optimistic about the stronger player**, and fails to anticipate that the underdog will win.
- This is not surprising, because:
  - The EDA showed that, as the ranking gap increases, the stronger player’s win rate can easily reach 70–90%.  
  - Most of the features (ranking gap, age difference, surface, level) are more informative for “normal wins” than for rare upsets.
- In practical terms, FPs mean that the model **misses surprising matches**.  
  If we used this model for betting or upset prediction, these are exactly the matches where we fail to see the upset coming.

#### False Negatives (FN = 37) – False Upset Alarms

- These are matches where the **stronger player actually won**, but the model predicted an upset.
- Compared to FP, the number of FN is much smaller (37 vs 174), consistent with the high recall for class 1 (≈ 0.89).
- FN usually happen when:
  - The ranking gap is small (almost equal strength),  
  - Or contextual factors (surface, tournament level) suggest higher upset risk.
- In practice, FNs correspond to cases where the model is **too pessimistic about the stronger player**, flagging a potential upset that does not occur.

---

### 2.X.3 Interpretation

Overall, the tuned XGBoost model:

- Achieves a test accuracy of **about 0.618**, slightly better than the naive baseline that always predicts the stronger player to win (≈ 0.614).
- Correctly classifies most matches where the stronger player wins (high recall for class 1).
- Struggles to predict **rare upsets**, which is reflected by the large number of **false positives (missed upsets)** and the low recall for class 0.

This behavior is consistent with the EDA findings:

- Ranking gap is a very strong predictor of match outcomes.
- Professional tennis is inherently dominated by stronger players, and true upsets are relatively rare and hard to forecast even with historical data.

In summary, the model is very good at confirming what we already know (the stronger player usually wins),  
and provides limited but non-zero ability to flag potential upsets.
