# 3. Prediction Demo – Using the Tuned XGBoost Model

In this notebook, we demonstrate how to:

1. Load a cleaned tennis match dataset (`clean_matches_tennis.csv`)
2. Create / load a small **sample CSV**
3. Load the **best XGBoost model** saved from the modeling notebook
4. Run predictions on the sample data
5. Display predictions and provide a short summary and key observations

The target variable is:

- `stronger_win = 1`: the higher-ranked player wins (expected outcome)  
- `stronger_win = 0`: the lower-ranked player loses and an upset occurs


In [12]:
import os
import numpy as np
import pandas as pd
import joblib

from sklearn.metrics import accuracy_score, confusion_matrix

pd.set_option("display.max_columns", 100)

print("Working directory:", os.getcwd())

# 文件路径（如果你的结构不一样，可以在这里改）
CLEAN_PATH = "./clean_matches_tennis.csv"
SAMPLE_PATH = "./sample_clean_matches.csv"
MODEL_PATH = "./Model/best_tennis_xgb.pkl"


Working directory: C:\Users\j'j


## 3.1 Load Cleaned Data and Create a Sample CSV

We use the cleaned dataset exported from the EDA notebook.
From this dataset, we either:

- Load an existing `sample_clean_matches.csv`, or
- Randomly sample a small subset (e.g. 30 rows) and save it as the sample CSV.


In [19]:
# 1. Load cleaned dataset
if not os.path.exists(CLEAN_PATH):
    raise FileNotFoundError(f"Cannot find {CLEAN_PATH}. Make sure you ran the EDA notebook and saved this file.")

df_clean = pd.read_csv(CLEAN_PATH)
print("Clean data shape:", df_clean.shape)

# 2. Load or create sample CSV
if os.path.exists(SAMPLE_PATH):
    print(f"Found existing sample CSV at: {SAMPLE_PATH}")
    df_sample = pd.read_csv(SAMPLE_PATH)
else:
    print("Sample CSV not found. Creating a new sample...")
    
    df_sample = df_clean.sample(30, random_state=42)
    df_sample.to_csv(SAMPLE_PATH, index=False)
    print("Saved sample CSV to:", SAMPLE_PATH)

print("Sample data shape:", df_sample.shape)
df_sample.head()


Clean data shape: (2760, 24)
Sample CSV not found. Creating a new sample...
Saved sample CSV to: ./sample_clean_matches.csv
Sample data shape: (30, 24)


Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_name,winner_id,winner_rank,winner_rank_points,winner_age,loser_name,loser_id,loser_rank,loser_rank_points,loser_age,stronger_rank,weaker_rank,rank_gap_abs,points_diff,age_diff,best_of,stronger_win
367,2019-0424,New York,Hard,32,A,20190211,298,Brayden Schnur,111790,154.0,367.0,23.6,Sam Querrey,105023,49.0,940.0,31.3,49.0,154.0,105.0,-573.0,-7.7,3,0
1293,2019-520,Roland Garros,Clay,128,G,20190527,1306,Gael Monfils,104792,17.0,1965.0,32.7,Antoine Hoang,126156,146.0,388.0,23.5,17.0,146.0,129.0,1577.0,9.2,5,1
2106,2019-560,US Open,Hard,128,G,20190826,169,Nikoloz Basilashvili,105932,18.0,1985.0,27.5,Jenson Brooksby,202385,394.0,88.0,18.8,18.0,394.0,376.0,1897.0,8.7,5,1
2330,2019-0329,Tokyo,Hard,32,A,20190930,273,Lloyd Harris,144750,99.0,574.0,22.5,Alex De Minaur,200282,25.0,1520.0,20.6,25.0,99.0,74.0,-946.0,1.9,3,0
521,2019-M004,Acapulco,Hard,32,A,20190225,290,John Millman,105357,44.0,1008.0,29.7,Peter Gojowczyk,105376,87.0,640.0,29.6,44.0,87.0,43.0,368.0,0.1,3,1


## 3.2 Load the Best XGBoost Model

We now load the tuned XGBoost pipeline that was saved in the modeling notebook.
This pipeline already includes the preprocessing steps (ColumnTransformer + OneHotEncoder).


In [20]:
if not os.path.exists(MODEL_PATH):
    raise FileNotFoundError(f"Cannot find model at {MODEL_PATH}. Make sure you saved best_tennis_xgb.pkl.")

best_model = joblib.load(MODEL_PATH)
print("Loaded model type:", type(best_model))
best_model


Loaded model type: <class 'sklearn.pipeline.Pipeline'>


0,1,2
,steps,"[('preprocess', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,1.0
,device,
,early_stopping_rounds,
,enable_categorical,False


## 3.3 Prepare Features and Run Predictions

We use the same feature set as in the modeling notebook:

- Numeric features: `rank_gap_abs`, `age_diff`
- Categorical features: `surface`, `tourney_level`, `best_of`


In [21]:
# 与建模时保持一致的特征
numeric_features = ["rank_gap_abs", "age_diff"]
categorical_features = ["surface", "tourney_level", "best_of"]
feature_cols = numeric_features + categorical_features

# 确保这些列都存在于 sample 中
for col in feature_cols:
    if col not in df_sample.columns:
        raise KeyError(f"Sample is missing required feature column: {col}")

# 丢掉这些特征存在 NaN 的行
df_sample_model = df_sample.dropna(subset=feature_cols).copy()
print("After dropping rows with NaN in feature columns:", df_sample_model.shape)

X_demo = df_sample_model[feature_cols]

# 预测概率（第 1 列是 stronger_win=1 的概率）
proba_demo = best_model.predict_proba(X_demo)[:, 1]
pred_demo = best_model.predict(X_demo)

df_sample_model["pred_stronger_win"] = pred_demo
df_sample_model["pred_prob_stronger_win"] = proba_demo

# 是否有真实标签可以对比
has_true_label = "stronger_win" in df_sample_model.columns
if has_true_label:
    demo_acc = accuracy_score(df_sample_model["stronger_win"], df_sample_model["pred_stronger_win"])
    print(f"Demo subset accuracy: {demo_acc:.4f}")


After dropping rows with NaN in feature columns: (30, 24)
Demo subset accuracy: 0.7333


## 3.4 Show Prediction Table

We now display a few key columns for inspection:

- Player names (winner/loser)
- Ranking information (stronger vs weaker, ranking gap)
- Match context (surface, level, best_of)
- Model predictions:
  - `pred_stronger_win` (0 = upset, 1 = stronger wins)
  - `pred_prob_stronger_win` (probability that the stronger player wins)
  - (optional) `stronger_win` as the true label if available


In [22]:
cols_to_show = []

# 尝试展示一些有意义的列（如果存在就加进去）
for col in [
    "winner_name", "winner_rank", "loser_name", "loser_rank",
    "stronger_rank", "weaker_rank", "rank_gap_abs",
    "surface", "tourney_level", "best_of",
]:
    if col in df_sample_model.columns:
        cols_to_show.append(col)

# 加上预测相关列
cols_to_show += ["pred_stronger_win", "pred_prob_stronger_win"]

# 如果有真实标签也展示
if has_true_label:
    cols_to_show.append("stronger_win")

print("Columns to display:", cols_to_show)

df_sample_model[cols_to_show].head(15)


Columns to display: ['winner_name', 'winner_rank', 'loser_name', 'loser_rank', 'stronger_rank', 'weaker_rank', 'rank_gap_abs', 'surface', 'tourney_level', 'best_of', 'pred_stronger_win', 'pred_prob_stronger_win', 'stronger_win']


Unnamed: 0,winner_name,winner_rank,loser_name,loser_rank,stronger_rank,weaker_rank,rank_gap_abs,surface,tourney_level,best_of,pred_stronger_win,pred_prob_stronger_win,stronger_win
367,Brayden Schnur,154.0,Sam Querrey,49.0,49.0,154.0,105.0,Hard,A,3,1,0.712666,0
1293,Gael Monfils,17.0,Antoine Hoang,146.0,17.0,146.0,129.0,Clay,G,5,1,0.820383,1
2106,Nikoloz Basilashvili,18.0,Jenson Brooksby,394.0,18.0,394.0,376.0,Hard,G,5,1,0.817964,1
2330,Lloyd Harris,99.0,Alex De Minaur,25.0,25.0,99.0,74.0,Hard,A,3,1,0.748153,0
521,John Millman,44.0,Peter Gojowczyk,87.0,44.0,87.0,43.0,Hard,A,3,1,0.565487,1
817,Benoit Paire,69.0,Jo-Wilfried Tsonga,116.0,69.0,116.0,47.0,Clay,A,3,1,0.644292,1
322,Fernando Verdasco,26.0,Marius Copil,56.0,26.0,56.0,30.0,Hard,A,3,1,0.609305,1
1628,Richard Gasquet,50.0,Dennis Novak,121.0,50.0,121.0,71.0,Clay,A,3,1,0.711491,1
365,Reilly Opelka,89.0,Brayden Schnur,154.0,89.0,154.0,65.0,Hard,A,3,1,0.730883,1
2553,Grigor Dimitrov,27.0,David Goffin,14.0,14.0,27.0,13.0,Hard,M,3,1,0.51829,0


## 3.5 Summary & Key Observations

In this demo:

- We loaded a small sample of matches from `clean_matches_tennis.csv`.
- We applied the **tuned XGBoost model** to predict whether the **stronger player (higher-ranked)** would win each match.
- For each match, the model outputs:
  - a binary prediction `pred_stronger_win`  
    - `1` → model expects the stronger player to win  
    - `0` → model predicts an upset (the weaker player wins)
  - a probability score `pred_prob_stronger_win` indicating how confident the model is that the stronger player will win.

If the true label `stronger_win` is available in the sample:

- We can compute the demo subset accuracy to see how well the model performs on this small subset.
- We can also manually inspect cases where:
  - `stronger_win = 0` but `pred_stronger_win = 1` (missed upsets)
  - `stronger_win = 1` but `pred_stronger_win = 0` (false upset alarms)

Overall, this notebook demonstrates how to:
1. Load a saved model,
2. Read a CSV file,
3. Run predictions, and
4. Present the results in a clear, interpretable table.


In [23]:
n_rows = len(df_sample_model)
n_pred_stronger = int(df_sample_model["pred_stronger_win"].sum())
avg_prob = float(df_sample_model["pred_prob_stronger_win"].mean())

print("=== Demo Summary on Sample CSV ===")
print(f"Number of matches in sample: {n_rows}")
print(f"Model predicts stronger player will win in {n_pred_stronger} / {n_rows} matches "
      f"({n_pred_stronger / n_rows:.2%})")
print(f"Average predicted probability that stronger player wins: {avg_prob:.3f}")

if has_true_label:
    acc = accuracy_score(df_sample_model["stronger_win"], df_sample_model["pred_stronger_win"])
    cm = confusion_matrix(df_sample_model["stronger_win"], df_sample_model["pred_stronger_win"])
    print(f"\nAccuracy on this sample (with true labels): {acc:.3f}")
    print("Confusion matrix on sample:\n", cm)


=== Demo Summary on Sample CSV ===
Number of matches in sample: 30
Model predicts stronger player will win in 30 / 30 matches (100.00%)
Average predicted probability that stronger player wins: 0.670

Accuracy on this sample (with true labels): 0.733
Confusion matrix on sample:
 [[ 0  8]
 [ 0 22]]


On this particular 30-match demo sample, the model predicts that the stronger player
wins in all 30 cases (`pred_stronger_win = 1` for every row), with an average
predicted probability of about 0.67.

In the ground truth, 22 out of these 30 matches are indeed won by the stronger player,
while 8 are upsets. This yields:

- Demo accuracy: **22 / 30 ≈ 0.733**
- Confusion matrix:

\[
\begin{bmatrix}
0 & 8 \\
0 & 22
\end{bmatrix}
\]

This means that, for this small sample, the model behaves similarly to the naive
baseline that always predicts the stronger player to win:
it correctly predicts all matches where the stronger player actually wins,
but completely misses all upsets.

This is consistent with the overall behavior observed in the full test set:
the model is very good at recognizing matches where the stronger player will win,
but predicting rare upsets (class 0) remains very challenging.
