# Slaying Dragons: How Objective Control Predicts Pro League of Legends Wins

**Name(s)**: Indrani Vairagare (A17410404) 

**Website Link**: https://indraniiii.github.io/lol-dragon-analysis/

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

import plotly.express as px
pd.options.plotting.backend = 'plotly'

from dsc80_utils import * # Feel free to uncomment and use this.

## Step 1: Introduction

In this project, I am analyzing the League of Legends esports dataset from Oracle’s Elixir (2024 season). My central question is:

**How does neutral objective control (especially dragons) relate to a team’s probability of winning a match in professional League of Legends?**

This dataset includes detailed team-level and player-level data from thousands of matches, including kills, objectives, gold differences, and game length. Because each match contains both player rows and team-level summary rows, I will focus only on team-level rows (`position == "team"`).

My goal in Steps 1–4 is to understand the structure of the dataset, explore its features, and test whether winning teams consistently secure more dragons than losing teams.

## Step 2: Data Cleaning and Exploratory Data Analysis

In [2]:
# Loading dataset
path = "2024_LoL_esports_match_data_from_OraclesElixir.csv"
lol = pd.read_csv(path)

# Cleaning
# Keeping only team-level rows 
teams = lol[lol["position"] == "team"].copy()

# Cleaning boolean cols, so converting to 0/1 
bool_cols = ["firstblood", "firstdragon", "firsttower", "result"]

for col in bool_cols:
    if col in teams.columns:
        teams[col] = teams[col].astype(bool)

# Creating game length in minutes
teams["gamelength_min"] = teams["gamelength"] / 60

teams.head()


Columns (2) have mixed types. Specify dtype option on import or set low_memory=False.



Unnamed: 0,gameid,datacompleteness,url,league,...,opp_killsat25,opp_assistsat25,opp_deathsat25,gamelength_min
10,10660-10660_game_1,partial,https://lpl.qq.com/es/stats.shtml?bmid=10660,DCup,...,,,,31.43
11,10660-10660_game_1,partial,https://lpl.qq.com/es/stats.shtml?bmid=10660,DCup,...,,,,31.43
22,10660-10660_game_2,partial,https://lpl.qq.com/es/stats.shtml?bmid=10660,DCup,...,,,,31.85
23,10660-10660_game_2,partial,https://lpl.qq.com/es/stats.shtml?bmid=10660,DCup,...,,,,31.85
34,10660-10660_game_3,partial,https://lpl.qq.com/es/stats.shtml?bmid=10660,DCup,...,,,,22.07


In [3]:
teams.columns

Index(['gameid', 'datacompleteness', 'url', 'league', 'year', 'split',
       'playoffs', 'date', 'game', 'patch',
       ...
       'golddiffat25', 'xpdiffat25', 'csdiffat25', 'killsat25', 'assistsat25',
       'deathsat25', 'opp_killsat25', 'opp_assistsat25', 'opp_deathsat25',
       'gamelength_min'],
      dtype='object', length=165)

In [4]:
import plotly.io as pio
pio.renderers.default = "iframe_connected"
import os
os.makedirs("assets", exist_ok=True)

In [5]:
# Plotting figure
fig = px.histogram(
    teams,
    x="gamelength_min",
    nbins=40,
    title="Distribution of Game Length in 2024 Professional League of Legends Matches",
    labels={"gamelength_min": "Game Length (minutes)"}
)
# Average number of dragons by match result
dragons_by_result = teams.groupby('result')["dragons"].mean().reset_index()
fig2 = px.bar(dragons_by_result, x='result', y='dragons',
              title='Average Dragons Secured by Match Result',
              labels={'result':'Win','dragons':'Average Dragons'})
fig2.update_layout(template='plotly_white')
fig2.show()
# Save the figure to assets folder
fig2.write_html('assets/avg_dragons_by_result.html', include_plotlyjs='cdn')

# Average number of dragons by league
agg_table = teams.groupby('league')["dragons"].mean().reset_index().rename(columns={'dragons':'avg_dragons'})
agg_table.head()

fig.update_layout(
    template="plotly_white",
    bargap=0.05
)

fig.show()

In [6]:
league_dragons = teams.groupby("league")["dragons"].mean().sort_values(ascending=False)
league_dragons

league
LRS      2.39
CBLOL    2.38
VCS      2.36
         ... 
CDF      1.86
PRMP     1.73
IC       1.72
Name: dragons, Length: 51, dtype: float64

## Step 3: Assessment of Missingness

In [7]:
# Missingness per column
missing_cols = teams.columns[teams.isna().any()]
missing_counts = teams[missing_cols].isna().sum().sort_values(ascending=False)
missing_fraction = teams[missing_cols].isna().mean().sort_values(ascending=False)

missingness_summary = pd.DataFrame({
    "column": missing_counts.index,
    "num_missing": missing_counts.values,
    "frac_missing": missing_fraction.values
})

missingness_summary.head(15)

Unnamed: 0,column,num_missing,frac_missing
0,atakhans,19608,1.00
1,opp_atakhans,19608,1.00
2,playername,19608,1.00
...,...,...,...
12,monsterkillsownjungle,16826,0.86
13,monsterkillsenemyjungle,16826,0.86
14,url,16826,0.86


In [8]:
miss_col = "goldat25"

teams_miss = teams.copy()
teams_miss["gamelength_min"] = teams_miss["gamelength"] / 60
teams_miss["goldat25_missing"] = teams_miss[miss_col].isna()

teams_miss["goldat25_missing"].mean()  

np.float64(0.20093839249286005)

In [9]:
# Comparing avg game length when goldat25 is missing vs not missing
group_means = teams_miss.groupby("goldat25_missing")["gamelength_min"].mean()
obs_diff_gamelength = group_means[False] - group_means[True]
obs_diff_gamelength

np.float64(2.9996872908969436)

In [10]:
n_permutations = 1000
perm_stats_len = []

for _ in range(n_permutations):
    shuffled = teams_miss["goldat25_missing"].sample(frac=1, replace=False).values
    means = teams_miss.groupby(shuffled)["gamelength_min"].mean()
    perm_stats_len.append(means[False] - means[True])

perm_stats_len = np.array(perm_stats_len)

p_value_len = np.mean(perm_stats_len >= obs_diff_gamelength)
obs_diff_gamelength, p_value_len

(np.float64(2.9996872908969436), np.float64(0.0))

In [11]:
fig = px.histogram(
    teams_miss,
    x="gamelength_min",
    color="goldat25_missing",
    barmode="overlay",
    nbins=40,
    labels={"gamelength_min": "Game Length (minutes)", "goldat25_missing": "goldat25 is missing"},
    title="Game Length vs Missingness of goldat25"
)
fig.update_layout(template="plotly_white")
fig.show()

In [12]:
# Comparing win rate when goldat25 is missing vs not
group_win = teams_miss.groupby("goldat25_missing")["result"].mean()
obs_diff_win = group_win[False] - group_win[True]
obs_diff_win

np.float64(0.0005076142131979489)

In [13]:
perm_stats_win = []

for _ in range(n_permutations):
    shuffled = teams_miss["goldat25_missing"].sample(frac=1, replace=False).values
    means = teams_miss.groupby(shuffled)["result"].mean()
    perm_stats_win.append(means[False] - means[True])

perm_stats_win = np.array(perm_stats_win)
p_value_win = np.mean(perm_stats_win >= obs_diff_win)
obs_diff_win, p_value_win

(np.float64(0.0005076142131979489), np.float64(0.494))

## Step 4: Hypothesis Testing

In [14]:
# Hypothesis Test #1:
# Do winning teams secure more dragons?

# Observed statistic
obs_diff_dragons = teams.groupby("result")["dragons"].mean()[True] - \
                   teams.groupby("result")["dragons"].mean()[False]

# Permutation test
n_permutations = 5000
stats = []

for _ in range(n_permutations):
    shuffled = teams["result"].sample(frac=1, replace=False).values
    diff = teams.groupby(shuffled)["dragons"].mean()[True] - \
           teams.groupby(shuffled)["dragons"].mean()[False]
    stats.append(diff)

p_value_dragons = np.mean(np.array(stats) >= obs_diff_dragons)

obs_diff_dragons, p_value_dragons


(np.float64(1.4662258048957366), np.float64(0.0))

In [15]:
# Hypothesis Test #2:
# Do teams that get first dragon win more often?

# Observed statistic
obs_diff_fd = teams.groupby("firstdragon")["result"].mean()[True] - \
              teams.groupby("firstdragon")["result"].mean()[False]

# Permutation test
stats_fd = []

for _ in range(5000):
    shuffled_fd = teams["firstdragon"].sample(frac=1, replace=False).values
    diff_fd = teams.groupby(shuffled_fd)["result"].mean()[True] - \
              teams.groupby(shuffled_fd)["result"].mean()[False]
    stats_fd.append(diff_fd)

p_value_fd = np.mean(np.array(stats_fd) >= obs_diff_fd)

obs_diff_fd, p_value_fd


(np.float64(0.1259055751407503), np.float64(0.0))

## Step 5: Framing a Prediction Problem

In [16]:
# Predicting whether a team wins (result) from post-game stats

predictors = ['dragons', 'barons', 'towers', 'kills', 'assists', 'gamelength_min']

X_all = teams[predictors].fillna(0)
y_all = teams["result"]

from sklearn.model_selection import train_test_split

# We split the INDEX once and reuse the same split for baseline + final models
train_idx, test_idx = train_test_split(
    X_all.index,
    test_size=0.2,
    random_state=42,
    stratify=y_all
)

X_train = X_all.loc[train_idx]
X_test = X_all.loc[test_idx]
y_train = y_all.loc[train_idx]
y_test = y_all.loc[test_idx]

X_train.head(), y_train.head()

(        dragons  barons  towers  kills  assists  gamelength_min
 9659        4.0     1.0     9.0     19       48           39.93
 51755       2.0     1.0     5.0      9       18           32.58
 104795      5.0     2.0    11.0     22       71           37.07
 50411       4.0     1.0     7.0      9       25           30.73
 52091       2.0     0.0     2.0      5       11           31.03,
 9659       True
 51755     False
 104795     True
 50411      True
 52091     False
 Name: result, dtype: bool)

## Step 6: Baseline Model

In [17]:
# Baseline non-ML model

# Determining most frequent outcome (True or False) in y_all
majority_class = y_all.mode()[0]

# Predict this class for every row
baseline_pred = np.repeat(majority_class, len(y_all))

# Computing baseline accuracy
baseline_accuracy = (baseline_pred == y_all).mean()
baseline_accuracy

np.float64(0.5001019991840066)

In [18]:
# Baseline ML model: simple LogisticRegression on a small set of features

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

baseline_features = predictors  # reuse the same list from Step 5

X_baseline_train = X_train[baseline_features]
X_baseline_test = X_test[baseline_features]

baseline_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("log_reg", LogisticRegression(max_iter=1000))
])

baseline_pipe.fit(X_baseline_train, y_train)

baseline_train_accuracy = baseline_pipe.score(X_baseline_train, y_train)
baseline_test_accuracy = baseline_pipe.score(X_baseline_test, y_test)

baseline_train_accuracy, baseline_test_accuracy

(0.9741807981639679, 0.9747577766445691)

## Step 7: Final Model

In [19]:
# Adding engineered features + hyperparameter tuning

from sklearn.model_selection import GridSearchCV

# Creating engineered features on the full teams DataFrame 

teams["dragon_diff"] = teams["dragons"] - teams["opp_dragons"]
teams["baron_diff"] = teams["barons"] - teams["opp_barons"]
teams["objective_count"] = teams["dragons"] + teams["barons"] + teams["heralds"] + teams["void_grubs"]
teams["early_gold_diff_15"] = teams["golddiffat15"]

final_features = baseline_features + [
    "dragon_diff",
    "baron_diff",
    "objective_count",
    "early_gold_diff_15",
]

X_final_all = teams[final_features].fillna(0)
y_all = teams["result"]

# Reusing the SAME train/test split as the baseline model
X_final_train = X_final_all.loc[train_idx]
X_final_test = X_final_all.loc[test_idx]
y_final_train = y_all.loc[train_idx]
y_final_test = y_all.loc[test_idx]

final_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("log_reg", LogisticRegression(max_iter=1000))
])

param_grid = {
    "log_reg__C": [0.1, 1.0, 10.0]
}

grid = GridSearchCV(
    final_pipe,
    param_grid,
    cv=3,
    scoring="accuracy"
)

grid.fit(X_final_train, y_final_train)

best_final_model = grid.best_estimator_
best_params = grid.best_params_

final_train_accuracy = best_final_model.score(X_final_train, y_final_train)
final_test_accuracy = best_final_model.score(X_final_test, y_final_test)

best_params, final_train_accuracy, final_test_accuracy

({'log_reg__C': 10.0}, 0.9760295805176591, 0.9780724120346762)

## Step 8: Fairness Analysis

In [20]:
# Does the final model perform differently on major vs. non-major leagues?

# Defining major regions vs others
major_leagues = ["LCK", "LPL", "LEC", "LCS"]
teams["league_group"] = np.where(teams["league"].isin(major_leagues), "major", "other")

# Building DataFrame of test results using the FINAL model
test_results = pd.DataFrame({
    "y_true": y_final_test,
    "y_pred": best_final_model.predict(X_final_test),
})

test_results["league_group"] = teams.loc[test_results.index, "league_group"]

def accuracy(df):
    return (df["y_true"] == df["y_pred"]).mean()

acc_by_group = test_results.groupby("league_group").apply(accuracy)
acc_by_group





league_group
major    0.98
other    0.98
dtype: float64

In [21]:
obs_diff_acc = acc_by_group["other"] - acc_by_group["major"]  # other - major

n_permutations = 1000
perm_stats = []

for _ in range(n_permutations):
    shuffled_groups = np.random.permutation(test_results["league_group"].values)
    temp = test_results.copy()
    temp["shuffled_group"] = shuffled_groups
    acc_perm = temp.groupby("shuffled_group").apply(accuracy)
    perm_stats.append(acc_perm["other"] - acc_perm["major"])

perm_stats = np.array(perm_stats)

# One-sided p-value: probability of seeing a difference this negative or more negative
p_value_fairness = np.mean(perm_stats <= obs_diff_acc)

obs_diff_acc, p_value_fairness












































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































(np.float64(0.0033554750640065745), np.float64(0.751))