# League of Legends Competitive Match Data
* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict if a team will win or lose a game.
    * Predict which role (top-lane, jungle, support, etc.) a player played given their post-game data.
    * Predict how long a game will take before it happens.
    * Predict which team will get the first Baron.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
Last time, I analyzed the League of Legends esports matching data using hypothesis test, to see if the Cloud Dragon Soul is as powerful as other Dragon Souls. This time, I will coninue the analysis of these esports data, but try some different topics of this game. One of the most common suspense people have when watching the esports games, or even all sports games, is that who is going to be the winner? In this article, I want to try to **predict the winner given the statistics of the matches at the in-game time 15th minute** (it is extremely rare that a match takes under 15 minutes, and I will cover this in the following sections). Since the winner of a match is a categorical data, this predition is a **classification** task.

As a reminder, each row of the dataset I used stores the statistics of a single player or a team of only one match, and I will predict the winner of the game with only the team statistics. The target variable I am going to predict it the `'result'` column, which represents whether the team/player of this row wins this match. There's only two possible values, `1` means that the team wins, while `0` means the team loses.

I would like to use **accuracy** as the evaluation metric of my prediction. Since the dataset has the statistics of both teams, there must be exactly one winner and one loser of each match. Therefore the results of matches should be **perfectly balanced**, exactly `50%/50%`, and accuracy should be a good metric here. I didn't choose recall or precesion because neither false negative or false positive is more important than the other here. The model is good as long as it has a good accuracy.

### Baseline Model
Again, before start doing the baseline model, I need to do some data cleaning here. As mentioned above, I would only use the team statistics, so firstly I removed all the rows where the column `'position'` is not `'team'`. There are `24418` rows remaining, and each match is stored as two rows for each team, meaning there are `12209` matches at total. Then I found that there are one match (two rows) that has `'gamelength'` less than 15 minutes (107 seconds to be exact). It is not reasonable that a game takes such a short time. There must be some problem that stopped the match, so I decided to remove these two rows. Also, I found that there are three matches that none of the two teams has the `'result'` column to be `1`, meaning there's no winner. This is also abnormal so I removed them. And finally, I removed all the irrelevant columns, and only kept the basic information of the match (`'result','patch','league','side'`) and the statistics at 15th minutes (columns ending with `'at15'`).

I also noticed that there's some data missing. The statistics at 15 minutes is entirely missing as long as the rows have the value of column `'league'` to be `'LPL'` or `'LDL'` (thus they are **Missing by Design**). And very few statistics are also missing for the rows with `'league'` to be `'WCS'`. The imputation methods I've learned so far do not make too much sense in this situation, since they are not representing the real data and could possibly lower the evaluation metrics of my predictions. I decided to ignore them, and only train and test the model with the rest of the dataset.

So my baseline model officially starts here. I used the Decision Tree Classifier (`sklearn.tree.DecisionTreeClassifier`) for the baseline model, and there are 7 features used: `'league', 'killsat15', 'assistsat15', 'deathsat15', 'opp_killsat15', 'opp_assistsat15', 'opp_deathsat15'`. The feature `'league'` represents the name of the esports association in a region. For example, the league in NA is called `'LCS'`. Different league has different gaming style. Some league might be doing better on reversing the early game disadvantage, thus influence the results. Since this feature has the data type string, and there's no inherent ordering between leagues, it should be a **nomial** data. I did an **one-hot-encoding** on this feature.

The rest 6 features are all related to the kills and deaths on the team and its opponent team, which are the most intuitive statistics to see which team is more advantageous at the 15th minutes. If one team has more kills and less deaths than its opponents, then we can say this team is somewhat playing better. These features are all floats (**numeric**), so I **leave them as is**. In addition, I did not use any hyperparameter, all the parameters of the `DecisionTreeClassifier` are the default values.

I splitted the dataset using `sklearn.model_selection.train_test_split` to make a 75% train set and a 25% test set. The accuracy I got from the baseline model is about `0.603`, with some random error every time the `train_test_split` runs.
As mentioned the result column of this dataset is perfectly balanced, if I guess all the results to be 0 or 1, I would get a `0.5` accuracy. So the accuracy of this basline model, around `0.6`, is not so good in my opinion.

### Final Model
For the baseline model I only used the most intuitive statistics (kills, assists, and deaths) that represents the advantage of a team at 15th minutes. For the final model, I added some other statistics, `'golddiffat15','xpdiffat15','csdiffat15'`, which are more slight advantages we can observe in the game. These three statistics are the ones that every good League of Legends players pursuing for in every game. I also added the features `'patch'` (game version) and `'side'` (the team is at upper-right side or bottom-left side of the game map), since the game pace changes in every game patch, and in some patches, one side is more advantageous than the other side. 

For the final model, I am still using the DecisionTreeClassifier. But this time, I used the `sklearn.model_selection.GridSearchCV` to search for good hyperparameters of the Decision Tree. I listed several values for each of the parameters of Decision Tree `max_depth, min_samples_split`, and `criterion`. The parameters that performed best are: `criterion: 'gini', max_depth: 4`, and `min_samples_split: 2`. The accuracy ended up with `0.746` for both the train set and test set.

### Fairness Evaluation
The league `LCK`, located in South Korea, winned the Final Worlds Championship in 2022. When the gaming level of a team is better, it has a better chance to reverse the disadvantage in the early game. And I noticed that my final model has an accuracy of around `0.72` when predicting the matches made in `LCK`. Is my model more unfair when doing predictions on `LCK` than those on other leagues? I will do a permutation test on this.

First I have the null and alternative hypothesis:
- Null Hypothesis: my model is fair; the accuracy for my two subsets are roughly the same

- Alternative Hypothesis: my model is unfair; the accuracy for the LCK subset is lower than the other leagues subset

I picked the parity measure to be accuracy parity again, because we want to show the predictions are independent of `'league'`. I shuffled the column `'league'` for `500` times. For each shuffle, I splitted the dataset based on the column `'league'` to be `LCK` or others, made predictions with my final model, substract the other subset accuracy from `LCK` subset accuracy, and record it in a numpy array. After comparing the simulated accuracy difference and the observed difference, **P-value** ended up to be `0.258`. Despite it is small, but compared to the **standard 0.05 P-value**, it is not small enough for us to reject the null hypothesis. So the conclusion is that we **failed to reject the null hypothesis**, this observation might be happening by random chance. We cannot confidently say that my model is unfair.

# Code

### Imports

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

### Data Cleaning

In [2]:
# Remove all the player rows, keeping only the team statistics rows
full_matches = pd.read_csv('2022_LoL_esports_match_data_from_OraclesElixir_20221108.csv')
team_stat = full_matches[full_matches['position'] == 'team']
team_stat.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,gameid,datacompleteness,url,league,year,split,playoffs,date,game,patch,...,opp_csat15,golddiffat15,xpdiffat15,csdiffat15,killsat15,assistsat15,deathsat15,opp_killsat15,opp_assistsat15,opp_deathsat15
10,ESPORTSTMNT01_2690210,complete,,LCK CL,2022,Spring,0,2022-01-10 07:44:08,1,12.01,...,510.0,107.0,-1617.0,-23.0,5.0,10.0,6.0,6.0,18.0,5.0
11,ESPORTSTMNT01_2690210,complete,,LCK CL,2022,Spring,0,2022-01-10 07:44:08,1,12.01,...,487.0,-107.0,1617.0,23.0,6.0,18.0,5.0,5.0,10.0,6.0
22,ESPORTSTMNT01_2690219,complete,,LCK CL,2022,Spring,0,2022-01-10 08:38:24,1,12.01,...,555.0,-1763.0,-906.0,-22.0,1.0,1.0,3.0,3.0,3.0,1.0
23,ESPORTSTMNT01_2690219,complete,,LCK CL,2022,Spring,0,2022-01-10 08:38:24,1,12.01,...,533.0,1763.0,906.0,22.0,3.0,3.0,1.0,1.0,1.0,3.0
34,8401-8401_game_1,partial,https://lpl.qq.com/es/stats.shtml?bmid=8401,LPL,2022,Spring,0,2022-01-10 09:24:26,1,12.01,...,,,,,,,,,,


In [3]:
abnormal_matches = set()

# Matches that take less than 15 minutes
abnormal_matches.update(team_stat[team_stat['gamelength'] <= 15*60]['gameid'].values)
# Matches that have no winner
no_winner = team_stat.groupby('gameid')['result'].apply(sum)
abnormal_matches.update(no_winner[no_winner == 0].index)

abnormal_matches

{'ESPORTSTMNT03_2788015', 'ESPORTSTMNT04_2170436', 'ESPORTSTMNT05_2980802'}

In [4]:
# Remove the abnormal matches
for id in abnormal_matches:
    team_stat = team_stat[team_stat['gameid'] != id]

In [5]:
# Keep only the basic information columns, and the statistics at 15th minute
stats_15 = team_stat[[
    'result','patch','league','side', # basic information of the match
    'goldat15','xpat15','csat15','opp_goldat15','opp_xpat15','opp_csat15',
    'golddiffat15','xpdiffat15','csdiffat15',
    'killsat15','assistsat15','deathsat15',
    'opp_killsat15','opp_assistsat15','opp_deathsat15']]

# Remove the rows where the statistics at 15th minute is missing
stats_15 = stats_15[pd.notna(stats_15['goldat15'])]
stats_15.head()

Unnamed: 0,result,patch,league,side,goldat15,xpat15,csat15,opp_goldat15,opp_xpat15,opp_csat15,golddiffat15,xpdiffat15,csdiffat15,killsat15,assistsat15,deathsat15,opp_killsat15,opp_assistsat15,opp_deathsat15
10,0,12.01,LCK CL,Blue,24806.0,28001.0,487.0,24699.0,29618.0,510.0,107.0,-1617.0,-23.0,5.0,10.0,6.0,6.0,18.0,5.0
11,1,12.01,LCK CL,Red,24699.0,29618.0,510.0,24806.0,28001.0,487.0,-107.0,1617.0,23.0,6.0,18.0,5.0,5.0,10.0,6.0
22,0,12.01,LCK CL,Blue,23522.0,28848.0,533.0,25285.0,29754.0,555.0,-1763.0,-906.0,-22.0,1.0,1.0,3.0,3.0,3.0,1.0
23,1,12.01,LCK CL,Red,25285.0,29754.0,555.0,23522.0,28848.0,533.0,1763.0,906.0,22.0,3.0,3.0,1.0,1.0,1.0,3.0
46,1,12.01,LCK CL,Blue,24795.0,31342.0,560.0,23604.0,29044.0,545.0,1191.0,2298.0,15.0,3.0,8.0,1.0,1.0,1.0,3.0


### Baseline Model

In [6]:
# Columns with numeric features
stats_15_col = ['killsat15', 'assistsat15', 'deathsat15', 'opp_killsat15', 'opp_assistsat15', 'opp_deathsat15']

# Splitting the test and train sets
X = stats_15.drop(['result'], axis=1)
y = stats_15['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# The pre-processing Transformer
preproc = ColumnTransformer([
    # One hot encode the nomial feature
    ('one-hot', OneHotEncoder(), ['league']),
    # Leave the numeric features as is
    ('numeric_cols', FunctionTransformer(lambda x: x,validate=False), stats_15_col)
    ], 
    # Drop the rest columns
    remainder='drop') 

# The Pipeline that do the prediction
pl = Pipeline([
    ('pre-processing', preproc),
    ('dtc', DecisionTreeClassifier())
])

# Making predictions
pl.fit(X_train,y_train)
pred = pl.predict(X_test)
print('The accuracy score is:', accuracy_score(y_true=y_test, y_pred=pred))

The accuracy score is: 0.603019877675841


### Final Model

In [7]:
# The hyperparameters I want to test with
hyperparameters = {
    'dtc__max_depth': [2, 3, 4, 5, 7, 10, 13, 15, 18, None], 
    'dtc__min_samples_split': [2, 3, 5, 7, 10, 15, 20],
    'dtc__criterion': ['gini', 'entropy']
}

In [8]:
# The numeric features used (statistics at 15th minute)
stats_15_col = [
    'golddiffat15','xpdiffat15','csdiffat15',
    'killsat15','assistsat15','deathsat15',
    'opp_killsat15','opp_assistsat15','opp_deathsat15']

# Split the train and test sets again 
X = stats_15.drop(['result'], axis=1)
y = stats_15['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# The pre-processing Transformer
preproc = ColumnTransformer([
    # Standarize the numeric features
    ('numeric_cols', StandardScaler(), stats_15_col),
    # One-hot encode the nomial features
    ('ont-hot', OneHotEncoder(), ['league','patch','side'])
    ],
    # Drop the rest of the columns
    remainder='drop')

# The Pipeline that do the prediction
pl = Pipeline([
    ('pre-processing', preproc),
    ('dtc', DecisionTreeClassifier())
])

# Making prediction without using any hyperparameters
pl.fit(X_train,y_train)
pred = pl.predict(X_test)
accuracy_score(y_true=y_test, y_pred=pred)

0.6620795107033639

In [9]:
# Search for the best hyperparameters using GridSearchCV
grids = GridSearchCV(pl, param_grid=hyperparameters, return_train_score=True)
grids.fit(X_train, y_train)
grids.best_params_

{'dtc__criterion': 'gini', 'dtc__max_depth': 4, 'dtc__min_samples_split': 2}

In [10]:
# The accuracy score on the train set using the best hyperparameter combination
grids.score(X_train, y_train)

0.752420998980632

In [11]:
# The accuracy score on the test set using the best hyperparameter combination
grids.score(X_test, y_test)

0.7356651376146789

### Fairness Evaluation

In [12]:
# Splits the dataset into two subsets

# The LCK subset
lck = stats_15[stats_15['league'] == 'LCK']
X_lck = lck.drop(['result'], axis=1)
y_lck = lck['result']

# The other leagues subset
other = stats_15[stats_15['league'] != 'LCK']
X_other = other.drop(['result'], axis=1)
y_other = other['result']

# Observed statistics
observed_stat = grids.score(X_lck, y_lck) - grids.score(X_other, y_other)
grids.score(X_lck, y_lck), grids.score(X_other, y_other), observed_stat

(0.7205567451820128, 0.7495248574572372, -0.028968112275224334)

In [13]:
# Number of repetitions
n_repetitions = 500

accuracy_diffs = []
for _ in range(n_repetitions):
    
    # Step 1: Shuffle the league column
    shuffled_league = (
        stats_15['league']
        .sample(frac=1)
        .reset_index(drop=True)
    )
    
    # Step 2: Put the shuffled column in a DataFrame
    shuffled = (
        stats_15
        .assign(**{'shuffled_league': shuffled_league})
    )
    
    # Step 3: Compute the accuracy difference of the predictions on each subset
    lck = shuffled[shuffled['shuffled_league'] == 'LCK']
    X_lck = lck.drop(['result'], axis=1)
    y_lck = lck['result']

    other = shuffled[shuffled['shuffled_league'] != 'LCK']
    X_other = other.drop(['result'], axis=1)
    y_other = other['result']

    accuracy_diff = grids.score(X_lck, y_lck) - grids.score(X_other, y_other)
    
    # Step 4: Store the result
    accuracy_diffs.append(accuracy_diff)

In [14]:
# Calculate the P-Value
pval = np.mean(accuracy_diffs <= observed_stat)
pval

0.258