# Exploring external data for LPL games

In [2]:
import pandas as pd
import json

**Disclaimer:**

This notebook explores data from [Oracle's Elixir](https://oracleselixir.com), a community stats platform created and maintained by former 100 Thieves data scientist Tim Sevenhuysen. The resources are provided freely within Riot Games's terms and policies.

The decision to take [the data from Oracle's Elixir](https://oracleselixir.com/tools/downloads) as a starting point prevents us from scraping another open data repository, [Leaguepedia](https://lol.fandom.com/wiki/League_of_Legends_Esports_Wiki). Should his data be inconclusive, web scraping will be required to get a minimum viable dataset.

## Oracle's Elixir

We will load Tim's data and see if it has the fields that will help us.

In [3]:
data_year = 2020
filename = f'external-data/{data_year}_LoL_esports_match_data_from_OraclesElixir.csv'
df_2020 = pd.read_csv(filename)
df_2020.head()

Unnamed: 0,gameid,datacompleteness,url,league,year,split,playoffs,date,game,patch,...,opp_csat15,golddiffat15,xpdiffat15,csdiffat15,killsat15,assistsat15,deathsat15,opp_killsat15,opp_assistsat15,opp_deathsat15
0,ESPORTSTMNT03/1241318,complete,http://matchhistory.na.leagueoflegends.com/en/...,KeSPA,2020,,0,2020-01-03 07:33:26,1,9.24,...,118.0,165.0,166.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0
1,ESPORTSTMNT03/1241318,complete,http://matchhistory.na.leagueoflegends.com/en/...,KeSPA,2020,,0,2020-01-03 07:33:26,1,9.24,...,98.0,-399.0,150.0,-7.0,0.0,0.0,0.0,1.0,0.0,0.0
2,ESPORTSTMNT03/1241318,complete,http://matchhistory.na.leagueoflegends.com/en/...,KeSPA,2020,,0,2020-01-03 07:33:26,1,9.24,...,140.0,-409.0,-1837.0,-11.0,0.0,0.0,1.0,0.0,1.0,0.0
3,ESPORTSTMNT03/1241318,complete,http://matchhistory.na.leagueoflegends.com/en/...,KeSPA,2020,,0,2020-01-03 07:33:26,1,9.24,...,135.0,51.0,-401.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ESPORTSTMNT03/1241318,complete,http://matchhistory.na.leagueoflegends.com/en/...,KeSPA,2020,,0,2020-01-03 07:33:26,1,9.24,...,28.0,-233.0,257.0,-8.0,0.0,0.0,0.0,0.0,1.0,0.0


First look: extremely promising. Now let's see what leagues it has.

In [4]:
df_2020['league'].unique()

array(['KeSPA', 'LPL', 'LPLOL', 'GLL', 'BL', 'LHE', 'DL', 'PRM', 'LFL',
       'PGN', 'SLO', 'LMF', 'LEC', 'LCSA', 'CBLOL', 'LCS', 'HM', 'EBL',
       'UL', 'DDH', 'OPL', 'VCS', 'TCL', 'TAL', 'BRCC', 'UKLC', 'UPL',
       'LCK', 'BM', 'CK', 'LJL', 'LLA', 'OTBLX', 'LCL', 'PCS', 'NEXO',
       'EUM', 'LDL', 'BIG', 'OCS', 'Riot', 'MSC', 'RCL', 'NLC', 'HC',
       'CU', 'WLDs', 'NEST', 'NASG', 'CT', 'LAS', 'DCup'], dtype=object)

In [5]:
lpl_2020_df = df_2020[df_2020['league']=='LPL'].copy()
lpl_2020_df

Unnamed: 0,gameid,datacompleteness,url,league,year,split,playoffs,date,game,patch,...,opp_csat15,golddiffat15,xpdiffat15,csdiffat15,killsat15,assistsat15,deathsat15,opp_killsat15,opp_assistsat15,opp_deathsat15
120,5655-7249,complete,https://lpl.qq.com/es/stats.shtml?bmid=5655,LPL,2020,Spring,0,2020-01-13 09:22:00,1,10.01,...,84.0,1635.0,1816.0,53.0,2.0,1.0,1.0,2.0,3.0,0.0
121,5655-7249,complete,https://lpl.qq.com/es/stats.shtml?bmid=5655,LPL,2020,Spring,0,2020-01-13 09:22:00,1,10.01,...,72.0,97.0,123.0,20.0,0.0,2.0,1.0,1.0,3.0,1.0
122,5655-7249,complete,https://lpl.qq.com/es/stats.shtml?bmid=5655,LPL,2020,Spring,0,2020-01-13 09:22:00,1,10.01,...,168.0,-2007.0,-2190.0,-51.0,0.0,2.0,2.0,2.0,3.0,0.0
123,5655-7249,complete,https://lpl.qq.com/es/stats.shtml?bmid=5655,LPL,2020,Spring,0,2020-01-13 09:22:00,1,10.01,...,138.0,-1199.0,-1497.0,-32.0,0.0,3.0,1.0,1.0,3.0,1.0
124,5655-7249,complete,https://lpl.qq.com/es/stats.shtml?bmid=5655,LPL,2020,Spring,0,2020-01-13 09:22:00,1,10.01,...,20.0,515.0,773.0,4.0,1.0,2.0,1.0,0.0,4.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102499,6702-8987,complete,https://lpl.qq.com/es/stats.shtml?bmid=6702,LPL,2020,,0,2020-08-30 12:58:43,4,10.16,...,128.0,188.0,569.0,21.0,0.0,2.0,0.0,0.0,1.0,1.0
102500,6702-8987,complete,https://lpl.qq.com/es/stats.shtml?bmid=6702,LPL,2020,,0,2020-08-30 12:58:43,4,10.16,...,135.0,659.0,282.0,10.0,0.0,2.0,0.0,1.0,0.0,1.0
102501,6702-8987,complete,https://lpl.qq.com/es/stats.shtml?bmid=6702,LPL,2020,,0,2020-08-30 12:58:43,4,10.16,...,18.0,486.0,305.0,5.0,1.0,1.0,2.0,0.0,1.0,1.0
102502,6702-8987,complete,https://lpl.qq.com/es/stats.shtml?bmid=6702,LPL,2020,,0,2020-08-30 12:58:43,4,10.16,...,546.0,-1757.0,-206.0,-16.0,2.0,3.0,3.0,3.0,6.0,2.0


In [10]:
print(lpl_2020_df.columns.to_list())

['gameid', 'datacompleteness', 'url', 'league', 'year', 'split', 'playoffs', 'date', 'game', 'patch', 'participantid', 'side', 'position', 'playername', 'playerid', 'teamname', 'teamid', 'champion', 'ban1', 'ban2', 'ban3', 'ban4', 'ban5', 'gamelength', 'result', 'kills', 'deaths', 'assists', 'teamkills', 'teamdeaths', 'doublekills', 'triplekills', 'quadrakills', 'pentakills', 'firstblood', 'firstbloodkill', 'firstbloodassist', 'firstbloodvictim', 'team kpm', 'ckpm', 'firstdragon', 'dragons', 'opp_dragons', 'elementaldrakes', 'opp_elementaldrakes', 'infernals', 'mountains', 'clouds', 'oceans', 'chemtechs', 'hextechs', 'dragons (type unknown)', 'elders', 'opp_elders', 'firstherald', 'heralds', 'opp_heralds', 'firstbaron', 'barons', 'opp_barons', 'firsttower', 'towers', 'opp_towers', 'firstmidtower', 'firsttothreetowers', 'turretplates', 'opp_turretplates', 'inhibitors', 'opp_inhibitors', 'damagetochampions', 'dpm', 'damageshare', 'damagetakenperminute', 'damagemitigatedperminute', 'wards

**How delightful!**

I don't think we need to go further than this. With Tim Sevenhuysen giving us permission to use his data, I will use his LPL records as a basis for future measurements in that league.

That said, I will need to also take a random sample of his data for training and testing a "model score converter" of sorts, to bridge the gap between Riot's data and ours.

Because Riot's data will be used separately to calculate performance scores, I want the scoring method to also apply here. That said, the data points available from Riot are far more elaborate than the ones available here, and I would like to know how error-prone scoring data from here will end up being.

For now, let's take a look at Tim's data.

In [15]:
lpl_2020_df.iloc[1,:].to_dict()

{'gameid': '5655-7249',
 'datacompleteness': 'complete',
 'url': 'https://lpl.qq.com/es/stats.shtml?bmid=5655',
 'league': 'LPL',
 'year': 2020,
 'split': 'Spring',
 'playoffs': 0,
 'date': '2020-01-13 09:22:00',
 'game': 1,
 'patch': 10.01,
 'participantid': 2,
 'side': 'Blue',
 'position': 'jng',
 'playername': 'Ning',
 'playerid': 'oe:player:9b22cace0315e520c50f1b8f8ac434c',
 'teamname': 'Invictus Gaming',
 'teamid': 'oe:team:53a258f289c26d94431c0496a54e151',
 'champion': 'Qiyana',
 'ban1': 'Pantheon',
 'ban2': 'Nautilus',
 'ban3': 'Elise',
 'ban4': 'Gangplank',
 'ban5': 'Mordekaiser',
 'gamelength': 2640,
 'result': 1,
 'kills': 0,
 'deaths': 7,
 'assists': 11,
 'teamkills': 28,
 'teamdeaths': 29,
 'doublekills': 0.0,
 'triplekills': 0.0,
 'quadrakills': 0.0,
 'pentakills': 0.0,
 'firstblood': 0.0,
 'firstbloodkill': 0.0,
 'firstbloodassist': 0.0,
 'firstbloodvictim': 0.0,
 'team kpm': 0.6364,
 'ckpm': 1.2955,
 'firstdragon': nan,
 'dragons': nan,
 'opp_dragons': nan,
 'elementaldr

Side commentary: of all the games to choose, I chose one where King Ning ran it down hard.

## Structure of an Oracle's Elixir game entry

- 'date': string, contains YYYY-MM-DD HH:MM:SS layout.
- 'split': string, either spring, summer, or None. None denotes regionals.
- 'playoffs': smart way of changing a game's importance.
- 'game': also a smart way of doing that.
- 'patch': equivalent to gameVersion
- 'participantid': equivalent to participantID
- 'position': has five possible values, can help readjust participant IDs.
- 'side': 'Blue' or 'Red' - note the capitalization
- 'playername': player ID.
- 'teamname': projects to a team's full name, run a query on team data to fall back on your feet.
- 'gamelength': also in seconds. Not as precise on the dusty bits like 2640.934, but very serviceable and more economical on memory (int vs float)
- 'result': 0 or 1 - 1 is win in this case. Ning ran it down, but he won... wow.
- 'teamkills': team's entire kills
- 'teamdeaths': team's entire deaths
- 'damagetochampions': that player's damage
- 'wardsplaced': is it Ning alone, or the entire team?
- 'wardskilled' same
- 'controlwardsbought': same
- 'visionscore': same
- 'totalgold': only Ning's
- 'monsterkillsenemyjungle': worth looking at
- 'monsterkillsownjungle': also worth looking at if I can fix my ETL
- 'goldat10'
- 'xpat10'
- 'golddiffat10'
- 'xpdiffat10'
- 'goldat15'
- 'xpat15'
- 'golddiffat15'

The datapoints above are reliably logged. The same cannot be said of these:

- 'dragons': nan
-  'opp_dragons': na
- 
 'elementaldrakes': n
- 
 'opp_elementaldrakes': 
- ,
 'infernals':
- n,
 'mountains'
- an,
 'clouds
- nan,
 'ocean
-  nan,
 'chemtec
- : nan,
 'hexte
- ': nan,
 'dragons (type unkn
- )': nan,
 'e
- rs': nan,
 'opp_
- ers': nan,
 'firs
- rald': nan,
 
- ralds': nan,
 'op
- eralds': nan,
 ' 
- stbaron': nan 
-  'barons': nan,
 
- pp_barons': nan, 
- firsttower':  
- ,
 'towers': nan 
-  'opp_towers': nan, 
- firstmidtower': nan,
 'f 
- ttothreetowers': na 
- 
 'turretplates': nan, 
- opp_turretplates' 
- an,
 'inhibitors': n 

But they are interesting datapoints nevertheless.
Let's check the other players.n,
 'opp_inhibitors': nan,

In [25]:
lpl_2020_df.iloc[10,:].to_dict()

{'gameid': '5655-7249',
 'datacompleteness': 'complete',
 'url': 'https://lpl.qq.com/es/stats.shtml?bmid=5655',
 'league': 'LPL',
 'year': 2020,
 'split': 'Spring',
 'playoffs': 0,
 'date': '2020-01-13 09:22:00',
 'game': 1,
 'patch': 10.01,
 'participantid': 100,
 'side': 'Blue',
 'position': 'team',
 'playername': nan,
 'playerid': nan,
 'teamname': 'Invictus Gaming',
 'teamid': 'oe:team:53a258f289c26d94431c0496a54e151',
 'champion': nan,
 'ban1': 'Pantheon',
 'ban2': 'Nautilus',
 'ban3': 'Elise',
 'ban4': 'Gangplank',
 'ban5': 'Mordekaiser',
 'gamelength': 2640,
 'result': 1,
 'kills': 28,
 'deaths': 29,
 'assists': 75,
 'teamkills': 28,
 'teamdeaths': 29,
 'doublekills': 2.0,
 'triplekills': 0.0,
 'quadrakills': 0.0,
 'pentakills': 0.0,
 'firstblood': 0.0,
 'firstbloodkill': nan,
 'firstbloodassist': nan,
 'firstbloodvictim': nan,
 'team kpm': 0.6364,
 'ckpm': 1.2955,
 'firstdragon': 0.0,
 'dragons': 2.0,
 'opp_dragons': 4.0,
 'elementaldrakes': nan,
 'opp_elementaldrakes': nan,
 '

In [26]:
lpl_2020_df.iloc[11,:].to_dict()

{'gameid': '5655-7249',
 'datacompleteness': 'complete',
 'url': 'https://lpl.qq.com/es/stats.shtml?bmid=5655',
 'league': 'LPL',
 'year': 2020,
 'split': 'Spring',
 'playoffs': 0,
 'date': '2020-01-13 09:22:00',
 'game': 1,
 'patch': 10.01,
 'participantid': 200,
 'side': 'Red',
 'position': 'team',
 'playername': nan,
 'playerid': nan,
 'teamname': 'FunPlus Phoenix',
 'teamid': 'oe:team:33d17f3717f58e12a3da80b377221fb',
 'champion': nan,
 'ban1': 'Aphelios',
 'ban2': 'Akali',
 'ban3': 'Lucian',
 'ban4': 'Varus',
 'ban5': 'Xayah',
 'gamelength': 2640,
 'result': 0,
 'kills': 29,
 'deaths': 28,
 'assists': 78,
 'teamkills': 29,
 'teamdeaths': 28,
 'doublekills': 3.0,
 'triplekills': 1.0,
 'quadrakills': 0.0,
 'pentakills': 0.0,
 'firstblood': 1.0,
 'firstbloodkill': nan,
 'firstbloodassist': nan,
 'firstbloodvictim': nan,
 'team kpm': 0.6591,
 'ckpm': 1.2955,
 'firstdragon': 1.0,
 'dragons': 4.0,
 'opp_dragons': 2.0,
 'elementaldrakes': nan,
 'opp_elementaldrakes': nan,
 'infernals': 2

Tim's data is laid out so that each game takes 12 rows in a batch. Using participantID as 100 or 200 allows some level of flexibility and denotes that those are team stats. In that way, we can directly query the dataset to extract those juicy tidbits.

That said, can we use Riot's mappings data to refer to his data? <if game not found in esportsmappings, fetch data from Tim's>

I'd prefer bringing it to 2023.

In [29]:
data_year = 2023
filename = f'external-data/{data_year}_LoL_esports_match_data_from_OraclesElixir.csv'
df_2023 = pd.read_csv(filename)
lpl_2023_df = df_2023[df_2023['league']=='LPL'].copy()
lpl_2023_df.head()

Unnamed: 0,gameid,datacompleteness,url,league,year,split,playoffs,date,game,patch,...,opp_csat15,golddiffat15,xpdiffat15,csdiffat15,killsat15,assistsat15,deathsat15,opp_killsat15,opp_assistsat15,opp_deathsat15
204,9691-9691_game_1,partial,https://lpl.qq.com/es/stats.shtml?bmid=9691,LPL,2023,Spring,0,2023-01-14 07:23:06,1,13.01,...,,,,,,,,,,
205,9691-9691_game_1,partial,https://lpl.qq.com/es/stats.shtml?bmid=9691,LPL,2023,Spring,0,2023-01-14 07:23:06,1,13.01,...,,,,,,,,,,
206,9691-9691_game_1,partial,https://lpl.qq.com/es/stats.shtml?bmid=9691,LPL,2023,Spring,0,2023-01-14 07:23:06,1,13.01,...,,,,,,,,,,
207,9691-9691_game_1,partial,https://lpl.qq.com/es/stats.shtml?bmid=9691,LPL,2023,Spring,0,2023-01-14 07:23:06,1,13.01,...,,,,,,,,,,
208,9691-9691_game_1,partial,https://lpl.qq.com/es/stats.shtml?bmid=9691,LPL,2023,Spring,0,2023-01-14 07:23:06,1,13.01,...,,,,,,,,,,


Dear lord this is a massacre.

In [31]:
lpl_2023_df.iloc[5,:].to_dict()

{'gameid': '9691-9691_game_1',
 'datacompleteness': 'partial',
 'url': 'https://lpl.qq.com/es/stats.shtml?bmid=9691',
 'league': 'LPL',
 'year': 2023,
 'split': 'Spring',
 'playoffs': 0,
 'date': '2023-01-14 07:23:06',
 'game': 1,
 'patch': 13.01,
 'participantid': 6,
 'side': 'Red',
 'position': 'top',
 'playername': 'Biubiu',
 'playerid': 'oe:player:445af96e9e883e18105c9c6072d97a1',
 'teamname': 'Team WE',
 'teamid': 'oe:team:62c1cd9465dc63824593ee5046f5aa8',
 'champion': 'Renekton',
 'ban1': 'Zeri',
 'ban2': 'Ryze',
 'ban3': 'Yuumi',
 'ban4': 'Ahri',
 'ban5': 'Fiora',
 'gamelength': 1836,
 'result': 1,
 'kills': 0,
 'deaths': 0,
 'assists': 6,
 'teamkills': 16,
 'teamdeaths': 6,
 'doublekills': nan,
 'triplekills': nan,
 'quadrakills': nan,
 'pentakills': nan,
 'firstblood': nan,
 'firstbloodkill': 0.0,
 'firstbloodassist': nan,
 'firstbloodvictim': nan,
 'team kpm': 0.5229,
 'ckpm': 0.719,
 'firstdragon': nan,
 'dragons': nan,
 'opp_dragons': nan,
 'elementaldrakes': nan,
 'opp_ele

Data for the 10 and 15-minute mark measurements are not available for individual players. Ooh wee.

The same applies to teams, to my dismay. This situation highlights some of the limitations of extracting LPL data nowadays.

In [30]:
lpl_2023_df.iloc[11,:].to_dict()

{'gameid': '9691-9691_game_1',
 'datacompleteness': 'partial',
 'url': 'https://lpl.qq.com/es/stats.shtml?bmid=9691',
 'league': 'LPL',
 'year': 2023,
 'split': 'Spring',
 'playoffs': 0,
 'date': '2023-01-14 07:23:06',
 'game': 1,
 'patch': 13.01,
 'participantid': 200,
 'side': 'Red',
 'position': 'team',
 'playername': nan,
 'playerid': nan,
 'teamname': 'Team WE',
 'teamid': 'oe:team:62c1cd9465dc63824593ee5046f5aa8',
 'champion': nan,
 'ban1': 'Zeri',
 'ban2': 'Ryze',
 'ban3': 'Yuumi',
 'ban4': 'Ahri',
 'ban5': 'Fiora',
 'gamelength': 1836,
 'result': 1,
 'kills': 16,
 'deaths': 6,
 'assists': 42,
 'teamkills': 16,
 'teamdeaths': 6,
 'doublekills': nan,
 'triplekills': nan,
 'quadrakills': nan,
 'pentakills': nan,
 'firstblood': 1.0,
 'firstbloodkill': nan,
 'firstbloodassist': nan,
 'firstbloodvictim': nan,
 'team kpm': 0.5229,
 'ckpm': 0.719,
 'firstdragon': nan,
 'dragons': 4.0,
 'opp_dragons': 0.0,
 'elementaldrakes': nan,
 'opp_elementaldrakes': nan,
 'infernals': nan,
 'mounta

Tim's data could be categorized into two sections:

"datacompleteness" == "complete": may function the way our data does.

"datacompleteness" == "partial": this is where we need to have a model calculating differences between how each game was scored, perform a linear regression against our "complete" game scores, and where we lay down a margin for error.

From there, we can use our model to port those scores to our rating system, with the caveat that we have a margin of error to consider. So, LPL ratings may be over/under-inflated.

The good thing is: our summer split data could serve as a rigorous testing set. So too would be a truncated sample of other games that we have, selected at random.