# League of Legends : Role identification by Machine Learning

## Introduction

Accurate role identification in League of Legends has been a concern for most analysts working on match data analysis. The Riot Games API offers a role indentification, but has been proven quite unreliable.

We will explore some of these role attribution given by API data and gives a role identification based on two fields : 
 * Role which can be "SOLO", "DUO", "DUO_CARRY", "DUO_SUPPORT"
 * Lane which can be "TOP", "MIDDLE", "BOTTOM", "JUNGLE", "NONE"

The metagame has defined 5 major roles in a League of Legends game : 
 * The Toplaner is in charge of the toplane, is often a tank or a fighter, serves as frontline for the team or as a splitpusher, and usually tagged as "SOLO" and "TOP"
 * The Midlaner, in charge of the midlane, is often a mage or an assassin, is the source of magic damages of the team and/or the nuker that will kill the strong but squishy enemies, tagged as "SOLO" and "MIDDLE"
 * The Jungler, in charge of the Jungle but also usually help other laner during the game, usually a fighter with good engage capacity and sustain, tagged as "NONE" and "JUNGLE"
 * The AD Carry, or Marksman, shares the botlane with the Support, is weak early in the game and need a lot of farm to be usefull for the team, tagged as "DUO_CARRY" and "BOTTOM"
 * The Support shares the botlane with the AD Carry, a tank or a mage with high utility capacity, can also be a burster mage, has to protect its AD Carry and offer utility (vision, controls) to its team, tagged as "DUO_SUPPORT" and "BOTTOM"

Exploring a sample games to get combinations of roles.

In [1]:
import pymongo

client = pymongo.MongoClient()
db = client.game
gamesTable = db.gameData_92_1

roles = []
for g in gamesTable.find({ "gameDuration": { "$gt": 900 }}).limit(1000):
    
    teamRoles = {
        100:[],
        200:[]
    }
    
    for p in g["participants"]:
        teamRoles[p['teamId']].append(p['timeline']['lane']+"_"+p['timeline']['role'])
    
    roles.append(sorted(teamRoles[100]))
    roles.append(sorted(teamRoles[200]))
    

    
import json
roleCount = {json.dumps(x):roles.count(x) for x in roles}
roleCount

{'["BOTTOM_DUO_CARRY", "BOTTOM_DUO_SUPPORT", "JUNGLE_NONE", "MIDDLE_SOLO", "TOP_SOLO"]': 1284,
 '["BOTTOM_SOLO", "JUNGLE_NONE", "MIDDLE_DUO_CARRY", "MIDDLE_DUO_SUPPORT", "TOP_SOLO"]': 54,
 '["JUNGLE_NONE", "MIDDLE_DUO", "MIDDLE_DUO", "MIDDLE_DUO_SUPPORT", "TOP_SOLO"]': 40,
 '["NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT"]': 116,
 '["NONE_DUO", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT"]': 96,
 '["BOTTOM_DUO_CARRY", "BOTTOM_DUO_SUPPORT", "JUNGLE_NONE", "JUNGLE_NONE", "MIDDLE_SOLO"]': 58,
 '["BOTTOM_DUO", "BOTTOM_DUO", "BOTTOM_DUO_SUPPORT", "JUNGLE_NONE", "JUNGLE_NONE"]': 1,
 '["BOTTOM_SOLO", "JUNGLE_NONE", "MIDDLE_DUO", "MIDDLE_DUO_SUPPORT", "MIDDLE_DUO_SUPPORT"]': 4,
 '["BOTTOM_DUO_CARRY", "BOTTOM_DUO_SUPPORT", "JUNGLE_NONE", "MIDDLE_DUO", "MIDDLE_DUO"]': 70,
 '["JUNGLE_NONE", "MIDDLE_DUO_SUPPORT", "MIDDLE_DUO_SUPPORT", "MIDDLE_DUO_SUPPORT", "MIDDLE_DUO_SUPPORT"]': 3,
 '["BOTTOM_DUO_CARRY", "BOTTOM

Some examples : for game 3906939349, red team has this composition : ["NONE_DUO", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT"]

http://timeline.canisback.com/game/EUW1/3906939349

https://matchhistory.euw.leagueoflegends.com/fr/#match-details/EUW1/3906939349?tab=overview

The game went so bad that the given role identification almost can't decide the role and can't find at all the lane. The blue team also suffer from misclassification, having a composition of "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT", "NONE_DUO_SUPPORT"

For game 3907056015, blue botlane quickly regrouped at the midlane, hence a composition of ["JUNGLE_NONE", "MIDDLE_DUO", "MIDDLE_DUO", "MIDDLE_DUO_SUPPORT", "TOP_SOLO"]

One of the most common misclassification can be found in game 3903980933 where the jungler is considered as midlane or toplane support.

Every game does not strictly follows the metagame, and should not get the usual ["BOTTOM_DUO_CARRY", "BOTTOM_DUO_SUPPORT", "JUNGLE_NONE", "MIDDLE_SOLO", "TOP_SOLO"] composition to be accurate, but as shown, one third of the teams is not classified as following the metagame, which, given the current state of the game, not realistic.

## Feature selection

A key part of the work here is to select features, from the data we can obtain of a game, that will help to classify each player role in the game, based on some expert knowledge of League of Legends.

### Map position

The most usefull data we can get from the API is the position of each player on the map every minute. Given that most roles have an assigned lane and that the players tends to keep on the lane during the first minutes of the match, this can be the most helpful informationfor our task.

We start to define the area of each lane using the map data available here : https://developer.riotgames.com/game-constants.html and some manual work to select the areas

In [4]:
import numpy as np
import matplotlib.path as mplPath

midlane = mplPath.Path(np.array([[4200 ,3500],[11300 ,10500],[13200 ,13200],[10500 ,11300],[3300 ,4400],[1600,1600]]))
toplane = mplPath.Path(np.array([[-120 ,1600],[-120 ,14980],[13200 ,14980],[13200 ,13200],[4000 ,13200],[1600,11000],[1600 ,1600]]))
botlane = mplPath.Path(np.array([[1600 ,-120],[14870,-120],[14870,13200],[13200,13200],[13270,4000],[10500,1700],[1600,1600]]))
jungle1 = mplPath.Path(np.array([[1600,5000],[1600,11000],[4000 ,13200],[9800 ,13200],[10500 ,11300],[3300 ,4400]]))
jungle2 = mplPath.Path(np.array([[5000,1700],[4200 ,3500],[11300 ,10500],[13270,9900],[13270,4000],[10500,1700]]))

Then, using those defined area, we can create a function that will list every position of each player in a game.

In [5]:
def getPositions(timeline):
    frames = timeline['frames']
    
    participantsPositions = {1:[],2:[],3:[],4:[],5:[],6:[],7:[],8:[],9:[],10:[]}
    # 10 first frames except the really first one, when everybody spawn
    for i in range(1,11):
        # For each participant
        for k in frames[i]['participantFrames']:
            position = None
            
            # Position on the map
            coord = [frames[i]['participantFrames'][k]['position']['x'],frames[i]['participantFrames'][k]['position']['y']]
            # Check where is the position
            if jungle1.contains_point(coord) or jungle2.contains_point(coord):
                position = "jungle"
            elif midlane.contains_point(coord):
                position = "mid"
            elif toplane.contains_point(coord):
                position = "top"
            elif botlane.contains_point(coord):
                position = "bot"
            # Save the position for the participant
            participantsPositions[int(k)].append(position)
    return participantsPositions

In [16]:
for i,j in getPositions(g["timeline"]).items():
    print(str(i)+" : "+str(j))

1 : ['bot', 'bot', 'bot', 'bot', 'bot', 'bot', 'bot', 'bot', 'bot', 'bot']
2 : ['jungle', 'jungle', 'jungle', 'jungle', 'jungle', 'mid', 'bot', 'jungle', 'jungle', 'jungle']
3 : ['jungle', 'bot', 'bot', 'bot', 'jungle', 'bot', 'bot', 'bot', 'bot', 'bot']
4 : ['jungle', 'mid', 'mid', 'mid', 'mid', 'mid', 'mid', 'jungle', 'mid', 'mid']
5 : ['top', 'top', 'top', 'top', 'bot', 'top', 'top', 'jungle', 'top', 'jungle']
6 : ['mid', 'mid', 'mid', 'mid', 'mid', 'mid', 'mid', 'jungle', 'jungle', 'mid']
7 : ['jungle', 'bot', 'bot', 'bot', 'bot', 'bot', 'bot', 'bot', 'jungle', 'bot']
8 : ['jungle', 'bot', 'bot', 'bot', 'bot', 'bot', 'jungle', 'jungle', 'bot', 'bot']
9 : ['jungle', 'jungle', 'jungle', 'jungle', 'jungle', 'jungle', 'jungle', 'jungle', 'jungle', 'jungle']
10 : ['jungle', 'top', 'top', 'top', 'top', 'jungle', 'top', 'top', 'top', 'top']


We can use the results of this function as is and feed it directly in the dataset, but easing the process and giving the most visited lane in early game might help.

In [27]:
def getMostFrequentLane(participantsPositions):
    mostFrequentLane = {}
    for participant in participantsPositions:
        laneFrequency = {"mid":0,"top":0,"bot":0,"jungle":0}
        
        for lane in participantsPositions[participant]:
            if not lane == None:
                laneFrequency[lane] += 1
        
        mostFrequentLane[participant] = max(laneFrequency, key=laneFrequency.get)
    return mostFrequentLane

In [28]:
getMostFrequentLane(getPositions(g["timeline"]))

{1: 'bot',
 2: 'jungle',
 3: 'bot',
 4: 'mid',
 5: 'top',
 6: 'mid',
 7: 'bot',
 8: 'bot',
 9: 'jungle',
 10: 'top'}

This way, we can easily have access to ealr game positions and the most frequent position for each player in the game.

### Items

Another very discriminating aspect of the game regarding role is the build of the players. Some items are mostly used by only a few roles. However, as there is hundreds of items in the game and a player being able to buy more than one item, doing a simple one-hot-encode would take too much features to be efficient. The way we will do it is by analyzing games and list trends in items by role.

For this analysis, we will only select the game showing the usual ["BOTTOM_DUO_CARRY", "BOTTOM_DUO_SUPPORT", "JUNGLE_NONE", "MIDDLE_SOLO", "TOP_SOLO"] composition, as it is quite reliable for this purpose.

In [30]:
#Initializing the target composition
role_composition = set(["JUNGLE_NONE","TOP_SOLO","MIDDLE_SOLO","BOTTOM_DUO_CARRY","BOTTOM_DUO_SUPPORT"])

import requests
#Initializing the itemsPlayed array
itemsPlayed = {}

r = requests.get('http://ddragon.leagueoflegends.com/cdn/9.2.1/data/en_US/item.json')
dataItems = r.json()
for r in role_composition:
    itemsPlayed[r] = {}
    for i in dataItems['data']:
        itemsPlayed[r][int(i)] = 0


for g in gamesTable.find({ "gameDuration": { "$gt": 900 }, "mapId":11 }).limit(10000):
    
    
    #Get teams that have a perfect metagame composition
    positionsByTeam = {100:[],200:[]}
    for p in g['participants']:
        positionsByTeam[p['teamId']].append(p['timeline']['lane']+"_"+p['timeline']['role'])
    teamOK = {}
    teamOK[100] = role_composition == set(positionsByTeam[100])
    teamOK[200] = role_composition == set(positionsByTeam[200])
    
    #Get items used
    for p in g['participants']:
        if teamOK[p['teamId']]:
            
            #For all item slots
            for i in range(0,7):
                #If there is an item in this slot
                if p['stats']['item'+str(i)] > 0:
                    #Increment item count for the specific role
                    itemsPlayed[p['timeline']['lane']+"_"+p['timeline']['role']][p['stats']['item'+str(i)]] += 1

In [33]:
import pandas as pd
dfItemsFrequency = pd.DataFrame(itemsPlayed).T
dfItemsFrequency

Unnamed: 0,1001,1004,1006,1011,1026,1027,1028,1029,1031,1033,...,4105,4201,4202,4203,4204,4301,4302,4401,4402,4403
MIDDLE_SOLO,274,23,2,32,1790,21,504,142,77,207,...,0,0,0,0,0,0,0,0,0,0
BOTTOM_DUO_CARRY,238,54,0,25,214,37,117,73,79,259,...,0,0,0,0,0,0,0,0,0,0
TOP_SOLO,320,10,61,395,658,48,1230,545,513,455,...,0,0,0,0,0,0,0,0,0,0
JUNGLE_NONE,473,3,33,645,408,50,1543,715,702,417,...,0,0,0,0,0,0,0,0,0,0
BOTTOM_DUO_SUPPORT,364,511,48,92,505,45,1190,1082,481,1017,...,0,0,0,0,0,0,0,0,0,0


Then we list items in different category for each role. If an item is taken at more than 90% by a role, we label it as "over used", at more than 60% as "mostly used" and less than 10%, as "under used".

In [43]:
overUsedItems = {
    "BOTTOM_DUO_SUPPORT":[],
    "JUNGLE_NONE":[],
    "TOP_SOLO":[],
    "BOTTOM_DUO_CARRY":[],
    "MIDDLE_SOLO":[]
}

mostlyUsedItems = {
    "BOTTOM_DUO_SUPPORT":[],
    "JUNGLE_NONE":[],
    "TOP_SOLO":[],
    "BOTTOM_DUO_CARRY":[],
    "MIDDLE_SOLO":[]
}

underUsedItems = {
    "BOTTOM_DUO_SUPPORT":[],
    "JUNGLE_NONE":[],
    "TOP_SOLO":[],
    "BOTTOM_DUO_CARRY":[],
    "MIDDLE_SOLO":[]
}

for itemId in dfItemsFrequency:
    
    for role,j in enumerate(dfItemsFrequency[itemId]):
        
        if j>(dfItemsFrequency[itemId].sum() * 0.9):
            overUsedItems[dfItemsFrequency[itemId].index[role]].append(str(itemId))
            
        elif j>(dfItemsFrequency[itemId].sum() * 0.6):
            mostlyUsedItems[dfItemsFrequency[itemId].index[role]].append(str(itemId))
            
        elif j<(dfItemsFrequency[itemId].sum() * 0.1):
            underUsedItems[dfItemsFrequency[itemId].index[role]].append(str(itemId))

### Dataset creation

Using the previously created function, we can create a dataset from the match data.

In [88]:
participants = []

#Find all games longer than 15 minutes and on the map Summoner's Rift
for g in gamesTable.find({ "gameDuration": { "$gt": 900 }, "mapId":11 }).limit(10000):
    
    #Get teams that have a perfect metagame composition
    positionsByTeam = {100:[],200:[]}
    for p in g['participants']:
        positionsByTeam[p['teamId']].append(p['timeline']['lane']+"_"+p['timeline']['role'])
    teamOK = {}
    teamOK[100] = role_composition == set(positionsByTeam[100])
    teamOK[200] = role_composition == set(positionsByTeam[200])
    
    #Get the players positions form the timeline
    playersPositions = getPositions(g['timeline'])
    
    playerMostFrequentPosition = getMostFrequentLane(playersPositions)
    
    for p in g['participants']:
        
        duration = g["gameDuration"]
        
        #If the participant is not in a perfect metagame team, ignore it
        if not teamOK[p['teamId']]:
            continue
            
            
        participant = {}
        
        participantId = p["participantId"]
        
        lanes = ["jungle","top","mid","bot"]
        
        #one hot encode positions
        for lane in lanes:
            participant['most-frequent-'+lane] = 0
            
            for i in range(0,10):
                participant['position-'+lane+'-'+str(i)] = 0
        
        participant['most-frequent-'+ playerMostFrequentPosition[participantId]] = 1
        
        for t,lane in enumerate(playersPositions[p['participantId']]):
            if not lane == None:
                participant['position-'+lane+'-'+str(t)] = 1
                
        #Init items lists 
        for role in role_composition:
            participant["has-item-overUsed-"+role] = 0
            participant["has-item-mostlyUsed-"+role] = 0
            participant["has-item-underUsed-"+role] = 0
        
        #Get the item ID for each of the 7 slots and check if they are in one of the three items lists
        for i in range(0,7):
            
            #Check if there is an item for the slot
            if p['stats']['item'+str(i)] > 0:
                for role in role_composition:

                    if str(p['stats']['item'+str(i)]) in overUsedItems[role]:
                        participant["has-item-overUsed-"+role] = 1

                    if str(p['stats']['item'+str(i)]) in mostlyUsedItems[role]:
                        participant["has-item-mostlyUsed-"+role] = 1

                    if str(p['stats']['item'+str(i)]) in underUsedItems[role]:
                        participant["has-item-underUsed-"+role] = 1
                        
        
        #Player position, what we are looking for
        participant['position'] = p['timeline']['lane']+"_"+p['timeline']['role']
        
        #game identification features
        participant["participantId"] = participantId
        participant["gameId"] = g["gameId"]
        
        participants.append(participant)

And finally putting all the participants in a dataframe and exporting to a csv file.

In [112]:
df = pd.DataFrame(participants)
df.to_csv("dataset_participants.csv")

## Learning

### Decision tree

Decision tree is a model I especially appreciate for its ability to work quite well on complex problems (i.e. non-linear) and its interpretability. This will be the first algorithm we will use.

First, we prepare our dataset to separate the label from the data, and informations not to be used.

In [116]:
target = df["position"]
data = df.drop(["position","participantId","gameId"], axis=1)

To ensure reliable results, we do a cross-validation on 10 folds, using the StratifiedShuffleSplit function, a merge of startified k-fold and shuffle split.

In [117]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

shuffle_split = StratifiedShuffleSplit(train_size=0.9, n_splits=10)

c = DecisionTreeClassifier(criterion="entropy")

accs = []

for train_index, test_index in shuffle_split.split(data,target):
    c.fit(data.iloc[train_index], target.iloc[train_index])
    accs.append(accuracy_score(target[test_index], c.predict(data.iloc[test_index])))

print(accs)
sum(accs) / len(accs)



[0.9938438438438438, 0.9938438438438438, 0.9948948948948949, 0.9941441441441441, 0.9947447447447447, 0.9947447447447447, 0.9945945945945946, 0.9954954954954955, 0.9941441441441441, 0.9936936936936936]


0.9944144144144144

This model shows a very good accuracy with minimal variance for each run. At least using this dataset, the decision tree is able to reproduce very well the classifier from match data.

We can vizualize the model with graphviz : 

In [124]:
from sklearn import tree 
tree.export_graphviz(c,out_file='tree.dot', feature_names=data.columns.values) 

import graphviz
with open("./tree.dot") as f :
    dot_graph = f.read()
graphviz.Source(dot_graph)
graphviz.Source(dot_graph, format="png").render("tree")

dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.728593 to fit


'tree.png'

We can also interpret the model and check for the feature importance : 

In [87]:
fi = []
for i in range(0,59):
    fi.append((data.columns[i],c.feature_importances_[i]))
sorted(fi, key=lambda tup: tup[1], reverse=True)

[('has-item-overUsed-BOTTOM_DUO_SUPPORT', 0.21924977079479513),
 ('has-item-overUsed-JUNGLE_NONE', 0.20103462498582575),
 ('most-frequent-top', 0.19196424695194564),
 ('most-frequent-mid', 0.18822603291901113),
 ('has-item-overUsed-BOTTOM_DUO_CARRY', 0.14688478210171443),
 ('most-frequent-bot', 0.025444734621574535),
 ('has-item-underUsed-BOTTOM_DUO_CARRY', 0.008412281431664849),
 ('position-mid-1', 0.0037084826027131007),
 ('most-frequent-jungle', 0.0020229627798552803),
 ('position-jungle-1', 0.001879738970058399),
 ('position-bot-2', 0.001600135013586653),
 ('position-top-1', 0.001403842946801677),
 ('position-bot-1', 0.0013473374625705844),
 ('position-jungle-2', 0.001330852818218696),
 ('position-mid-2', 0.0012586572047082914),
 ('position-mid-3', 0.0008247429455926458),
 ('has-item-mostlyUsed-BOTTOM_DUO_CARRY', 0.0007037492782240552),
 ('position-bot-6', 0.0005477429021549349),
 ('has-item-underUsed-MIDDLE_SOLO', 0.00050717967610481),
 ('position-top-2', 0.00026723050334414233),


This way, we can see that the most usefull features are the most frequent lanes and some of the over used and under used items by role.

### Exploring the model

The intuition is that the match data classifier is not fully reliable, even for the "perfect" composition you've been using to learn. With the limited number of misclassifications, we can manually check the errors using some match report tools

https://matchhistory.euw.leagueoflegends.com/fr/#match-details/EUW1/{gameId}

http://timeline.canisback.com/game/EUW1/{gameId}

In [148]:
dataTest = df.loc[test_index]

dataTest["position_prediction"] = c.predict(data.iloc[test_index])
dataTest["correct"] = dataTest["position"] == dataTest["position_prediction"]
dfTest = dataTest[dataTest["correct"] == False][["gameId","participantId","position","position_prediction"]]
dfTest

Unnamed: 0,gameId,participantId,position,position_prediction
29721,3906773310,2,MIDDLE_SOLO,TOP_SOLO
28815,3908502948,6,MIDDLE_SOLO,TOP_SOLO
51506,3903834705,2,TOP_SOLO,JUNGLE_NONE
10502,3904691330,8,TOP_SOLO,MIDDLE_SOLO
20091,3905988984,2,BOTTOM_DUO_CARRY,BOTTOM_DUO_SUPPORT
9763,3906055368,9,MIDDLE_SOLO,TOP_SOLO
42375,3905147178,6,TOP_SOLO,JUNGLE_NONE
22832,3906085185,3,MIDDLE_SOLO,JUNGLE_NONE
51001,3907362555,7,MIDDLE_SOLO,JUNGLE_NONE
55982,3904105855,8,TOP_SOLO,MIDDLE_SOLO


Checking the first game, with gameId 3906773310, https://matchhistory.euw.leagueoflegends.com/fr/#match-details/EUW1/3906773310?tab=overview , http://timeline.canisback.com/game/EUW1/3906773310 , the match data classify the Akali as a midlane player, while our classifier predict as toplaner. By checking the data and using the tree vizualization, we can check where is the problem.


In [162]:
df[((df["gameId"] == 3906773310) & (df["participantId"] == 2))].iloc[0].to_dict()

{'gameId': 3906773310,
 'has-item-mostlyUsed-BOTTOM_DUO_CARRY': 0,
 'has-item-mostlyUsed-BOTTOM_DUO_SUPPORT': 0,
 'has-item-mostlyUsed-JUNGLE_NONE': 0,
 'has-item-mostlyUsed-MIDDLE_SOLO': 0,
 'has-item-mostlyUsed-TOP_SOLO': 0,
 'has-item-overUsed-BOTTOM_DUO_CARRY': 0,
 'has-item-overUsed-BOTTOM_DUO_SUPPORT': 0,
 'has-item-overUsed-JUNGLE_NONE': 0,
 'has-item-overUsed-MIDDLE_SOLO': 0,
 'has-item-overUsed-TOP_SOLO': 0,
 'has-item-underUsed-BOTTOM_DUO_CARRY': 1,
 'has-item-underUsed-BOTTOM_DUO_SUPPORT': 1,
 'has-item-underUsed-JUNGLE_NONE': 1,
 'has-item-underUsed-MIDDLE_SOLO': 0,
 'has-item-underUsed-TOP_SOLO': 0,
 'most-frequent-bot': 0,
 'most-frequent-jungle': 0,
 'most-frequent-mid': 1,
 'most-frequent-top': 0,
 'participantId': 2,
 'position': 'MIDDLE_SOLO',
 'position-bot-0': 0,
 'position-bot-1': 0,
 'position-bot-2': 0,
 'position-bot-3': 0,
 'position-bot-4': 0,
 'position-bot-5': 0,
 'position-bot-6': 0,
 'position-bot-7': 0,
 'position-bot-8': 0,
 'position-bot-9': 0,
 'positi

Following the decision path, we can see that the trees overgrow on positions, and given that Akali was not midlane from minutes 6 to 9, that she does not have overused items for midlane and does not have underused items for both midlane and toplane, the confusion is understandable. On the other hand, she has not come once toplane, which is quite a mistake from the model, but open to better feature selection options, like instead of having a feature for each lane/minute position, giving the lane presence frequency would reduce the number of features and higher the accuracy.

But first we should investigate others misclassification cases.

In [160]:
dfTest["truth"] = "None"
dfTest.loc[29721,"truth"] = "MIDDLE_SOLO"
dfTest.loc[28815,"truth"] = "MIDDLE_SOLO"
dfTest.loc[51506,"truth"] = "TOP_SOLO"
dfTest.loc[10502,"truth"] = "MIDDLE_SOLO"
dfTest.loc[20091,"truth"] = "BOTTOM_DUO_CARRY"
dfTest.loc[9763,"truth"] = "MIDDLE_SOLO"
dfTest.loc[42375,"truth"] = "TOP_SOLO"
dfTest.loc[22832,"truth"] = "MIDDLE_SOLO"
dfTest.loc[51001,"truth"] = "MIDDLE_SOLO"
dfTest.loc[55982,"truth"] = "MIDDLE_SOLO"
dfTest.loc[18934,"truth"] = "MIDDLE_SOLO"
dfTest.loc[50491,"truth"] = "MIDDLE_SOLO"
dfTest.loc[12706,"truth"] = "MIDDLE_SOLO"
dfTest.loc[23483,"truth"] = "TOP_SOLO"
dfTest.loc[52531,"truth"] = "MIDDLE_SOLO"
dfTest.loc[5765,"truth"] = "TOP_SOLO"
dfTest.loc[16152,"truth"] = "MIDDLE_SOLO"
dfTest.loc[39456,"truth"] = "TOP_SOLO"
dfTest.loc[58789,"truth"] = "TOP_SOLO"
dfTest.loc[43366,"truth"] = "MIDDLE_SOLO"
dfTest.loc[20711,"truth"] = "MIDDLE_SOLO"
dfTest.loc[16724,"truth"] = "MIDDLE_SOLO"
dfTest.loc[33402,"truth"] = "TOP_SOLO"
dfTest.loc[37800,"truth"] = "TOP_SOLO"
dfTest.loc[37527,"truth"] = "BOTTOM_DUO_SUPPORT"
dfTest.loc[8727,"truth"] = "TOP_SOLO"
dfTest.loc[5199,"truth"] = "BOTTOM_DUO_SUPPORT"
dfTest.loc[61240,"truth"] = "MIDDLE_SOLO"
dfTest.loc[41281,"truth"] = "TOP_SOLO"
dfTest.loc[55869,"truth"] = "BOTTOM_DUO_SUPPORT"
dfTest.loc[9273,"truth"] = "MIDDLE_SOLO"
dfTest.loc[13732,"truth"] = "BOTTOM_DUO_SUPPORT"
dfTest.loc[45762,"truth"] = "TOP_SOLO"
dfTest.loc[58323,"truth"] = "TOP_SOLO"
dfTest.loc[39853,"truth"] = "TOP_SOLO"
dfTest.loc[33473,"truth"] = "MIDDLE_SOLO"
dfTest.loc[4723,"truth"] = "TOP_SOLO"
dfTest.loc[66532,"truth"] = "JUNGLE_NONE"
dfTest.loc[19157,"truth"] = "MIDDLE_SOLO"
dfTest.loc[28049,"truth"] = "MIDDLE_SOLO"
dfTest.loc[64694,"truth"] = "MIDDLE_SOLO"
dfTest.loc[61827,"truth"] = "TOP_SOLO"

dfTest

Unnamed: 0,gameId,participantId,position,position_prediction,truth
29721,3906773310,2,MIDDLE_SOLO,TOP_SOLO,MIDDLE_SOLO
28815,3908502948,6,MIDDLE_SOLO,TOP_SOLO,MIDDLE_SOLO
51506,3903834705,2,TOP_SOLO,JUNGLE_NONE,TOP_SOLO
10502,3904691330,8,TOP_SOLO,MIDDLE_SOLO,MIDDLE_SOLO
20091,3905988984,2,BOTTOM_DUO_CARRY,BOTTOM_DUO_SUPPORT,BOTTOM_DUO_CARRY
9763,3906055368,9,MIDDLE_SOLO,TOP_SOLO,MIDDLE_SOLO
42375,3905147178,6,TOP_SOLO,JUNGLE_NONE,TOP_SOLO
22832,3906085185,3,MIDDLE_SOLO,JUNGLE_NONE,MIDDLE_SOLO
51001,3907362555,7,MIDDLE_SOLO,JUNGLE_NONE,MIDDLE_SOLO
55982,3904105855,8,TOP_SOLO,MIDDLE_SOLO,MIDDLE_SOLO


Notable exception : Mordekaiser with support item in botlane as AD Carry, Lulu mid with ardent censer (support item) and roaming considered as support, lots of midlaner considered as jungler due to lot of roaming.

Possible solutions : adding summoner spells (jungler always have smite), adding income stats (supports usually don't farm a lot, hence low income)

The case 33402 is quite intersting also. The path in the model is quick and only considers that a player mostly top but having the position 2 at midlane is a midlaner. Investigating : 

In [161]:
df[((df["gameId"] == 3903868851) & (df["participantId"] == 3))].iloc[0].to_dict()

{'gameId': 3903868851,
 'has-item-mostlyUsed-BOTTOM_DUO_CARRY': 0,
 'has-item-mostlyUsed-BOTTOM_DUO_SUPPORT': 0,
 'has-item-mostlyUsed-JUNGLE_NONE': 0,
 'has-item-mostlyUsed-MIDDLE_SOLO': 0,
 'has-item-mostlyUsed-TOP_SOLO': 1,
 'has-item-overUsed-BOTTOM_DUO_CARRY': 0,
 'has-item-overUsed-BOTTOM_DUO_SUPPORT': 0,
 'has-item-overUsed-JUNGLE_NONE': 0,
 'has-item-overUsed-MIDDLE_SOLO': 0,
 'has-item-overUsed-TOP_SOLO': 0,
 'has-item-underUsed-BOTTOM_DUO_CARRY': 1,
 'has-item-underUsed-BOTTOM_DUO_SUPPORT': 1,
 'has-item-underUsed-JUNGLE_NONE': 1,
 'has-item-underUsed-MIDDLE_SOLO': 1,
 'has-item-underUsed-TOP_SOLO': 0,
 'most-frequent-bot': 0,
 'most-frequent-jungle': 0,
 'most-frequent-mid': 0,
 'most-frequent-top': 1,
 'participantId': 3,
 'position': 'TOP_SOLO',
 'position-bot-0': 0,
 'position-bot-1': 0,
 'position-bot-2': 0,
 'position-bot-3': 0,
 'position-bot-4': 0,
 'position-bot-5': 0,
 'position-bot-6': 0,
 'position-bot-7': 0,
 'position-bot-8': 0,
 'position-bot-9': 0,
 'position-

In [157]:
df[((df["most-frequent-top"] == 1) & (df["position-mid-2"] == 1))]["gameId","participantId","position"]

Unnamed: 0,gameId,participantId,position
3955,3905473192,6,MIDDLE_SOLO
4585,3908342739,6,MIDDLE_SOLO
5682,3903535288,8,MIDDLE_SOLO
15957,3904274379,8,MIDDLE_SOLO
25927,3903906376,8,MIDDLE_SOLO
33402,3903868851,3,TOP_SOLO
33615,3904561713,6,MIDDLE_SOLO
37566,3904456963,7,MIDDLE_SOLO
41021,3904870841,7,MIDDLE_SOLO
62005,3906654151,1,MIDDLE_SOLO


Case 3955 : Diana midlane, swap to top at 7. Case 4585 : Riven midlane, swap to top at 5. Case 5682 : Urgot midlane, swap top at 5. Case 15957 : Akali midlane, swap top at 5. Case 25927 : Ryze toplane, ganked? midlane (hence position midlane 2) -> match data misclassification. Case 33615 : Vladimir midlane, swap toplane at 5, went back midlane at 8 and get back toplane after. Case 37566 : Yasuo midlane, swap to top at 5. Case 41021 : Lissandra midlane, swap to top at 5. Case 62005 : Viktor midlane swap to top at 6. Case 62011 : Zed midlane, swap to top at 6 (same game, followed Viktor).

Here the culprit is the swaplane. It seems that the match data classifier only take position before 5 minutes to make its decision.

To sum up the enhancement to do on the feature selection : 
 * modify the position features
 * add summoner spells
 * add income stats such as CS or gold per minute. Relate number of monsters killed and minions killed could also help with jungler misclassification.
There is also a decision to take on how to handle swaplane.

## Improving feature selection

As seen before, using simple posititions may lead to error. Instead, we will now use the lane presence frequencies.

In [164]:
def getLaneFrequencies(participantsPositions):
    laneFrequencies = {}
    for participant in participantsPositions:
        laneFrequency = {"mid":0,"top":0,"bot":0,"jungle":0}
        
        for lane in participantsPositions[participant]:
            if not lane == None:
                laneFrequency[lane] += 1
        
        laneFrequencies[participant] = laneFrequency
    return laneFrequencies

In [165]:
getLaneFrequencies(getPositions(g["timeline"]))

{1: {'mid': 6, 'top': 0, 'bot': 0, 'jungle': 3},
 2: {'mid': 0, 'top': 0, 'bot': 9, 'jungle': 1},
 3: {'mid': 0, 'top': 0, 'bot': 9, 'jungle': 1},
 4: {'mid': 0, 'top': 7, 'bot': 0, 'jungle': 3},
 5: {'mid': 0, 'top': 0, 'bot': 2, 'jungle': 8},
 6: {'mid': 4, 'top': 0, 'bot': 0, 'jungle': 4},
 7: {'mid': 0, 'top': 8, 'bot': 0, 'jungle': 2},
 8: {'mid': 0, 'top': 0, 'bot': 9, 'jungle': 1},
 9: {'mid': 1, 'top': 0, 'bot': 1, 'jungle': 8},
 10: {'mid': 0, 'top': 0, 'bot': 9, 'jungle': 1}}

For the income stats, we will rely on the stats at 10 minutes, which should give a reliable overview for the early game.

In [168]:
def getStatsAt10(timeline):
    participantFrame = timeline['frames'][10]['participantFrames']
    
    p = {}
    for k in participantFrame:
        row = {
            "minionsKilled":participantFrame[k]["minionsKilled"],
            "jungleMinionsKilled":participantFrame[k]["jungleMinionsKilled"],
            "jungleMinionRatio":participantFrame[k]["jungleMinionsKilled"]/(participantFrame[k]["minionsKilled"]+participantFrame[k]["jungleMinionsKilled"]) if participantFrame[k]["minionsKilled"] > 0 else 0
        }
        p[k] = row
    return p

In [169]:
getStatsAt10(g["timeline"])

{'1': {'minionsKilled': 80,
  'jungleMinionsKilled': 0,
  'jungleMinionRatio': 0.0},
 '2': {'minionsKilled': 10,
  'jungleMinionsKilled': 0,
  'jungleMinionRatio': 0.0},
 '3': {'minionsKilled': 64,
  'jungleMinionsKilled': 0,
  'jungleMinionRatio': 0.0},
 '4': {'minionsKilled': 59,
  'jungleMinionsKilled': 0,
  'jungleMinionRatio': 0.0},
 '5': {'minionsKilled': 2,
  'jungleMinionsKilled': 56,
  'jungleMinionRatio': 0.9655172413793104},
 '6': {'minionsKilled': 76,
  'jungleMinionsKilled': 4,
  'jungleMinionRatio': 0.05},
 '7': {'minionsKilled': 68,
  'jungleMinionsKilled': 0,
  'jungleMinionRatio': 0.0},
 '8': {'minionsKilled': 72,
  'jungleMinionsKilled': 0,
  'jungleMinionRatio': 0.0},
 '9': {'minionsKilled': 8,
  'jungleMinionsKilled': 44,
  'jungleMinionRatio': 0.8461538461538461},
 '10': {'minionsKilled': 4,
  'jungleMinionsKilled': 0,
  'jungleMinionRatio': 0.0}}

Last step before recreating a dataset, we have to initialize the one-hot-encode for the summoner spells.

In [174]:

r = requests.get('http://ddragon.leagueoflegends.com/cdn/9.2.1/data/en_US/summoner.json')
dataSummoner = r.json()
spells = {}
for r in dataSummoner["data"]:
    if "CLASSIC" in dataSummoner["data"][r]["modes"]:
        spells["spell-"+dataSummoner["data"][r]["key"]] = 0
spells

{'spell-21': 0,
 'spell-1': 0,
 'spell-14': 0,
 'spell-3': 0,
 'spell-4': 0,
 'spell-6': 0,
 'spell-7': 0,
 'spell-11': 0,
 'spell-12': 0}

## Retraining

With these new features, we can recreate a dataset.

In [237]:
participants = []

#Find all games longer than 15 minutes and on the map Summoner's Rift
for g in gamesTable.find({ "gameDuration": { "$gt": 900 }, "mapId":11 }).limit(10000):
    
    #Get teams that have a perfect metagame composition
    positionsByTeam = {100:[],200:[]}
    for p in g['participants']:
        positionsByTeam[p['teamId']].append(p['timeline']['lane']+"_"+p['timeline']['role'])
    teamOK = {}
    teamOK[100] = role_composition == set(positionsByTeam[100])
    teamOK[200] = role_composition == set(positionsByTeam[200])
    
    #Get the players positions form the timeline
    playersPositions = getPositions(g['timeline'])
    
    playerMostFrequentLane = getMostFrequentLane(playersPositions)
    playerLaneFrequencies = getLaneFrequencies(playersPositions)
    
    playerStats = getStatsAt10(g["timeline"])
    
    for p in g['participants']:
        
        #If the participant is not in a perfect metagame team, ignore it
        if not teamOK[p['teamId']]:
            continue
            
            
        participant = {}
        
        participantId = p["participantId"]
        
        lanes = ["jungle","top","mid","bot"]
        
        #one hot encode positions
        for lane in lanes:
            participant['most-frequent-'+lane] = 0
        
        participant['most-frequent-'+ playerMostFrequentLane[participantId]] = 1
        
        #Lane frequency
        participant = {**participant, **{"lane-frequency-"+k:v for k,v in playerLaneFrequencies[participantId].items()} }
        
                
        #Init items lists 
        for role in role_composition:
            participant["has-item-overUsed-"+role] = 0
            participant["has-item-mostlyUsed-"+role] = 0
            participant["has-item-underUsed-"+role] = 0
        
        #Get the item ID for each of the 7 slots and check if they are in one of the three items lists
        for i in range(0,7):
            
            #Check if there is an item for the slot
            if p['stats']['item'+str(i)] > 0:
                for role in role_composition:

                    if str(p['stats']['item'+str(i)]) in overUsedItems[role]:
                        participant["has-item-overUsed-"+role] = 1

                    if str(p['stats']['item'+str(i)]) in mostlyUsedItems[role]:
                        participant["has-item-mostlyUsed-"+role] = 1

                    if str(p['stats']['item'+str(i)]) in underUsedItems[role]:
                        participant["has-item-underUsed-"+role] = 1
        
        
        #Summoner spells
        participant = {**participant, **spells}
        participant["spell-"+str(p["spell1Id"])] = 1
        participant["spell-"+str(p["spell2Id"])] = 1
        
        #Player stats
        participant = {**participant, **playerStats[str(participantId)]}
                        
        
        #Player position, what we are looking for
        participant['position'] = p['timeline']['lane']+"_"+p['timeline']['role']
        
        #game identification features
        participant["participantId"] = participantId
        participant["gameId"] = g["gameId"]
        
        participants.append(participant)

In [384]:
df = pd.DataFrame(participants)
df.to_csv("dataset_participants2.csv", index=False)

In [291]:
target = df["position"]
data = df.drop(["position","participantId","gameId"], axis=1)

c2 = DecisionTreeClassifier(criterion="gini", min_samples_split=6)

accs = []

for train_index, test_index in shuffle_split.split(data,target):
    c2.fit(data.iloc[train_index], target.iloc[train_index])
    accs.append(accuracy_score(target[test_index], c2.predict(data.iloc[test_index])))

print(accs)
sum(accs) / len(accs)

[0.9954954954954955, 0.9950450450450451, 0.995945945945946, 0.9963963963963964, 0.9965465465465465, 0.9962462462462462, 0.9956456456456456, 0.9957957957957958, 0.995945945945946, 0.9956456456456456]


0.9958708708708709

In [286]:
tree.export_graphviz(c2,out_file='tree2.dot', feature_names=data.columns.values, class_names=c2.classes_) 

with open("./tree2.dot") as f :
    dot_graph = f.read()
graphviz.Source(dot_graph)
graphviz.Source(dot_graph, format="png").render("tree2")

dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.888211 to fit


'tree2.png'

Now that we have  a set of games manually verified, we can use this verification set to assess the true ability of our classifier to do its job.

In [329]:
dfVerification = pd.read_csv("verification_set.csv")

roleTranslate = {
    "midlaner":"MIDDLE_SOLO",
    "toplaner":"TOP_SOLO",
    "jungler":"JUNGLE_NONE",
    "carry":"BOTTOM_DUO_CARRY",
    "support":"BOTTOM_DUO_SUPPORT",
    "undefined":"undefined",
    "swaplaner":"swaplaner"
}



participantEntries = []
#Using match of another dataset but on the same patch : no game in the training set was used in the verification set.
for g in db.gameData_92_2.find({"gameId":{"$in":[int(i) for i in list(dfVerification["gameId"].values)]}}):
    
    #Get the players positions form the timeline
    playersPositions = getPositions(g['timeline'])
    
    playerMostFrequentLane = getMostFrequentLane(playersPositions)
    playerLaneFrequencies = getLaneFrequencies(playersPositions)
    
    playerStats = getStatsAt10(g["timeline"])
    
    for p in g['participants']:
            
        participant = {}
        
        participantId = p["participantId"]
        
        lanes = ["jungle","top","mid","bot"]
        
        #one hot encode positions
        for lane in lanes:
            participant['most-frequent-'+lane] = 0
        
        participant['most-frequent-'+ playerMostFrequentLane[participantId]] = 1
        
        #Lane frequency
        participant = {**participant, **{"lane-frequency-"+k:v for k,v in playerLaneFrequencies[participantId].items()} }
        
                
        #Init items lists 
        for role in role_composition:
            participant["has-item-overUsed-"+role] = 0
            participant["has-item-mostlyUsed-"+role] = 0
            participant["has-item-underUsed-"+role] = 0
        
        #Get the item ID for each of the 7 slots and check if they are in one of the three items lists
        for i in range(0,7):
            
            #Check if there is an item for the slot
            if p['stats']['item'+str(i)] > 0:
                for role in role_composition:

                    if str(p['stats']['item'+str(i)]) in overUsedItems[role]:
                        participant["has-item-overUsed-"+role] = 1

                    if str(p['stats']['item'+str(i)]) in mostlyUsedItems[role]:
                        participant["has-item-mostlyUsed-"+role] = 1

                    if str(p['stats']['item'+str(i)]) in underUsedItems[role]:
                        participant["has-item-underUsed-"+role] = 1
        
        
        #Summoner spells
        participant = {**participant, **spells}
        participant["spell-"+str(p["spell1Id"])] = 1
        participant["spell-"+str(p["spell2Id"])] = 1
        
        #Player stats
        participant = {**participant, **playerStats[str(participantId)]}
                        
        
        #Player position, what we are looking for
        participant['position'] = p['timeline']['lane']+"_"+p['timeline']['role']
        
        #game identification features
        participant["participantId"] = participantId
        participant["gameId"] = g["gameId"]
        
        participant["position_verified"] = roleTranslate[ dfVerification[dfVerification["gameId"] == g["gameId"]][str(p["participantId"])].iloc[0] ]
        
        #Ignoring undefined positions and swaplaners
        if participant["position_verified"] not in ["undefined","swaplaner"]:
            participantEntries.append(participant)

In [330]:
dfEntries = pd.DataFrame(participantEntries)

dfEntries["position_prediction"] = c2.predict(dfEntries.drop(["position","participantId","gameId","position_verified"], axis=1))

dfWorkingEntries = dfEntries[["gameId","participantId","position","position_prediction","position_verified"]]

In [331]:
dfWorkingEntries.shape[0]

5868

In [332]:
dfWorkingEntries[dfWorkingEntries["position_prediction"] != dfWorkingEntries["position_verified"]].shape[0]

12

In [333]:
dfWorkingEntries[dfWorkingEntries["position"] != dfWorkingEntries["position_verified"]].shape[0]

738

Out of our 5838 examples from the verification set, only 12 are wrong, compared to 738 for the original classification. Let's take a look at these mistakes.

In [334]:
dfWorkingEntries[dfWorkingEntries["position_prediction"] != dfWorkingEntries["position_verified"]]

Unnamed: 0,gameId,participantId,position,position_prediction,position_verified
203,3912854276,4,BOTTOM_DUO,BOTTOM_DUO_CARRY,BOTTOM_DUO_SUPPORT
513,3913087102,4,TOP_SOLO,MIDDLE_SOLO,TOP_SOLO
941,3913453901,2,TOP_SOLO,BOTTOM_DUO_SUPPORT,TOP_SOLO
2137,3914743235,8,TOP_SOLO,MIDDLE_SOLO,TOP_SOLO
2349,3914926377,2,NONE_DUO_SUPPORT,MIDDLE_SOLO,JUNGLE_NONE
2440,3914993472,3,TOP_DUO_CARRY,MIDDLE_SOLO,TOP_SOLO
2461,3915023859,4,MIDDLE_DUO_SUPPORT,MIDDLE_SOLO,TOP_SOLO
2935,3915465270,8,BOTTOM_DUO,BOTTOM_DUO_SUPPORT,BOTTOM_DUO_CARRY
3085,3915637683,8,NONE_DUO_SUPPORT,JUNGLE_NONE,TOP_SOLO
3404,3915938526,7,MIDDLE_DUO,MIDDLE_SOLO,TOP_SOLO


Out of these 13 cases, we have : 
 * A support farming (and not supporting though, vision score at 8? -> 3912854276
 * A late swaplane at 8th minute -> 3913087102
 * AFK -> 3913453901 3915465270 3915637683 3915944893
 * Top classified as midlaners due to quite constant misclassification from match noisy labels -> 3914743235 3914993472 3915023859 3915938526
     * Note : 3 of them are tied at 50/50 on mid or top
 * Someone trying to handle the game despite feeder -> 3914926377

In [337]:
def getEntryStats(gameId, participantId):
    return dfEntries[((dfEntries["gameId"]==gameId) & (dfEntries["participantId"]==participantId))].iloc[0].to_dict()

def getEntryPredict(gameId, participantId):
    return c2.predict(dfEntries[((dfEntries["gameId"]==gameId) & (dfEntries["participantId"]==participantId))].drop(["position","participantId","gameId","position_prediction","position_verified"], axis=1))

def getEntryPredictProba(gameId, participantId):
    return c2.predict_proba(dfEntries[((dfEntries["gameId"]==gameId) & (dfEntries["participantId"]==participantId))].drop(["position","participantId","gameId","position_prediction","position_verified"], axis=1))

In [338]:
getEntryStats(3912854276,4)

{'gameId': 3912854276,
 'has-item-mostlyUsed-BOTTOM_DUO_CARRY': 0,
 'has-item-mostlyUsed-BOTTOM_DUO_SUPPORT': 0,
 'has-item-mostlyUsed-JUNGLE_NONE': 0,
 'has-item-mostlyUsed-MIDDLE_SOLO': 1,
 'has-item-mostlyUsed-TOP_SOLO': 0,
 'has-item-overUsed-BOTTOM_DUO_CARRY': 0,
 'has-item-overUsed-BOTTOM_DUO_SUPPORT': 1,
 'has-item-overUsed-JUNGLE_NONE': 0,
 'has-item-overUsed-MIDDLE_SOLO': 0,
 'has-item-overUsed-TOP_SOLO': 0,
 'has-item-underUsed-BOTTOM_DUO_CARRY': 1,
 'has-item-underUsed-BOTTOM_DUO_SUPPORT': 1,
 'has-item-underUsed-JUNGLE_NONE': 1,
 'has-item-underUsed-MIDDLE_SOLO': 1,
 'has-item-underUsed-TOP_SOLO': 1,
 'jungleMinionRatio': 0.0,
 'jungleMinionsKilled': 0,
 'lane-frequency-bot': 7,
 'lane-frequency-jungle': 2,
 'lane-frequency-mid': 1,
 'lane-frequency-top': 0,
 'minionsKilled': 33,
 'most-frequent-bot': 1,
 'most-frequent-jungle': 0,
 'most-frequent-mid': 0,
 'most-frequent-top': 0,
 'participantId': 4,
 'position': 'BOTTOM_DUO',
 'position_verified': 'BOTTOM_DUO_SUPPORT',
 '

In [339]:
getEntryPredict(3913379618,6)

array(['MIDDLE_SOLO'], dtype=object)

In [341]:
getEntryPredictProba(3914743235,8)

array([[0. , 0. , 0. , 0.5, 0.5]])

In [342]:
getEntryPredictProba(3914993472,3)

array([[0. , 0. , 0. , 0.5, 0.5]])

In [343]:
getEntryPredictProba(3915023859,4)

array([[0., 0., 0., 1., 0.]])

In [344]:
getEntryPredictProba(3915938526,7)

array([[0. , 0. , 0. , 0.5, 0.5]])

In [313]:
c2.classes_

array(['BOTTOM_DUO_CARRY', 'BOTTOM_DUO_SUPPORT', 'JUNGLE_NONE',
       'MIDDLE_SOLO', 'TOP_SOLO'], dtype=object)

In [274]:
data[((data["most-frequent-bot"] == 0) & (data["most-frequent-top"]==0) & (data["most-frequent-mid"]==1) & (data["minionsKilled"]>39)  & (data["jungleMinionsKilled"]<14.5)  & (data["spell-21"]==1)    )]
data[((data["jungleMinionsKilled"]<=14.5)  & (data["lane-frequency-bot"]<=2.5)  & (data["lane-frequency-top"]<=2.5) & (data["lane-frequency-top"]>0)  & (data["lane-frequency-mid"]<=6.5)  & (data["minionsKilled"]>40) & (data["minionsKilled"]<94)  & (data["spell-21"]==1)  & (data["spell-6"]==0) & (data["has-item-mostlyUsed-BOTTOM_DUO_CARRY"]==1)& (data["has-item-mostlyUsed-BOTTOM_DUO_SUPPORT"]==0) & (data["jungleMinionRatio"]<=0.055)    )]

Unnamed: 0,has-item-mostlyUsed-BOTTOM_DUO_CARRY,has-item-mostlyUsed-BOTTOM_DUO_SUPPORT,has-item-mostlyUsed-JUNGLE_NONE,has-item-mostlyUsed-MIDDLE_SOLO,has-item-mostlyUsed-TOP_SOLO,has-item-overUsed-BOTTOM_DUO_CARRY,has-item-overUsed-BOTTOM_DUO_SUPPORT,has-item-overUsed-JUNGLE_NONE,has-item-overUsed-MIDDLE_SOLO,has-item-overUsed-TOP_SOLO,...,most-frequent-top,spell-1,spell-11,spell-12,spell-14,spell-21,spell-3,spell-4,spell-6,spell-7
5180,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
21406,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
51929,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
57738,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0


In [308]:
df[((df["lane-frequency-top"]==8) & (df["jungleMinionRatio"]<=0.015) & (data["minionsKilled"]==63) & (data["has-item-mostlyUsed-BOTTOM_DUO_CARRY"]==0) & (data["has-item-overUsed-BOTTOM_DUO_CARRY"]==0) & (data["has-item-underUsed-JUNGLE_NONE"]==0) )]["participantId"]

3853     9 
9835     1 
10807    3 
20513    9 
26166    7 
27624    5 
30264    10
39996    2 
40238    4 
40477    8 
40659    10
41061    7 
45559    10
56381    7 
59094    5 
62938    9 
Name: participantId, dtype: int64

# Gradient Boosting and Random Forests

In [363]:

from sklearn.ensemble import GradientBoostingClassifier

target = df["position"]
data = df.drop(["position","participantId","gameId"], axis=1)

shuffle_split = StratifiedShuffleSplit(train_size=0.9, n_splits=10)

c = GradientBoostingClassifier(min_samples_split=5)


accs = []

for train_index, test_index in shuffle_split.split(data,target):
    c.fit(data.iloc[train_index], target.iloc[train_index])
    accs.append(accuracy_score(target[test_index], c.predict(data.iloc[test_index])))

print(accs)
sum(accs) / len(accs)



[0.9962462462462462, 0.996996996996997, 0.998048048048048, 0.998048048048048, 0.9972972972972973, 0.9977477477477478, 0.996996996996997, 0.9966966966966967, 0.998048048048048, 0.9977477477477478]


0.9973873873873874

In [364]:
dfEntries = pd.DataFrame(participantEntries)

dfEntries["position_prediction"] = c.predict(dfEntries.drop(["position","participantId","gameId","position_verified"], axis=1))

dfWorkingEntries = dfEntries[["gameId","participantId","position","position_prediction","position_verified"]]

In [365]:
dfWorkingEntries[dfWorkingEntries["position_prediction"] != dfWorkingEntries["position_verified"]].shape[0]

3

In [366]:
dfWorkingEntries[dfWorkingEntries["position_prediction"] != dfWorkingEntries["position_verified"]]

Unnamed: 0,gameId,participantId,position,position_prediction,position_verified
2349,3914926377,2,NONE_DUO_SUPPORT,MIDDLE_SOLO,JUNGLE_NONE
3414,3915944893,7,JUNGLE_NONE,TOP_SOLO,BOTTOM_DUO_CARRY
5066,3917774255,9,JUNGLE_NONE,TOP_SOLO,JUNGLE_NONE


In [369]:
dfEntries[((dfEntries["gameId"]==3917774255) & (dfEntries["participantId"]==9))].drop(["position","participantId","gameId","position_verified"], axis=1)

Unnamed: 0,has-item-mostlyUsed-BOTTOM_DUO_CARRY,has-item-mostlyUsed-BOTTOM_DUO_SUPPORT,has-item-mostlyUsed-JUNGLE_NONE,has-item-mostlyUsed-MIDDLE_SOLO,has-item-mostlyUsed-TOP_SOLO,has-item-overUsed-BOTTOM_DUO_CARRY,has-item-overUsed-BOTTOM_DUO_SUPPORT,has-item-overUsed-JUNGLE_NONE,has-item-overUsed-MIDDLE_SOLO,has-item-overUsed-TOP_SOLO,...,spell-1,spell-11,spell-12,spell-14,spell-21,spell-3,spell-4,spell-6,spell-7,position_prediction
5066,0,0,1,0,0,0,0,1,0,0,...,0,1,0,0,0,0,1,0,0,TOP_SOLO


In [371]:
c.predict_proba(dfEntries[((dfEntries["gameId"]==3917774255) & (dfEntries["participantId"]==9))].drop(["position","participantId","gameId","position_verified","position_prediction"], axis=1))

array([[5.43948616e-002, 4.66187486e-002, 3.79666290e-297,
        3.93422642e-001, 5.05563747e-001]])

In [372]:
c.classes_

array(['BOTTOM_DUO_CARRY', 'BOTTOM_DUO_SUPPORT', 'JUNGLE_NONE',
       'MIDDLE_SOLO', 'TOP_SOLO'], dtype=object)

In [376]:
3.79666290e-297 > 3.05563747e-001

False

🠑 Strange case where jungler prediciton proba got a e-297, I suppose it's an overflow error of some sort.

In [385]:
dfEntries = pd.DataFrame(participantEntries)

In [387]:
dfEntries.to_csv("dataset_verification.csv", index=False)

In [398]:
from sklearn.ensemble import RandomForestClassifier

target = df["position"]
data = df.drop(["position","participantId","gameId"], axis=1)

shuffle_split = StratifiedShuffleSplit(train_size=0.9, n_splits=10)

c = RandomForestClassifier(min_samples_split=5, n_estimators=100)


accs = []

for train_index, test_index in shuffle_split.split(data,target):
    c.fit(data.iloc[train_index], target.iloc[train_index])
    accs.append(accuracy_score(target[test_index], c.predict(data.iloc[test_index])))

print(accs)
sum(accs) / len(accs)



[0.9975975975975976, 0.996996996996997, 0.996996996996997, 0.9983483483483484, 0.998048048048048, 0.9978978978978978, 0.9974474474474474, 0.9972972972972973, 0.9978978978978978, 0.9975975975975976]


0.9976126126126126

In [399]:
dfEntries = pd.DataFrame(participantEntries)

dfEntries["position_prediction"] = c.predict(dfEntries.drop(["position","participantId","gameId","position_verified"], axis=1))

dfWorkingEntries = dfEntries[["gameId","participantId","position","position_prediction","position_verified"]]

In [400]:
dfWorkingEntries[dfWorkingEntries["position_prediction"] != dfWorkingEntries["position_verified"]]

Unnamed: 0,gameId,participantId,position,position_prediction,position_verified
2461,3915023859,4,MIDDLE_DUO_SUPPORT,MIDDLE_SOLO,TOP_SOLO
3414,3915944893,7,JUNGLE_NONE,MIDDLE_SOLO,BOTTOM_DUO_CARRY


Two errors, the jungler trying the fix the midlane and the weird Neeko toplane.

# Exporting the model

In [403]:
from sklearn.externals import joblib
joblib.dump(c, "roleml/role_identification_model.sav")

['roleml/role_identification_model.sav']