<img src="https://d33jl3tgfli0fm.cloudfront.net/helix/images/games/league-of-legends/background.jpg">
<center>
<h1>
Learning League of Legends</h1>
<h2>
CMSC320 Final Project</h2>
<h4>
Group members: Maria Furman, Jacqueline Chen
</h4>
</center>

## Introduction

League of Legends is one of the most popular online games in the world, with the highest monthly players than any other video game in 2014 (when they last released statistics). It's a free to play MOBA (multiplayer online battle arena) created by Riot where two teams of 5 battle it out to destroy each other, and eventually the other team's Nexus. While it's just a game to many, League of Legends is a full-blown career for professional players and streamers, and the game has carved out a huge esports scene with millions of viewers tuning in just to watch the pros play. 

Given its popularity, it is only natural that League of Legends will continue to draw in a constant stream of new players. However, the game itself is quite complex, and the learning curve for new players can be quite steep. Despite this, new players should not be discouraged from trying the game, and one goal of this tutorial is to shed some insight into gameplay strategies that will potentially help newer players while. This tutorial will also walk you step-by-step through the data pipeline of using League of Legends champion data to determine gameplay trends and strategies. Furthermore, we hope that the steps outlined in this tutorial will encourage other people to analyze game data in similar ways, as analysis of such statistics can lead to some rather interesting conclusions about gameplay. This tutorial is written in Python but can be extended to other languages such as R.

Import the required python libraries:

In [15]:
from bs4 import BeautifulSoup 
from IPython.display import display
import requests 
import pandas as pd
import html5lib
import numpy as np
import matplotlib
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import warnings
import json
from copy import deepcopy
import statsmodels
from statsmodels import sandbox
from statsmodels.sandbox import stats
from statsmodels.sandbox.stats import runs
import sklearn
from sklearn import *
from sklearn.model_selection import KFold
import mpld3
from mpld3 import plugins
from mpld3.utils import get_id
warnings.filterwarnings('ignore')

## Scraping the Data

Use the requests package to get the HTML from champion.gg. This will allow us to obtain data from the current game patch (7.9) from Riot's API (it is updated twice daily). We are using data from leagues that are Platinum or higher, which are more skilled players. The data from these players will be less influenced by cheap tricks that are easily worked around by more experienced players.

Use BeautifulSoup to extract the contents.

In [16]:
r = requests.get("http://champion.gg/statistics/?league=platplus")
root = BeautifulSoup(r.content,  "html")


Use findAll() to get all the contents inside the script tags. You may have to do a little bit of searching to find the script tag that includes the relevant champion data. We'll be using the matchup data, though other data is available.

In [17]:
scripts =  root.findAll("script")
matchup_data = scripts[21]
matchup_data = str(matchup_data)

Extract the data from the script tags. Formulate the data into proper JSON format, and use json.loads() to parse it. The particular JSON data we are using includes a nested object that we want to include in a single row of the dataframe, so we must first flatten the JSON so that the dataframe creation step is easier.

In [18]:
value = '{%s}' % (matchup_data.split('{', 1)[1].rsplit('}', 1)[0],)
value= '[' + value + ']'
value = value.replace('\'', "");
value = json.loads(value)
flat_value = deepcopy(value)
for i in range(len(value)):
    for k,v in value[i].items():
        if k == "general":
            del flat_value[i][k]
            flat_value[i] = {**flat_value[i], **v}

Create the dataframe. Make sure that all labels and values from the original JSON data are present.

In [19]:
df = pd.DataFrame(data = flat_value)
display(df.head())

Unnamed: 0,assists,banRate,deaths,experience,goldEarned,key,kills,largestKillingSpree,minionsKilled,neutralMinionsKilledEnemyJungle,neutralMinionsKilledTeamJungle,overallPosition,overallPositionChange,playPercent,role,title,totalDamageDealtToChampions,totalDamageTaken,totalHeal,winPercent
0,8.303429,0.033988,5.581528,10.413124,11093,LeeSin,6.363754,9,35.69965,29.622818,47.504617,2,-1.0,0.409331,Jungle,Lee Sin,13370,28760,7098,0.504096
1,7.062964,0.014404,5.124371,7.83174,12404,Lucian,6.732241,9,195.223267,4.367864,5.742724,2,-1.0,0.36878,ADC,Lucian,21054,21589,3815,0.508247
2,7.301804,0.008465,5.427994,7.201414,12482,Caitlyn,6.073345,9,200.910707,5.409103,7.010473,3,3.0,0.330641,ADC,Caitlyn,22116,18175,3549,0.517319
3,14.1526,0.002187,5.038351,8.667023,8817,Thresh,2.074492,4,32.944024,0.221874,0.085267,2,3.0,0.305594,Support,Thresh,7705,19526,3568,0.528844
4,7.318412,0.019519,5.020842,6.727117,11992,Ahri,6.734794,9,185.523593,1.726803,3.363092,1,0.0,0.20915,Middle,Ahri,21858,18054,3773,0.534117


The goal of the code below is to a create a plot of win rates vs. bane rates for all the champions in League of Legends. Champions that are included more than once in the dataframe (due to playing different roles) are only plotted once with their averaged win and ban rate. First, generate a random color for each champion. Then, iterate through all the champion names, and keep track of their respective win and ban rates using a data structure of your choice (we're using a simple array for both). Plot this data using matplotlib, and make sure to use mpld3 in order to create a tooltip with the champion name that appears when hovering over a specific point in the plot. This plot displays a lot of data, so the tooltip is essential for increased understanding.

In [20]:
fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'), figsize = (14,8))

win_percents = []
ban_rates = []

champion_names = df["key"]
champion_names = champion_names.tolist();
colors = cm.rainbow(np.linspace(0, 1, len(champion_names)))

for champ in champion_names:
    champ_df = df['key'] == champ
    champ_df = df[champ_df]
    win_percent = champ_df['winPercent']
    ban_rate = champ_df['banRate']
    win_percent = win_percent.tolist();
    ban_rate = ban_rate.tolist();
    if (len(win_percent) > 1):
        win_percent = np.mean(win_percent)
        ban_rate = np.mean(ban_rate)
    else:
        win_percent = win_percent[0];
        ban_rate = ban_rate[0];
        
    win_percents.append(win_percent)
    ban_rates.append(ban_rate)

scatter = ax.scatter(win_percents,
                     ban_rates,
                     c=colors,
                     alpha=0.3,
                     cmap=plt.cm.jet)

ax.grid(color='white', linestyle='solid')

ax.set_title("Ban Rate vs. Winning Percentage", size=20)
ax.set_ylabel('Ban rate')
ax.set_xlabel('Winning percentage')

tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=champion_names)
mpld3.plugins.connect(fig, tooltip)

mpld3.enable_notebook()
mpld3.display()

In the game, you can ban 3 characters per team. Certain characters are considered "OP" (over powered) and are banned quite frequently, but is ban rate actually positively correlated with their win rate? The graph reveals that the majority of characters are not considered threatening enough to ban at all, and the win rates for most champions cluster around 50%, as they should. However, there are many outliers with high ban rates and a wide range of win percentages. Some examples include Lulu and Fizz, who both have average winning percentages (around 50%), but appear to be banned more often than other champions. Another interesting outlier is Ivern, who appears to win more often than other champions (57% win rate).

<img src="http://riot-web-static.s3.amazonaws.com/images/news/November_2012/2012_11_13_Dragon_Trainer_Lulu_Bioforge_Darius/Dragon_Trainer_Lulu_splash.jpg"  width="877" height="500">
<center> *Dragon Tamer Lulu* </center>

Every player has a score for the number of kills, deaths, and assists (with kills) they get. We will now plot kills vs. deaths to see if there are any trends among roles. There are 5 roles in League of Legends: Top, Mid, Jungle, Support, and ADC. Top champions tend to be tanky and include high health but lower damage. Mid champions are high damage. Jungle champions don't have a specific lane and can range from high damage to high health. ADCs (Attack Damage Carry) have high sustained damage and go in the bottom lane with Supports. Supports have high utility and low damage.

Iterate through the dataframe and plot kills and deaths with a different color based on roles. The code below is very similar to the code used in order to create the ban rate vs. winning percentage plot above, but is slightly different in the sense that a champion can be included in the plot multiple times (as different roles for the same champion result in relatively large differences in number of kills and deaths). The plot below includes an interactive legend that allows you to select and de-select data points based on champion role if you click the color in the legend.

In [21]:
fig2, ax2 = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'), figsize = (12,8))

labels = {}
kills = {}
deaths = {}

for index, row in df.iterrows():
    if row['role'] not in labels:
        labels[row['role']] = []
        kills[row['role']] = []
        deaths[row['role']] = []
        
    labels[row['role']].append(row["key"] + ", " + row["role"])
    kills[row['role']].append(row['kills'])
    deaths[row['role']].append(row['deaths'])

ax2.grid(color='white', linestyle='solid')

plot1 = ax2.scatter(kills['Support'], deaths['Support'],c = '#66e8c5', alpha=0.8)
plot2 = ax2.scatter(kills['ADC'], deaths['ADC'],c = '#f4e258', alpha=0.8)
plot3 = ax2.scatter(kills['Jungle'], deaths['Jungle'],c = '#9ae055', alpha=0.8)
plot4 = ax2.scatter(kills['Middle'], deaths['Middle'],c = '#c673c6', alpha=0.8)
plot5 = ax2.scatter(kills['Top'], deaths['Top'],c = '#ff6d91', alpha=0.8)


ax2.set_title("Kills vs. Deaths", size=20)
ax2.set_ylabel('Deaths')
ax2.set_xlabel('Kills')

tooltip = mpld3.plugins.PointLabelTooltip(plot1, labels=labels['Support'])
tooltip2 = mpld3.plugins.PointLabelTooltip(plot2, labels=labels['ADC'])
tooltip3 = mpld3.plugins.PointLabelTooltip(plot3, labels=labels['Jungle'])
tooltip4 = mpld3.plugins.PointLabelTooltip(plot4, labels=labels['Middle'])
tooltip5 = mpld3.plugins.PointLabelTooltip(plot5, labels=labels['Top'])

interactive_legends = mpld3.plugins.InteractiveLegendPlugin([plot1,plot2,plot3,plot4,plot5],["S","A","J","M","T"], ax = 0)
mpld3.plugins.connect(fig2, tooltip, interactive_legends)
mpld3.plugins.connect(fig2, tooltip2, interactive_legends)
mpld3.plugins.connect(fig2, tooltip3, interactive_legends)
mpld3.plugins.connect(fig2, tooltip4, interactive_legends)
mpld3.plugins.connect(fig2, tooltip5, interactive_legends)


mpld3.enable_notebook()
mpld3.display()

From the graph, the Support role is easily differentiated from other roles by its low number of kills, which makes sense because they're primarily supposed to get assists. Diving deeper, you can see that some champions that were designed for Mid (high damage) but get played as Supports such as Brand or Annie often have much higher kills and deaths. Additionally, the champions with the highest kills overall like Akali and Katarina have abilities where their cooldowns get reduced dramatically if they get a kill, allowing them to go on larger killstreaks. There's a lot of interesting patterns that can be discovered by creating visual representations of data that can inspire further research. Looking back at the roles, Top appears to have a relatively low amount of kills, which also aligns with their stereotypical role of being tanky (high health and defense but lower damage). Mid, ADC, and Jungle start to get a bit intertwined. It appears that ADC is more clustered and Mid and Jungle have more range of kills and deaths. This raises the following question: can we determine a champion's role based on other available information about them? We explore this problem next with machine learning. 

<img src="http://www.leagueoflegendsskins.com/images/champions/splash/Katarina_5.jpg">
<center> *High Commander Katarina* </center>

## Machine Learning

The goal of this machine learning is to infer the role of each champion based on features such as assists, kills, ban rate, etc. Ideally, if we were to obtain the average playstyle data for a player, we would be able to use the learned classifiers in order to predict which role they fit best into. Obviously, this is less useful for top tier players, but beginners may benefit from having a sense of what champions they are better fit to play. The roster of characters in League of Legends is quite large, and providing beginners with a general idea of what role they would be best at can help guide them in determining which characters they want to purchase (as most characters are not available to a new player at the beginning of the game).

Since we have a limited amount of data available to us (the data we scraped provides us with averages for each champion, rather than all possible player data), we are going to use K-Fold cross validation in order to figure out the best hyperparameters for our models. We are also going to use this type of cross validation in order to get the final error estimates for our fully trained models.

### Splitting the training data for k-fold cross validation

In [22]:
X = df.as_matrix(['assists', 'banRate', 'deaths', 'experience', 'goldEarned', 'kills', 'largestKillingSpree', 'minionsKilled', 'neutralMinionsKilledEnemyJungle', 'neutralMinionsKilledTeamJungle', 'playPercent', 'totalDamageDealtToChampions', 'totalDamageTaken', 'totalHeal', 'winPercent'])
y = df.as_matrix(["role"])
kf = KFold(n_splits=10, shuffle=True)
kf.get_n_splits(X)
print(kf)  

KFold(n_splits=10, random_state=None, shuffle=True)


### K-Nearest Neighbors

First, we'll use K-Fold cross validation in order to figure out the ideal numbers of nearest neighbors to use for K-Nearest Neighbors (KNN). K-nearest neighbors uses each feature of the data as a dimension and then calculates the closest K neighbors to a data point. The predicted label for a given set of features is chosen based on the label of the most neighbors. It is also possible to weight the vote of each neighbor based on how far away it is from the point being classified, but we decided to use uniform weights for each neighbor because it resulted in a better accuracy.

In [23]:
k = 1
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = neighbors.KNeighborsClassifier(k, weights='uniform')
    clf.fit(X_train, y_train)
    k = k + 1
    Z = clf.predict(X_test)
    y_test = np.ravel(y_test)
    print(sklearn.metrics.zero_one_loss(y_test,Z))

0.45
0.421052631579
0.368421052632
0.263157894737
0.157894736842
0.368421052632
0.263157894737
0.526315789474
0.263157894737
0.315789473684


Based on the testing done above, it appears as though 9 neighbors tends to produce the least amount of error for our data (though it does vary depending on how exactly the splits are generated). Now that you know this information, you can train your classifier (using this hyperparameter) in order to obtain a more accurate loss value. Since we have a limited amount of data, we are using K-Fold cross validation in order to determine our total test error.

In [24]:
k = 9
total_loss = 0
knn_clf = neighbors.KNeighborsClassifier(k, weights='uniform')
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    knn_clf = neighbors.KNeighborsClassifier(k, weights='uniform')
    knn_clf.fit(X_train, y_train)
    Z = clf.predict(X_test)
    y_test = np.ravel(y_test)
    total_loss = total_loss + sklearn.metrics.zero_one_loss(y_test, Z);
print(total_loss/10)    

0.266842105263


### Decision Tree Classifier

In the code below, we are using K-Fold cross validation in order to figure out the best depth (hyperparameter) for our decision tree classifier. At each level of the decision tree, a feature is chosen to split on. Training data is then split into different branches based on the value that it has for the feature that was chosen on that level. The leaves of the tree give the classifications for each example.

In [25]:
k = 1
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = sklearn.tree.DecisionTreeClassifier(max_depth=k)
    clf.fit(X_train, y_train)
    k = k + 1
    Z = clf.predict(X_test)
    y_test = np.ravel(y_test)
    print(sklearn.metrics.zero_one_loss(y_test,Z))

0.6
0.473684210526
0.210526315789
0.0526315789474
0.263157894737
0.157894736842
0.210526315789
0.315789473684
0.157894736842
0.263157894737


Based on this cross validation it appears as though both a depth of 9 and a depth of 10 appear to result in the smallest loss values (just like KNN the best depth tends to vary based on the particular splits that were generated for cross validation). Since shorter trees tend to result in less overfitting of the training data, we decided to go with a depth of 9 for our decision tree classifier.

In [26]:
k = 9
total_loss = 0
dt_clf = sklearn.tree.DecisionTreeClassifier(max_depth=k)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    dt_clf = sklearn.tree.DecisionTreeClassifier(max_depth=k)
    dt_clf.fit(X_train, y_train)
    Z = clf.predict(X_test)
    y_test = np.ravel(y_test)
    total_loss = total_loss + sklearn.metrics.zero_one_loss(y_test, Z);
print(total_loss/10)    

0.0260526315789


The decision tree classifier had a much lower loss than KNN. The reason why this is occuring is because decision tree classifiers are better than KNNs at ignoring noisy features that don't contribute to the classification. KNNs take all the features into account, while lower depth decision trees can place emphasis on more important features (and ignore the less important ones). For example, the win_percent, total_heal, and play_percent columns are probably not indicative of a champions role, and the decision tree classifier is able to ignore them in order to achieve a more accurate classification.

In [27]:
rss1 = 0
y_hats = []
ys = []
for index, row in df.iterrows():
    y_hat = knn_clf.predict(row.as_matrix(['assists', 'banRate', 'deaths', 'experience', 'goldEarned', 'kills', 'largestKillingSpree', 'minionsKilled', 'neutralMinionsKilledEnemyJungle', 'neutralMinionsKilledTeamJungle', 'playPercent', 'totalDamageDealtToChampions', 'totalDamageTaken', 'totalHeal', 'winPercent']))
    y = row["role"]
    y_hats.append(y_hat)
    ys.append(ys)
    rss1 += sklearn.metrics.zero_one_loss([y], y_hat)
print("Residual sum of squares for first model:")
print(rss1)

rss2 = 0
y_hats2 = []
for index, row in df.iterrows():
    y_hat = dt_clf.predict(row.as_matrix(['assists', 'banRate', 'deaths', 'experience', 'goldEarned', 'kills', 'largestKillingSpree', 'minionsKilled', 'neutralMinionsKilledEnemyJungle', 'neutralMinionsKilledTeamJungle', 'playPercent', 'totalDamageDealtToChampions', 'totalDamageTaken', 'totalHeal', 'winPercent']))
    y = row["role"]
    rss2 += sklearn.metrics.zero_one_loss([y], y_hat)
    y_hats2.append(y_hat)
print("Residual sum of squares for second model:")
print (rss2)


Residual sum of squares for first model:
53.0
Residual sum of squares for second model:
6.0


In [28]:
statsmodels.sandbox.stats.runs.mcnemar(x=y_hats, y=y_hats2)

(array([11]), array([  3.58879161e-05]))

The difference between the decision tree classifier and the KNN classifier is statistically significant, since the p-value we obtained is less than 0.05. From the residual sum of squares, we can also see that the second model (the decision tree classifier) outperforms the KNN classifier (and that it outperforms it in a significant way).

## Conclusion

This project can be expanded further to analyze data from other leagues and other patches to see how the game differs between players of different skill level and across time. This would be an interesting path for future exploration because it would allow us to see how statistical trends in gameplay change based on other factors.

This is just one example of how to use game data in order to explore interesting trends in gameplay. This data science pipeline can be applied to many other games, as well as other sources of data. For example, you could collect Hearthstone data in order to determine which matchups and decks are more favorable and result in the most frequent number of wins.

<img src="https://i.ytimg.com/vi/WDaqFf_ulJw/maxresdefault.jpg">