# League of Legends Diamond Classification Problem
Hi all, I am new to the data science community on all platforms and would appreciate any help and guidance you can provide in my projects.

League of Legends. Possibly the biggest online game of all time and a life choice for some people, but is it possible to predict the outcome of a game based on the statistics in the first 10 minutes?

According to the <a href='https://leagueoflegends.com'>leagueoflegends.com</a>
> League of Legends is a team-based strategy game where two teams of five powerful champions face off to destroy the other’s base. Choose from over 140 champions to make epic plays, secure kills, and take down towers as you battle your way to victory.

## [Game Basics](#1)
I imagine the matter that you are reading a post on League of Legends suggests you may be more than familiar with the rules, and have a much more in depth understanding of strategies and influences than I do. However, I will briefly explain some of the basics. Feel free to skip this part.


- Players accumulate gold and experience from a mixture of killing minions, monsters, other players and towers. 
       More gold -> better items -> easier killing.
       More experience -> higher levels -> easier killing.
- Wards provide map vision so we can see people coming to kill us.
       More wards -> better vision -> less deathing.
- Main objective of the game is to destroy a number of towers leading to the destruction of the opponents base.
       Kill towers -> kill base -> win game.

In [None]:
# Import data analysis libraries
import pandas as pd
import numpy as np

# Import libraries for visualisation
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

# Show all columns
pd.set_option('display.max_columns', None)
pd.set_option('mode.chained_assignment', None)

print('Libraries Imported!')

## [Data Exploration](#2)
The aim of this project is to try and predict a class for **blueWins**, which is the respective outcome of the game. We can do this by visualising the features in the following dataframe and subsequently using machine learning techniques to find the best predictions.  

Begin by viewing the different features we have available in the dataset provided.

In [None]:
# Read into dataframe
df = pd.read_csv('../input/league-of-legends-diamond-ranked-games-10-min/high_diamond_ranked_10min.csv')
df.head()

In [None]:
# Get column names
cols = df.columns
print(cols)

#### Target Variable "blueWins"
Let's seperate the target variable from the dataframe into a new variable "y". "gameId" can also be dropped as it is randomised and provides no information gain to the observations in the data.

In [None]:
# Seperate target variable from dataframe
y = df.blueWins

# Drop target and unnecessary features
drop_cols = ['gameId','blueWins']
x = df.drop(drop_cols, axis=1)

x.head()

Good news! Our dataset provides almost 50/50 data for our target variable, this means there is no data imbalance.

In [None]:
# Visualise blueWins using countplot
ax = sns.countplot(y, label='Count', palette='RdBu')
W, L = y.value_counts()

print('Red Wins: {} ({}%), Blue Wins: {}({}%)'.format(W,round(100*W/(W+L),3),L,round(100*L/(W+L),3)))

- Notice that numeric data has very different ranges, which gives higher weights to larger in machine learning models. So standardise.

### Analysis of basic statistics
The numeric data in our dataset have very different ranges which could effect machine learning models effectiveness by applying different weights to different features.

**Discrete Data**
- Blue/red wards placed/destroyed have a massive range and sdev.
- Blue/red Elite Monsters equal to sum of Dragons + Heralds, dragons more popular kill.
- Blue/red total gold/minions killed have low sdev (<10% mean)
- Blue gold diff / experience diff is exact negative of red gold diff / experience diff.

**Binary Data**
- Blue/red First blood is yes/no with approx 50% reliability.

In [None]:
x.describe()

In [None]:
# Drop unnecessary features (same as blueFirstBlood, blueDeaths etc.)
drop_cols = ['redFirstBlood','redKills','redDeaths'
             ,'redGoldDiff','redExperienceDiff', 'blueCSPerMin',
            'blueGoldPerMin','redCSPerMin','redGoldPerMin']
x.drop(drop_cols, axis=1, inplace=True)
x.head()

### Violin and Box Plots
Violin plots allow us to visualise the distribution of each features simply and seperate data points based on the final outcome of a game.

**Observations from plots**
- Blue kills appears to have a large positive impact on winning the game.
- Similarly, blue deaths has a large negative impact on winning the game (i.e positive on losing).
- Blue assists similar plot to blue kills, need to get kills to get assists so scales with kills.
- First blood is positively correlated with outcome but also mirrors blue kills.
- Gold and experience differences have major influence.
- Minions and Jungle minions do not have much impact.

In [None]:
# Copy feature matrix and standardise
data = x
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std.iloc[:, 0:9]], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

fig, ax = plt.subplots(1,2,figsize=(15,5))

# Create violin plot of features
#plt.figure(figsize=(8,5))
sns.violinplot(x='Features', y='Values', hue='blueWins', data=data, split=True,
               inner='quart', ax=ax[0], palette='Blues')
fig.autofmt_xdate(rotation=45)

data = x
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std.iloc[:, 9:18]], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

# Create violin plot
#plt.figure(figsize=(8,5))
sns.violinplot(x='Features', y='Values', hue='blueWins', 
               data=data, split=True, inner='quart', ax=ax[1], palette='Blues')
fig.autofmt_xdate(rotation=45)

plt.show()

In [None]:
plt.figure(figsize=(18,14))
sns.heatmap(round(x.corr(),2), cmap='Blues', annot=True)
plt.show()

In [None]:
# Drop unnecessary features
drop_cols = ['redAvgLevel','blueAvgLevel']
x.drop(drop_cols, axis=1, inplace=True)

### Ward Data
We can see the seperation of data points is pretty well randomised in the plot of ward data below. From knowledge of the game I would suggest that ward placing and destruction in Diamond is quite systematic and therefore there isn't much variance in the data as suggested by the violin plots above.  

With this in mind, I will not use ward data in my learning model.

In [None]:
sns.set(style='whitegrid', palette='muted')

x['wardsPlacedDiff'] = x['blueWardsPlaced'] - x['redWardsPlaced']
x['wardsDestroyedDiff'] = x['blueWardsDestroyed'] - x['redWardsDestroyed']

data = x[['blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff','wardsDestroyedDiff']].sample(1000)
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

plt.figure(figsize=(10,6))
sns.swarmplot(x='Features', y='Values', hue='blueWins', data=data)
plt.xticks(rotation=45)
plt.show()

In [None]:
# Drop unnecessary features
drop_cols = ['blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff',
            'wardsDestroyedDiff','redWardsPlaced','redWardsDestroyed']
x.drop(drop_cols, axis=1, inplace=True)

### Kills, Assists and Deaths
The distribution of the kills, deaths and assists appear similar, assists of course scale with kills (or red assists with blue deaths) so the histograms are as expected.  

In [None]:
x['killsDiff'] = x['blueKills'] - x['blueDeaths']
x['assistsDiff'] = x['blueAssists'] - x['redAssists']

x[['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']].hist(figsize=(12,10), bins=20)
plt.show()

The importance of each feature on the outcome of a game can be pictured below, where the outcome isn't solely represented by these features, there is a clear correlation.  

Include **killsDiff** and **assistsDiff** in modelling.

In [None]:
sns.set(style='whitegrid', palette='muted')

data = x[['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']].sample(1000)
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

plt.figure(figsize=(10,6))
sns.swarmplot(x='Features', y='Values', hue='blueWins', data=data)
plt.xticks(rotation=45)
plt.show()

In [None]:
data = pd.concat([y, x], axis=1).sample(500)

sns.pairplot(data, vars=['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists'], 
             hue='blueWins')

plt.show()

In [None]:
data = pd.concat([y, x], axis=1)

fig, ax = plt.subplots(1,2, figsize=(15,6))
sns.scatterplot(x='killsDiff', y='assistsDiff', hue='blueWins', data=data, ax=ax[0])

sns.scatterplot(x='blueKills', y='blueAssists', hue='blueWins', data=data, ax=ax[1])
plt.show()

In [None]:
# Drop unnecessary features
drop_cols = ['blueFirstBlood','blueKills','blueDeaths','blueAssists','redAssists']
x.drop(drop_cols, axis=1, inplace=True)

### Elite Monsters
Including all three of the features that are **blueEliteMonsters**, **blueDragons** and **blueHeralds** would be unadvisable since the first of these is an accumulation of the others. Grouping the data below shows that having a dragon advantage gives a larger advantage than having a herald advantage.  

The dragon group shows a 64% chance of winning if killing the dragon before 10 minutes, 50% if equal on dragons and 37% chance if the opposite team has killed the dragon. S

Dragons pose more influence than heralds on the outcome of the game, therefore choose to include both **heralds** and **dragons** individually in my machine learning model.

In [None]:
x['dragonsDiff'] = x['blueDragons'] - x['redDragons']
x['heraldsDiff'] = x['blueHeralds'] - x['redHeralds']
x['eliteDiff'] = x['blueEliteMonsters'] - x['redEliteMonsters']

data = pd.concat([y, x], axis=1)

eliteGroup = data.groupby(['eliteDiff'])['blueWins'].mean()
dragonGroup = data.groupby(['dragonsDiff'])['blueWins'].mean()
heraldGroup = data.groupby(['heraldsDiff'])['blueWins'].mean()

fig, ax = plt.subplots(1,3, figsize=(15,4))

eliteGroup.plot(kind='bar', ax=ax[0])
dragonGroup.plot(kind='bar', ax=ax[1])
heraldGroup.plot(kind='bar', ax=ax[2])

print(eliteGroup)
print(dragonGroup)
print(heraldGroup)

plt.show()

In [None]:
# Drop unnecessary features
drop_cols = ['blueEliteMonsters','blueDragons','blueHeralds',
            'redEliteMonsters','redDragons','redHeralds']
x.drop(drop_cols, axis=1, inplace=True)

### Towers
A major objective for each team and we should therefore expect to be heavily influential with the outcome of the game.

The plots below show that although it is unlikely there will be any towers destroyed in the first ten minutes of the game, the destruction of a tower provides a great advantage to a team, and therefore will be included in my model as **towerDiff**.

In [None]:
x['towerDiff'] = x['blueTowersDestroyed'] - x['redTowersDestroyed']

data = pd.concat([y, x], axis=1)

towerGroup = data.groupby(['towerDiff'])['blueWins']
print(towerGroup.count())
print(towerGroup.mean())

fig, ax = plt.subplots(1,2,figsize=(15,5))

towerGroup.mean().plot(kind='line', ax=ax[0])
ax[0].set_title('Proportion of Blue Wins')
ax[0].set_ylabel('Proportion')

towerGroup.count().plot(kind='line', ax=ax[1])
ax[1].set_title('Count of Towers Destroyed')
ax[1].set_ylabel('Count')

In [None]:
# Drop unnecessary features
drop_cols = ['blueTowersDestroyed','redTowersDestroyed']
x.drop(drop_cols, axis=1, inplace=True)

### Gold and Experience

In [None]:
data = pd.concat([y, x], axis=1)

data[['blueGoldDiff','blueExperienceDiff']].hist(figsize=(15,5))
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x='blueExperienceDiff', y='blueGoldDiff', hue='blueWins', data=data)

In [None]:
# Drop unnecessary features
drop_cols = ['blueTotalGold','blueTotalExperience','redTotalGold','redTotalExperience']
x.drop(drop_cols, axis=1, inplace=True)

x.rename(columns={'blueGoldDiff':'goldDiff', 'blueExperienceDiff':'expDiff'}, inplace=True)

### Minions and Jungle Minions

In [None]:
data = pd.concat([y, x], axis=1)

data[['blueTotalMinionsKilled','blueTotalJungleMinionsKilled',
      'redTotalMinionsKilled','redTotalJungleMinionsKilled']].hist(figsize=(15,10))
plt.show()

In [None]:
sns.set(style='whitegrid', palette='muted')

data = x[['blueTotalMinionsKilled','blueTotalJungleMinionsKilled',
      'redTotalMinionsKilled','redTotalJungleMinionsKilled']].sample(1000)
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std], axis=1)
data = pd.melt(data, id_vars='blueWins', var_name='Features', value_name='Values')

plt.figure(figsize=(10,6))
sns.swarmplot(x='Features', y='Values', hue='blueWins', data=data)
plt.xticks(rotation=45)
plt.show()

In [None]:
# Drop unnecessary features
drop_cols = ['blueTotalMinionsKilled','blueTotalJungleMinionsKilled',
      'redTotalMinionsKilled','redTotalJungleMinionsKilled']
x.drop(drop_cols, axis=1, inplace=True)

## [Machine Learning](#3)
Below I use some machine learning algorithms from the Scikit-Learn library to see how effective the features I have selected above are for predicting the outcome of a Diamond League of Legends match.

In [None]:
# Import libraries for machine learning models
from sklearn import preprocessing, metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

print('Machine Learning Libraries Imported!')

In [None]:
print(x.shape,y.shape)
x.head()

In [None]:
X = preprocessing.StandardScaler().fit(x).transform(x.astype(float))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

In [None]:
from prettytable import PrettyTable
table = PrettyTable()
table.field_names = ['Algorithm', 'Accuracy', 'Recall', 'Precision', 'F-Score']

In [None]:
def get_confusion_matrix(algorithm, y_pred, y_actual):
    # Create confusion matrix and interpret values
    con = confusion_matrix(y_test, y_pred)
    tp, fn, fp, tn = con[0][0], con[0][1], con[1][0], con[1][1]
    algorithm = algorithm
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    f_score = (2 * precision * recall) / (recall + precision)
    return algorithm, accuracy, recall, precision, f_score

### K-Nearest Neighbours 

In [None]:
# Test different values of k
Ks = 10
mean_acc = np.zeros((Ks-1))
for n in range(1,Ks):
    kneigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    y_pred = kneigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, y_pred)

# Use most accurate k value to predict test values
k = mean_acc.argmax()+1
neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
y_pred = neigh.predict(X_test)

In [None]:
# Call confusion matrix and accuracy
algorithm, accuracy, recall, precision, f_score = get_confusion_matrix('KNN', y_pred, y_test)

# Add values to table
table.add_row([algorithm, round(accuracy,5), round(recall,5),
               round(precision,5), round(f_score,5)])

### Decision Trees

In [None]:
# Initialise Decision Tree classifier and predict
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree.fit(X_train,y_train)
y_pred = drugTree.predict(X_test)

In [None]:
# Call confusion matrix and accuracy
algorithm, accuracy, recall, precision, f_score = get_confusion_matrix('Decision', y_pred, y_test)

# Add values to table
table.add_row([algorithm, round(accuracy,5), round(recall,5),
               round(precision,5), round(f_score,5)])

### Logistic Regression

In [None]:
# Train and predict logistic regression model
LR = LogisticRegression(C=0.01, solver='liblinear')
y_pred = LR.fit(X_train,y_train).predict(X_test)

In [None]:
# Call confusion matrix and accuracy
algorithm, accuracy, recall, precision, f_score = get_confusion_matrix('LR', y_pred, y_test)

# Add values to table
table.add_row([algorithm, round(accuracy,5), round(recall,5),
               round(precision,5), round(f_score,5)])

### Support Vector Machines

In [None]:
clf = svm.SVC(kernel='rbf')
y_pred = clf.fit(X_train, y_train).predict(X_test)

In [None]:
# Call confusion matrix and accuracy
algorithm, accuracy, recall, precision, f_score = get_confusion_matrix('SVM', y_pred, y_test)

# Add values to table
table.add_row([algorithm, round(accuracy,5), round(recall,5),
               round(precision,5), round(f_score,5)])

### Naive Bayes

In [None]:
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)

In [None]:
# Call confusion matrix and accuracy
algorithm, accuracy, recall, precision, f_score = get_confusion_matrix('Bayes', y_pred, y_test)

# Add values to table
table.add_row([algorithm, round(accuracy,5), round(recall,5),
               round(precision,5), round(f_score,5)])

### Random Forest

In [None]:
# Instantiate Random Forest Classifier and predict values
clf = RandomForestClassifier(max_depth=2, random_state=0)
y_pred = clf.fit(X_train, y_train).predict(X_test)

In [None]:
# Call confusion matrix and accuracy
algorithm, accuracy, recall, precision, f_score = get_confusion_matrix('R Forest', y_pred, y_test)

# Add values to table
table.add_row([algorithm, round(accuracy,5), round(recall,5),
               round(precision,5), round(f_score,5)])

### Evaluation of Machine Learning Models
**Logistic Regression** had the highest accuracy score of the machine learning models used with a prediction accuracy of 74.646%. Since we are trying to predict an outcome of a game, there isn't much risk involved with false positives or negatives and therefore we look to the accuracy for the greattest predicting model.

In [None]:
print(table)