# Overview

For the machine learning section of the Capstone, three classifiers will be compared to find which performs best. These classifiers were chosen due to their performance regarding larger datasets with many features(for this dataset: ~5k features, ~7.5k samples). The classifiers are as follows: 

    - Random Forest
    - Support Vector Machine
    - Deep Neural Network
 
The goal of the classifiers is to predict which team will win given certain features/indicators within the game. The classifiers will be evaluated based on how many game outcomes they predicted correctly. THese results will be displayed using a confusion matrix. 

In [2]:
# Import standard packages
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

ImportError: cannot import name 'rcParams' from 'matplotlib' (unknown location)

In [None]:
# Load wrangled dataframe
league_wrangled = pd.read_csv("../Dataset/Wrangled_LeagueofLegends.csv")

In [None]:
# View dataframe
league_wrangled.head()

In [None]:
# Seperate labels from training data
result = league_wrangled.pop('bResult')

In [None]:
league_wrangled.head()

## Random Forest 

Random forest classifiers are made of an ensemble of decision trees. Decision trees work by seperating the data such that the homogeneity of the splits are maximized. These decision trees all return a classification, and the classification that appears most frequently is considered the final classification by the random forest. 

The random forest algorithm can easily be implemented using the Scikit-Learn Package. Because we are trying to predict between two classes (Blue team win, or Red team win), the RandomForestClassifier class will be used. 

In [None]:
# First split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(league_wrangled, result, 
                                                    test_size=0.33, random_state=42)


print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)

In [None]:
# Import the Random Forest classifier
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=1000, random_state=42)
rf.fit(X_train, y_train)

In [None]:
# Evaluate Random Forest 
rf_acc = rf.score(X_test, y_test)

print(f"Mean accuracy of random forest: {rf_acc:0.4}")

In [None]:
# Display classification report and confusino matrix
from sklearn.metrics import classification_report, confusion_matrix

predictions = rf.predict(X_test)

# Classification Report
print(classification_report(y_test, predictions))

# Confusion Matrix
conf_mat = confusion_matrix(y_test, predictions)
fig, ax = plt.subplots(figsize=(8,5))
sns.heatmap(conf_mat, annot=True, annot_kws={'size':15}, fmt='g', linewidth=0.75, cbar=False, cmap="Blues",
            xticklabels=['Red Team Win', 'Blue Team Win'])

ax.set_title("Confusion Matrix", size=15)
plt.yticks([0.5,1.5], ['Red Team Win', 'Blue Team Win'], va='center');

### Feature Importance

In order to see which features were the most influential in determining who would win, the variable importances will be inspected.

In [None]:
# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 5)) for feature, importance in zip(league_wrangled.columns, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print(f'Variable: {pair[0]:20} Importance: {pair[1]}') for pair in feature_importances[:35]];

Based on the feature importances, it is clear that the most important features are the number of inhibtors taken by each team. Afterwards, the inner and base towers are the most significant. Based on the values, it is clear that games are not decided by a single feature alone, but rather a combination of multiple features.  

## Suppot Vector Machine (SVM)

A SVM is a classifier that seperates the data points using hyperplanes. Hyperplanes are calculated by maximizing the distance between support vectors and the hyperplane. SVMs take advantage of the kernel trick in order to model nonlinear features. Their advantages include: 

- Performs well for high-dimensional data
- Useful when classes are seperable 
- Suited for binary classification

All of these advantages are applicable to the wrangled league of legends dataset. With regards to the SVM's disadvantages, these include: 

- Requires a large amount of processing time for large datasets
- Does not perform well on overlapping classes
- Can be difficult selecting the proper kernel function 

In [None]:
# Import StandardScaler  
from sklearn.preprocessing import StandardScaler

league_ss = league_wrangled.copy()
ss = StandardScaler() 
league_ss = ss.fit_transform(league_ss)

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(league_ss, result, 
                                                    test_size=0.33, random_state=42)

In [None]:
# Import SVM
from sklearn.svm import SVC

svc = SVC(gamma="scale") 
svc.fit(X_train, y_train)

In [None]:
svc_acc = svc.score(X_test, y_test)

print(f"Mean Accuracy of SVC: {svc_acc}")

## Deep Neural Network 

Neural networks are modeled after the way neurons are structured. They are able to learn complex relationships and functions through the adjustment of weights by gradient descent. Simply put, they learn by reducing the error between the predicted value/class and the actual value/class.

Because of the nature of neural networks, a simple neural network will be constructed using a single hidden layer. The layer will have half the number of features, will a final single output neuron. Additionally dropout will be used. 

In [None]:
# Transform inputs 
X_train_array = np.asarray(X_train)
y_train_array = np.asarray(y_train)

X_test_array = np.asarray(X_test)
y_test_array = np.asarray(y_test)

In [None]:
import tensorflow as tf
from tensorflow import keras 
from tensorflow.keras import layers

input_size = len(X_train_array[0])
inputs = keras.Input(shape=(input_size, )) 
dense = layers.Dense(0.5*input_size, activation='relu')(inputs)
output = layers.Dense(1, activation='softmax')(dense)

model = keras.Model(inputs=inputs, outputs=output, name='base_league_model')

In [None]:
model.summary()

In [None]:
keras.utils.plot_model(model, 'base_league_model.png', show_shapes=True)

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train_array, y_train_array, 
          batch_size=64, 
          epochs=100)

test_scores = model.evaluate(X_test_array, y_test_array, verbose=2)

print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])

In [None]:
len(X_train_array[0])