# League of Legends: Diamond Ranked Games (10 min) Analysis and Model

## Background:

League of Legends is a multiplayer online battle arena video game for Windows computer systems. In each game, two teams of five players face off in five different defined roles. The map consists of three main lanes along with a jungle area and the goal of the match is the destroy the enemy base ('Nexus'). 

#### Data:

From dataset explanation: "This dataset contains the first 10min. stats of approx. 10k ranked games (SOLO QUEUE) from a high ELO (DIAMOND I to MASTER). Players have roughly the same level". A few notes here: 1) These games consist of data spread across 8 different ranks (Diamond 1 to Master) 2) Players being around the same level implies that these players are have about equaivalent experirence on current account. 

#### Data Terms:
    Warding totem: An item that a player can put on the map to reveal the nearby area. Very useful for map/objectives control.
    Minions: NPC that belong to both teams. They give gold when killed by players.
    Jungle minions: NPC that belong to NO TEAM. They give gold and buffs when killed by players.
    Elite monsters: Monsters with high hp/damage that give a massive bonus (gold/XP/stats) when killed by a team.
    Dragons: Elite monster which gives team bonus when killed. The 4th dragon killed by a team gives a massive stats bonus. The 5th dragon (Elder Dragon) offers a huge advantage to the team.
    Herald: Elite monster which gives stats bonus when killed by the player. It helps to push a lane and destroys structures.
    Towers: Structures you have to destroy to reach the enemy Nexus. They give gold.
    Level: Champion level. Start at 1. Max is 18.


In [None]:
# Load Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import Data
data = pd.read_csv('/home/high_diamond_ranked_10min.csv')
data.head()

In [None]:
data.info() # Check data type for all columns

Observations:  
    - There are 39 columns and no missing data  
    - All columns are already in numeric format  

In [None]:
data.loc[:, data.columns != 'gameId'].describe().T

In [None]:
data['blueWins'].value_counts()

Observations:  
    - Even distribution of outcome variable 

In [None]:
# Extract binary columns and check distribution with outcome feature
bool_cols = [col for col in data 
             if np.isin(data[col].dropna().unique(), [0, 1]).all()]
binary_data = data[bool_cols]

for col in binary_data:
    print(pd.crosstab(binary_data['blueWins'], binary_data[col]))

Observations:  
    - All binary features seem to be evenly distrbuted across both outcomes

In [None]:
# All continous data
cont_data = data.drop(bool_cols, axis = 1 )

# Correlation Heatmap 
plt.figure(figsize=(18,13))
sns.heatmap(data.loc[:, data.columns != 'blueWins'].corr(),cbar=True,fmt =' .2f', cmap='coolwarm')

Observations:  
    - Certain high correlations are expected. Red team deaths is highly correlated with blue team kills and vice versa. Additionally for each team, kills, gold and experience features are highly correlated. This is expected since each kill rewards the team with gold and experience while also allowing the pushing of lanes. Minions, monsters and turrets also provide the team with experince and gold.   
    - A lot of these columns are providing the same info in different formats such as blueKills/redDeaths, blueGoldDiff/redGoldDiff, blueExperienceDiff/redExperienceDiff. So these columns can be removed before proceeding to the modeling phase. 
    - Based on game background, a few other columns could be removed. csPerMin and totalMinionsKilled relate the same information but one is a count and the other is a rate. Same issue with goldPerMin and totalGold. Additionally goldDiff is a ratio between both teams and would be heavily dependent on the other gold features within the dataset. Same with experienceDiff. 

In [None]:
# Remove unnecessary columns 
extra_col = ['gameId', 'redGoldDiff', 'redExperienceDiff', 'redKills', 'redDeaths', 'blueAvgLevel', 'redAvgLevel', 'redFirstBlood', 'blueGoldPerMin', 'redGoldPerMin', 'redTotalMinionsKilled', 'blueTotalMinionsKilled', 'blueTotalExperience', 'blueTotalGold', 'redTotalGold', 'redTotalExperience', 'redDragons']

data_clean = data.drop(extra_col, axis=1)

# Redo Correlation Heatmap with excluded data 
plt.figure(figsize=(18,13))
sns.heatmap(data_clean.loc[:, data_clean.columns != 'blueWins'].corr(),cbar=True,fmt =' .2f', cmap='coolwarm')

In [None]:
# Check columns for correlation with blueWins
corr_list = data_clean.corr()['blueWins'].abs()
corr_list = corr_list.sort_values(kind='quicksort')
print(corr_list)

Observations:  
    - Certain features have very little correlation with the outcome so those could be removed. 

In [None]:
corr_cols = ['blueWardsPlaced', 'redWardsPlaced', 'blueWardsDestroyed', 'redWardsDestroyed', 'blueHeralds', 'redHeralds', 'redTowersDestroyed', 'redTotalJungleMinionsKilled', 'blueTowersDestroyed', 'blueTotalJungleMinionsKilled']

# Drop features with correlation under 0.20 with the outcome
final_data = data_clean.drop(corr_cols, axis=1)
final_data.info()

## Data Modeling

In [None]:
# Get data ready for modeling
from sklearn.model_selection import train_test_split

x = final_data.drop('blueWins', axis=1)
y = final_data['blueWins']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state=15)

accuracy_results = {} # Store all model accuracy ratings

In [None]:
# Logistic Regression 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logistic_model = LogisticRegression(max_iter=400)
logistic_model.fit(x_train, y_train)
lm_predict_labels = logistic_model.predict(x_test)

lm_accuracy = accuracy_score(lm_predict_labels, y_test)
print(lm_accuracy)

accuracy_results['Logistic Regression'] = lm_accuracy

In [None]:
# Normalizing continous variables
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range = (0,1))

scaler.fit(x_train)
x_train_normalized = scaler.transform(x_train)
x_test_normalized = scaler.transform(x_test)

logistic_model.fit(x_train_normalized, y_train)
lm_predict_labels_normalized = logistic_model.predict(x_test_normalized)
lm_accuracy_normalized = accuracy_score(lm_predict_labels_normalized, y_test)
print(lm_accuracy_normalized)

accuracy_results['Logistic Regression Normalized'] = lm_accuracy_normalized

Normalizing the data did not improve the accuracy scores of the model. The next step could be to keep tuning the hyperparameters or try a different model. 

In [None]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

dt_model = DecisionTreeClassifier(random_state=1) 
dt_model.fit(x_train, y_train)
dt_predictions = dt_model.predict(x_test)

dt_accuracy = accuracy_score(dt_predictions, y_test)
print(dt_accuracy)

accuracy_results['Decision Tree'] = dt_accuracy

The decision tree model also resulted in lower accuracy compared to the original logistic regression model. 

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=5, random_state=1, min_samples_leaf=2)
rf_model.fit(x_train, y_train)
rf_predictions = rf_model.predict(x_test)

rf_accuracy = accuracy_score(rf_predictions, y_test)
print(rf_accuracy)

accuracy_results['Random Forest'] = rf_accuracy

In [None]:
# Turning Random Forest Parameters
rf_model = RandomForestClassifier(n_estimators=75, random_state=1, min_samples_leaf=2)
rf_model.fit(x_train, y_train)
rf_predictions = rf_model.predict(x_test)

rf_accuracy_tuned = accuracy_score(rf_predictions, y_test)
print(rf_accuracy_tuned)

accuracy_results['Random Forest Tuned'] = rf_accuracy_tuned

Tuning the Random Forest model improved accuracy and provided results closer to the logistic regression model. 

In [None]:
# XGBoost
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(x_train, y_train)
xgb_predictions = xgb.predict(x_test)
xgb_accuracy = accuracy_score(xgb_predictions, y_test)
print(xgb_accuracy)

accuracy_results['XGBoost'] = xgb_accuracy

In [None]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(x_train, y_train)
gnb_predictions = gnb.predict(x_test)
gnb_accuracy = accuracy_score(gnb_predictions, y_test)
print(gnb_accuracy)

accuracy_results['Naive Bayes'] = gnb_accuracy

In [None]:
# Check results of models
print(pd.DataFrame.from_dict(accuracy_results, orient='index'))

Logistic Regression, Random Forest and Naive Bayes seem to have the best accurracy scores compared to other models. Moving forward, Logistic Regression will be the model used. 

## Predictive Model Details

In [None]:
# Extract coefficents from the model
inferential_table = np.concatenate((logistic_model.coef_, np.exp(logistic_model.coef_)),axis=0)
infer_col = final_data.loc[:, final_data.columns != 'blueWins'].columns
inferential_data = pd.DataFrame(data=inferential_table, columns=infer_col).T.reset_index().rename(columns={'index': 'Features', 0: 'Coefficient', 1: 'Odds Ratio'})
print(inferential_data)

Observations:  
    - blueGoldDiff and blueExperienceDiff were the highest predictors of odds of winning the game. blueKills also seem to have high odds of winning the game. 

In [None]:
# Get confusion matrix of the predictions 
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, lm_predict_labels)

## Conclusion

**Logistic Model** was chosen as the final predictive model for the dataset. Further tuning of hyperparemeters of other models such as Random Forest or Naive Bayes might have presented similar accuracy scores. blueGoldDiff and blueExperienceDiff seem to be the strongest predictors of game wins. 