# COGS 118A - Final Project

# Predicting Winner of LoL Match Using EDA

## Group members

- Samuel Kweon  
- Nicholas Azpeitia 
- Gyujin Hong

# Abstract 

Our goal is to use EDA to determine what classifier model and with what features will be able to predict the winner of a League of Legends game with the best accuracy. The data used represents competitive League of Legends matches from 2015 to 2017 from the NALCS, EULCS, LMS, CBLoL, and LCK. We will apply logistic regression and random forest to predictor features of our choice from the dataset and evaluate the performance of the models. We will evaluate the performance of our model based on the accuracy of its classification of winners of the matches that is not from the training data. A visualization method will be used to help understand the results of our model in more detail.
After analyzing the accuracy scores, we will be able to conclude which algorithm/model and feature is best for predicting the winner of a match. 


# Background

League of Legends, an online multiplayer game, has emerged as a significant area of interest for data scientists and machine learning enthusiasts due to its complex and multifaceted gameplay. Two substantial contributions have been made to this field of research by Diego Angulo Quintana and Yingshan Li, who have both utilized diverse data and machine learning models to predict the outcome of the matches.

Diego Angulo Quintana's study stands out in the sense that it focuses primarily on the objectives achieved by each team as predictive parameters<a name="quintana"></a>[<sup>[1]</sup>](#quintananote). His methodology implies that the various objectives throughout the game, such as securing Dragon or Baron Nashor, have a decisive role in determining the outcome. Quintana's research offers crucial insights into the importance of in-game objectives, serving as an informative resource for those interested in identifying the key performance indicators in a League of Legends match.

On the other hand, Yingshan Li's research<a name="li"></a>[<sup>[2]</sup>](#linote) tackles the issue from a different perspective by studying the early stage of the game. Li utilizes statistics from the first 10 minutes of ranked matches to predict match outcomes. This approach underlines the significance of early game performance in shaping the rest of the match, suggesting that strategies and actions in the initial phases can potentially affect the game's outcome. Furthermore, Li compares different machine learning models to assess their accuracy, contributing valuable information to the ongoing dialogue around the most suitable predictive algorithms for such complex tasks.

# Problem Statement




The problem we are trying to solve is the accurate prediction of the winner in League of Legends matches using data from competitive matches played between 2015 and 2017. The problem is well-defined and can be quantified, measured, and replicated.
<br>

The problem is quantifiable because the prediction we will make with the logistic regression model is based on probability. It is also measureable through the dataset provided, as every match entails of the accurate objectie scores, champion ban or picks, and gold differences along the match is provided following a timeline. The gold difference column we will be using is a integer value, and we are able to conduct mathematical operations on it and obtain specific measurements for our data for every match. Moreover, the problem is replicable in that many League of Legends games can be played and analyzed to validate the accuracy of our prediction models.
<br>

The gold difference between teams (i.e., the difference in total gold earned by each team at any given point in the match) can often indicate which team is in the lead or has the upper hand. A significant gold difference can highlight a major discrepancy in team strengths, often translating into one team having better items and thus a higher likelihood of winning.

Another feature we chose to apply our prediction models on is the kda difference between the blue and red team for each match. This could be a good indicator for chance of winning in a match because the more a team snowballs to victory, the stronger their champions get against the champions in the other team, leading to higher chance of winning team fights, and therefore getting more kills than the other team who has a lower chance of winning. Given the nature of subtracting the number of red team's champion kills from the blue teams, we will suspect the kda ratio to be a positive integer rather than a negative, whenever the blue team has been labeled as the winner of the match. 

The third feature we would like to conduct prediction model learning is on the winrate difference of the champions between teams might indicate a team has stronger champions and thus an upper hand. An upper hand, could leave to a higher likelihood of winning.

After comparing the performance of the logistic regression model and the random forest classifier on each of the three features, we will be able to determine which feature has better metrics for predicting the winner of a league of legends game.

# Data


### Information on the dataset and libraries used

The dataset used for this project was obtained from a Kaggle dataset post<a name="kaggledata"></a>[<sup>[3]</sup>](#Kaggle).  The dataset consists of 57 variables and 7620 observations.
Below is a list of the variables from the dataset provided. 

```python

# out of all these variables, we felt like the golddiff would be the most useful
# it contained the gold difference between the blue and red team throughout the game
# applying mean and std dev ops on the column gave us the appropriate values
variables = ['League', 'Year', 'Season', 'Type', 'blueTeamTag', 'bResult', 'rResult',
       'redTeamTag', 'gamelength', 'golddiff', 'goldblue', 'bKills', 'bTowers',
       'bInhibs', 'bDragons', 'bBarons', 'bHeralds', 'goldred', 'rKills',
       'rTowers', 'rInhibs', 'rDragons', 'rBarons', 'rHeralds', 'blueTop',
       'blueTopChamp', 'goldblueTop', 'blueJungle', 'blueJungleChamp',
       'goldblueJungle', 'blueMiddle', 'blueMiddleChamp', 'goldblueMiddle',
       'blueADC', 'blueADCChamp', 'goldblueADC', 'blueSupport',
       'blueSupportChamp', 'goldblueSupport', 'blueBans', 'redTop',
       'redTopChamp', 'goldredTop', 'redJungle', 'redJungleChamp',
       'goldredJungle', 'redMiddle', 'redMiddleChamp', 'goldredMiddle',
       'redADC', 'redADCChamp', 'goldredADC', 'redSupport', 'redSupportChamp',
       'goldredSupport', 'redBans', 'Address']
```

Each observation in the dataset represents a single game of League of Legends, with various statistics recorded throughout the game for each team. Each observation is represented by each row, so we have a dataset of 7620 matches. <br>
The dataset provides information on different aspects of the game, including gold differentials and objectives taken by the teams.

The dataset has already undergone cleaning, so no further cleaning or transformations was necessary. The preprocessing steps involved extracting relevant features from the original data, such as computing gold differences and extracting objective control information. 

The libraries imported are as following,

```python
# logistic regression libraries
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import HistGradientBoostingClassifier
import numpy as np
import ast
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
```

```python
# random forest libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import numpy as np
import ast
import seaborn as sns
import matplotlib.pyplot as plt
```

### Data Preprocessing

#### Gold Difference Feature Predictor
We preprocessed the dataset by extracting the gold difference from the "golddiff" column. Using a custom function, we extracted the final_gold_diff, avg_gold_diff, and std_dev_gold_diff features which capture different aspects of the gold difference per match labeled with a win or loss for the blue team. These features serve as predictors in our model.
The 'bResult' column is the target variable that holds the win or loss labels as 1 or 0, respectively. 

```python
# gold difference 
def process_gold_diff(gold_diff_str):
    gold_diff = ast.literal_eval(gold_diff_str)
    final_gold_diff = gold_diff[-1]
    avg_gold_diff = np.mean(gold_diff)
    std_dev_gold_diff = np.std(gold_diff)
    return final_gold_diff, avg_gold_diff, std_dev_gold_diff

# applying the function 
data['final_gold_diff'], data['avg_gold_diff'], data['std_dev_gold_diff'] = 
    zip(*data['golddiff'].map(process_gold_diff))

# features and target
X = data[['final_gold_diff', 'avg_gold_diff', 'std_dev_gold_diff']]
y = data['bResult']
```

#### Kill Death Assist Difference Feature Predictor
We extracted the kda from the red team's champion kill column and the blue team's champion kill column and took the length of the rows that indicated how many total champion kills for that team per match, and took the difference from the blue team with the red team. We compared that against the winning label of the blue team. <br>
<br>
This feature had an extra step of checking for infinity values, and large values of the rows within the extracted dataframe. Not doing so caused errors when trying to fit into our models. 
We set a threshold value to not have too large values for calculation and prediction, and applied Standard Scaler afterwards to normalize the values. After doing so, we ran the training and test data through the prediciton model. 

```python
# preprocess and extract features
# kda ratio
def process_kda(row):
    kda = ast.literal_eval(row)
    kill_count = len(kda)
    return kill_count

# apply function above related to kda ratio from the 'bKills' and 'rKills' column
data['bkills'] = data['bKills'].apply(process_kda)
data['rkills'] = data['rKills'].apply(process_kda)

data['kda_diff'] = data['bkills'] - data['rkills']

# Check for infinity values in 'kda_diff'
is_inf = np.isinf(data['kda_diff'].values)

# Calculate absolute values of 'kda_diff'
abs_diff = np.abs(data['kda_diff'].values)

# Define threshold for excessively large values
threshold = 100.0

# Create boolean mask for large values
is_large = abs_diff > threshold

data_d = data['kda_diff']

# Print rows with infinity values or large values
print("Rows with Infinity Values:")
print(data_d[is_inf])

print("\nRows with Large Values:")
print(data_d[is_large])

# splitting data to training and test data
X = data['kda_diff'].values.reshape(-1,1)
y = data['bResult']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

# normalizing values
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

#### Champion Winrate Difference Feature Predictor
A new column was also made called "wrdiff". "wrdiff" takes the average winrate of the champions on the blue side, which gets subtracted by the average winrate of the champions on the red side.

```python
# Create the empty column
data['wrdiff'] = pd.Series(data = None, dtype = float)

# Run for all rows in the data
for match in range(len(data)):
    # Check for blue champions
    color = 'blue'
    bluewrs = []
    # Run for each role
    for role in ['Top', 'Jungle', 'Middle', 'ADC', 'Support']:
        # Get the champion
        champ = data.at[match, color + role + 'Champ']
        # Get champion's average winrate
        bluechampgames = data[data['blue' + role + 'Champ'] == champ]
        redchampgames = data[data['red' + role + 'Champ'] == champ]
        bluewins = len(bluechampgames[bluechampgames['bResult'] == 1])
        redwins = len(redchampgames[redchampgames['rResult'] == 1])
        champwr = (bluewins+redwins) / (len(bluechampgames) + len(redchampgames))
        # Append champion's average winrate
        bluewrs.append(champwr)
    
    # Check for red champions
    color = 'red'
    redwrs = []
    for role in ['Top', 'Jungle', 'Middle', 'ADC', 'Support']:
        # Get the champion
        champ = data.at[match, color + role + 'Champ']
        # Get champion's average winrate
        bluechampgames = data[data['blue' + role + 'Champ'] == champ]
        redchampgames = data[data['red' + role + 'Champ'] == champ]
        bluewins = len(bluechampgames[bluechampgames['bResult'] == 1])
        redwins = len(redchampgames[redchampgames['rResult'] == 1])
        champwr = (bluewins+redwins) / (len(bluechampgames) + len(redchampgames))
        # Append champion's average winrate
        redwrs.append(champwr)
    
    # Append the subtracted team's averages
    data.at[match, 'wrdiff'] = np.mean(bluewrs) - np.mean(redwrs)
```


### Dataset Split to Training and Test 

After preprocessing the data, we split the data into training and testing sets to evaluate the model's performance. The logistic regression model was suitable to be our base model because it is commonly used for binary classification. We trained the model with the training data. We set the training to test dataset set aside to be a ratio 80% and 20%.  

```python
# although the code snippet below includes both logistic and random forest
# in our actual eda process, they were done on separate notebooks
# therefore the var names may overlap, but actual code rendering was not affected

# split data into training and test sets
X_train, X_test, y_train, y_test = 
    train_test_split(X, y, test_size=0.2, random_state=12)

# create and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
 
# create and train the random forest classifer model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# make predictions on the test set
y_pred = model.predict(X_test)
```

# Proposed Solution

The solution in our project is to build a predictive model for the best feature in determining the outcome of League of Legends matches based on the gold difference ratios and champion win ratios. We will compare the performance of the logistic regression as a base model with the Random Forest classifier. We will compare the performance for each model, on how accurate these models were able to predict the win or loss of a new testing match data based on the training and testing data that we will feed in.

This will provide an unbiased evaluation of the model's performance, ensuring it can generalize well to new, unseen data. Then we will calculate the accuracy score to assess the model's performance in predicting the outcome of League of Legends matches based on gold difference ratios.

While no model can guarantee 100% accuracy due to the inherent uncertainty and complexity of League of Legends matches, with the proposed approach, we should be able to create a highly reliable predictive model. This will allow us to accurately predict match outcomes based on the game data available.

Additionally, we will plot a confusion matrix to visually represent the model's performance.

After running logistic regression, we will also apply the Random Forest classifier to our dataset for gold differences and wrdiff. The Random Forest algorithm is an ensemble method that combines multiple decision trees to make predictions and increase the robustness. 

By comparing the accuracy scores of both models, we can determine which model and feature performs better in predicting the outcome of League of Legends matches.

# Evaluation Metrics

The main evaluation metric that will be used is accuracy since we have a binary classification task. In this case, accuracy will be measure with if we are able to accurately predict the winner of a League of Legends match. Accuracy is calculated with the formula:

Accuracy = (True Positives + True Negatives) / (All Predictions)

However, if the dataset is imbalanced, the F1 score, recall, or precision might be a better evaluation metric. The formulas for each are below:

F1 Score = (2 * Precision * Recall) / (Precision + Recall)

Recall = (True Positives) / (True Positives + False Negatives)

Precision = (True Positives) / (True Positives + False Positives)


These metrics will be used to measure the performance of our preposed solution versus a benchmark model.

### Initial Findings

We looked at the correlation, F1 score, precision, and recall to see how the models and features predicted wins.

## kda_diff with forest model:

![someimg](../kdatree.png)

## kda_diff with log model:

![someimg2](../kdalog.png)

## gold_diff with forest model:

![someimg3](../golddifftree.png)

## gold_diff with log model:

![someimg4](../golddifflog.png)

## wrdiff with forest model:

![someimg5](../wrdifftree.png)

## wrdiff with log model:

![someimg6](../wrdifflog.png)

From these findings, it appears that the kda and gold difference may be good determiners of which team wins the match while wrdiff seems quite poor.

### Diving Deeper

Although the correlation may give some weight to what feature predicts wins the best, we need to look at the accuracy of the models to get a better look.

The accuracy score was used as the measure of performance for the classifier models. We compared the accuracy scores of logistic regression and random forest to determine which model and features performed better in predicting the winner of League of Legends games based on the kda difference, gold difference, and wrdiff features separately.

For consistency, the random seed was set to 43 for our logistic regression models for all three features and set to 12 for our random forest models. 

```python
# evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))

# Logistic Regression Gold Difference
Accuracy: 0.9665354330708661
    
# Random Forest Regression Gold Difference
Accuracy = 0.9625984251968503

# # Logistic Regression KDA difference
Accuracy: 0.94750656167979
    
# Random Forest Regression KDA Difference
Accuracy: 0.9488188976377953

# Logistic Regression wrdiff
Accuracy: 0.5912073490813649
    
# Random Forest Regression wrdiff
Accuracy: 0.5190288713910761
```

### Reaching The Bottom

After obtaining the accuracy scores, we analyzed the results obtained from the classifier models in predicting the outcome of League of Legends matches.  <br>
This included the confusion matrix that helped visualize the true positive, true negative, false positive, and false negative predictions.

```python
cm = confusion_matrix(y_test, y_pred)

# Create a DataFrame from the confusion matrix
cm_df = pd.DataFrame(cm, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])

# Visualize the confusion matrix using Seaborn's heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(cm_df, annot=True, cmap='Greens', fmt='g')
plt.title('Confusion Matrix')
```

## Confusion Matrix for kda_diff with forest model:

![someimg](../ckdatree.png)

## Confusion Matrix for kda_diff with log model:

![someimg2](../ckdalog.png)

## Confusion Matrix for gold_diff with forest model:

![someimg3](../cgolddifftree.png)

## Confusion Matrix for gold_diff with log model:

![someimg4](../cgolddifflog.png)

## Confusion Matrix for wrdiff with forest model:

![someimg5](../cwrdifftree.png)

## Confusion Matrix for wrdiff with log model:

![someimg6](../cwrdifflog.png)




From looking at the accuracies and confusion matrices of the models separated by feature, the gold difference via logistic regression yielded the best results.

# Discussion

### Interpreting the result


Main Point: 

Gold difference is the best feature at determining the winner of a League of Legends match.
Across all models and all features, the two with the highest accuracy were the gold difference models. Gold difference via logistic regression yielded a very high accuracy of 0.9665354330708661.

Sub Point: 

There was no clear best model for all features. The difference in accuracies of the two models did not differ that much except for wrdiff where it differed by about 7%. However, logistic regression performed better for gold difference and wrdiff, while random forest was better for kda.


Sub Point: 

Preliminary results do not tell the whole story. When looking at the correlations, kda had the highest correlation, so one might have assumed that it would also have the highest accuracy. However, this is not true. The recall and precision seemed to be a better indicator of accuracy with gold diff having the highest precision and recall for the three features.

### Limitations

The dataset was somewhat limited in that it required a lot of cleaning. Additionally, it did not track other potentially useful features like damage and damage share. Our question is quite open ended, and it would have been good to explore how the models worked with the features combined. I think this would have gave a definitive answer on what model was best for this task regardless of feature.

### Ethics & Privacy

There is privacy in that players who play these League of Legends matches from which the data is found may not consent to their data being given. While this could be considered an ethical issue, the players know the data is publicly available since the matches are broadcasted and no sensitive data is given out.

One issue could be that someone uses this data to gain the upper hand in the match by leveraging statistics. This could be considered an ethical issue since it might be an unfair advantage. Our team will address these issues by informing players of what data is being collected, how it might impact them, where to find it, and how to use it.

### Conclusion


The gold difference using a logistic regression model is the best at determining the winner of a League of Legends match. Our results support it through the high accuracy, recall, and precision of the gold difference with logistic regression model. This is further strengthened by the confusion matrix. Gold difference also had the best stats except for correlation than the other features. Finding which features and models are the best for a prediction are very important in machine learning and should be considered with every machine learning problem. Through improving feature and model selection, one can yield better predictions. In the future, combining the features to find a definitive best overall model would be ideal.

# Footnotes

<a name="quintananote"></a>1.[^](#quitana): Quintana, Diego A. (30 Aug 2019) Predicting Wins in League of Legends. *R Studio Pubs*. <br>https://rstudio-pubs-static.s3.amazonaws.com/533941_5ecc18a32e0b4603bd3fa2c46f05a174.html<br>
<a name="linote"></a>2.[^](#li): Li, Yingshan. Predicting Wins Using First Ten Minutes. *R Studio Pubs*. <br>https://rstudio-pubs-static.s3.amazonaws.com/912026_793d82d1630641bfad6ce05cb5c18385.html<br>
<a name="Kaggle"></a>3.[^](#kaggledata): Ephron, Chuck. League of Legends. *Kaggle Datasets.* <br>https://www.kaggle.com/datasets/chuckephron/leagueoflegends<br>