# Importing packages

In [1]:
!pip install pandas-profiling

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [2]:
import logging
import math
import warnings

import numpy as np
import pandas as pd
import plotly.express as px
from pandas_profiling import ProfileReport
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, f1_score, precision_score,
                             recall_score)
from statsmodels.discrete.discrete_model import BinaryResultsWrapper, Logit
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [3]:
logging.basicConfig(level=logging.ERROR)
warnings.filterwarnings('ignore')

# Loading dataset and first visualization

In [4]:
df = pd.read_csv("../input/league-of-legends-diamond-ranked-games-10-min/high_diamond_ranked_10min.csv")
df.head()

Unnamed: 0,gameId,blueWins,blueWardsPlaced,blueWardsDestroyed,blueFirstBlood,blueKills,blueDeaths,blueAssists,blueEliteMonsters,blueDragons,...,redTowersDestroyed,redTotalGold,redAvgLevel,redTotalExperience,redTotalMinionsKilled,redTotalJungleMinionsKilled,redGoldDiff,redExperienceDiff,redCSPerMin,redGoldPerMin
0,4519157822,0,28,2,1,9,6,11,0,0,...,0,16567,6.8,17047,197,55,-643,8,19.7,1656.7
1,4523371949,0,12,1,0,5,5,5,0,0,...,1,17620,6.8,17438,240,52,2908,1173,24.0,1762.0
2,4521474530,0,15,0,0,7,11,4,1,1,...,0,17285,6.8,17254,203,28,1172,1033,20.3,1728.5
3,4524384067,0,43,1,0,4,5,5,1,0,...,0,16478,7.0,17961,235,47,1321,7,23.5,1647.8
4,4436033771,0,75,4,0,6,6,6,0,0,...,0,17404,7.0,18313,225,67,1004,-230,22.5,1740.4


In [5]:
profile = ProfileReport(df, explorative=False, interactions={'continuous': False})
profile

Summarize dataset:   0%|          | 0/54 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



As per the report, we observe that many variables are highly correlated.

For example, `redCSPerMin` is highly correlated with `redTotalMinionsKilled` which makes sense since the more CS killed per minute, the more minions killed overall.

We will probably face a multicollinearity problem here.

As a League of Legends player, we could also point out some interesting things :
- It seems that in average, red side tends to have the first drake more often (41.3% vs 36.2%). This point is especially surprising as it is supposed to be easier for blue side to both ward the Drake's pit and to steal it.
- It seems that in average, blue side tends to have the first herald more often (18.8% vs 16%)
- The win probability seems to be independent on the side which is a bit surprising as the side is supposed to have a significant (while supposedly low) impact on the win probability as :
    - Blue side picks first which can guarantee the pick of an "OP" champ while red side can thus counterpick. The draft aspect is not explicit here but we suppose that it does not have any impact on the win probabilities as we are analyzing SoloQ data whereas the draft effect is especially observable in competitive matches where the level is way beyond the one from Diamond.
    - Due to the way the map has been created, the odds of getting the objectives slightly vary from one side to another
    

# Feature Selection

The goal of this section is to select the features that we are going to consider to try to predict the winner of the game. 
For this purpose, we will first rely on the correlation matrix, from there, we will then slim the features set by ensuring that they do not have an high VIF.

In [6]:
correlation = df.corr()["blueWins"]

In [7]:
# We only keep the variables that have an absolute correlation coefficient that is greater than 0.2
THRESHOLD = 0.2
features = [feature for feature, corr_coef in correlation.iteritems() if abs(corr_coef) >= THRESHOLD]

In [8]:
# We drop all the variables that are symetric i.e blueKills == redDeaths
symetric_features = ["redFirstBlood", "redKills", "redDeaths", "redGoldDiff", "redExperienceDiff"]
# We drop all the redundant variables. For example, we do not care about the total amount of gold earned, what matters is actually the difference compared to the enemy team
redundant_features = ["redTotalGold", "redAvgLevel", "redTotalExperience", "blueTotalGold", "blueAvgLevel", "blueTotalExperience"]
# We drop highly correlated features. For example, the redAvgLevel is logically highly corelated with the 
multicolinear_features = ["redCSPerMin", "blueCSPerMin", "blueEliteMonsters", "redEliteMonsters", "blueGoldPerMin", "redGoldPerMin"]

features = set(features) - set(symetric_features) - set(redundant_features) - set(multicolinear_features)
features.remove("blueWins") # Removing target variable from features

In [9]:
df_features = df[features]

In [10]:
df_features.head()

Unnamed: 0,blueGoldDiff,redTotalMinionsKilled,blueDragons,blueTotalMinionsKilled,redDragons,redAssists,blueExperienceDiff,blueFirstBlood,blueKills,blueDeaths,blueAssists
0,643,197,0,195,0,8,-8,1,9,6,11
1,-2908,240,0,174,1,2,-1173,0,5,5,5
2,-1172,203,1,186,0,14,-1033,0,7,11,4
3,-1321,235,0,201,0,10,-7,0,4,5,5
4,-1004,225,0,210,1,7,230,0,6,6,6


In [11]:
def check_colinearity_from_correlation(dataframe: pd.DataFrame, threshold: float) -> None:
    correlation = dataframe.corr()
    analyzed_pair = set()
    for index, serie in correlation.iterrows():
        for column, value in serie.iteritems():
            if index != column and value >= threshold and tuple(sorted((index, column))) not in analyzed_pair:
                print(f"There may be colinearity between {index} and {column}. Correlation : {value}")
            analyzed_pair.add(tuple(sorted((index, column))))


def check_colinearity_from_vif(dataframe: pd.DataFrame, threshold: float) -> None:
    variance_inflation = [variance_inflation_factor(dataframe.values, i) for i in range(dataframe.shape[1])]
    for feature, vif in zip(dataframe.columns, variance_inflation):
        if vif >= threshold:
            print(f"{feature} may be colinear. VIF : {vif}")

In [12]:
print("Multicolinearity check using correlation :\n")
check_colinearity_from_correlation(df_features, threshold=0.7)
print("\nMulticolinearity check using VIF :\n")
check_colinearity_from_vif(df_features, 10)

Multicolinearity check using correlation :

There may be colinearity between blueGoldDiff and blueExperienceDiff. Correlation : 0.8947294549589994
There may be colinearity between redAssists and blueDeaths. Correlation : 0.8040234763151517
There may be colinearity between blueKills and blueAssists. Correlation : 0.8136672492803579

Multicolinearity check using VIF :

blueGoldDiff may be colinear. VIF : 15.547928173629156
redTotalMinionsKilled may be colinear. VIF : 122.44331515672044
blueTotalMinionsKilled may be colinear. VIF : 121.97734882305433
redAssists may be colinear. VIF : 11.468409676363617
blueKills may be colinear. VIF : 33.912320715337245
blueDeaths may be colinear. VIF : 33.516306789199746
blueAssists may be colinear. VIF : 11.991397444271854


Despite our first attempt to manually handle colinearity with our feature selection, it seems that we still have some multicolinearity problem.

We are first going to try if we manage to achieve a good performance from our first selection of variable, if not, we will eventually think about ways to remove the remaining multicollinearity

# Logistic Regression Fitting

In [13]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    df_features, df["blueWins"], test_size=0.2, random_state=42)

model = Logit(y_train, X_train).fit()

Optimization terminated successfully.
         Current function value: 0.530349
         Iterations 6


In [14]:
model.summary()

0,1,2,3
Dep. Variable:,blueWins,No. Observations:,7903.0
Model:,Logit,Df Residuals:,7892.0
Method:,MLE,Df Model:,10.0
Date:,"Mon, 08 Nov 2021",Pseudo R-squ.:,0.2349
Time:,21:39:18,Log-Likelihood:,-4191.3
converged:,True,LL-Null:,-5477.9
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
blueGoldDiff,0.0005,4.54e-05,10.024,0.000,0.000,0.001
redTotalMinionsKilled,0.0042,0.001,3.116,0.002,0.002,0.007
blueDragons,0.3181,0.071,4.453,0.000,0.178,0.458
blueTotalMinionsKilled,-0.0042,0.001,-3.126,0.002,-0.007,-0.002
redDragons,-0.2294,0.070,-3.284,0.001,-0.366,-0.093
redAssists,0.0079,0.012,0.672,0.502,-0.015,0.031
blueExperienceDiff,0.0003,3.45e-05,7.585,0.000,0.000,0.000
blueFirstBlood,0.0916,0.058,1.583,0.113,-0.022,0.205
blueKills,-0.0256,0.023,-1.098,0.272,-0.071,0.020


We can see here that we actually have a bunch of variable that we thought would be important to determine the probability of winning that actually are statistically insignificant.

Among them we can list :
- blueDeaths
- blueKills
- blueFirstBlood
- blueAssists
- redAssists

Actually, this is not as unexpected as we could think at first hand. 

Indeed, we can see for example that the `redAssists` and `blueAssists` variables are clearly insignificant but this is expected as we saw earlier that these two were kind of redundant as they were highly correlated respectively with `blueDeaths` and `blueKills`.

`blueKills` is not necessary highly correlated with another variable from our set of features however, we saw that it presented a quite high VIF (33) which could suppose that it may be colinear with another variable which we suppose is the `blueGoldDiff` variable as more kills means more gold and thus tends to increase the `blueGoldDiff` parameter.

Regarding the `blueDeaths` parameter, it also presents a high p-value probably because more deaths implies more gold for the enemy team which leads to a decrease in the `blueGoldDiff` variable.

In [15]:
predictions = model.predict(X_test)
predictions = [1 if x >= 0.5 else 0 for x in predictions]

In [16]:
def compute_metrics(y_true: np.array, y_pred: np.array) -> None:
    print(f"Accuracy = {accuracy_score(y_true, y_pred)}")
    print(f"F1 = {f1_score(y_true, y_pred)}")
    print(f"Recall = {recall_score(y_true, y_pred)}")
    print(f"Precision = {precision_score(y_true, y_pred)}")

In [17]:
compute_metrics(y_test, predictions)

Accuracy = 0.7348178137651822
F1 = 0.738
Recall = 0.743202416918429
Precision = 0.7328699106256207


Despite the presence of some redundant variables, it looks like the regression is presenting a quite good accuracy even though we're only considering the first 10 minutes of the game

In [18]:
def filter_significant_variables(model: BinaryResultsWrapper, pvalue_threshold: float) -> list:
    return [variable for variable, pvalue in model.pvalues.items() if pvalue <= pvalue_threshold]

def plot_variable_impact_on_odds_ratio(model: BinaryResultsWrapper, pvalue_threshold: float = 0.05) -> None:
    significant_variables = filter_significant_variables(model, pvalue_threshold)
    coefficients = {variable: coef for variable, coef in model.params.items() if variable in significant_variables}
    odds = [(variable, (math.exp(coef) - 1) * 100 if coef > 0 else (-1 + math.exp(coef)) * 100) for variable, coef in coefficients.items()]
    df = pd.DataFrame(odds, columns=["Variable", "Impact on odds of winning (in %)"])
    fig = px.bar(df, x="Impact on odds of winning (in %)", y="Variable", orientation='h')
    fig.show()

In [19]:
significant_variables = filter_significant_variables(model, 0.05)
subset_df = df[significant_variables]

print("Multicolinearity check using correlation :\n")
check_colinearity_from_correlation(subset_df, threshold=0.7)
print("\nMulticolinearity check using VIF :\n")
check_colinearity_from_vif(subset_df, 10)

Multicolinearity check using correlation :

There may be colinearity between blueGoldDiff and blueExperienceDiff. Correlation : 0.8947294549589994

Multicolinearity check using VIF :

redTotalMinionsKilled may be colinear. VIF : 87.71166884499301
blueTotalMinionsKilled may be colinear. VIF : 87.44882379275072


We still have some variables that seem colinear but we can see that our feature selection allowed us to reduce the number of variables concerned and also lowered the VIF of the `blueTotalMinionsKilled` and `redTotalMinionsKilled` since they decrease from 122 to 87.

In [20]:
plot_variable_impact_on_odds_ratio(model)

Simple interpretation of the results :
- When blue team gets the first dragon, it increases the team's winning odds by 37% (if considering all the other variables as fixed)
- When the red team gets the first dragon, it reduces the team's winning odds by 20.4% (if considering all the other variables as fixed)
- For each additional minions killed by the red team, the blue team's winning odds increase by 0.42% (if considering all the other variables as fixed)
- For each additional minions killed by the blue team, the blue team's winning odds decrease by 0.42% (if considering all the other variables as fixed)
- Each time the blue team gets one more experience point compared to the red one, its winning ods increase by 0.02% (if considering all the other variables as fixed)
- Each time the blue team gets one more gold compared to the red one, its winning ods increase by 0.04% (if considering all the other variables as fixed)

## Things to notice

Within the first 10 minutes of the game, the dragon has a huge impact on the odds of winning, however this impact is not symetric as we saw that when blue team gets the first dragon, the increase of the odds of winning (37%) is almost the double of the increase when the red team actually gets it which makes a huge difference.

Paradoxically, the more minions a team kills, the lower is its odds of winning. This sounds to be counterintuitive but this fact can be explained. Indeed, one could say that the more minions a team is killing, the more the lanes are pushed meaning that it brings the enemy team next to their turrets which makes it way harder for the pushing team to get more kills compared to when they get pushed

Regarding the low increase of the odds of winning for the `blueGoldDiff` and `blueTotalMinionsKilled`, this has to be considered with regards of the fact that 1 unit of those is not much at all. For example, the standard deviation those two variables are respectively 2453 and 1920. As an illustration, if blue teams gets 2453 more gold than the red team or 1920 more experience points, it respectively increases the odds of winning of the blue team by 98% (2453 * 0.04) or by 38.4% (1920 * 0.02)

Despite their low coefficients, it seems, as expected, that the two most important factors, at least in the 10 first minutes of a League of Legends game are actually the amount of gold generated and the experience points acquired by the team.

# Attempt to add some relevant yet uncorrelated features to the Logistic Regression

We filtered out a lot of features since we only kept those that have a correlation with our target variables which was greater than 0.2 but we actually think, as a LoL player that some features have been unfairly removed from our set of features, among them we have :
- blueWardsPlaced : more wards equals to more squirmishes (supposedly positively correlated with `blueKills` and thus with `blueGoldDiff`) and less deaths (supposedly positively related with the `blueGoldDiff` since less deaths equals to less gold for the enemy team)
- blueTowersDestroyed : More towers destroyed equals to more gold and more vision

Remark : The herald could be a variable to consider, however, due to the fact that the first and main goal of the herald is to destroy turrets, we will not include it in the following regression as it may introduce some multicolinearity with `blueTowersDestroyed`

In [21]:
features = significant_variables + ["blueWardsPlaced", "blueTowersDestroyed"]
df_features = df[features]

In [22]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    df_features, df["blueWins"], test_size=0.2, random_state=42)

model = Logit(y_train, X_train).fit()

Optimization terminated successfully.
         Current function value: 0.530699
         Iterations 6


In [23]:
model.summary()

0,1,2,3
Dep. Variable:,blueWins,No. Observations:,7903.0
Model:,Logit,Df Residuals:,7895.0
Method:,MLE,Df Model:,7.0
Date:,"Mon, 08 Nov 2021",Pseudo R-squ.:,0.2344
Time:,21:39:22,Log-Likelihood:,-4194.1
converged:,True,LL-Null:,-5477.9
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
blueGoldDiff,0.0004,2.68e-05,15.492,0.000,0.000,0.000
redTotalMinionsKilled,0.0032,0.001,2.761,0.006,0.001,0.005
blueDragons,0.3174,0.071,4.453,0.000,0.178,0.457
blueTotalMinionsKilled,-0.0032,0.001,-2.795,0.005,-0.005,-0.001
redDragons,-0.2230,0.070,-3.197,0.001,-0.360,-0.086
blueExperienceDiff,0.0003,3.22e-05,7.870,0.000,0.000,0.000
blueWardsPlaced,-0.0010,0.001,-0.700,0.484,-0.004,0.002
blueTowersDestroyed,0.0034,0.141,0.024,0.981,-0.272,0.279


The two freshly introduced variables are actually highly statistically insignificant, this is unexpected from a player perspective but actually really understandable from a statistical point of view. 

Indeed, these two variables are, as presented earlier, indirectly related to the amount of gold generated which is directly impacting the `blueGoldDiff` parameter which is key as we saw earlier.

Apart from this, an interesting point is also the coefficient associated with the `blueWardsPlaced` parameter. Indeed, despite the fact that the parameter is insignificant, it particularly present a negative coefficient which means that the more wards placed, the less the odds of winning.

This seems counterintuitive but it could be explained by the fact that :
- More wards placed equals to potentially more wards destroyed which generates gold for the enemy team and thus reduces the `blueGoldDiff` parameter
- More wards placed equals to more wards bought which means more gold spent in things that do not increase the fighting skills of the player's characters which probably tends to reduce the number of kills and/or increase the number of deaths which *in fine*, reduces the `blueGoldDiff` parameter. 

# Conclusion

The main gold of this first analysis resorting to **Logistic Regression** was to be able to identify the key factors that could, within the 10 first minutes of a LoL game, impact the odds of winning of a given team.

Thanks to this first analysis, we were able to confirm the fact that both the **gold** generated and **experience** acquired are key factors in determining which team would win.

We also learnt that the dragon had a non-negligible effect on the odds of winning, but that this effet was asymetric has the increase in the odds of winning when a team gets the first dragon is much greater when concerning the blue than the red team.

With our Logistic Regression that has not been fine-tuned and a probabiliy threshold set at **0.5**, we managed to predict with a 73.4% accuracy the outcome of a LoL game from data about the first 10 minutes of the game which is actually pretty good when keeping in mind that in the case of this study, we were not able to grasp the effects of :
- Tilt : This happens when players get really mad and frustrated due to the succession of bad events occuring along the game and that negatively affect their performance. It is a factor to consider since every indicators can point towards a win, it only needs a few minutes of players being tilted to drastically decrease their odds of winning
- Nashor : As we are only analyzing the first 10 minutes here, we cannot take into account the bonus provided by the Nashor which can, in some games, completely change the situation
- AFK : We do not have any informations about whether a player left the game. Even if everything is going very well and we are predicting high odds of winning, having one player to left the game should cause a huge drop of the winning odds
- Champions : We are not taking into account which champions are being played by the players. Everything being equal, having a huge lead with a champion considered as "early" could lead to a fast snowballing of the game.
- Repartition of the gold : We know how much gold a team earned, but we are not provided with the distribution of the amount of gold among the team. Indeed having gold equally distributed in a team should increase the odds of winning compared to a situation in which the `blueGoldDiff` is actually mostly concentrated on a single player
