In [2]:
import numpy as np
import pandas as pd
import easydatascience as eds

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from xgboost import XGBClassifier
import keras

from sklearn.metrics import accuracy_score

vanilla_data = pd.read_csv('vanilla_data.csv')
diff_team_data = pd.read_csv('diff_team_data.csv')
diff_data = pd.read_csv('diff_data.csv')
compact_data = pd.read_csv('compact_data.csv')
compact_diff_data = pd.read_csv('compact_diff_data.csv')
bool_team_data = pd.read_csv('bool_team_data.csv')

data = {'vanilla': vanilla_data,
        'diff_team': diff_team_data,
        'diff': diff_data,
        'compact': compact_data,
        'compact_diff': compact_diff_data,
        'bool': bool_team_data}

## Boolean Data
The first dataset I wanted to examine is with only boolean values. Since it is very simple, it has a lot of use cases:
- It doesn't require much computational power so it can be used on any machine.
- Since the data is simple, the model can't be too complicated either (easy to understand).
- Can be good for some simple early predictions.
<br><br>
&emsp;Of course, everything that I mentioned above (as I will do with every dataset that I am going to examine) is prone to change and should not be held true until proven otherwise.

In [25]:
X_train, X_cv, y_train, y_cv = train_test_split(data['bool'].iloc[:, 1:],
                                     data['bool']['Blue_Won'], test_size=0.1)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.1)

lr = LogisticRegression()
lr.fit(X_train, y_train)

print('Cross Validation Accuracy:', round(accuracy_score(y_cv, lr.predict(X_cv)), 4))

Cross Validation Accuracy: 0.9812


&emsp;My best guess is that the model is already saturated by similar features with a high correlation. Remember, our goal is not to predict the winner with the data of the game when the game has already been finished, but we want to see the probabilities during the game. So lets first simplify the model and then run an experiment/simulation.

In [142]:
bool1 = data['bool'].iloc[:, :12]
X1_train, X1_test, y1_train, y1_test = train_test_split(bool1.iloc[:, 1:],
                                         bool1['Blue_Won'], test_size=0.1)
lr = LogisticRegression()
lr.fit(X1_train, y1_train)

print('New data accuracy:', round(accuracy_score(y1_test, lr.predict(X1_test)), 4))

sample_data = pd.DataFrame(0.5, index=range(1), columns=bool1.columns[1:])

def print_sample():
    print('\nSample Data:')
    display(sample_data)
    
print_sample()

def proba_byclass():
    print('---------------------\nProbability by class:')
    for class_, proba in zip(lr.classes_, lr.predict_proba(sample_data)[0]):
        print(str(class_)+':', round(proba, 4))
    print('---------------------')
        
proba_byclass()

New data accuracy: 0.9502

Sample Data:


Unnamed: 0,Blue_FirstBlood,Blue_FirstTower,Blue_FirstInhib,Blue_FirstBaron,Blue_FirstDragon,Blue_FirstHerald,Purp_FirstInhib,Purp_FirstBaron,Purp_FirstHerald,Blue_KDA,Blue_VisionScore
0,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5


---------------------
Probability by class:
0: 0.5193
1: 0.4807
---------------------


&emsp;It is expected to see the default results as these since the purple time is slightly favored. Let's see how is this going to play out, __The Blue team gets the First Blood__.

In [143]:
sample_data.loc[0, 'Blue_KDA'] = 1
sample_data.loc[0, 'Blue_FirstBlood'] = 1

print_sample()
proba_byclass()


Sample Data:


Unnamed: 0,Blue_FirstBlood,Blue_FirstTower,Blue_FirstInhib,Blue_FirstBaron,Blue_FirstDragon,Blue_FirstHerald,Purp_FirstInhib,Purp_FirstBaron,Purp_FirstHerald,Blue_KDA,Blue_VisionScore
0,1.0,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,1.0,0.5


---------------------
Probability by class:
0: 0.1063
1: 0.8937
---------------------


&emsp;This looks a bit oversaturated. The problem here is probably the KDA ratio since the model has been trained on the data from finished games and KDA biases the prediction heavily. We should have that in mind continuing from now on but for now, let's just train the same model but now only without that feature. We can also see that here:

In [144]:
print('---------------------\nLog. Reg. Coefficients:\n')
for ftr, coef in zip(sample_data.columns, lr.coef_[0]):
    print(ftr+':', np.round(coef, 4))
print('---------------------')

---------------------
Log. Reg. Coefficients:

Blue_FirstBlood: -0.1043
Blue_FirstTower: 0.556
Blue_FirstInhib: 1.718
Blue_FirstBaron: 0.4588
Blue_FirstDragon: 0.6051
Blue_FirstHerald: -0.02
Purp_FirstInhib: -1.3735
Purp_FirstBaron: -0.7782
Purp_FirstHerald: -0.0991
Blue_KDA: 4.5183
Blue_VisionScore: 0.3196
---------------------


In [155]:
# No KDA
bool2 = data['bool'].iloc[:, :12].drop('Blue_KDA', axis=1)
X2_train, X2_test, y2_train, y2_test = train_test_split(bool2.iloc[:, 1:],
                                         bool1['Blue_Won'], test_size=0.1)
lr.fit(X2_train, y2_train)
print('Accuracy:', round(accuracy_score(y2_test, lr.predict(X2_test)), 4), '\n')

sample_data = pd.DataFrame(0.5, index=range(1), columns=bool2.columns[1:])

proba_byclass()

sample_data.loc[0, 'Blue_FirstBlood'] = 1

print_sample()
proba_byclass()

Accuracy: 0.9061 

---------------------
Probability by class:
0: 0.5241
1: 0.4759
---------------------

Sample Data:


Unnamed: 0,Blue_FirstBlood,Blue_FirstTower,Blue_FirstInhib,Blue_FirstBaron,Blue_FirstDragon,Blue_FirstHerald,Purp_FirstInhib,Purp_FirstBaron,Purp_FirstHerald,Blue_VisionScore
0,1.0,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5


---------------------
Probability by class:
0: 0.4902
1: 0.5098
---------------------


&emsp;Well, that looks much better! Even though the initial probability of the Blue team winning got a bit lower and I can't seem to be able to explain that right now, it has something to do with the KDA obviously but we are going to leave it for now.
<br><br>
&emsp;On the other side, if you've been paying attention, you might've noticed that Blue_FirstBlood (Blue_FirstHerald is also not very shiny) has a negative coefficient in the Logistic Regression model for some reason. I am not going to explore that into detail so let's just say for now that the model might not be right for the job or that the data doesn't contain enough information. 
<br><br>
&emsp;With that said, __The Purple team uses its jungle vision pressure to take the first dragon but they are quickly countered by the Blue team taking the First Tower. The Blue team takes the First Herald also and they make even more pressure even though the Purple team has more map control through Vision. Blue team snowballs fast taking the first Inhibitor and eventually the First Baron too. By now they have naturally taken over the Vision and they end the game rather fast, by the 25-minute mark. GG!__

In [156]:
sample_data.loc[0, 'Blue_FirstDragon'] = 0
print('First Dragon taken by Purple.')
proba_byclass()

sample_data.loc[0, 'Blue_FirstTower'] = 1
print('\nFirst Tower taken by Blue.')
proba_byclass()

sample_data.loc[0, 'Blue_FirstHerald'] = 1
sample_data.loc[0, 'Purp_FirstHerald'] = 0
sample_data.loc[0, 'Blue_VisionScore'] = 0
print('\nFirst Herald taken by Blue.')
print('Purple controlls the Vision.')
proba_byclass()
print_sample()

sample_data.loc[0, 'Blue_FirstInhib'] = 1
sample_data.loc[0, 'Purp_FirstInhib'] = 0
print('\nFirst Inhib taken by Blue.')
proba_byclass()

sample_data.loc[0, 'Blue_FirstBaron'] = 1
sample_data.loc[0, 'Purp_FirstBaron'] = 0
sample_data.loc[0, 'Blue_VisionScore'] = 1
print('\nFirst Baron taken by Blue.')
print('Blue controlls the Vision.')
proba_byclass()
print_sample()

First Dragon taken by Purple.
---------------------
Probability by class:
0: 0.5802
1: 0.4198
---------------------

First Tower taken by Blue.
---------------------
Probability by class:
0: 0.4725
1: 0.5275
---------------------

First Herald taken by Blue.
Purple controlls the Vision.
---------------------
Probability by class:
0: 0.6064
1: 0.3936
---------------------

Sample Data:


Unnamed: 0,Blue_FirstBlood,Blue_FirstTower,Blue_FirstInhib,Blue_FirstBaron,Blue_FirstDragon,Blue_FirstHerald,Purp_FirstInhib,Purp_FirstBaron,Purp_FirstHerald,Blue_VisionScore
0,1.0,1.0,0.5,0.5,0.0,1.0,0.5,0.5,0.0,0.0



First Inhib taken by Blue.
---------------------
Probability by class:
0: 0.1828
1: 0.8172
---------------------

First Baron taken by Blue.
Blue controlls the Vision.
---------------------
Probability by class:
0: 0.0269
1: 0.9731
---------------------

Sample Data:


Unnamed: 0,Blue_FirstBlood,Blue_FirstTower,Blue_FirstInhib,Blue_FirstBaron,Blue_FirstDragon,Blue_FirstHerald,Purp_FirstInhib,Purp_FirstBaron,Purp_FirstHerald,Blue_VisionScore
0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


&emsp;The other binarized numeric value, Vision, has also taken its toll by having too much weight put on it. It is not the same if a team leads by 0.1 KDA vs, 5 KDA, the same goes for the Vision, we can't reduce it to just that. Besides that, the Bool Data gave us a good overview of how could the probabilities move during the game.
<br><br>
&emsp;Remember that even though we are training the model on discrete binary variables, our final goal is still to guess a probability __during__ the game and we have no way of evaluating that. So the model's performance during the final testing is going to be purely subjective and we can't do much about it but take care of training on the hard values and not to make the mistakes as it could be seen with the binarized KDA and Vision.

In [None]:
# Don't forget to scale