# Dota 2: Win Probability Prediction
https://www.kaggle.com/c/dota-2-win-probability-prediction/discussion
<br>
https://www.coursera.org/learn/vvedenie-mashinnoe-obuchenie/peer/J1SH8/proiekt-priedskazaniia-pobieditielia-v-onlain-ighrie

## Part 1

In [1]:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt

Dowlnoading data for train and data for test and selecting a target variable

In [2]:
X_train = pd.read_csv('features.csv', index_col='match_id')
y_train = X_train.radiant_win
X_test = pd.read_csv('features_test.csv', index_col='match_id')

Function MissingValuesTable showed how many values are missed in data (that is an object of DataFrame)

In [3]:
def MissingValuesTable(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        by='% of Total Values', ascending=False).round(1)
        print ("Selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There're " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

In [4]:
MissingValuesTable(X_train)

Selected dataframe has 108 columns.
There're 12 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
first_blood_player2,43987,45.2
radiant_flying_courier_time,27479,28.3
dire_flying_courier_time,26098,26.8
first_blood_time,19553,20.1
first_blood_team,19553,20.1
first_blood_player1,19553,20.1
dire_bottle_time,16143,16.6
radiant_bottle_time,15691,16.1
radiant_first_ward_time,1836,1.9
dire_first_ward_time,1826,1.9


Deleting of features that are not about the first 5 mins so they can impact on the result since they know the future

In [5]:
X_train.drop(['duration', 'radiant_win', 'tower_status_radiant',
           'tower_status_dire', 'barracks_status_radiant','barracks_status_dire'], axis=1, inplace=True)

### Simple Gradient Boosting classifier (Градиентный бустинг "в лоб")


In [6]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer

Filling missed values with a huge number.
<br>
This allows send objects containing missed values to another branch thus classification can be better.

In [7]:
imp = SimpleImputer(missing_values=np.nan, fill_value=1e9, strategy='constant')
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)
#X_train.fillna(value=1e9, inplace=True)

Functon CrossValGBC let make cross validation on different count of trees(n_estimators).
<br>
It returns array of scores after cross_val_score and spent time.

In [8]:
def CrossValGBC(n):
    start_time = datetime.datetime.now()
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    clf = GradientBoostingClassifier(n_estimators=n, random_state=42)
    quality = cross_val_score(estimator=clf,X=X_train,y=y_train,cv=kf, scoring='roc_auc', n_jobs=-1)
    end_time = datetime.datetime.now()
    return quality, end_time-start_time


Let's test n = {10, 20, 30, 100} and save all in a dictionary where key is $n_i$ and value is a tuple of array of scores and spent time.

In [None]:
n_trees = [10, 20, 30, 100]
res = {}
for n in n_trees:
    res[n] = CrossValGBC(n)

In [11]:
for key, val in res.items():
    minutes, seconds = divmod(val[1].total_seconds(), 60)
    print("Count of trees: {0:3}, score on cross validation: {1:.10f}, spent time:{2:2.0f} mins and {3:.2f} sec"
          .format(key, val[0].mean(), minutes, seconds))

Count of trees:  10, score on cross validation: 0.6660340265, spent time: 0 mins and 49.23 sec
Count of trees:  20, score on cross validation: 0.6834568680, spent time: 1 mins and 13.86 sec
Count of trees:  30, score on cross validation: 0.6899489423, spent time: 1 mins and 46.87 sec
Count of trees: 100, score on cross validation: 0.7067004675, spent time: 5 mins and 40.64 sec


## Part 2

Dowlnoading data for train and data for test and selecting a target variable

In [102]:
X_train = pd.read_csv('features.csv', index_col='match_id')
y_train = X_train.radiant_win
X_test = pd.read_csv('features_test.csv', index_col='match_id')

Deleting of features that are not about the first 5 mins so they can impact on the result since they know the future

In [103]:
X_train.drop(['duration', 'radiant_win', 'tower_status_radiant',
           'tower_status_dire', 'barracks_status_radiant','barracks_status_dire'], axis=1, inplace=True)

### Logistic Regression (Логистическая регрессия)

In [104]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

Filling missed values with 0.0. It's a good strategy for methods like this.

In [105]:
X_train.fillna(value=0.0, inplace=True)
X_test.fillna(value=0.0, inplace=True)

Function CrossValLogReg allows to get the best parameter C = $\frac{1}{\lambda}$ from the array = {1e-5, 1e-4,..., 10, 100}
<br>
For small values of C, the regularization strength increases simple models'll be created which underfit the data. For big values of C the other way around. The model is allowed to increase it's complexity, and therefore, overfit the data.

In [106]:
def CrossValLogReg(X, n):
    start_time = datetime.datetime.now()
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    logreg = LogisticRegression(penalty='l2', C=n, solver='lbfgs', random_state=42)
    quality = cross_val_score(estimator=logreg, X=X, y=y_train, cv=kf, scoring='roc_auc', n_jobs=-1)
    end_time = datetime.datetime.now()
    return quality, end_time-start_time

Function ScaleData scales the data (train and test).

In [107]:
def ScaleData(X_train, X_test):
    sc = StandardScaler()
    X_train_sc = sc.fit_transform(X_train.astype(float))
    X_test_sc = sc.transform(X_test.astype(float))
    return X_train_sc, X_test_sc

Function PrintScoreTime_C prints scores on cross validation with given C values and time spent for it.

In [108]:
def PrintScoreTime_C(result_C):
    getbest_C = []
    for key, val in result_C.items():
        getbest_C.append((key, val[0].mean()))
        minutes, seconds = divmod(val[1].total_seconds(), 60)
        print("C: {0:9.5f}, score on cross validation: {1:.10f}, spent time: {2:.0f} mins and {3:4.2f} sec"
              .format(key, val[0].mean(), minutes, seconds))
    getbest_C.sort(key=lambda x:x[1], reverse=True)
    print("\nBest C: {0} with score: {1}".format(getbest_C[0][0], getbest_C[0][1]))

### No thinking about categorical features
Just treat them as numerical features.
<br>
Scale the data and find the best parameter C in this case.
Further there'll be attempts to increase score on cross validation.

In [109]:
X_train_scaled, X_test_scaled = ScaleData(X_train, X_test)

In [110]:
array_C = np.power(10.0, np.arange(-5, 3))
result_C = {}
for n in array_C:
    result_C[n] = CrossValLogReg(X_train_scaled, n)

In [111]:
PrintScoreTime_C(result_C)

C:   0.00001, score on cross validation: 0.6951552810, spent time: 0 mins and 11.07 sec
C:   0.00010, score on cross validation: 0.7113523612, spent time: 0 mins and 2.28 sec
C:   0.00100, score on cross validation: 0.7163630590, spent time: 0 mins and 3.28 sec
C:   0.01000, score on cross validation: 0.7165498862, spent time: 0 mins and 4.32 sec
C:   0.10000, score on cross validation: 0.7165269452, spent time: 0 mins and 4.59 sec
C:   1.00000, score on cross validation: 0.7165221088, spent time: 0 mins and 4.78 sec
C:  10.00000, score on cross validation: 0.7165218906, spent time: 0 mins and 4.51 sec
C: 100.00000, score on cross validation: 0.7165218101, spent time: 0 mins and 4.97 sec

Best C: 0.01 with score: 0.7165498862352037


### Delete categorical features at all
Try to get if scores on cross validation gets better or not not having categorical features.

In [112]:
categorical_features = ['lobby_type'] + [name for name in X_train.columns if "hero" in name]

Drop categorical features and scale the data.

In [113]:
X_train_nocateg = X_train.copy()
X_test_nocateg = X_test.copy()
X_train_nocateg.drop(categorical_features, inplace=True, axis=1)
X_test_nocateg.drop(categorical_features, inplace=True, axis=1)
X_train_nocateg_sc, X_test_nocateg_sc = ScaleData(X_train_nocateg, X_test_nocateg)

Find the best parameter C using cross_val_score for the rebuild data not having categorical features.

In [114]:
array_C = np.power(10.0, np.arange(-5, 3))
result_C = {}
for n in array_C:
    result_C[n] = CrossValLogReg(X_train_nocateg_sc, n)

In [115]:
PrintScoreTime_C(result_C)

C:   0.00001, score on cross validation: 0.6950920646, spent time: 0 mins and 2.03 sec
C:   0.00010, score on cross validation: 0.7113327497, spent time: 0 mins and 2.32 sec
C:   0.00100, score on cross validation: 0.7163758697, spent time: 0 mins and 3.30 sec
C:   0.01000, score on cross validation: 0.7165592000, spent time: 0 mins and 3.87 sec
C:   0.10000, score on cross validation: 0.7165338145, spent time: 0 mins and 4.00 sec
C:   1.00000, score on cross validation: 0.7165303444, spent time: 0 mins and 4.09 sec
C:  10.00000, score on cross validation: 0.7165305329, spent time: 0 mins and 3.94 sec
C: 100.00000, score on cross validation: 0.7165304164, spent time: 0 mins and 3.96 sec

Best C: 0.01 with score: 0.7165592000076536


### Use little modified one-hot encoding for categorical features
Features $r_k hero$ and $d_k hero$, k=1..5, are important because different heroes have different charasteritics and it impacts on the result of the fight so let's use one-hot encoding to try improve scores on cross validation.

In [130]:
X_train_OHE = X_train.copy()
X_test_OHE = X_test.copy()


Get how many different id of heroes exist.

In [131]:
categorical_features = [name for name in X_train.columns if "hero" in name]
uniqueId_heroes = np.unique(X_train_OHE[categorical_features])
categorical_features += ['lobby_type']
len(uniqueId_heroes)


108

Let's encode categorical features: $r_k hero$ and $d_k hero$, k=1..5.
<br>
Process of encoding looks like: $j$ feature is $0$ if $j$-hero didn't take part in the match, $1$ if $j$-hero was a member of Radiant team and -1 if $j$-hero was a member of Dire.
<br>
Need to find max id of hero (this is $N$) to create appropriate matrixes for encoding described above.

In [134]:
def EncodeData(X, typedata):
    original_data = X_test if typedata == 'test' else X_train
    for i, match_id in enumerate(original_data.index):
        for p in range(5):
            X[i, original_data.loc[match_id, 'r%d_hero' % (p+1)]-1] = 1
            X[i, original_data.loc[match_id, 'd%d_hero' % (p+1)]-1] = -1
    return X
    

In [135]:
N = max(uniqueId_heroes) 
X_train_enc = np.zeros((X_train.shape[0], N))
X_test_enc = np.zeros((X_test.shape[0], N))

X_train_enc = EncodeData(X_train_enc, 'train')
X_test_enc = EncodeData(X_test_enc, 'test')

Get DataFrame of those matrixes (X_train_enc, X_test_enc) and change their index and columns names for fitting with original data. It'll help to understand information inside them better and to concatenate them with original matrixes. 

In [136]:
id_hero = {i:i+1 for i in range(N)}

row_train = {i:x for i, x in enumerate(X_train_OHE.index)}
encoded_heroes_train = pd.DataFrame(X_train_enc).rename(index=row_train, columns=id_hero)

row_test = {i:x for i, x in enumerate(X_test_OHE.index)}
encoded_heroes_test = pd.DataFrame(X_test_enc).rename(index=row_test, columns=id_hero)

In [143]:
X_train_OHE_enc = pd.concat([X_train_OHE, encoded_heroes_train], axis=1)
X_test_OHE_enc = pd.concat([X_test_OHE, encoded_heroes_test], axis=1)

Drop extra features that were changed using one-hot encoding and scale the data. 

In [138]:
X_train_OHE_enc.drop(categorical_features, axis=1, inplace=True)
X_test_OHE_enc.drop(categorical_features, axis=1, inplace=True)
X_train_OHE_enc_scaled, X_test_OHE_enc_scaled = ScaleData(X_train_OHE_enc, X_test_OHE_enc)

Find the best parameter C using cross_val_score for the new build data (X_train_OHE_enc_scaled).

In [139]:
array_C = np.power(10.0, np.arange(-5, 3))
result_C = {}
for n in array_C:
    result_C[n] = CrossValLogReg(X_train_OHE_enc_scaled, n)

In [140]:
PrintScoreTime_C(result_C)

C:   0.00001, score on cross validation: 0.7147781039, spent time: 0 mins and 13.82 sec
C:   0.00010, score on cross validation: 0.7427271252, spent time: 0 mins and 5.19 sec
C:   0.00100, score on cross validation: 0.7516116548, spent time: 0 mins and 7.99 sec
C:   0.01000, score on cross validation: 0.7519639456, spent time: 0 mins and 10.54 sec
C:   0.10000, score on cross validation: 0.7519296127, spent time: 0 mins and 11.37 sec
C:   1.00000, score on cross validation: 0.7519244691, spent time: 0 mins and 11.07 sec
C:  10.00000, score on cross validation: 0.7519241578, spent time: 0 mins and 10.47 sec
C: 100.00000, score on cross validation: 0.7519241366, spent time: 0 mins and 11.82 sec

Best C: 0.01 with score: 0.7519639455861148


### Predict probabilities on test sample

It's clear that the best model in this case is Logistic regression with $C = 0.01$ that was used over the preprocessed data: encoding categorical features of all heroes. So let's just take it and predict probabilities on test sample X_test that was also of course changed in the way like X_train was.

In [141]:
logreg = LogisticRegression(penalty='l2', C=0.01, solver='lbfgs', random_state=42, n_jobs=-1)
logreg.fit(X_train_OHE_enc_scaled, y_train)
prob_pred = logreg.predict_proba(X_test_OHE_enc_scaled)[:,1]
print(prob_pred)

[0.8227909  0.75216683 0.18906701 ... 0.2379355  0.6283895  0.42762756]


Check what the best and the worst values were got in prob_pred.

In [142]:
print(min(prob_pred), max(prob_pred))

0.00850599069841149 0.9962795503882215
