# Dota2 win prediction - Solution

The best score I got is 0.825 by catboost classifier, see details below or jump to the last section

1. [Top 3 classifiers](#Top-3-classifiers)
2. [New features from raw data set](#New-features-from-raw-dataset)
3. [Feature engineering](#Feature-engineering)
4. [Feature selection](#Feature-selection)
5. [Hyperparams tuning](#Hyperparams-tuning)
6. [Prediction on test dataset and submission](#Prediction-on-test-dataset-and-submission)





In [2]:
import os
import numpy as np
import json
import pandas as pd
import datetime
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, ShuffleSplit, cross_val_score
from sklearn.metrics import roc_auc_score, accuracy_score
import time
from boruta import BorutaPy
from sklearn.model_selection import GridSearchCV

import xgboost
import lightgbm
import catboost
from sklearn.ensemble import (RandomForestClassifier,
                              ExtraTreesClassifier,
                              VotingClassifier)
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

from sklearn.preprocessing import StandardScaler

%matplotlib inline

In [3]:
SEED = 10801
sns.set_style(style="whitegrid")
plt.rcParams["figure.figsize"] = 12, 8
warnings.filterwarnings("ignore")

In [4]:
PATH_TO_DATA = "./input/bi-2021-ml-competitions-dota2"

df_train_features = pd.read_csv(os.path.join(PATH_TO_DATA,
                                             "train_data.csv"),
                                    index_col="match_id_hash")
df_train_targets = pd.read_csv(os.path.join(PATH_TO_DATA,
                                            "train_targets.csv"),
                                   index_col="match_id_hash")

In [5]:
X = df_train_features
y = df_train_targets["radiant_win"]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=SEED)

## Top 3 classifiers

Our experiments showed that CAT, RF and LR are top one, so we continue working with them


In [6]:
# this function helps us to try different models with different datasets
def try_model(X, y, name, model):
    start = time.time()

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_train, X_valid, y_train, y_valid = train_test_split(X_scaled, y,
                                                      test_size=0.3,
                                                      random_state=SEED)

    cv_scores = cross_val_score(model, X_scaled, y, cv=cv, scoring="roc_auc")

    model.fit(X_train, y_train)
    y_pred = model.predict_proba(X_valid)[:, 1]
    accuracy = accuracy_score(y_valid, y_pred > 0.5)
    print(f"{name:>10}: accuracy={accuracy:.3f} mean roc_auc_score={cv_scores.mean():.6f} {time.time() - start:.0f} sec")
    return name, accuracy, cv_scores.mean()

In [7]:
rf = RandomForestClassifier(n_estimators=300, max_depth=7, random_state=SEED)
cat = catboost.CatBoostClassifier(verbose=False, random_seed=SEED)
lr = LogisticRegression(solver='liblinear', max_iter=10000)

base_models = [
    ("RF", rf),
    ("CAT", cat),
    ("LR", lr),
]

In [8]:
%%time
for name, model in base_models:
    try_model(X, y, name, model)

So this is what we start from, a baseline for that j-notebook.

## New features from raw dataset

I add new feature from raw dataset (as discussed in experiments):
- ability_upgrades
- purchase_log
- item_uses
- ability_uses


In [9]:
try:
    import ujson as json
except ModuleNotFoundError:
    import json
    print ("Подумайте об установке ujson, чтобы работать с JSON объектами быстрее")

try:
    from tqdm.notebook import tqdm
except ModuleNotFoundError:
    tqdm_notebook = lambda x: x
    print ("Подумайте об установке tqdm, чтобы следить за прогрессом")


def read_matches(matches_file, total_matches=31698, n_matches_to_read=None):

    if n_matches_to_read is None:
        n_matches_to_read = total_matches

    c = 0
    with open(matches_file) as fin:
        for line in tqdm(fin, total=total_matches):
            if c >= n_matches_to_read:
                break
            else:
                c += 1
                yield json.loads(line)

In [10]:
def add_new_features(df, matches_file):
    vars = ["ability_upgrades", "purchase_log", "item_uses", "ability_uses"]

    for match in read_matches(matches_file, total_matches=df.shape[0]):
        match_id_hash = match['match_id_hash']

        for var in vars:

            for idx, player in enumerate(match["players"][:5], start=1):
                df.loc[match_id_hash, f"r{idx}_{var}"] = len(player[var])

            for idx, player in enumerate(match["players"][5:], start=1):
                df.loc[match_id_hash, f"d{idx}_{var}"] = len(player[var])


In [11]:
df_test_features = pd.read_csv(os.path.join(PATH_TO_DATA,
                                             "test_data.csv"),
                                    index_col="match_id_hash")

In [12]:

add_new_features(df_train_features, os.path.join(PATH_TO_DATA, "train_raw_data.jsonl"))

Let's do it also for _test_ dataset, I will use it later for prediction:

In [13]:
add_new_features(df_test_features, os.path.join(PATH_TO_DATA, "test_raw_data.jsonl"))

I save these `df_train_features` and `df_test_features` to CSV to avoid long processing of raw jsons.

In [14]:
df_train_features.to_csv("train_features_ex.csv")
df_test_features.to_csv("test_features_ex.csv")

Let's see what our three classifiers are capable of on that extended dataset:

In [15]:
%%time
for name, model in base_models:
    try_model(df_train_features, y, name, model)

Scores are slightly better, it's fine ;)

## Feature engineering

KDA was mentioned as a good feature, lets add it to datasets.

In [16]:
df_train_features = pd.read_csv("train_features_ex.csv", index_col="match_id_hash")
df_test_features = pd.read_csv("test_features_ex.csv", index_col="match_id_hash")

In [17]:
def add_kda(df):
    for t in "rd":
        for p in "12345":
            df[f"{t}{p}_kda"] = (df[f"{t}{p}_kills"] + df[f"{t}{p}_assists"]) / np.maximum(1, df[f"{t}{p}_deaths"])
    return df

In [18]:
add_kda(df_train_features)

In [19]:
add_kda(df_test_features);

In [20]:
df_test_features.shape, df_train_features.shape

we had 245 features at the beginning, added 4 new features from raw data set and KDA per player, so 245+(4+1)*10 = 295 features.

Now I aggregate player's variables into team variables: mean and std, and
I calculate a difference between Radiant and Dire for each variable mean and std.

Here I decided to aggregate all features (except _hero_id_), because if some aggregated feature would have low importance, we will just throw it out at FeatureSelection stage, not a big problem.

In [21]:
def add_team_diff_std(df):
    variables_to_aggregate = [
        "kills", "deaths", "assists", "denies", "gold", "lh", "xp", "health",
        "max_health", "max_mana", "level", "x", "y", "stuns", "creeps_stacked",
        "camps_stacked", "rune_pickups", "firstblood_claimed", "teamfight_participation",
        "towers_killed","roshans_killed", "obs_placed", "sen_placed", "ability_upgrades", "purchase_log", "item_uses", "ability_uses", "kda",
    ]

    for name in variables_to_aggregate:
        r_team = [f"r{p}_{name}" for p in "12345"]
        d_team = [f"d{p}_{name}" for p in "12345"]

        df[name] = df[r_team].mean(axis=1) - df[d_team].mean(axis=1)
        df[f"{name}_std"] = df[r_team].std(axis=1) - df[d_team].std(axis=1)

    return df

In [22]:
add_team_diff_std(df_train_features)

In [23]:
add_team_diff_std(df_test_features);

Let's see what this feature engineering (now we have 351 features) gives us:

In [24]:
%%time
for name, model in base_models:
    try_model(df_train_features, y, name, model)

These are good results and improvements over initial dataset:
RF  0.773139
CAT 0.801168
 LR 0.808886

Again, saving datasets along the way, to avoid waiting for processing or repeating some steps from above:

In [25]:
df_train_features.to_csv("my_train_features.csv")
df_test_features.to_csv("my_test_features.csv")

## Feature selection

Not all features are good, some are noisy ones, we can drop them

NOTE: there several feature selection methods (from [sklearn](https://scikit-learn.org/stable/modules/feature_selection.html) [SelectFromModel, SequentialFeatureSelection, Recursive feature elimination], from [eli5](https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html) [PermutationImportance], Boruta).

SelectFromModel is the simplest one (it selects features based on model's feature importance) and it's fast.

In [26]:
df_train_features = pd.read_csv("my_train_features.csv", index_col="match_id_hash")
df_test_features = pd.read_csv("my_test_features.csv", index_col="match_id_hash")

Let's look at feature importance of each model:

In [27]:
from sklearn.feature_selection import SelectFromModel
import eli5
from IPython.display import display

### RF

In [28]:
display(eli5.show_weights(rf, feature_names=list(df_train_features), top=50))

In [29]:
def select_features_rf(full_df, threshold=0.005):
    sel_rf = SelectFromModel(rf, threshold=threshold, prefit=True)
    print(f"{sel_rf.get_support().sum()} features left")
    _rf = RandomForestClassifier(n_estimators=300, max_depth=7, random_state=SEED)
    df = full_df[full_df.columns[sel_rf.get_support()]]
    try_model(df, y, "selected RF", _rf)
    return sel_rf.get_support()


In [30]:
select_features_rf(df_train_features, threshold=0.005);

In [31]:
select_features_rf(df_train_features, threshold=0.009);

### CAT

In [32]:
display(eli5.show_weights(cat, feature_names=list(df_train_features), top=50))

In [33]:
def select_features_cat(full_df, threshold=0.005):
    sel_cat = SelectFromModel(cat, threshold=threshold, prefit=True)
    print(f"{sel_cat.get_support().sum()} features left")
    _cat = catboost.CatBoostClassifier(verbose=False, random_seed=SEED)
    df = full_df[full_df.columns[sel_cat.get_support()]]
    try_model(df, y, "selected CAT", _cat)
    return sel_cat.get_support()

In [34]:
select_features_cat(df_train_features, threshold=0.1);

In [35]:
select_features_cat(df_train_features, threshold=0.3);

In [36]:
select_features_cat(df_train_features, threshold=0.5);

### LR

In [37]:
display(eli5.show_weights(lr, feature_names=list(df_train_features), top=50))

In [38]:
def select_features_lr(full_df, threshold=0.005):
    sel_lr = SelectFromModel(lr, threshold=threshold, prefit=True)
    print(f"{sel_lr.get_support().sum()} features left, threshold={threshold}")
    _lr = LogisticRegression(solver='liblinear', max_iter=10000)
    df = full_df[full_df.columns[sel_lr.get_support()]]
    try_model(df, y, "selected LR", _lr)
    return sel_lr.get_support()

In [39]:
select_features_lr(df_train_features, threshold=0.005);
select_features_lr(df_train_features, threshold=0.01);
select_features_lr(df_train_features, threshold=0.04);
select_features_lr(df_train_features, threshold=0.05);
select_features_lr(df_train_features, threshold=0.075);
select_features_lr(df_train_features, threshold=0.1);

### Results of feature selection

For all three models we see that _gold, xp and level_ (team variables) - are top three features.

We dropped noisy features (by a simple method) and at the same time didn't lose score of models, but gained a bit of speed of fitting them.

Also CAT is a winner, its roc_auc=0.821

Now we transform _train_ and _test_ datasets and save them to csv:

In [40]:
sel_cat = SelectFromModel(cat, threshold=0.3, prefit=True)
print(f"{sel_cat.get_support().sum()} features left")

def transform(full_df, sel_cat):
    return full_df[full_df.columns[sel_cat.get_support()]]

In [41]:
transform(df_train_features, sel_cat).to_csv("selected_train_features.csv")
transform(df_test_features, sel_cat).to_csv("selected_test_features.csv")

## Hyperparams tuning

Let's tune

In [42]:
df_train_features = pd.read_csv("selected_train_features.csv", index_col="match_id_hash")
df_test_features = pd.read_csv("selected_test_features.csv", index_col="match_id_hash")

In [43]:
%%time
cat_search = GridSearchCV(estimator=catboost.CatBoostClassifier(verbose=False, random_seed=SEED),
                          param_grid={
                              'learning_rate': [0.03],  #[0.01, 0.03, 0.1],
                              'depth': [5, 6],
                              'l2_leaf_reg': [1, 3, 5]
                          }, scoring="roc_auc")
cat_search.fit(df_train_features, y)
cat_search.best_params_


In [44]:
cat_search.best_score_

**Yay! Before it was 0.821, here we have 0.825**

## Prediction on test dataset and submission

In [46]:
from datetime import datetime

y_test_pred = cat_search.best_estimator_.predict_proba(df_test_features.values)[:, 1]

df_submission = pd.DataFrame({"radiant_win_prob": y_test_pred},
                                 index=df_test_features.index)
file_name = "submission-{}.csv".format(datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))

df_submission.to_csv(file_name)

![](https://i.imgflip.com/2hzv20.jpg)