## General information

In this kernel I am implementing catboost and showing some methods of SHAP explainer.<br>
For EDA check [this](https://www.kaggle.com/vchulski/dota-2-eda-and-simple-models-comparing) out.

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, ShuffleSplit, KFold, cross_val_score
from sklearn.metrics import roc_auc_score

from catboost import CatBoostClassifier, Pool, cv

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

import time
import datetime

import ujson as json
from tqdm import tqdm_notebook

import warnings
warnings.filterwarnings("ignore")

ModuleNotFoundError: No module named 'ujson'

Reading data

In [None]:
%%time
PATH_TO_DATA = '../input/'

sample_submission = pd.read_csv(os.path.join(PATH_TO_DATA, 'sample_submission.csv'), 
                                    index_col='match_id_hash')
df_train_features = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_features.csv'), 
                                    index_col='match_id_hash')
df_train_targets = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_targets.csv'), 
                                   index_col='match_id_hash')
df_test_features = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_features.csv'), 
                                   index_col='match_id_hash')

In [None]:
print('Shape of Training set: {0}\nShape of Test set: {1}'.format(df_train_features.shape,df_test_features.shape))

Read data from Yorko kernel

In [None]:
#a helper function, we will use it in next cell
def read_matches(matches_file):
    
    MATCHES_COUNT = {
        'test_matches.jsonl': 10000,
        'train_matches.jsonl': 39675,
    }
    _, filename = os.path.split(matches_file)
    total_matches = MATCHES_COUNT.get(filename)
    
    with open(matches_file) as fin:
        for line in tqdm_notebook(fin, total=total_matches):
            yield json.loads(line)

In [None]:
def add_new_features(df_features, matches_file):
    
    # Process raw data and add new features
    for match in read_matches(matches_file):
        match_id_hash = match['match_id_hash']

        # Counting ruined towers for both teams
        radiant_tower_kills = 0
        dire_tower_kills = 0
        for objective in match['objectives']:
            if objective['type'] == 'CHAT_MESSAGE_TOWER_KILL':
                if objective['team'] == 2:
                    radiant_tower_kills += 1
                if objective['team'] == 3:
                    dire_tower_kills += 1

        # Write new features
        df_features.loc[match_id_hash, 'radiant_tower_kills'] = radiant_tower_kills
        df_features.loc[match_id_hash, 'dire_tower_kills'] = dire_tower_kills
        df_features.loc[match_id_hash, 'diff_tower_kills'] = radiant_tower_kills - dire_tower_kills

In [None]:
%%time
# copy the dataframe with features
df_train_features_extended = df_train_features.copy()
df_test_features_extended = df_test_features.copy()

# add new features
add_new_features(df_train_features_extended, os.path.join(PATH_TO_DATA, 'train_matches.jsonl'))
add_new_features(df_test_features_extended, os.path.join(PATH_TO_DATA, 'test_matches.jsonl'))

In [None]:
df_train_features_extended.info() #no categorical data here 

In [None]:
X = df_train_features_extended.values
y = df_train_targets['radiant_win'].map({True: 1, False: 0}).values
X_test = df_test_features_extended.values
print('Shape of Training set: ', X.shape, ' shape of target: ', y.shape, ' shape of test set: ', X_test.shape)

## CatBoost implementation

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=17) #let's use holdout

In [None]:
model = CatBoostClassifier(iterations=400,random_seed=42,eval_metric='AUC',logging_level='Silent')

Reproduce this to see great real time interactive graph! 
And if someone has idea how to save it and show in rendered kernel please let me know! 

In [None]:
%%time
#https://github.com/catboost/tutorials/blob/master/python_tutorial.ipynb
model.fit(X_train, y_train,
    eval_set=(X_valid, y_valid),
    #logging_level='Verbose',  # you can uncomment this for text output
    plot=True #Uncomment and you'll see really great real time interactive graph
);

From graph I saw that best iteration was on 347 iteration with 0.8060119476 result on holdout set.
Pay attention you could also switch to logloss (on graph) - where learn line will be visible. 

In [None]:
%%time
cv_params = model.get_params()
print('cv params: ', cv_params)
cv_data = cv(
    Pool(X, y),
    cv_params,
    seed=17,
    fold_count=5,
    plot=True #this one has much more delay, but results are awesome (you really could understand how learning was going over folds)
)

In [None]:
print('Best validation accuracy score: {:.2f}±{:.4f} on step {}'.format(
    np.max(cv_data['test-AUC-mean']),
    cv_data['test-AUC-std'][np.argmax(cv_data['test-AUC-mean'])],
    np.argmax(cv_data['test-AUC-mean'])
))

Well, on CV we maybe need to increase number of iterations. 

## SHAP explainer

In [None]:
import shap
shap.initjs()

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(df_train_features_extended) #here I use df instead of X, because I've used df.values before 

# visualize the first prediction's explanation
shap.force_plot(explainer.expected_value, shap_values[0,:], df_train_features_extended.iloc[0,:]) #here I use df instead of X, because I've used df.values before 

In [None]:
# create a SHAP dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot("d3_gold", shap_values, df_train_features_extended)

In [None]:
# summarize the effects of all the features
shap.summary_plot(shap_values, df_train_features_extended)

I am really impressed by the abilities of CatBoost library and SHAP explainer.
Tower kills, deaths and gold are obviously strong features. 

With this instrument you could add new features in very representative view. 

## Submission

Just as in my other [kernel](https://www.kaggle.com/vchulski/dota-2-eda-and-simple-models-comparing) on this competition for submission I use simple model without any hyperparameters. 

In [None]:
model = CatBoostClassifier(iterations=400,random_seed=42,eval_metric='AUC',logging_level='Silent')
model.fit(X, y)

y_test_pred = model.predict_proba(X_test)[:, 1]
df_submission = pd.DataFrame({'radiant_win_prob': y_test_pred}, 
                                 index=df_test_features.index)
submission_filename = 'catboost_{}.csv'.format(
    datetime.datetime.now().strftime('%Y-%m-%d_%H-%M'))
df_submission.to_csv(submission_filename)
print('Submission saved to {}'.format(submission_filename))

In [None]:
df_submission.head() #just to check that everything allright 

Feel free to discuss anything on this kernel in comments and upvote if it was useful. 