### Setting up a prediction problem

This notebook sets up the problem of predicting the match outcome given the history of each player involved in the match. I go through my thought process as I try to avoid leaks



In [1]:
import pandas as pd
import numpy as np
from sklearn import ensemble 
from sklearn import metrics

# this is meant to be a simple example so only matches and players are used
matches = pd.read_csv('../input/match.csv', index_col=0)
players = pd.read_csv('../input/players.csv')

test_labels = pd.read_csv('../input/test_labels.csv', index_col=0)
test_players = pd.read_csv('../input/test_player.csv')

train_labels = matches['radiant_win'].astype(int)

### Predicting Match Outcome

In this problem we are asking the questions: which team will win? It is important to consider when the question is being asked. Most frequently this is asked before the match starts, but it could also be asked after the match has be running for 10 or 15 minutes. It could be asked before hero selection, and all that is known is the identity of the competitors. It could also be asked after hero selection in which case the hero composition of each team would be something to consider. An additional case to consider would be predicting the outcome based only on the heros involved not accounting for the players identities. 

The important point is that a time and set of conditions need to be picked before trying to solve the problem. Here we will try to predict the outcome of a match when only the player identities are known, but before hero selection or any gameplay starts.

Any information only available after we ask the question is off limits. This means any details at all about events in the match should be excluded as well as any information about future matches.

In [2]:
# take a look at the match data
matches.head()

Unnamed: 0_level_0,start_time,duration,tower_status_radiant,tower_status_dire,barracks_status_dire,barracks_status_radiant,first_blood_time,game_mode,radiant_win,negative_votes,positive_votes,cluster
match_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,1446750112,2375,1982,4,3,63,1,22,True,0,1,155
1,1446753078,2582,0,1846,63,0,221,22,False,0,2,154
2,1446764586,2716,256,1972,63,48,190,22,False,0,0,132
3,1446765723,3085,4,1924,51,3,40,22,False,0,0,191
4,1446796385,1887,2047,0,0,63,58,22,True,0,0,156


Of these variables only game_mode, cluster, and perhaps start_time are possible to determine before the match starts. None of them seem like useful variables if the goal is to use players past performance to predict the match outcome.

Radiant_win is the target variable we are trying to predict. It is pretty easy to see that a time based split is probably best here for validation. By holding out future we reduce the likelyhood of accidently introducing leakage. 

In [3]:
# since this is a simple example I will use very basic features which are probably not very good.
feature_columns = players.iloc[:3,4:17].columns.tolist()
feature_columns

['gold',
 'gold_spent',
 'gold_per_min',
 'xp_per_min',
 'kills',
 'deaths',
 'assists',
 'denies',
 'last_hits',
 'stuns',
 'hero_damage',
 'hero_healing',
 'tower_damage']

In [4]:
player_groups = players.groupby('account_id')

# These are just a the mean of the above values, one for each account
feature_components = player_groups[feature_columns].mean()

In [5]:
# the account_id 0 is included even though it represents more then one account 
# its average stats for players who hide their account ids 
feature_components.head()

Unnamed: 0_level_0,gold,gold_spent,gold_per_min,xp_per_min,kills,deaths,assists,denies,last_hits,hero_damage,hero_healing,tower_damage
account_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,1800.798735,13955.154883,407.672621,447.691653,7.436487,8.029696,11.644845,4.403066,124.977535,12227.711667,427.988298,1232.203666
1,8642.5,21200.0,627.5,667.5,20.5,1.5,13.5,8.0,242.0,31304.5,0.0,2256.0
2,1756.333333,20576.666667,537.666667,520.0,10.0,7.333333,16.666667,2.333333,277.0,14060.666667,1066.666667,3525.666667
3,3307.0,23825.0,613.0,762.0,20.0,3.0,17.0,13.0,245.0,33740.0,243.0,1833.0
4,763.5,12597.5,381.0,480.0,5.5,8.5,10.0,6.0,146.5,11819.0,0.0,324.5


In [6]:
# now to construct match_level features from the components
# account_id is needed to join with feature_components
train_ids = players[['match_id','account_id']]
test_ids = test_players[['match_id','account_id']]

In [7]:
# add player component data to full match and player data
# note if a player is not in the train set but appears in the test set they will have 
# nan values inserted

train_feat_comp = pd.merge(train_ids, feature_components,
                           how='left', left_on='account_id' ,
                           right_index=True)

test_feat_comp = pd.merge(test_ids, feature_components, 
                          how='left', left_on='account_id',
                          right_index=True)

In [8]:
# this is no longer needed now that the join is done 
train_feat_comp.drop(['account_id'], axis=1, inplace=True)
test_feat_comp.drop(['account_id'], axis=1, inplace=True)

# this basically flattens an entire match, removes the redundent match_ids, and then 
# drops the unneaded multi-index
# is there a better way to do this?
def unstack_simplify(df):
    return df.unstack().iloc[10:].reset_index(drop=True)

In [9]:
# group by match then combine data for all players in a match into one row
test_feat_group = test_feat_comp.groupby('match_id')
test_feats = test_feat_group.apply(unstack_simplify)

In [10]:
train_feat_group = train_feat_comp.groupby('match_id')
train_feats = train_feat_group.apply(unstack_simplify)

In [11]:
test_feats.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,110,111,112,113,114,115,116,117,118,119
match_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50000,623.5,,,6420.0,1588.25,1879.833333,774.0,3360.0,812.0,2866.916667,...,2814.5,,,286.0,177.5,956.333333,57.5,313.0,960.0,1356.916667
50001,2250.222222,1800.798735,1800.798735,,2358.0,,296.0,1800.798735,,1800.798735,...,725.111111,1232.203666,1232.203666,,3282.0,,1679.5,1232.203666,,1232.203666
50002,1133.0,,2587.272727,2935.0,1800.798735,1800.798735,1800.798735,1800.798735,,1800.798735,...,291.0,,816.090909,1807.0,1232.203666,1232.203666,1232.203666,1232.203666,,1232.203666
50003,,1800.798735,1800.798735,2002.140351,77.0,1800.798735,1800.798735,,,1800.798735,...,,1232.203666,1232.203666,1605.280702,219.0,1232.203666,1232.203666,,,1232.203666
50004,1800.798735,2944.5,1800.798735,1800.798735,1800.798735,521.0,1800.798735,1858.5,1821.736842,1800.798735,...,1232.203666,251.0,1232.203666,1232.203666,1232.203666,345.0,1232.203666,2256.0,2468.947368,1232.203666


In [12]:
for i in range(0,40, 10):
    print(test_feats.iloc[0,i:i+10],'\n')

0     623.500000
1            NaN
2            NaN
3    6420.000000
4    1588.250000
5    1879.833333
6     774.000000
7    3360.000000
8     812.000000
9    2866.916667
Name: 50000, dtype: float64 

10    21967.500000
11             NaN
12             NaN
13    15990.000000
14     9633.750000
15    11432.916667
16     9182.500000
17    20710.000000
18    22040.000000
19    14770.833333
Name: 50000, dtype: float64 

20    653.000000
21           NaN
22           NaN
23    507.000000
24    302.500000
25    358.166667
26    271.000000
27    511.000000
28    455.000000
29    459.500000
Name: 50000, dtype: float64 

30    637.000000
31           NaN
32           NaN
33    505.000000
34    332.250000
35    407.000000
36    294.500000
37    549.000000
38    536.000000
39    478.083333
Name: 50000, dtype: float64 



Unstack is interleaving the data of different players the above is to visually check that the nans are showing up in a regular pattern. To make sure I didn't make a mistake.

Below you can see that most matches in the test set have  players not in the train set. and 
this is not accounting for hidding account_ids

In [13]:
row_nans = test_feats.isnull().sum(axis=1)
nan_counts = row_nans.value_counts()
nan_counts = nan_counts.reset_index()

nan_counts.columns = ['num_missing_players','count']
nan_counts.loc[:, 'num_missing_players'] =(nan_counts.loc[:,'num_missing_players']/12).astype(int)
nan_counts

# counting how many players are missing from match because they didn't exist in 
# the train set

Unnamed: 0,num_missing_players,count
0,2,21243
1,3,20322
2,1,16113
3,4,16046
4,5,10239
5,0,6897
6,6,5480
7,7,2422
8,8,896
9,9,272


In [14]:
rf = ensemble.RandomForestClassifier(n_estimators=150, n_jobs=-1)
rf.fit(train_feats,train_labels) 


# this is a bad way to deal with missing values 
test_feats.replace(np.nan, 0, inplace=True)

test_probs = rf.predict_proba(test_feats)
test_preds = rf.predict(test_feats)

In [15]:
metrics.log_loss(test_labels.values.ravel(), test_probs[:,1])

0.77003894887615809

In [16]:
metrics.roc_auc_score(test_labels.values, test_probs[:,1])

0.50435543650989145

In [17]:
print(metrics.classification_report(test_labels.values, test_preds))

             precision    recall  f1-score   support

          0       0.49      0.45      0.46     48139
          1       0.52      0.56      0.54     51861

avg / total       0.50      0.51      0.50    100000



Having mostly just competed on kaggle, now i have to think about what the metrics mean;) I would say the performance is no where near as good as I would like but with the features I used that is to be expected. 

I am more concerned about whether this is the right approach to predicting match outcomes(or i have a bug:)) from player histories. It also seems likely given the number of missing players in the test set that a larger dataset would be useful. 

There are other tasks besides predicting match outcomes like predicting win rate, which should be reasonably easy to set up.