# 0. Project Overview

# 1. Data Collection  
The data collection isn't done through this notebook since it can be very lengthy in terms of time so it is done in a seperate script. Furthermore, the API used by the python code isn't supported by Anaconda. To collect the data set you must perform the following operations:
1. pip3 install -t ../dota2/requirements.txt
2. python3 ../dota2/src/api_wrapper.py  

The dataset will be saved in *../dota2/src/data/*  

## 1.1 Files  
1. match.json
2. heroes.json
3. items.json

# 2 Data Discovery  
The data is well structured in *.json* format already. So we start by observing their contents. The first step is to download the R package to read *.json* files if it is not already present.

Next we read the contents from the JSON files. We start off with the *match.json* since this is what the study will focus on.

In [1]:
import json
with open('data/match.json') as f:
    raw_match_data = json.load(f)

# print first value
for keys in raw_match_data:
    print(raw_match_data[keys])
    break

{'game_mode': 22, 'tower_status_dire': 1828, 'match_id': 3933472768, 'negative_votes': 0, 'picks_bans': [{'order': 0, 'hero_id': 32, 'team': 0, 'is_pick': False}, {'order': 1, 'hero_id': 44, 'team': 0, 'is_pick': False}, {'order': 2, 'hero_id': 105, 'team': 0, 'is_pick': False}], 'players': [{'gold_per_min': 382, 'leaver_status_name': 'NONE', 'item_5_name': 'Assault Cuirass', 'assists': 18, 'backpack_1': 0, 'item_2_name': 'Urn of Shadows', 'scaled_tower_damage': 58, 'last_hits': 119, 'backpack_2': 0, 'account_id': 167415941, 'backpack_0': 46, 'item_1_name': 'Dust of Appearance', 'item_0': 127, 'hero_name': 'Spirit Breaker', 'hero_healing': 280, 'level': 25, 'item_3_name': 'Power Treads', 'item_0_name': 'Blade Mail', 'item_4_name': "Heaven's Halberd", 'denies': 2, 'ability_upgrades': [{'ability': 5353, 'time': 276, 'level': 1}, {'ability': 5355, 'time': 462, 'level': 2}, {'ability': 5353, 'time': 629, 'level': 3}, {'ability': 5355, 'time': 730, 'level': 4}, {'ability': 5354, 'time': 897

We notice that there is a HUGE amount of information. Most of this data comes from the **players** key from the dictionary. 

## 2.1 Breaking down the matches dictionary  
### 2.1.1  Analyzing the player key
We first seperate the **players** key from the rest of the data

In [2]:
for keys in raw_match_data:
    print(raw_match_data[keys]['players'][0])
    break

{'gold_per_min': 382, 'leaver_status_name': 'NONE', 'item_5_name': 'Assault Cuirass', 'assists': 18, 'backpack_1': 0, 'item_2_name': 'Urn of Shadows', 'scaled_tower_damage': 58, 'last_hits': 119, 'backpack_2': 0, 'account_id': 167415941, 'backpack_0': 46, 'item_1_name': 'Dust of Appearance', 'item_0': 127, 'hero_name': 'Spirit Breaker', 'hero_healing': 280, 'level': 25, 'item_3_name': 'Power Treads', 'item_0_name': 'Blade Mail', 'item_4_name': "Heaven's Halberd", 'denies': 2, 'ability_upgrades': [{'ability': 5353, 'time': 276, 'level': 1}, {'ability': 5355, 'time': 462, 'level': 2}, {'ability': 5353, 'time': 629, 'level': 3}, {'ability': 5355, 'time': 730, 'level': 4}, {'ability': 5354, 'time': 897, 'level': 5}, {'ability': 5356, 'time': 1041, 'level': 6}, {'ability': 5353, 'time': 1152, 'level': 7}, {'ability': 5353, 'time': 1200, 'level': 8}, {'ability': 5354, 'time': 1255, 'level': 9}, {'ability': 5932, 'time': 1293, 'level': 10}, {'ability': 5355, 'time': 1574, 'level': 11}, {'abil

In [3]:
import pandas as pd
for keys in raw_match_data:
    try:
        del raw_match_data[keys]['players'][0]['ability_upgrades']
        player_0 = raw_match_data[keys]['players'][0]
        player = pd.DataFrame(player_0, index=[0])        
    except KeyError:
        pass
    break

In [4]:
player.iloc[0]

account_id                                    167415941
assists                                              18
backpack_0                                           46
backpack_1                                            0
backpack_2                                            0
deaths                                                8
denies                                                2
gold                                                833
gold_per_min                                        382
gold_spent                                        16355
hero_damage                                       22012
hero_healing                                        280
hero_id                                              71
hero_name                                Spirit Breaker
item_0                                              127
item_0_name                                  Blade Mail
item_1                                               40
item_1_name                          Dust of App

### 2.1.2  Analyzing the rest of the keys 
Then we look at the rest of the keys

In [5]:
for keys in raw_match_data:
    for k in raw_match_data[keys]:
        if k == 'players':
            pass
        else:
            print(k + ": " + str(raw_match_data[keys][k]))
    break

game_mode: 22
tower_status_dire: 1828
match_id: 3933472768
negative_votes: 0
picks_bans: [{'order': 0, 'hero_id': 32, 'team': 0, 'is_pick': False}, {'order': 1, 'hero_id': 44, 'team': 0, 'is_pick': False}, {'order': 2, 'hero_id': 105, 'team': 0, 'is_pick': False}]
flags: 1
radiant_score: 36
barracks_status_dire: 63
engine: 1
start_time: 1528135893
duration: 2765
barracks_status_radiant: 0
game_mode_name: Ranked All Pick
leagueid: 0
positive_votes: 0
pre_game_duration: 90
cluster_name: Southeast Asia
match_seq_num: 3412777443
cluster: 153
tower_status_radiant: 0
lobby_type: 0
radiant_win: False
first_blood_time: 83
lobby_name: Public matchmaking
dire_score: 48
human_players: 10


We have analyzed this long enough, we can start creating the tables by ommitting the **ability_upgrades** key since it is variable and not that impactful. We also seperate the player data from the rest as both do not possess the same dimensions

## 2.2 Analyzing the items dictionary

In [6]:
with open('data/items.json') as f:
    raw_item_data = json.load(f)

In [7]:
for key in raw_item_data:
    print(key)

items
status


In [8]:
print(type(raw_item_data['items']))

<class 'list'>


In [9]:
print(raw_item_data['items'][0])

{'id': 1, 'localized_name': 'Blink Dagger', 'name': 'item_blink', 'recipe': 0, 'secret_shop': 0, 'url_image': 'http://cdn.dota2.com/apps/dota2/images/items/blink_lg.png', 'cost': 2250, 'side_shop': 1}


## 2.3 Analyzing the heroes dictionary

In [10]:
with open('data/heroes.json') as f:
    raw_hero_data = json.load(f)
pd.DataFrame(raw_hero_data).head()

Unnamed: 0,count,heroes,status
0,115,{'url_small_portrait': 'http://cdn.dota2.com/a...,200
1,115,{'url_small_portrait': 'http://cdn.dota2.com/a...,200
2,115,{'url_small_portrait': 'http://cdn.dota2.com/a...,200
3,115,{'url_small_portrait': 'http://cdn.dota2.com/a...,200
4,115,{'url_small_portrait': 'http://cdn.dota2.com/a...,200


In [11]:
for key in raw_hero_data:
    print(key)

heroes
status
count


In [12]:
print(raw_hero_data['count'])

115


In [13]:
print(type(raw_hero_data['heroes']))

<class 'list'>


In [14]:
print(raw_hero_data['heroes'][0])

{'url_small_portrait': 'http://cdn.dota2.com/apps/dota2/images/heroes/antimage_sb.png', 'id': 1, 'name': 'npc_dota_hero_antimage', 'localized_name': 'Anti-Mage', 'url_large_portrait': 'http://cdn.dota2.com/apps/dota2/images/heroes/antimage_lg.png', 'url_full_portrait': 'http://cdn.dota2.com/apps/dota2/images/heroes/antimage_full.png', 'url_vertical_portrait': 'http://cdn.dota2.com/apps/dota2/images/heroes/antimage_vert.jpg'}


## 2.4 Analyzing game complexity

In [15]:
import math
item_count = len(raw_item_data['items'])
heroes_count = len(raw_hero_data['heroes'])
heroes_per_game = 10
items_per_hero = 6

# hero combinations without repetition
numerator = math.factorial(heroes_count)
denominator = math.factorial(heroes_per_game) * math.factorial(heroes_count - heroes_per_game)
hero_combinations = numerator // denominator

# item combinations with repetition
numerator = math.factorial(item_count + items_per_hero - 1)
denominator = math.factorial(item_count - 1) * math.factorial(items_per_hero)
item_combinations = numerator // denominator

In [16]:
print(hero_combinations)
print(item_combinations)

74540394223878
594115882360


In [17]:
# Theorycraft combinations
theory_craft_combination = hero_combinations * item_combinations
print(theory_craft_combination)
'{:.2e}'.format(theory_craft_combination)

44285632085781525350992080


'4.43e+25'

## 2.5 Reducing game complexity with data classification

### 2.5.1 Analyzing hero roles

In [18]:
file = open('data/hero_role.txt', 'r')
raw_hero_role_data = ""
for line in file:
    raw_hero_role_data += line
file.close()
print(raw_hero_role_data[:1000])

{| class="wikitable"
!colspan=8| ■■■ Carry
|-
|{{HeroIcon|am}}{{HeroIcon|arc}}{{HeroIcon|ck}}{{HeroIcon|gyro}}{{HeroIcon|medusa}}{{HeroIcon|morph}}{{HeroIcon|naga}}{{HeroIcon|pa}}<br>{{HeroIcon|sniper}}{{HeroIcon|spectre}}{{HeroIcon|tb}}{{HeroIcon|tiny}}{{HeroIcon|troll}}
|-
!colspan=8| ■■ Carry
|-
|{{HeroIcon|alch}}{{HeroIcon|bb}}{{HeroIcon|dk}}{{HeroIcon|huskar}}{{HeroIcon|ls}}{{HeroIcon|lycan}}{{HeroIcon|mk}}{{HeroIcon|slardar}}<br>{{HeroIcon|sven}}{{HeroIcon|wk}}{{HeroIcon|clinkz}}{{HeroIcon|drow}}{{HeroIcon|ember}}{{HeroIcon|void}}{{HeroIcon|jugg}}{{HeroIcon|ld}}<br>{{HeroIcon|luna}}{{HeroIcon|meepo}}{{HeroIcon|pl}}{{HeroIcon|razor}}{{HeroIcon|riki}}{{HeroIcon|sf}}{{HeroIcon|slark}}{{HeroIcon|ta}}<br>{{HeroIcon|ursa}}{{HeroIcon|weaver}}{{HeroIcon|od}}{{HeroIcon|storm}}{{HeroIcon|pangolier}}
|-
!colspan=8| ■ Carry
|-
|{{HeroIcon|abaddon}}{{HeroIcon|brew}}{{HeroIcon|doom}}{{HeroIcon|kunkka}}{{HeroIcon|lc}}{{HeroIcon|ns}}{{HeroIcon|sb}}{{HeroIcon|bs}}<br>{{HeroIcon|brood}}{{HeroIcon|

In [19]:
delimeter_start = "■■■ "
delimeter_end = "\n|-\n|"
delimited_roles = raw_hero_role_data.split(delimeter_start)
roles_list = []
for dr in delimited_roles:
    roles_list.append(dr.split(delimeter_end)[0])
roles_list = roles_list[1:]
print(roles_list)

['Carry', 'Nuker', 'Initiator', 'Disabler', 'Escape', 'Support', 'Pusher', 'Jungler']


### 2.5.2 Analyzing hero complexity

In [20]:
file = open('data/hero_complexity.txt')
raw_hero_complexity_data = ""
for line in file:
    raw_hero_complexity_data += line
file.close()
print(raw_hero_complexity_data[500:1000])

y styles.

{| class="wikitable"
!colspan=8| ■■■ Complexity
|-
|{{HeroIcon|brewmaster}}{{HeroIcon|earth spirit}}{{HeroIcon|io}}{{HeroIcon|aw}}{{HeroIcon|ld}}{{HeroIcon|meepo}}{{HeroIcon|morph}}{{HeroIcon|chen}}<br>{{HeroIcon|invoker}}{{HeroIcon|oracle}}{{HeroIcon|rubick}}{{HeroIcon|storm}}{{HeroIcon|visage}}
|-
!colspan=8| ■■ Complexity
|-
|{{HeroIcon|beastmaster}}{{HeroIcon|clock}}{{HeroIcon|doom}}{{HeroIcon|earthshaker}}{{HeroIcon|et}}{{HeroIcon|kunkka}}{{HeroIcon|lifestealer}}{{HeroIcon|lycan}


# 3 Cleaning data & Structuring

## 3.1 Cleaning hero roles

In [21]:
updated_hero_role_data = ""
updated_hero_role_data = raw_hero_role_data.replace("{{HeroIcon|", "|")
updated_hero_role_data = updated_hero_role_data.replace("}}", "| ")
updated_hero_role_data = updated_hero_role_data.replace("|-\n", "")
updated_hero_role_data = updated_hero_role_data.replace("<br>", "")
print(updated_hero_role_data[:500])

{| class="wikitable"
!colspan=8| ■■■ Carry
||am| |arc| |ck| |gyro| |medusa| |morph| |naga| |pa| |sniper| |spectre| |tb| |tiny| |troll| 
!colspan=8| ■■ Carry
||alch| |bb| |dk| |huskar| |ls| |lycan| |mk| |slardar| |sven| |wk| |clinkz| |drow| |ember| |void| |jugg| |ld| |luna| |meepo| |pl| |razor| |riki| |sf| |slark| |ta| |ursa| |weaver| |od| |storm| |pangolier| 
!colspan=8| ■ Carry
||abaddon| |brew| |doom| |kunkka| |lc| |ns| |sb| |bs| |brood| |mirana| |viper| |dp| |invoker| |leshrac| |lina| |np| |n


In [22]:
import re
hero_regex = re.compile("\|.*\|")
hero_regex_matches = re.findall(hero_regex, updated_hero_role_data)
for i in range(len(hero_regex_matches)):
    hero_regex_matches[i] = hero_regex_matches[i].replace("|", ", ")
print(hero_regex_matches)

[', , am,  , arc,  , ck,  , gyro,  , medusa,  , morph,  , naga,  , pa,  , sniper,  , spectre,  , tb,  , tiny,  , troll, ', ', , alch,  , bb,  , dk,  , huskar,  , ls,  , lycan,  , mk,  , slardar,  , sven,  , wk,  , clinkz,  , drow,  , ember,  , void,  , jugg,  , ld,  , luna,  , meepo,  , pl,  , razor,  , riki,  , sf,  , slark,  , ta,  , ursa,  , weaver,  , od,  , storm,  , pangolier, ', ', , abaddon,  , brew,  , doom,  , kunkka,  , lc,  , ns,  , sb,  , bs,  , brood,  , mirana,  , viper,  , dp,  , invoker,  , leshrac,  , lina,  , np,  , necro,  , qop,  , silencer,  , tinker,  , wr, ', ', , phoenix,  , timber,  , sf,  , invoker,  , lesh,  , lina,  , lion,  , oracle,  , qop,  , skywrath,  , techies,  , tinker,  , zeus, ', ', , earth spirit,  , sk,  , tiny,  , luna,  , meepo,  , nyx,  , cm,  , jakiro,  , kotl,  , lich,  , necro,  , ogre,  , od,  , puck,  , pugna,  , shaman,  , storm,  , visage,  , wd,  , pangolier,  , dark willow, ', ', , alch,  , beastmaster,  , brew,  , bb,  , centaur,  ,

In [23]:
file = open("data/hero_abbreviations.txt", 'r')
sample = ""
for line in file:
    sample += line
file.close()
print(sample[:100])

Abaddon
Alch = Alchemist
AA = Ancient Apparition
AM = Anti-Mage
Arc = Arc Warden
Axe
Bane
Bat = Batr


In [24]:
file = open("data/hero_abbreviations.txt", 'r')
hero_abbr = {}
for line in file:
    txt = line.lower()
    key = txt.split(' =')[0]
    try:
        value = txt.split('= ')[1]
        value = value.replace("\n", "")        
    except IndexError:
        key = key.replace("\n", "")
        value = key        
    hero_abbr[key] = value
    hero_abbr[value] = value
file.close()
print(hero_abbr)

{'abaddon': 'abaddon', 'alch': 'alchemist', 'alchemist': 'alchemist', 'aa': 'ancient apparition', 'ancient apparition': 'ancient apparition', 'am': 'anti-mage', 'anti-mage': 'anti-mage', 'arc': 'arc warden', 'arc warden': 'arc warden', 'axe': 'axe', 'bane': 'bane', 'bat': 'batrider', 'batrider': 'batrider', 'bm': 'beastmaster', 'beastmaster': 'beastmaster', 'beast': 'beastmaster', 'bs': 'bloodseeker', 'bloodseeker': 'bloodseeker', 'bh': 'bounty hunter', 'bounty hunter': 'bounty hunter', 'brew': 'brewmaster', 'brewmaster': 'brewmaster', 'bb': 'bristleback', 'bristleback': 'bristleback', 'brood': 'broodmother', 'broodmother': 'broodmother', 'centaur': 'centaur warrunner', 'centaur warrunner': 'centaur warrunner', 'cent': 'centaur warrunner', 'ck': 'chaos knight', 'chaos knight': 'chaos knight', 'chen': 'chen', 'clinkz': 'clinkz', 'clock': 'clockwerk', 'clockwerk': 'clockwerk', 'cm': 'crystal maiden', 'crystal maiden': 'crystal maiden', 'ds': 'dark seer', 'dark seer': 'dark seer', 'dazzle

In [25]:
cleaned_hero_names = []
for i in range(len(hero_regex_matches)):
    line = hero_regex_matches[i].split(", ")        
    tmp_list = []
    for j in range(len(line)):
        if line[j] == ' ' or line[j] == '':
            pass
        else:
            tmp_list.append(hero_abbr[line[j]])
    cleaned_hero_names.append(tmp_list)
print(cleaned_hero_names)

[['anti-mage', 'arc warden', 'chaos knight', 'gyrocopter', 'medusa', 'morphling', 'naga siren', 'phantom assassin', 'sniper', 'spectre', 'terrorblade', 'tiny', 'troll warlord'], ['alchemist', 'bristleback', 'dragon knight', 'huskar', 'lifestealer', 'lycan', 'monkey king', 'slardar', 'sven', 'wraith king', 'clinkz', 'drow ranger', 'ember spirit', 'faceless void', 'juggernaut', 'lone druid', 'luna', 'meepo', 'phantom lancer', 'razor', 'riki', 'shadow fiend', 'slark', 'templar assassin', 'ursa', 'weaver', 'outworld devourer', 'storm spirit', 'pangolier'], ['abaddon', 'brewmaster', 'doom', 'kunkka', 'legion commander', 'night stalker', 'spirit breaker', 'bloodseeker', 'broodmother', 'mirana', 'viper', 'death prophet', 'invoker', 'leshrac', 'lina', "nature's prophet", 'necrophos', 'queen of pain', 'silencer', 'tinker', 'windranger'], ['phoenix', 'timbersaw', 'shadow fiend', 'invoker', 'leshrac', 'lina', 'lion', 'oracle', 'queen of pain', 'skywrath mage', 'techies', 'tinker', 'zeus'], ['eart

In [26]:
cleaned_roles_dict = {}
role_index = -1
for i in range(len(cleaned_hero_names)):    
    role_strength = i % 3    

    if role_strength == 0:
        role_index += 1
        hard_role = 'hard-' + roles_list[role_index].lower()
        cleaned_roles_dict[hard_role] = cleaned_hero_names[i]
        
    elif role_strength == 1:
        semi_role = 'semi-' + roles_list[role_index].lower()
        cleaned_roles_dict[semi_role] = cleaned_hero_names[i]
    
    elif role_strength == 2:
        weak_role = 'weak-' + roles_list[role_index].lower()        
        cleaned_roles_dict[weak_role] = cleaned_hero_names[i]    
print(cleaned_roles_dict)

{'hard-carry': ['anti-mage', 'arc warden', 'chaos knight', 'gyrocopter', 'medusa', 'morphling', 'naga siren', 'phantom assassin', 'sniper', 'spectre', 'terrorblade', 'tiny', 'troll warlord'], 'semi-carry': ['alchemist', 'bristleback', 'dragon knight', 'huskar', 'lifestealer', 'lycan', 'monkey king', 'slardar', 'sven', 'wraith king', 'clinkz', 'drow ranger', 'ember spirit', 'faceless void', 'juggernaut', 'lone druid', 'luna', 'meepo', 'phantom lancer', 'razor', 'riki', 'shadow fiend', 'slark', 'templar assassin', 'ursa', 'weaver', 'outworld devourer', 'storm spirit', 'pangolier'], 'weak-carry': ['abaddon', 'brewmaster', 'doom', 'kunkka', 'legion commander', 'night stalker', 'spirit breaker', 'bloodseeker', 'broodmother', 'mirana', 'viper', 'death prophet', 'invoker', 'leshrac', 'lina', "nature's prophet", 'necrophos', 'queen of pain', 'silencer', 'tinker', 'windranger'], 'hard-nuker': ['phoenix', 'timbersaw', 'shadow fiend', 'invoker', 'leshrac', 'lina', 'lion', 'oracle', 'queen of pain

In [27]:
role_bits = {
    'role': {}
}

index = 0
for role in cleaned_roles_dict:
    role_bits['role'][role] = index
    index += 1
role_bits_df = pd.DataFrame(role_bits)     
role_bits_df.head()

Unnamed: 0,role
hard-carry,0
hard-disabler,9
hard-escape,12
hard-initiator,6
hard-jungler,21


In [28]:
with open('data/role_bits.json', 'w') as outfile:
    json.dump(role_bits, outfile)

In [29]:
hero_roles_dict = {}
for key in cleaned_roles_dict:
    for hero in cleaned_roles_dict[key]:
        if hero in hero_roles_dict:
            hero_roles_dict[hero]['score'] += 2**role_bits['role'][key]            
        else:
            hero_roles_dict[hero] = {
                'score': 2**role_bits['role'][key],
                'hard-carry': 0, 
                'hard-disabler': 0, 
                'hard-escape': 0, 
                'hard-initiator': 0,
                'hard-jungler': 0, 
                'hard-nuker': 0, 
                'hard-pusher': 0, 
                'hard-support': 0,
                'semi-carry': 0, 
                'semi-disabler': 0, 
                'semi-escape': 0, 
                'semi-initiator': 0,
                'semi-jungler': 0, 
                'semi-nuker': 0, 
                'semi-pusher': 0, 
                'semi-support': 0,
                'weak-carry': 0, 
                'weak-disabler': 0, 
                'weak-escape': 0, 
                'weak-initiator': 0,
                'weak-jungler': 0, 
                'weak-nuker': 0, 
                'weak-pusher': 0, 
                'weak-support': 0
            }
        hero_roles_dict[hero][key] = 1
hero_scores_df = pd.DataFrame(hero_roles_dict).transpose()
hero_scores_df = hero_scores_df.reset_index()
hero_scores_df = hero_scores_df.rename({'index':'hero_name'}, axis='columns')
hero_scores_df.head()

Unnamed: 0,hero_name,hard-carry,hard-disabler,hard-escape,hard-initiator,hard-jungler,hard-nuker,hard-pusher,hard-support,score,...,semi-pusher,semi-support,weak-carry,weak-disabler,weak-escape,weak-initiator,weak-jungler,weak-nuker,weak-pusher,weak-support
0,abaddon,0,0,0,0,0,0,0,0,65540,...,0,1,1,0,0,0,0,0,0,0
1,alchemist,0,0,0,0,0,0,0,0,133410,...,0,0,0,1,0,1,0,1,0,1
2,ancient apparition,0,0,0,0,0,0,0,0,67616,...,0,1,0,1,0,0,0,1,0,0
3,anti-mage,1,0,1,0,0,0,0,0,4129,...,0,0,0,0,0,0,0,1,0,0
4,arc warden,1,0,1,0,0,0,0,0,4129,...,0,0,0,0,0,0,0,1,0,0


In [30]:
with open('data/heroes.json') as f:
    raw_hero_data = json.load(f)
    
hero_roles_dict = {
    'hero': {}
}

hero_list = raw_hero_data['heroes']

for dictionary in hero_list:
    dictionary['localized_name'] = dictionary['localized_name'].lower()            
    hero_name = dictionary['localized_name']    
hero_roles_df = pd.DataFrame(hero_list) 

hero_roles_df = hero_roles_df.reset_index()
hero_roles_df = hero_roles_df.rename({'localized_name':'hero_name','id': 'hero_id'}, axis='columns')
drop_columns = [
    'name',
    'url_full_portrait',
    'url_large_portrait',
    'url_small_portrait',
    'url_vertical_portrait',
    'index'
]
hero_roles_df = hero_roles_df.drop(columns=drop_columns)
hero_roles_df.head()

Unnamed: 0,hero_id,hero_name
0,1,anti-mage
1,2,axe
2,3,bane
3,4,bloodseeker
4,5,crystal maiden


In [31]:
hero_roles_df = pd.merge(hero_roles_df, hero_scores_df, on='hero_name')
hero_roles_df.head()

Unnamed: 0,hero_id,hero_name,hard-carry,hard-disabler,hard-escape,hard-initiator,hard-jungler,hard-nuker,hard-pusher,hard-support,...,semi-pusher,semi-support,weak-carry,weak-disabler,weak-escape,weak-initiator,weak-jungler,weak-nuker,weak-pusher,weak-support
0,1,anti-mage,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,2,axe,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,bane,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
3,4,bloodseeker,0,0,0,0,0,0,0,0,...,0,0,1,1,0,1,1,1,0,0
4,5,crystal maiden,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0


In [32]:
with open('data/hero_roles.json', 'w') as outfile:
    outfile.write(hero_roles_df.to_json(orient='records', lines=True))

## 3.2 Cleaning Match data  
### 3.2.1 Match meta data

In [33]:
for key in raw_match_data:    
    try:
        del raw_match_data[key]['players']        
    except KeyError:
        pass    
    try:
        del raw_match_data[key]['picks_bans']
    except KeyError:
        pass
            

match_df = pd.DataFrame(raw_match_data)
match_df = match_df.transpose()
match_df.head()

Unnamed: 0,barracks_status_dire,barracks_status_radiant,cluster,cluster_name,dire_captain,dire_score,duration,engine,first_blood_time,flags,...,match_seq_num,negative_votes,positive_votes,pre_game_duration,radiant_captain,radiant_score,radiant_win,start_time,tower_status_dire,tower_status_radiant
3933472394,63,0,133,Europe West,,38,1620,1,70,1,...,3412750547,0,0,60,,29,False,1528135877,1958,0
3933472395,63,0,227,,,35,2300,1,157,1,...,3412770903,0,0,90,,31,False,1528135882,1974,0
3933472396,63,0,183,Russia,,5,781,1,142,1,...,3412736252,0,0,60,,9,,1528135876,2039,0
3933472397,0,63,185,Russia,,27,2277,1,129,1,...,3412770514,0,0,90,,54,True,1528135877,0,1975
3933472398,0,55,187,Russia,,46,2291,1,50,1,...,3412766523,0,0,90,,44,True,1528135875,0,1926


In [34]:
match_df.columns

Index(['barracks_status_dire', 'barracks_status_radiant', 'cluster',
       'cluster_name', 'dire_captain', 'dire_score', 'duration', 'engine',
       'first_blood_time', 'flags', 'game_mode', 'game_mode_name',
       'human_players', 'leagueid', 'lobby_name', 'lobby_type', 'match_id',
       'match_seq_num', 'negative_votes', 'positive_votes',
       'pre_game_duration', 'radiant_captain', 'radiant_score', 'radiant_win',
       'start_time', 'tower_status_dire', 'tower_status_radiant'],
      dtype='object')

In [35]:
drop_list = [
    'dire_captain', 
    'cluster', 
    'cluster_name', 
    'engine', 
    'match_seq_num', 
    'negative_votes', 
    'positive_votes',
    'lobby_type',
    'start_time',
    'radiant_captain',
    'pre_game_duration',    
    'lobby_name',
    'leagueid',
    'flags',
    'game_mode_name',
    'game_mode',
    'first_blood_time',
    'match_id'
]
match_df = match_df.drop(columns = drop_list)
match_df = match_df.dropna(subset = ['radiant_win'])
match_df = match_df.reset_index()
match_df = match_df.rename({'index':'match_id'}, axis='columns')
match_df.head()

Unnamed: 0,match_id,barracks_status_dire,barracks_status_radiant,dire_score,duration,human_players,radiant_score,radiant_win,tower_status_dire,tower_status_radiant
0,3933472394,63,0,38,1620,10,29,False,1958,0
1,3933472395,63,0,35,2300,10,31,False,1974,0
2,3933472397,0,63,27,2277,10,54,True,0,1975
3,3933472398,0,55,46,2291,10,44,True,0,1926
4,3933472399,0,51,48,2861,10,50,True,0,1798


In [36]:
with open('data/match.json') as f:
    raw_match_data = json.load(f)

player_list = []

for keys in raw_match_data:
    for i in range(len(raw_match_data[keys]['players'])):        
        try:
            del raw_match_data[keys]['players'][i]['ability_upgrades']                
        except KeyError:
            pass 
        raw_match_data[keys]['players'][i]['match_id'] = keys        
        player_list.append(raw_match_data[keys]['players'][i])
    
players_df = pd.DataFrame(player_list)

drop_columns = [
    'leaver_status_description',
    'leaver_status_name',    
    'item_0_name',
    'item_1_name',
    'item_2_name',
    'item_3_name',
    'item_4_name',
    'item_5_name',
    'item_0',
    'item_1',
    'item_2',
    'item_3',
    'item_4',
    'item_5',
    'backpack_0',
    'backpack_1',
    'backpack_2',
    'hero_damage',
    'hero_healing',
    'tower_damage',
    'additional_units'
]

players_df = players_df.drop(columns = drop_columns)
players_df = players_df.dropna(subset = ['account_id'])
players_df = players_df.dropna(subset = ['hero_name'])
players_df['account_id'] = players_df['account_id'].astype(int)
players_df['hero_name'] = players_df['hero_name'].str.lower()
players_df.head()

Unnamed: 0,account_id,assists,deaths,denies,gold,gold_per_min,gold_spent,hero_id,hero_name,kills,last_hits,leaver_status,level,match_id,player_slot,scaled_hero_damage,scaled_hero_healing,scaled_tower_damage,xp_per_min
0,167415941,18,8,2,833,382,16355,71,spirit breaker,9,119,0.0,25,3933472768,0,12976,152,58,648
1,4294967295,9,9,7,2498,377,13655,6,drow ranger,1,243,0.0,22,3933472768,1,4790,0,194,469
2,4294967295,12,11,3,157,326,14470,88,nyx assassin,5,37,0.0,21,3933472768,2,7562,0,161,413
3,314390062,5,7,10,316,595,26815,106,ember spirit,12,377,0.0,25,3933472768,3,17834,0,2011,779
4,255288415,11,13,8,4782,428,14675,93,slark,8,217,0.0,25,3933472768,4,12834,0,110,612


# 4. Data Enriching

## 4.1 Enriching Hero Role

In [37]:
with open('data/hero_monthly_stats.json') as f:
    hero_monthly_stats = json.load(f)
hero_monthly_stats_df = pd.DataFrame(hero_monthly_stats).transpose()
hero_monthly_stats_df = hero_monthly_stats_df.reset_index()
hero_monthly_stats_df = hero_monthly_stats_df.rename({
    'index':'hero_name', 
    'games_played':'hero_games_played',
    'kda':'hero_kda',
    'win_ratio':'hero_win_ratio'
    }, axis='columns')
hero_monthly_stats_df['hero_name'] = hero_monthly_stats_df['hero_name'].str.lower()
hero_monthly_stats_df.head()

Unnamed: 0,hero_name,hero_games_played,hero_kda,pick_rate,hero_win_ratio
0,abaddon,813552,2.7662,3.8269,53.147
1,alchemist,1002710,2.1533,4.7167,47.2532
2,ancient apparition,1444245,2.9395,6.7937,52.3461
3,anti-mage,2937612,2.4398,13.8186,49.2443
4,arc warden,393944,2.5933,1.8531,47.1554


In [38]:
hero_df = pd.merge(hero_monthly_stats_df, hero_roles_df, on='hero_name')
hero_df.head()

Unnamed: 0,hero_name,hero_games_played,hero_kda,pick_rate,hero_win_ratio,hero_id,hard-carry,hard-disabler,hard-escape,hard-initiator,...,semi-pusher,semi-support,weak-carry,weak-disabler,weak-escape,weak-initiator,weak-jungler,weak-nuker,weak-pusher,weak-support
0,abaddon,813552,2.7662,3.8269,53.147,102,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
1,alchemist,1002710,2.1533,4.7167,47.2532,73,0,0,0,0,...,0,0,0,1,0,1,0,1,0,1
2,ancient apparition,1444245,2.9395,6.7937,52.3461,68,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
3,anti-mage,2937612,2.4398,13.8186,49.2443,1,1,0,1,0,...,0,0,0,0,0,0,0,1,0,0
4,arc warden,393944,2.5933,1.8531,47.1554,113,1,0,1,0,...,0,0,0,0,0,0,0,1,0,0


## 4.2 Enriching Player Game Performance

In [39]:
with open('data/player_stats.json') as f:
    raw_player_stats = json.load(f)

player_stats_df = pd.DataFrame(raw_player_stats).transpose()
player_stats_df = player_stats_df.reset_index()
player_stats_df = player_stats_df.rename({'index':'account_id'}, axis='columns')
player_stats_df.columns = map(str.lower, player_stats_df.columns)
player_stats_df.fillna(0, inplace=True)
player_stats_df.head()

Unnamed: 0,account_id,abaddon,alchemist,ancient apparition,anti-mage,arc warden,axe,bane,batrider,beastmaster,...,venomancer,viper,visage,warlock,weaver,windranger,winter wyvern,witch doctor,wraith king,zeus
0,100090157,"{'win_ratio': '42.857143', 'games_played': '14...","{'win_ratio': '25.0', 'games_played': '12', 'k...","{'win_ratio': '40.0', 'games_played': '10', 'k...","{'win_ratio': '50.0', 'games_played': '6', 'kd...","{'win_ratio': '0.0', 'games_played': '1', 'kda...","{'win_ratio': '42.105263', 'games_played': '19...","{'win_ratio': '50.0', 'games_played': '8', 'kd...","{'win_ratio': '51.724136', 'games_played': '29...","{'win_ratio': '62.5', 'games_played': '8', 'kd...",...,"{'win_ratio': '46.153847', 'games_played': '13...","{'win_ratio': '77.77778', 'games_played': '9',...","{'win_ratio': '66.66667', 'games_played': '18'...","{'win_ratio': '75.0', 'games_played': '4', 'kd...","{'win_ratio': '41.463413', 'games_played': '41...","{'win_ratio': '72.72727', 'games_played': '11'...","{'win_ratio': '33.333336', 'games_played': '6'...","{'win_ratio': '58.333332', 'games_played': '36...","{'win_ratio': '66.66667', 'games_played': '18'...","{'win_ratio': '14.285715', 'games_played': '7'..."
1,10011865,"{'win_ratio': '50.0', 'games_played': '24', 'k...","{'win_ratio': '38.235294', 'games_played': '34...","{'win_ratio': '33.333336', 'games_played': '3'...","{'win_ratio': '35.185184', 'games_played': '54...","{'win_ratio': '100.0', 'games_played': '1', 'k...","{'win_ratio': '44.444447', 'games_played': '11...","{'win_ratio': '65.625', 'games_played': '32', ...","{'win_ratio': '11.111112', 'games_played': '9'...","{'win_ratio': '40.0', 'games_played': '25', 'k...",...,"{'win_ratio': '44.897957', 'games_played': '49...","{'win_ratio': '50.0', 'games_played': '50', 'k...","{'win_ratio': '28.57143', 'games_played': '14'...","{'win_ratio': '53.846157', 'games_played': '26...","{'win_ratio': '54.166668', 'games_played': '48...","{'win_ratio': '52.63158', 'games_played': '19'...","{'win_ratio': '50.0', 'games_played': '8', 'kd...","{'win_ratio': '52.63158', 'games_played': '19'...","{'win_ratio': '49.36709', 'games_played': '79'...","{'win_ratio': '55.88235', 'games_played': '34'..."
2,1002045,"{'win_ratio': '50.0', 'games_played': '2', 'kd...","{'win_ratio': '33.333336', 'games_played': '3'...","{'win_ratio': '0.0', 'games_played': '1', 'kda...",0,"{'win_ratio': '0.0', 'games_played': '2', 'kda...","{'win_ratio': '50.0', 'games_played': '4', 'kd...",0,0,0,...,0,"{'win_ratio': '50.0', 'games_played': '4', 'kd...",0,"{'win_ratio': '100.0', 'games_played': '1', 'k...",0,"{'win_ratio': '50.0', 'games_played': '4', 'kd...","{'win_ratio': '60.000004', 'games_played': '5'...","{'win_ratio': '0.0', 'games_played': '1', 'kda...","{'win_ratio': '75.0', 'games_played': '4', 'kd...","{'win_ratio': '55.555557', 'games_played': '9'..."
3,100233112,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,100235084,"{'win_ratio': '57.14286', 'games_played': '42'...","{'win_ratio': '58.69565', 'games_played': '46'...","{'win_ratio': '45.901638', 'games_played': '61...","{'win_ratio': '52.941177', 'games_played': '68...","{'win_ratio': '33.333336', 'games_played': '3'...","{'win_ratio': '43.90244', 'games_played': '41'...","{'win_ratio': '68.115944', 'games_played': '69...","{'win_ratio': '64.81481', 'games_played': '54'...","{'win_ratio': '47.826088', 'games_played': '23...",...,"{'win_ratio': '47.058823', 'games_played': '34...","{'win_ratio': '42.857143', 'games_played': '14...","{'win_ratio': '66.66667', 'games_played': '18'...","{'win_ratio': '50.0', 'games_played': '12', 'k...","{'win_ratio': '55.813957', 'games_played': '12...","{'win_ratio': '42.937855', 'games_played': '17...","{'win_ratio': '63.88889', 'games_played': '36'...","{'win_ratio': '47.61905', 'games_played': '42'...","{'win_ratio': '47.5', 'games_played': '40', 'k...","{'win_ratio': '62.162163', 'games_played': '37..."


In [40]:
sum(pd.isnull(player_stats_df['abaddon']))

0

In [41]:
player_stats_df.shape[0]

4307

In [42]:
hero_win_ratio = []
hero_kda = []
hero_games_played = []

for index, row in players_df.iterrows():
    account_id = str(row['account_id'])
    hero_name = row['hero_name']    
    player_stats = player_stats_df.loc[player_stats_df['account_id'] == account_id][hero_name]    
        
    try:
        row_index = player_stats.index[0]
        player_stats = player_stats[row_index]

    except IndexError:
        player_stats = None
    
    if type(player_stats) == dict:
        win_ratio = player_stats['win_ratio']
        kda = player_stats['kda']
        games_played = player_stats['games_played']
        
    else:
        win_ratio = None
        kda = None
        games_played = None

    hero_win_ratio.append(win_ratio)
    hero_kda.append(kda)
    hero_games_played.append(games_played)

players_df['win_ratio'] = pd.Series(hero_win_ratio)
players_df['kda'] = pd.Series(hero_kda)
players_df['games_played'] = pd.Series(hero_games_played)

In [43]:
players_df.head()

Unnamed: 0,account_id,assists,deaths,denies,gold,gold_per_min,gold_spent,hero_id,hero_name,kills,...,level,match_id,player_slot,scaled_hero_damage,scaled_hero_healing,scaled_tower_damage,xp_per_min,win_ratio,kda,games_played
0,167415941,18,8,2,833,382,16355,71,spirit breaker,9,...,25,3933472768,0,12976,152,58,648,57.14286,4.3773584,7.0
1,4294967295,9,9,7,2498,377,13655,6,drow ranger,1,...,22,3933472768,1,4790,0,194,469,,,
2,4294967295,12,11,3,157,326,14470,88,nyx assassin,5,...,21,3933472768,2,7562,0,161,413,,,
3,314390062,5,7,10,316,595,26815,106,ember spirit,12,...,25,3933472768,3,17834,0,2011,779,51.315792,2.778559,76.0
4,255288415,11,13,8,4782,428,14675,93,slark,8,...,25,3933472768,4,12834,0,110,612,41.025642,2.0392158,39.0


# 5. Merging the data

In [49]:
hero_df = hero_df.drop(columns=['hero_name'])
hero_df.head()

Unnamed: 0,hero_games_played,hero_kda,pick_rate,hero_win_ratio,hero_id,hard-carry,hard-disabler,hard-escape,hard-initiator,hard-jungler,...,semi-pusher,semi-support,weak-carry,weak-disabler,weak-escape,weak-initiator,weak-jungler,weak-nuker,weak-pusher,weak-support
0,813552,2.7662,3.8269,53.147,102,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
1,1002710,2.1533,4.7167,47.2532,73,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,1
2,1444245,2.9395,6.7937,52.3461,68,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
3,2937612,2.4398,13.8186,49.2443,1,1,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
4,393944,2.5933,1.8531,47.1554,113,1,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0


In [45]:
match_df.head()

Unnamed: 0,match_id,barracks_status_dire,barracks_status_radiant,dire_score,duration,human_players,radiant_score,radiant_win,tower_status_dire,tower_status_radiant
0,3933472394,63,0,38,1620,10,29,False,1958,0
1,3933472395,63,0,35,2300,10,31,False,1974,0
2,3933472397,0,63,27,2277,10,54,True,0,1975
3,3933472398,0,55,46,2291,10,44,True,0,1926
4,3933472399,0,51,48,2861,10,50,True,0,1798


In [47]:
players_df.head()

Unnamed: 0,account_id,assists,deaths,denies,gold,gold_per_min,gold_spent,hero_id,hero_name,kills,...,level,match_id,player_slot,scaled_hero_damage,scaled_hero_healing,scaled_tower_damage,xp_per_min,win_ratio,kda,games_played
0,167415941,18,8,2,833,382,16355,71,spirit breaker,9,...,25,3933472768,0,12976,152,58,648,57.14286,4.3773584,7.0
1,4294967295,9,9,7,2498,377,13655,6,drow ranger,1,...,22,3933472768,1,4790,0,194,469,,,
2,4294967295,12,11,3,157,326,14470,88,nyx assassin,5,...,21,3933472768,2,7562,0,161,413,,,
3,314390062,5,7,10,316,595,26815,106,ember spirit,12,...,25,3933472768,3,17834,0,2011,779,51.315792,2.778559,76.0
4,255288415,11,13,8,4782,428,14675,93,slark,8,...,25,3933472768,4,12834,0,110,612,41.025642,2.0392158,39.0


In [48]:
player_df = pd.merge(players_df, match_df, on='match_id')
player_df.head()

Unnamed: 0,account_id,assists,deaths,denies,gold,gold_per_min,gold_spent,hero_id,hero_name,kills,...,games_played,barracks_status_dire,barracks_status_radiant,dire_score,duration,human_players,radiant_score,radiant_win,tower_status_dire,tower_status_radiant
0,167415941,18,8,2,833,382,16355,71,spirit breaker,9,...,7.0,63,0,48,2765,10,36,False,1828,0
1,4294967295,9,9,7,2498,377,13655,6,drow ranger,1,...,,63,0,48,2765,10,36,False,1828,0
2,4294967295,12,11,3,157,326,14470,88,nyx assassin,5,...,,63,0,48,2765,10,36,False,1828,0
3,314390062,5,7,10,316,595,26815,106,ember spirit,12,...,76.0,63,0,48,2765,10,36,False,1828,0
4,255288415,11,13,8,4782,428,14675,93,slark,8,...,39.0,63,0,48,2765,10,36,False,1828,0


In [50]:
player_df = pd.merge(players_df, hero_df, on='hero_id')
player_df.head()

Unnamed: 0,account_id,assists,deaths,denies,gold,gold_per_min,gold_spent,hero_id,hero_name,kills,...,semi-pusher,semi-support,weak-carry,weak-disabler,weak-escape,weak-initiator,weak-jungler,weak-nuker,weak-pusher,weak-support
0,167415941,18,8,2,833,382,16355,71,spirit breaker,9,...,0,0,1,0,1,0,0,0,0,0
1,4294967295,9,13,0,616,313,7610,71,spirit breaker,6,...,0,0,1,0,1,0,0,0,0,0
2,133377036,28,7,8,6444,418,11400,71,spirit breaker,12,...,0,0,1,0,1,0,0,0,0,0
3,4294967295,11,12,1,507,208,5025,71,spirit breaker,2,...,0,0,1,0,1,0,0,0,0,0
4,257152852,8,8,0,553,281,8645,71,spirit breaker,4,...,0,0,1,0,1,0,0,0,0,0
