# Vectorization of the data to be used

## Map heroes to a vector
First, we want to create a representation of the heroes. For this, we will create a vector of lenght n, where n is the ammount of heroes in the game. And then, we will map each hero to a single value of the vector.

In [2]:
import pandas as pd
import json
import numpy as np

In [3]:
with open('../Preprocessing/hero_names.json', 'r') as file:
    heroes = json.load(file)
heroes = heroes.keys()
n_heroes = len(heroes)
print("There are {} heroes".format(n_heroes))
heroes

There are 84 heroes


dict_keys(['Abathur', 'Alarak', "Anub'arak", 'Artanis', 'Arthas', 'Auriel', 'Azmodan', 'Brightwing', 'Cassia', 'Chen', 'Cho', 'Chromie', 'D.Va', 'Dehaka', 'Diablo', 'E.T.C.', 'Falstad', 'Gall', 'Garrosh', 'Gazlowe', 'Genji', 'Greymane', "Gul'dan", 'Illidan', 'Jaina', 'Johanna', "Kael'thas", 'Kerrigan', 'Kharazim', 'Leoric', 'Li Li', 'Li-Ming', 'Lt. Morales', 'Lucio', 'Lunara', 'Malfurion', 'Malthael', 'Medivh', 'Muradin', 'Murky', 'Nazeebo', 'Nova', 'Probius', 'Ragnaros', 'Raynor', 'Rehgar', 'Rexxar', 'Samuro', 'Sgt. Hammer', 'Sonya', 'Stitches', 'Stukov', 'Sylvanas', 'Tassadar', 'The Butcher', 'The Lost Vikings', 'Thrall', 'Tracer', 'Tychus', 'Tyrael', 'Tyrande', 'Uther', 'Valeera', 'Valla', 'Varian', 'Xul', 'Zagara', 'Zarya', 'Zeratul', "Zul'jin", "Kel'Thuzad", 'Ana', 'Junkrat', 'Alexstrasza', 'Hanzo', 'Blaze', 'Maiev', 'Fenix', 'Deckard', 'Yrel', 'Whitemane', 'Mephisto', "Mal'Ganis", 'Orphea'])

In [4]:
def map_generator():
    i = 0
    while True:
        yield i
        i += 1

In [5]:
generator = map_generator()
vectorization = {key:next(generator) for key in heroes}
vectorization

{'Abathur': 0,
 'Alarak': 1,
 "Anub'arak": 2,
 'Artanis': 3,
 'Arthas': 4,
 'Auriel': 5,
 'Azmodan': 6,
 'Brightwing': 7,
 'Cassia': 8,
 'Chen': 9,
 'Cho': 10,
 'Chromie': 11,
 'D.Va': 12,
 'Dehaka': 13,
 'Diablo': 14,
 'E.T.C.': 15,
 'Falstad': 16,
 'Gall': 17,
 'Garrosh': 18,
 'Gazlowe': 19,
 'Genji': 20,
 'Greymane': 21,
 "Gul'dan": 22,
 'Illidan': 23,
 'Jaina': 24,
 'Johanna': 25,
 "Kael'thas": 26,
 'Kerrigan': 27,
 'Kharazim': 28,
 'Leoric': 29,
 'Li Li': 30,
 'Li-Ming': 31,
 'Lt. Morales': 32,
 'Lucio': 33,
 'Lunara': 34,
 'Malfurion': 35,
 'Malthael': 36,
 'Medivh': 37,
 'Muradin': 38,
 'Murky': 39,
 'Nazeebo': 40,
 'Nova': 41,
 'Probius': 42,
 'Ragnaros': 43,
 'Raynor': 44,
 'Rehgar': 45,
 'Rexxar': 46,
 'Samuro': 47,
 'Sgt. Hammer': 48,
 'Sonya': 49,
 'Stitches': 50,
 'Stukov': 51,
 'Sylvanas': 52,
 'Tassadar': 53,
 'The Butcher': 54,
 'The Lost Vikings': 55,
 'Thrall': 56,
 'Tracer': 57,
 'Tychus': 58,
 'Tyrael': 59,
 'Tyrande': 60,
 'Uther': 61,
 'Valeera': 62,
 'Valla': 63,

## Transform a draft into a single vector
Now, we will transform a complete draft, 10 picked heroes and (hopefully) 6 banned heroes, by two different teams into a single vector. For this, we will use four vectors of lenght n (ammount of heroes in the game). They will represent the heroes picked by the first team, the heroes banned by the first team, the heroes picked by the second team, and the heroes banned by the second team, respectively.

In addition to that, we will insert all of this into a dataset, where the `winner` and `map` columns will be added. `map` will represent with a value of 1, if the team one won, and with a value of 0, if the team two won. Map will represent the map in which the game was played, with a vector representation just like the heroes.

### Import the dataset
For this, we first need our dataset

In [6]:
with open('../Preprocessing/data/data.csv', 'r') as file:
    dataset = pd.read_csv(file)
dataset

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,map,game_type,winner,ban1,ban2,ban3,ban4,...,level1,level2,level3,level4,level5,level6,level7,level8,level9,level10
0,0,0,1,Battlefield of Eternity,TeamLeague,1,Mal'Ganis,,Artanis,Orphea,...,20,12,16,20,5,3,10,20,17,1
1,1,1,2,Blackheart's Bay,UnrankedDraft,1,Mal'Ganis,Xul,Genji,Orphea,...,1,10,15,14,20,7,19,5,4,20
2,2,2,3,Sky Temple,TeamLeague,0,Garrosh,Azmodan,Mal'Ganis,Orphea,...,5,10,6,20,8,9,16,9,18,6
3,3,3,4,Towers of Doom,TeamLeague,1,Whitemane,Orphea,Azmodan,Mal'Ganis,...,20,20,8,20,20,20,20,20,11,20
4,4,4,5,Volskaya Foundry,TeamLeague,0,Mal'Ganis,Orphea,Azmodan,Muradin,...,12,20,20,11,11,12,13,20,16,15
5,5,5,6,Infernal Shrines,HeroLeague,0,Mal'Ganis,Garrosh,Orphea,Genji,...,18,20,13,20,10,9,14,13,7,18
6,6,6,7,Volskaya Foundry,TeamLeague,1,Genji,Azmodan,Orphea,Kael'thas,...,7,10,20,20,20,6,20,5,20,16
7,7,7,8,Towers of Doom,TeamLeague,1,Mal'Ganis,Diablo,Orphea,Maiev,...,11,14,20,20,16,20,20,14,20,10
8,8,8,9,Dragon Shire,HeroLeague,1,Mal'Ganis,Azmodan,Orphea,Kerrigan,...,10,20,8,20,19,9,20,20,11,16
9,9,9,10,Alterac Pass,TeamLeague,0,Orphea,Nazeebo,Genji,Zagara,...,1,20,16,20,13,20,20,10,8,20


In this case, there are certain columns for which we have no interest. The columns `Unnamed: x` are columns that were added when the data from different computers was merged, and the `id` column gives no useful information. Also, since we only collected data from game types that were of interest to us (game types with draft), the game_type column doesn't give any useful information either.

In addition to this, for now, we won't be considering the level of each of the heroes picked, so we will just drop those columns along with the others.

In [7]:
dataset = dataset.drop(['Unnamed: 0', 'Unnamed: 0.1', 'id',
                        'game_type', 'level1', 'level2', 'level3',
                        'level4', 'level5', 'level6', 'level7',
                        'level8', 'level9', 'level10'],
                       axis=1)
dataset

Unnamed: 0,map,winner,ban1,ban2,ban3,ban4,ban5,ban6,pick1,pick2,pick3,pick4,pick5,pick6,pick7,pick8,pick9,pick10
0,Battlefield of Eternity,1,Mal'Ganis,,Artanis,Orphea,Deckard,Kael'thas,Hanzo,Tracer,Thrall,Diablo,Varian,Muradin,Kharazim,Li-Ming,Rehgar,Fenix
1,Blackheart's Bay,1,Mal'Ganis,Xul,Genji,Orphea,Illidan,Uther,Alarak,Maiev,Garrosh,Malfurion,Artanis,Ana,Kael'thas,Jaina,Zagara,Nazeebo
2,Sky Temple,0,Garrosh,Azmodan,Mal'Ganis,Orphea,Stukov,Genji,Kael'thas,Diablo,Deckard,Muradin,Junkrat,Raynor,Zagara,Li Li,Thrall,Chromie
3,Towers of Doom,1,Whitemane,Orphea,Azmodan,Mal'Ganis,Ana,Cho,Kael'thas,Diablo,Deckard,Raynor,Johanna,Varian,Nazeebo,Fenix,Kharazim,Chromie
4,Volskaya Foundry,0,Mal'Ganis,Orphea,Azmodan,Muradin,Auriel,Diablo,D.Va,Tracer,Kael'thas,Raynor,Stitches,Arthas,Rehgar,Valeera,Zeratul,Artanis
5,Infernal Shrines,0,Mal'Ganis,Garrosh,Orphea,Genji,E.T.C.,Stukov,Ana,Zagara,Mephisto,Kael'thas,Kerrigan,Diablo,Cassia,Blaze,Tassadar,Li Li
6,Volskaya Foundry,1,Genji,Azmodan,Orphea,Kael'thas,Li-Ming,Gul'dan,Mal'Ganis,Diablo,Li Li,Arthas,Valla,Jaina,Sonya,Alexstrasza,Kel'Thuzad,Zeratul
7,Towers of Doom,1,Mal'Ganis,Diablo,Orphea,Maiev,Garrosh,Kel'Thuzad,Genji,Ana,Varian,Hanzo,Deckard,Blaze,Kael'thas,Anub'arak,Alarak,Zeratul
8,Dragon Shire,1,Mal'Ganis,Azmodan,Orphea,Kerrigan,Mephisto,Sylvanas,Diablo,Leoric,Zul'jin,Varian,Falstad,Li Li,Genji,Alexstrasza,Cassia,E.T.C.
9,Alterac Pass,0,Orphea,Nazeebo,Genji,Zagara,Sylvanas,Arthas,Mal'Ganis,Azmodan,Hanzo,Li Li,Li-Ming,Tyrande,Leoric,Zul'jin,Varian,Maiev


### Create a map of maps to a vector

In [8]:
maps = dataset['map'].unique()
n_maps = len(maps)
print('There are {} different maps'.format(n_maps))

There are 14 different maps


In [9]:
generator2 = map_generator()
map_vectorization = {map:next(generator2) for map in maps}
map_vectorization

{'Battlefield of Eternity': 0,
 "Blackheart's Bay": 1,
 'Sky Temple': 2,
 'Towers of Doom': 3,
 'Volskaya Foundry': 4,
 'Infernal Shrines': 5,
 'Dragon Shire': 6,
 'Alterac Pass': 7,
 'Tomb of the Spider Queen': 8,
 'Warhead Junction': 9,
 'Braxis Holdout': 10,
 'Cursed Hollow': 11,
 'Hanamura Temple': 12,
 'Garden of Terror': 13}

### Create each vector

In [10]:
def create_vector(row):
    # Create the needed vectors
    picks_team_one = np.zeros(n_heroes)
    picks_team_two = np.zeros(n_heroes)
    bans_team_one = np.zeros(n_heroes)
    bans_team_two = np.zeros(n_heroes)
    chosen_map = np.zeros(n_maps)
    
    # Replace the zeros for ones in the correct indexes
    for i in ['pick1', 'pick4', 'pick5', 'pick8', 'pick9']:
        picks_team_one[vectorization[row[i]]] = 1
    for i in ['pick2', 'pick3', 'pick6', 'pick7', 'pick10']:
        picks_team_two[vectorization[row[i]]] = 1
    for i in ['ban1', 'ban3', 'ban6']:
        if pd.isnull(row[i]):
            continue
        bans_team_one[vectorization[row[i]]] = 1
    for i in ['ban2', 'ban4', 'ban5']:
        if pd.isnull(row[i]):
            continue
        bans_team_two[vectorization[row[i]]] = 1
    chosen_map[map_vectorization[row['map']]] = 1
    
    
    # Concatenate the vectors
    return np.concatenate([picks_team_one, bans_team_one, picks_team_two, bans_team_two, chosen_map])

Testing the function. We should have a vector of lenght (84)\*4 + 14 = 350

In [11]:
vector = create_vector(dataset.iloc[1])
print(len(vector))
vector

350


array([0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [12]:
vector[:84]  # Array that represents the picks of the first team

array([0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

### Iterate over the dataset, creating our new, vectorized, dataset

In [13]:
vectorized_dataset = []
wins = []
for index, row in dataset.iterrows():
    vectorized_dataset.append(create_vector(row))
    wins.append(row['winner'])
print(len(vectorized_dataset))
print(len(wins))

45952
45952


In [14]:
vectorized_dataset[1]

array([0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

## Export the data

In [15]:
vectorized_dataset = np.array(vectorized_dataset)
wins = np.array(wins)
print(len(vectorized_dataset))
print(len(wins))

45952
45952


In [16]:
np.save('dataset.npy', vectorized_dataset, allow_pickle=True)
np.save('classifications.npy', wins, allow_pickle=True)