# DOTA 2 Data - What models can I build with data from my own DOTA 2 matches?

This is a notebook exploring the data already pulled from the OpenDota API (see data_import.py)

We currently have 2 files:
    - profile.csv
    - matches.csv
    
Let's explore what's in them.

In [15]:
import pandas as pd
from pathlib import Path

# Prefer to use pathlib when handling paths - it will detect what OS you're using and auto-correct the path,
# very useful for sharing with others
# Also can be a lot quicker than using os.path
base_directory = Path('C:/Users/Ryan/dota2-classification')

profile_df = pd.read_csv(base_directory / 'profile.csv')

profile_df.head()

Unnamed: 0,competitive_rank,leaderboard_rank,mmr_estimate.estimate,profile.account_id,profile.avatar,profile.avatarfull,profile.avatarmedium,profile.cheese,profile.is_contributor,profile.last_login,profile.loccountrycode,profile.name,profile.personaname,profile.plus,profile.profileurl,profile.steamid,rank_tier,solo_competitive_rank,tracked_until
0,4202,,3761,45576964,https://steamcdn-a.akamaihd.net/steamcommunity...,https://steamcdn-a.akamaihd.net/steamcommunity...,https://steamcdn-a.akamaihd.net/steamcommunity...,0,False,2019-03-18T20:45:06.869Z,,,Smooth Jazz <3,True,https://steamcommunity.com/profiles/7656119800...,76561198005842692,63,4538,1556024495


Seems to just be a single row of information about my account. Can't see much immediate use for this so will ignore for now.

Let's check out matches.csv instead, this should a be table of all matches I have played before 1st January 2019

In [16]:
df = pd.read_csv(base_directory / 'matches.csv')

df.head()

Unnamed: 0,assists,deaths,duration,game_mode,hero_id,kills,leaver_status,lobby_type,match_id,party_size,player_slot,radiant_win,skill,start_time,version,hero_name,start_datetime
0,18,4,2533,22,27,7,0,0,3587558457,,2,True,3.0,1511879974,,Shadow Shaman,2017-11-28 14:39:34
1,5,6,2375,22,18,5,0,0,3434506769,2.0,128,True,3.0,1504985730,20.0,Sven,2017-09-09 19:35:30
2,1,2,2112,22,31,1,1,0,3434469722,2.0,129,True,3.0,1504984250,20.0,Lich,2017-09-09 19:10:50
3,19,5,3074,5,75,15,0,0,3433444310,2.0,0,True,3.0,1504953922,20.0,Silencer,2017-09-09 10:45:22
4,16,1,2816,22,29,3,0,0,3432206454,2.0,131,True,3.0,1504897425,20.0,Tidehunter,2017-09-08 19:03:45


Seems to be exactly that. Let's check out a bit more information about the data and what each column represents.

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3819 entries, 0 to 3818
Data columns (total 17 columns):
assists           3819 non-null int64
deaths            3819 non-null int64
duration          3819 non-null int64
game_mode         3819 non-null int64
hero_id           3819 non-null int64
kills             3819 non-null int64
leaver_status     3819 non-null int64
lobby_type        3819 non-null int64
match_id          3819 non-null int64
party_size        67 non-null float64
player_slot       3819 non-null int64
radiant_win       3819 non-null bool
skill             95 non-null float64
start_time        3819 non-null int64
version           328 non-null float64
hero_name         3819 non-null object
start_datetime    3819 non-null object
dtypes: bool(1), float64(3), int64(11), object(2)
memory usage: 481.2+ KB


We can see that we've got 3819 matches here - a shamefully useful dataset.

However some columns have a lot of nulls - I'll try and understand what they represent before removing them.

In [18]:
df['party_size'].unique()

array([nan,  2.,  3., 10.])

party_size seems to be the number of players in my party (friends I have grouped up with) during the match. However I know most of my matches were actually played with friends, so this seems to be a redundant column still in the API.

In [19]:
df['skill'].unique()

array([ 3.,  2., nan])

In [20]:
df['version'].unique()

array([nan, 20., 17., 16., 15., 12., 11., 10.,  7.,  6.,  5.,  4.])

Similarly skill and version probably once represented something but have been deprecated. I'll remove these 3 columns from the dataset.

In [21]:
df = df.drop(columns = ['party_size', 'skill', 'version'], axis = 1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3819 entries, 0 to 3818
Data columns (total 14 columns):
assists           3819 non-null int64
deaths            3819 non-null int64
duration          3819 non-null int64
game_mode         3819 non-null int64
hero_id           3819 non-null int64
kills             3819 non-null int64
leaver_status     3819 non-null int64
lobby_type        3819 non-null int64
match_id          3819 non-null int64
player_slot       3819 non-null int64
radiant_win       3819 non-null bool
start_time        3819 non-null int64
hero_name         3819 non-null object
start_datetime    3819 non-null object
dtypes: bool(1), int64(11), object(2)
memory usage: 391.7+ KB


All nulls gone. Let's take a bit of a harder look at the remaining data, and see if we can find something fun to model.

In [23]:
df.head()

Unnamed: 0,assists,deaths,duration,game_mode,hero_id,kills,leaver_status,lobby_type,match_id,player_slot,radiant_win,start_time,hero_name,start_datetime
0,18,4,2533,22,27,7,0,0,3587558457,2,True,1511879974,Shadow Shaman,2017-11-28 14:39:34
1,5,6,2375,22,18,5,0,0,3434506769,128,True,1504985730,Sven,2017-09-09 19:35:30
2,1,2,2112,22,31,1,1,0,3434469722,129,True,1504984250,Lich,2017-09-09 19:10:50
3,19,5,3074,5,75,15,0,0,3433444310,0,True,1504953922,Silencer,2017-09-09 10:45:22
4,16,1,2816,22,29,3,0,0,3432206454,131,True,1504897425,Tidehunter,2017-09-08 19:03:45


We can see there's a column called radiant_win ('radiant' is the name one of the two sides in a match of DOTA, the other is called the 'dire'). This represents whether or not radiant won this match. There's also a column called player_slot, which I think will tell us whether I was on radiant or dire.

I think it would be quite interesting to try and predict whether I won or lost each match, based on the other data present.

## Data cleaning

In order to predict a win/loss, I'll need a boolean column describing whether each match was won or lost. Let's look into the radiant_win and player_slot column a bit further

In [26]:
df['player_slot'].unique()

array([  2, 128, 129,   0, 131, 130,   1, 132,   4,   3], dtype=int64)

This seems to be 2 different ranges of 5 numbers; 0-4 and 128-132. Since DOTA is a 5vs5 game, I am guessing these represent which side you were on.

A brief check of my recent games confirms this, range 0-4 represents the radiant side, and 128-132 the dire side.

Let's make some new columns to make this information a bit clearer.

In [28]:
df['team'] = df['player_slot'].apply(lambda x: 'radiant' if x <= 5 else 'dire')

df.head()

Unnamed: 0,assists,deaths,duration,game_mode,hero_id,kills,leaver_status,lobby_type,match_id,player_slot,radiant_win,start_time,hero_name,start_datetime,team
0,18,4,2533,22,27,7,0,0,3587558457,2,True,1511879974,Shadow Shaman,2017-11-28 14:39:34,radiant
1,5,6,2375,22,18,5,0,0,3434506769,128,True,1504985730,Sven,2017-09-09 19:35:30,dire
2,1,2,2112,22,31,1,1,0,3434469722,129,True,1504984250,Lich,2017-09-09 19:10:50,dire
3,19,5,3074,5,75,15,0,0,3433444310,0,True,1504953922,Silencer,2017-09-09 10:45:22,radiant
4,16,1,2816,22,29,3,0,0,3432206454,131,True,1504897425,Tidehunter,2017-09-08 19:03:45,dire


We now have a team column, showing which team I was in for each match. Let's now combine the radiant_win column with the team column to make a win column, which will be the target for 

In [29]:
def radiant_win(df):
    if (df['team'] == 'radiant') & (df['radiant_win'] == 1):
        return 1
    if (df['team'] == 'dire') & (df['radiant_win'] == 0):
        return 1
    else:
        return 0
    
df['win'] = df.apply(radiant_win, axis = 1)

df.head()

Unnamed: 0,assists,deaths,duration,game_mode,hero_id,kills,leaver_status,lobby_type,match_id,player_slot,radiant_win,start_time,hero_name,start_datetime,team,win
0,18,4,2533,22,27,7,0,0,3587558457,2,True,1511879974,Shadow Shaman,2017-11-28 14:39:34,radiant,1
1,5,6,2375,22,18,5,0,0,3434506769,128,True,1504985730,Sven,2017-09-09 19:35:30,dire,0
2,1,2,2112,22,31,1,1,0,3434469722,129,True,1504984250,Lich,2017-09-09 19:10:50,dire,0
3,19,5,3074,5,75,15,0,0,3433444310,0,True,1504953922,Silencer,2017-09-09 10:45:22,radiant,1
4,16,1,2816,22,29,3,0,0,3432206454,131,True,1504897425,Tidehunter,2017-09-08 19:03:45,dire,0


Nice! We now have a win column, with 1's for wins and 0's for losses. I'll save this as a new csv for modelling on.

In [31]:
df.to_csv(base_directory  / 'dataset.csv', index = False)

### In 'DOTA 2 Model Building.ipynb', I have a go at predicting wins and losses