## Analyzing Dota Data
Isaac Haberman (Ihaberma) and Adin Adler (aadler)



## Introduction

For our final project we will be analyzing the video game Defense of the Ancients 2 DOTA for short. For a full explanation of the game and further resources go to http://dota2.gamepedia.com/Dota_2. DOTA is game played by ten players, five per team, who compete in matches that generally last forty-five minutes in length. Before each match, players choose a unique hero without replacement from a pool of one hundred thirteen heroes.  Throughout the match players earns gold and experience which help them win.  Gold and experience are reset after each match.

We will be analyzing the data from two CSV files from Feedless, `MatchOverview` and `MatchDetail`. 

`MatchOverview` contains 23875 rows, each row corresponding to a unique match. The data frame contains 12 columns, grouped by the match id.

* The first column identifies the match, usually a large number like 2503037971.

* The next ten columns contain the heroes picked in the match, columns one through five represent team zero, columns six through ten represent team one. The columns can contain any integer 1-113 inclusive.

* The final column is a boolean variable representing the winning team; 0 if team zero won and 1 if team one won.

`MatchDetail` contains 202319 rows and 23 columns containing data on the experience and gold for a player at 5 minute intervals up to 45 minutes.  For example `e_5` represents experience of a player at the 5 minute mark and `g_5` represents gold of a player at the 5 minute mark. 

* The first 3 columns identify the player, character and match id

* The next 10 columns detail the experience values at 5 minute intervals, from 0 minutes to 45 minutes inclusive.

* The final 10 columns detail the gold values at 5 minute intervals, from 0 minutes to 45 minutes inclusive.


## Our Goal

Our goal for this project is to identify the best model for predicting victory in a DOTA match. 

In [3]:
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline
from ggplot import *


we begin our analysis by loading the data and printing out some information on each of the data sets

In [4]:
def load_data(file1,file2):
    overview = pd.read_csv(file1)
    detail = pd.read_csv(file2)
    return overview,detail

overview,detail = load_data("MatchOverview.csv","MatchDetail.csv")

print overview.dtypes
print "---------"
print detail.dtypes

print overview.head()
print "---------"
print detail.head()
    

match_id         int64
hero_1           int64
hero_2         float64
hero_3         float64
hero_4         float64
hero_5         float64
hero_6         float64
hero_7         float64
hero_8         float64
hero_9         float64
hero_10        float64
first_5_won     object
dtype: object
---------
match_id       int64
hero_id      float64
player_id    float64
e_0          float64
e_5          float64
e_10         float64
e_15         float64
e_20         float64
e_25         float64
e_30         float64
e_35         float64
e_40         float64
e_45         float64
g_0          float64
g_5          float64
g_10         float64
g_15         float64
g_20         float64
g_25         float64
g_30         float64
g_35         float64
g_40         float64
g_45         float64
dtype: object
     match_id  hero_1  hero_2  hero_3  hero_4  hero_5  hero_6  hero_7  hero_8  \
0  2488245920      34    83.0    29.0   102.0    12.0    88.0   107.0    10.0   
1  2488233366     102    63.0    34.0   1

In order to better analyze the data, we merged the two data frames into one data frame aptly named `df`.  Using pandas's `merge` we were able to merge the two data sets on `match_id` and remove all match id's that did not have a corollary in the other data set.  Our new data set is grouped by both match id, player id and character id. Unfortunately, each row has an extra column, as both data frames have the player id of the row.

In [5]:
#print "this is the number of unique match_ids in the overview dataframe."
#print len(overview['match_id'].unique())

#print "this is the number of unique match_ids in the detail dataframe."
#print len(detail['match_id'].unique())

#print "in this case the intersect happens to be the components of the detail match_id column."
#print len(np.intersect1d(detail['match_id'].unique(),overview['match_id'].unique()))

df = overview.merge(detail,on="match_id")
#print "once we merge the two columns we only have matches that exist in both dataframes."
print "Unique match id's:", len(df)
print "df.dtypes:", df.dtypes
print "----------"
#print "notice how a lot of columns are repeated. For each player in a game they have their own row in the detail csv."
#print "as a result the lines of the overview csv are repeated."
print df.tail(5)
print len(df['match_id'].unique())




Unique match id's: 202319
df.dtypes: match_id         int64
hero_1           int64
hero_2         float64
hero_3         float64
hero_4         float64
hero_5         float64
hero_6         float64
hero_7         float64
hero_8         float64
hero_9         float64
hero_10        float64
first_5_won     object
hero_id        float64
player_id      float64
e_0            float64
e_5            float64
e_10           float64
e_15           float64
e_20           float64
e_25           float64
e_30           float64
e_35           float64
e_40           float64
e_45           float64
g_0            float64
g_5            float64
g_10           float64
g_15           float64
g_20           float64
g_25           float64
g_30           float64
g_35           float64
g_40           float64
g_45           float64
dtype: object
----------
          match_id  hero_1  hero_2  hero_3  hero_4  hero_5  hero_6  hero_7  \
202314  2502981287     104    83.0    88.0    53.0    15.0    39.0    31.0   


We then cleaned our data by:

* Tallying how many wins each team had. When we split our data set into training and testing data, we can preserve the ratio of wins to losses for team.  This avoids the rare case where our training set is entirely one team winning, which would obviously ruin predictions.
* We set `hero_id` and `player_id` to strings, so they will be interpreted as factors and not as quantatative variables.
* Reset experience and gold columns that are greater than match time to the last non-zero time amount.  This removes zeros that may invalidate our correlations and regressions.
* Changed `player_id` to `team_id`, a binary variable representing team, as we are interested in team performances and not individual performances.

In [9]:
def clean_data(tempdf):
    newdf = tempdf
    newdf = newdf.assign(team = newdf.apply(lambda x: 0 if x.player_id < 5 else 1,axis=1))
    newdf[['hero_1','hero_2','hero_3','hero_4','hero_5','hero_6','hero_7','hero_8','hero_9','hero_10','hero_id']] = newdf[['hero_1','hero_2','hero_3','hero_4','hero_5','hero_6','hero_7','hero_8','hero_9','hero_10','hero_id']].astype(str)
    #on this line we can check to make sure every game has a full roster of players (having 10 players).
    #print "We want to make sure each id occurs the same number of times"
    print df['player_id'].value_counts()
    #When we run the above line we see that there is at least 1 game without a full roster!
    C = Counter(newdf['match_id'])
    less = [x for x in C.keys() if C[x] < 10]
    #print "So it turns out we only have %d ID that doesn't show up 10 times" % (len(less))
    #print "Previously our dataframe had %d lines" % len(newdf)
    newdf = newdf[newdf['match_id'] != less[0]]
    #Once we emilinate these rows we have only full game data!
    #print "now we have %d lines" % len(newdf)
    
    #Now we want to change all values of 0 after the match ends into the last non-zero value
    newdf = newdf.apply(lambda x: set_zero(x),axis=1)
    
    #We want to count the number of games each team wins
    counts = newdf['first_5_won'].value_counts()
    newdf.to_csv("output.csv")
    return newdf,counts

#There has to be a better way to do this but I have not thought it up
def set_zero(row):
    value = 5
    while value <= 40:
        e1 = 'e_' + str(value+5)
        e2 = 'e_' + str(value)
        g1 = 'g_' + str(value+5)
        g2 = 'g_' + str(value)
        if row[e1] == 0:
            row[e1] = row[e2]
            row[g1] = row[g2]
        value += 5
    return row

overview,detail = load_data("MatchOverview.csv","MatchDetail.csv")
df = overview.merge(detail,on="match_id")
df,counts = clean_data(df) 

7.0    20232
6.0    20232
5.0    20232
4.0    20232
3.0    20232
2.0    20232
1.0    20232
0.0    20232
9.0    20231
8.0    20231
Name: player_id, dtype: int64
