# Sourcing and  converting SC2 replays 

StarCraftII replay files are a dime a dozen. 

In this notebook we dedicate ourselves to sourcing some of these files, and converting them to a tractable format.

Our priorities are:

    - the 420+ pro replays of the most recent SC2 world championship.
    
    - the 7200 pro replays available at http://lotv.spawningtool.com/
    - the 16,000+ gand-master and master replays readily www.gamereplays.org
    
    - the 25,000+ mixed-skill replays at http://lotv.spawningtool.com/
    - the 65,000+ mixed-skill replays at www.gamereplays.org

It is also worth noting that Blizzard (in partnership with Google Deep Mind) recently released 35,000 anonymized replay files for the purposes of A.I. research, and that they intend for this dataset to grow to 500,000 by the end of the year. However, their proces of annonymizing these files have made them incompatible with our parser. If time allows we will seek to remedy this, but, then again, maybe 100k+ replays are enough.

### Sourcing the 420+ pro replays of the most recent SC2 world championship.

This one is not dificult, a download link is readily available:

http://www.mediafire.com/file/4er2bk8k5d65bb4/IEM+XI+-+World+Championship+-+StarCraft+II+Replays.rar

### Converting these 420+ pro replays to dictionary:

In [2]:
import sc2reader
import pickle

from Scripts.replay_to_dict import replay_to_dict

In [3]:
path_to_games = './../../../sc2games'

In [None]:
iem_replays = [replay_to_dict(replay) 
               for replay in sc2reader.load_replays(
                   path_to_games+'/IEM XI - World Championship - StarCraft II Replays/',
               load_level = 3)]

In [None]:
# with open(path_to_games + '/PickledGames/iem_replays.p','wb') as iem_file:
#     pickle.dump(iem_replays, iem_file)

### Sourcing the 7200 pro replays available at http://lotv.spawningtool.com

This can be asily done. Using Xpath we dicovered LoTV.spawningtool.com allows the download a zip file of 25 replays by visiting a url of the form:

    http://lotv.spawningtool.com/zip/? + <details>
    
With some further tinckering we discovered the following settings of interest:

    pro_only=on
    tag= <120 to 219> (relates to the labled build)
    
However, trying to classify the build order choosen by the playes is not within the scope of our project. This would be an interesting area for further study. For the time being we scrape games of the form:
    
    http://lotv.spawningtool.com/zip/?pro_only=<result page>

We use a random delay between requests (averaging to 8.6 seconds between querries). This is partly a courtesy to LovT's IT team, partly to avoid saturating our internet connection, and avoid being classed as spammers.

The code we used to do this was:

    import requests, zipfile, io, time, numpy

    for i in range(1,291):
        r = requests.get('http://lotv.spawningtool.com/zip/?pro_only=on&p={}'.format(i))
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall()
        time.sleep(max(max(numpy.random.normal(7,3,2)),1))

### Converting these 7000+ pro replays to dictionary:
Here we observe some data quality issues, and sc2reader fails to parse some of the replay files.

To circuvent this as much as possible we wrap the parsing of the replay files in a try-except control structure. This will return to us the parsed replays corresponding to the replay files that sc2reader can load.

##### Part 1: Loading games 1-1239

In [14]:
import os
root, _, filenames =  list(os.walk(path_to_games+'/LotV SpawingTool Replays/Pro Replays/Zips 1 to 50'))[0]

def carefully_load_games(root, filenames):
    successfully_parssed_games = []
    errors = []
    for replay_file in [root+'/'+filename for filename in filenames]:
        try:
            game = sc2reader.load_replay(replay_file, load_level = 3)
            successfully_parssed_games.append(game)
        except:
            errors.append(replay_file)
            
    return (successfully_parssed_games,errors)
        
successfully_parssed_games, errors = carefully_load_games(root, filenames)

successes = len(successfully_parssed_games)
print('Successfully parsed games: {} of {}'.format(successes,successes+len(errors)))

Successfully parsed games: 697 of 1239


It is unfortunate that close to half our games were simply unreadable by sc2reader. This is a regretable data quality issue, but resolving it does not fall within the scope of this project. We proceed with the successfully loaded games.

In [9]:
lovt_games_part1 = [replay_to_dict(replay) 
                       for replay in successfully_parssed_games]

In [11]:
# with open(path_to_games + '/PickledGames/lovt_games_part1.p','wb') as lovt_part1_file:
#     pickle.dump(lovt_games_part1, lovt_part1_file)

##### Part 2: Loading games 1240 - 2436

In [15]:
root, _, filenames =  list(os.walk(path_to_games+'/LotV SpawingTool Replays/Pro Replays/Zips 51 to 100'))[0]
successfully_parssed_games, errors = carefully_load_games(root, filenames)

successes = len(successfully_parssed_games)
print('Successfully parsed games: {} of {}'.format(successes,successes+len(errors)))

Successfully parsed games: 1195 of 1196


On this second set of games we observe a much higher ratio of success when loading the replays. It may be that the first few results from our querry to lotv.spawningtool.com were from a game version or event that sc2reader was not prepared to handle.

It should be mentioned that sc2reader is - internally - an absolute mess of control structures. Very often (when updating the game) Blizzard has had no qualms altering the hexadecimal reprecentation of objects in the replay file. Due to this, sc2reader has to be updated for each new patch, and when it is tasked to loading a file it must reverse engeneer the hex-file to figure out whuch vesion of itself to use to parse the file sucessfully.

In [18]:
lovt_games_part2 = [replay_to_dict(replay) 
                       for replay in successfully_parssed_games]

In [19]:
# with open(path_to_games + '/PickledGames/lovt_games_part2.p','wb') as lovt_part2_file:
#    pickle.dump(lovt_games_part2, lovt_part2_file)

##### Part 3: Loading games 2437 - 3608

In [20]:
root, _, filenames =  list(os.walk(path_to_games+'/LotV SpawingTool Replays/Pro Replays/Zips 101 to 150'))[0]
successfully_parssed_games, errors = carefully_load_games(root, filenames)

successes = len(successfully_parssed_games)
print('Successfully parsed games: {} of {}'.format(successes,successes+len(errors)))

Successfully parsed games: 1171 of 1171


In [21]:
lovt_games_part3 = [replay_to_dict(replay) 
                       for replay in successfully_parssed_games]

In [24]:
# with open(path_to_games + '/PickledGames/lovt_games_part3.p','wb') as lovt_part3_file:
#     pickle.dump(lovt_games_part3, lovt_part3_file)

##### Part 4: Loading games 3609 - 4753

In [25]:
root, _, filenames =  list(os.walk(path_to_games+'/LotV SpawingTool Replays/Pro Replays/Zips 151 to 200'))[0]
successfully_parssed_games, errors = carefully_load_games(root, filenames)

successes = len(successfully_parssed_games)
print('Successfully parsed games: {} of {}'.format(successes,successes+len(errors)))

Successfully parsed games: 1144 of 1144


In [26]:
lovt_games_part4 = [replay_to_dict(replay) 
                       for replay in successfully_parssed_games]

In [28]:
# with open(path_to_games + '/PickledGames/lovt_games_part4.p','wb') as lovt_part4_file:
#    pickle.dump(lovt_games_part4, lovt_part4_file)

##### Part 5: Loading games 4754 - 5990

In [29]:
root, _, filenames =  list(os.walk(path_to_games+'/LotV SpawingTool Replays/Pro Replays/Zips 201 to 250'))[0]
successfully_parssed_games, errors = carefully_load_games(root, filenames)

successes = len(successfully_parssed_games)
print('Successfully parsed games: {} of {}'.format(successes,successes+len(errors)))

Successfully parsed games: 1237 of 1237


In [30]:
lovt_games_part5 = [replay_to_dict(replay) 
                       for replay in successfully_parssed_games]

In [32]:
# with open(path_to_games + '/PickledGames/lovt_games_part5.p','wb') as lovt_part5_file:
#    pickle.dump(lovt_games_part5, lovt_part5_file)

##### Part 6: Loading games 5991 - 6911

In [33]:
root, _, filenames =  list(os.walk(path_to_games+'/LotV SpawingTool Replays/Pro Replays/Zips 251 to 289'))[0]
successfully_parssed_games, errors = carefully_load_games(root, filenames)

successes = len(successfully_parssed_games)
print('Successfully parsed games: {} of {}'.format(successes,successes+len(errors)))

Successfully parsed games: 921 of 921


In [34]:
lovt_games_part6 = [replay_to_dict(replay) 
                       for replay in successfully_parssed_games]

In [37]:
# with open(path_to_games + '/PickledGames/lovt_games_part6.p','wb') as lovt_part6_file:
#    pickle.dump(lovt_games_part6, lovt_part6_file)

It is odd that in total the 289 pages of results (25 results per page) did not yield 7200+ results. This may be due to repeated games across pages (same file name) that got overwiten when the zip file was being extracted. We have not had time to explore this issue further, nor does it influence our intended analysis.

All in all we successfully parsed

    697 + 1195 + 1171 + 1144 + 1237 + 921 = 6365
    
professional StarCraft II replays.

### Sourcing the 16k+ pro replays available at https://www.gamereplays.org

This was slightly more involved. Here replay files may be downloaded one at a time by visiting a url of the form 

    https://www.gamereplays.org/starcraft2/replays...
    
where the game's id is a unique identifier within gamereplays.org.

No clear index existed for these id's, but it was easy enough to:

- obtain the raw html of the various result pages using requests.
- parse the html for the id's using the lxml library and xpaths.
- colate the id's into a csv file to serve as an index

In [None]:
import pandas as pd
game_links = pd.read_csv('./Resources/links_to_games_in_GameReplay.csv', header = None)[1]
game_links.tail()

In [None]:
game_links[0]

At this point it is just a matter of itterating through the 16284 links using requests.

We introduce a significant delay of 1 second between calls to avoid the wrath of their I.T. staff.

In [None]:
file_names = [str(a)+'_'+b.split('id=')[-1] for a,b in game_links.items()]

In [None]:
for i in range(16284):
    id_of_game = game_links.iloc[i].split('id=')[1]
    with open('./../../../sc2games/GameReplayOrg/'+str(i)+'_'+id_of_game+'.SC2Replay', 'wb') as destination:
        r = requests.get(game_links.iloc[i], allow_redirects=True)
        time.sleep(1)
        destination.write(r.content)
        time.sleep(1)
        if i%100 == 0: print(i, end=';')

We've had no time to scale our analysis to these 16k+ files. Analysizing these replays is one self-evident area of further study. 