# Parsing Data

As it stands we have 50 summary and 20 timeline files (each file containing 1000 matches)


we parse the data into a single place so we can work with it without having to store 20gb of data in-memory. (Could explore the need for NoSQL for timeline data but it may not be the case that we need a key-value store) 

For each of summary and timeline data we will need to:

- Decide on a table structure
- Create a class to parse the data

## Summary data parsing

In [68]:
import pandas as pd 
import json
import ast

import requests
import time


In [48]:
summary = pd.read_csv('./match-data/summaries/summary_0.csv')


In [49]:
summary.head()


Unnamed: 0,info,metadata
0,"{'gameCreation': 1635958972000, 'gameDuration'...","{'dataVersion': '2', 'matchId': 'EUW1_55368067..."
1,"{'gameCreation': 1635958972000, 'gameDuration'...","{'dataVersion': '2', 'matchId': 'EUW1_55368067..."
2,"{'gameCreation': 1642718390000, 'gameDuration'...","{'dataVersion': '2', 'matchId': 'EUW1_56803270..."
3,"{'gameCreation': 1642718390000, 'gameDuration'...","{'dataVersion': '2', 'matchId': 'EUW1_56803270..."
4,"{'gameCreation': 1639481435000, 'gameDuration'...","{'dataVersion': '2', 'matchId': 'EUW1_56083649..."


In [367]:
for i, item in summary.head().iterrows():
    print(item['matchId'])

EUW1_5536806773
EUW1_5536806773
EUW1_5680327054
EUW1_5680327054
EUW1_5608364901


In [61]:
# we know that in the previous notebook we introduced duplicates. So let's remove.
# we need the matchId from the metadata column. 
# we need to read the json in using ast library since it is escaped with single quotes, not double.

summary['matchId'] = [ast.literal_eval(summary['metadata'][i])['matchId'] for i in range(len(summary))]

In [63]:
# deduplicated summary
summary_dedup = summary.drop_duplicates(subset='matchId').reset_index(drop=True).drop(columns='metadata')

In [66]:
# now we need to decide what to do with the info entry

ast.literal_eval(summary_dedup['info'][0])

{'gameCreation': 1635958972000,
 'gameDuration': 1012,
 'gameEndTimestamp': 1635960026182,
 'gameId': 5536806773,
 'gameMode': 'CLASSIC',
 'gameName': 'teambuilder-match-5536806773',
 'gameStartTimestamp': 1635959013202,
 'gameType': 'MATCHED_GAME',
 'gameVersion': '11.22.406.3587',
 'mapId': 11,
 'participants': [{'assists': 0,
   'baronKills': 0,
   'bountyLevel': 1,
   'champExperience': 8547,
   'champLevel': 11,
   'championId': 157,
   'championName': 'Yasuo',
   'championTransform': 0,
   'consumablesPurchased': 2,
   'damageDealtToBuildings': 332,
   'damageDealtToObjectives': 2079,
   'damageDealtToTurrets': 332,
   'damageSelfMitigated': 9081,
   'deaths': 3,
   'detectorWardsPlaced': 0,
   'doubleKills': 0,
   'dragonKills': 0,
   'firstBloodAssist': False,
   'firstBloodKill': True,
   'firstTowerAssist': False,
   'firstTowerKill': False,
   'gameEndedInEarlySurrender': False,
   'gameEndedInSurrender': True,
   'goldEarned': 6934,
   'goldSpent': 5925,
   'individualPosit

## Outline of method of parsing

Though this data is heavily set up in a key-value structure, we know that to run it through any classical clustering / supervised regression task it will have to be organised in a sql-like structure. As such it is necessary to reduce the information into a SQL paradigm.

We are lucky to have mostly numeric / boolean data here. This will mean modelling takes significantly less steps than if we were to have free-text.

We know that a players performance depends on the other players in the game. As such is it enough to measure a players performance in isolation? The real answer is better results can be taken by modelling interactions between players. However due to time/scope considerations we will limit ourselves to viewing the performance of the player independent of the other players. This will allow us to have a reasonable dimension for our table and limit overall complexity.

Therefore we will organise the summary data into a table as such:

| MatchId  |  Player 1 Name | Player stats (105 features) | Match MetaData (Length, time started, etc) |
| ---- | ---- | ---- | --- | 
| EUW_134235 | ApplySunday | 7 0 5..etc | 20min |
| EUW_134235 | IamPlayerTwo | 2 3 5..etc | 15min |
| EUW_134235 | MovingOnwards | 9 1 5..etc | 30min | 
| ... | ... | ... | ... |
| EUW_948582 | DifferentGamePlayerOne | 3 9 3..etc | 40min |

So we should have a dataframe with 10 rows per match, so about 500,000 rows in total.


In [111]:
# let's write a function to build a dataframe with this required structure per entry in summary_dedup

ast.literal_eval(summary_dedup['info'][0])

# note that the styles and perks fields are other json objects - we will not these as they are very specific

{'gameCreation': 1635958972000,
 'gameDuration': 1012,
 'gameEndTimestamp': 1635960026182,
 'gameId': 5536806773,
 'gameMode': 'CLASSIC',
 'gameName': 'teambuilder-match-5536806773',
 'gameStartTimestamp': 1635959013202,
 'gameType': 'MATCHED_GAME',
 'gameVersion': '11.22.406.3587',
 'mapId': 11,
 'participants': [{'assists': 0,
   'baronKills': 0,
   'bountyLevel': 1,
   'champExperience': 8547,
   'champLevel': 11,
   'championId': 157,
   'championName': 'Yasuo',
   'championTransform': 0,
   'consumablesPurchased': 2,
   'damageDealtToBuildings': 332,
   'damageDealtToObjectives': 2079,
   'damageDealtToTurrets': 332,
   'damageSelfMitigated': 9081,
   'deaths': 3,
   'detectorWardsPlaced': 0,
   'doubleKills': 0,
   'dragonKills': 0,
   'firstBloodAssist': False,
   'firstBloodKill': True,
   'firstTowerAssist': False,
   'firstTowerKill': False,
   'gameEndedInEarlySurrender': False,
   'gameEndedInSurrender': True,
   'goldEarned': 6934,
   'goldSpent': 5925,
   'individualPosit

In [147]:
import typing as t

# let's define  JSON type so we can include it as a type hint in our function
JSON = t.Union[str, int, float, bool, None, t.Mapping[str, 'JSON'], t.List['JSON']]

def parse_summary(json: JSON) -> pd.DataFrame:
    
    ''' Takes a single row of the matches summary data and outputs 10 rows, 1 per player'''
    
    #perhaps not the most efficient function, but let's make a dataframe for every json and return that
    game_start = json['gameStartTimestamp']
    game_duration = json['gameDuration']
    
    features = []
    
    for item in json['participants']:
        
        del item['perks']
        
        item['gameStart'] = game_start
        item['gameDuration'] = game_duration
        
        features.append(item)
    
    df = pd.DataFrame(features)
    
    return df

In [345]:
print('working as expected!')

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(parse_summary(ast.literal_eval(summary_dedup['info'][0])).head())


working as expected!


Unnamed: 0,assists,baronKills,bountyLevel,champExperience,champLevel,championId,championName,championTransform,consumablesPurchased,damageDealtToBuildings,damageDealtToObjectives,damageDealtToTurrets,damageSelfMitigated,deaths,detectorWardsPlaced,doubleKills,dragonKills,firstBloodAssist,firstBloodKill,firstTowerAssist,firstTowerKill,gameDuration,gameEndedInEarlySurrender,gameEndedInSurrender,gameStart,goldEarned,goldSpent,individualPosition,inhibitorKills,inhibitorTakedowns,inhibitorsLost,item0,item1,item2,item3,item4,item5,item6,itemsPurchased,killingSprees,kills,lane,largestCriticalStrike,largestKillingSpree,largestMultiKill,longestTimeSpentLiving,magicDamageDealt,magicDamageDealtToChampions,magicDamageTaken,neutralMinionsKilled,nexusKills,nexusLost,nexusTakedowns,objectivesStolen,objectivesStolenAssists,participantId,pentaKills,physicalDamageDealt,physicalDamageDealtToChampions,physicalDamageTaken,profileIcon,puuid,quadraKills,riotIdName,riotIdTagline,role,sightWardsBoughtInGame,spell1Casts,spell2Casts,spell3Casts,spell4Casts,summoner1Casts,summoner1Id,summoner2Casts,summoner2Id,summonerId,summonerLevel,summonerName,teamEarlySurrendered,teamId,teamPosition,timeCCingOthers,timePlayed,totalDamageDealt,totalDamageDealtToChampions,totalDamageShieldedOnTeammates,totalDamageTaken,totalHeal,totalHealsOnTeammates,totalMinionsKilled,totalTimeCCDealt,totalTimeSpentDead,totalUnitsHealed,tripleKills,trueDamageDealt,trueDamageDealtToChampions,trueDamageTaken,turretKills,turretTakedowns,turretsLost,unrealKills,visionScore,visionWardsBoughtInGame,wardsKilled,wardsPlaced,win
0,0,0,1,8547,11,157,Yasuo,0,2,332,2079,332,9081,3,0,0,0,False,True,False,False,1012,False,True,1635959013202,6934,5925,TOP,0,0,0,1054,3006,6673,1037,0,0,3340,16,1,4,NONE,361,3,1,613,5485,875,1552,4,0,0,0,0,0,1,0,69083,6973,8955,5117,NcEr3O6jwNB8ggIy9pjYhOk170z2yZ7bk2ZUx7v0wToJlS...,0,,,DUO,0,130,4,66,3,1,12,2,4,vPRE7cm6AKEeQeMhTR99bwwY8eBF6IAlGJJ-LfGrvQa3Gkc,260,kkrco,False,100,TOP,6,1012,74569,7849,0,10731,289,0,137,87,90,1,0,0,0,224,0,0,4,0,6,0,1,4,False
1,2,0,0,5881,9,76,Nidalee,0,3,0,5383,0,5124,3,2,0,0,True,False,False,False,1012,False,True,1635959013202,5246,5025,JUNGLE,0,0,0,4636,2031,2055,3020,0,0,3364,12,0,1,NONE,0,0,1,424,53081,3411,1848,90,0,0,0,0,0,2,0,9319,504,10755,4930,N01CDHKiy6zOPCMlG2RlF5xgiq9K55z1Sc91VzA8o482jv...,0,,,SUPPORT,0,137,134,79,150,2,4,10,11,lqTquo_VSH6BDuK2j79Ioiz28y1Nj-2k0-3ff6bF6BqFuaM,456,Sajnadz,False,100,JUNGLE,2,1012,73014,4114,0,12720,6877,696,13,70,56,5,0,10613,199,116,0,0,4,0,17,3,3,6,False
2,2,0,0,5521,9,23,Tryndamere,0,1,746,1271,746,7523,8,0,0,0,False,False,False,False,1012,False,True,1635959013202,4029,3575,MIDDLE,0,0,0,1055,6670,1001,1018,1037,0,3340,9,0,0,NONE,347,0,0,186,0,0,1283,0,0,0,0,0,0,3,0,33178,5233,13406,4893,GlK-rDoDz-Mqkqo4opBDsK4LOotYy0_6uxPHo7Z0uL61AC...,0,,,SUPPORT,0,17,11,41,5,3,6,2,4,p9HJ_g-TunwL9NMHEgpTFgrpZXS1a-TgvsopzSg-IaMbNo4,274,Swove,False,100,MIDDLE,4,1012,33178,5233,0,15197,1826,0,61,27,147,1,0,0,0,506,0,0,4,0,9,0,2,5,False
3,0,0,0,5618,9,22,Ashe,0,2,0,0,0,2897,3,1,0,0,False,False,False,False,1012,False,True,1635959013202,5277,4575,BOTTOM,0,0,0,1055,6673,1042,1001,0,0,3340,10,0,1,NONE,185,0,1,446,149,149,1101,0,0,0,0,0,0,4,0,32053,3792,5456,3870,nIZvsxyS2eXjOv0pvKhZiBgQPNcNvMYsiswlMPbutgHOpF...,0,,,SUPPORT,0,10,29,3,3,2,4,1,7,kvWGrk0tFq4fhffgyIvc6LRWja5NQvgaGYqAmnjls7EArsM,232,The Knax,False,100,BOTTOM,13,1012,36493,4147,0,6589,270,135,105,278,57,2,0,4289,204,32,0,0,4,0,4,1,0,4,False
4,2,0,0,4144,8,63,Brand,0,3,0,0,0,2648,4,1,0,0,False,False,False,False,1012,False,True,1635959013202,4040,3125,UTILITY,0,0,0,3853,3916,2031,1001,3802,0,3364,12,0,0,NONE,0,0,0,272,13163,3403,1267,0,0,0,0,0,0,5,0,1216,416,6623,3799,-sDvVzdopLUp8biXPBDeUxeeyD5YOPvO6g_CCJl-3lV0Sc...,0,,,SUPPORT,0,20,34,21,3,3,4,0,14,4AOA2A8nqFrnP6B0AUWsyhyJxzdQ-dl94IjA25_UPBiv05A,485,B0oted,False,100,UTILITY,4,1012,14538,3977,0,8016,0,0,21,8,61,0,0,157,157,126,0,0,4,0,17,1,1,11,False


# Now that we understand our process, let's make a class which will accept three arguments: 

- the filepath of the source summary csv data
- the output path of the parsed summary data. 
- the name of the output path.

**Note it is possible to clean the summary data and save to a single file due to the relatively small filesizes. Total parsed summary data should a maximum of 1.5GB. This will not be possible with the timeline data - but we will get to that next.**

In [473]:
import pandas as pd
import glob
import ast
import typing as t


class SummaryParser():
    
    ''' 
    
    Takes a set of summary files and outputs them to a single file in a specified location
    
    Attributes
    -----------
    
    summary_data_folder : str
        The path (relative or absolute) of the folder the data is coming from
    output_folder : str
        The path of the folder the parsed data is being sent to   
    output_filename : str
        The name of what to save the parsed data as. Note you must include .csv at the end
    
    Methods
    ----------
    
    parse_all_files(self):
        The only function necessary to call this class. All other functions are helpers.
        Takes all files from specified folder, parses them in series (storing data in memory),
        and saves the data to the specified path and name.
    
    
    '''
    
    JSON = t.Union[str, int, float, bool, None, t.Mapping[str, 'JSON'], t.List['JSON']]


    def __init__(self, summary_data_folder: str, output_folder: str, output_filename: str):
        
        self.summary_data_folder = summary_data_folder
        self.output_folder = output_folder
        
        try:
            output_filename[-4:] == '.csv'
            self.output_filename = output_filename
        except KeyError:
            raise ValueError(f"{output_filename} does not end with '.csv'. Please include!") from None
        
        
    def parse_all_files(self) -> str:
        
        ''' Takes the input folder path and outputs a csv with the folder and name'''
        
        output = pd.DataFrame({})
        
        for filepath in self._get_csv_paths():
            parsed_file = self._parse_summary_file(filepath)
            output = output.append(parsed_file)
            
            print(f'{filepath} parsed successfully')
        
        output.reset_index(drop=True).to_csv(self.output_folder + self.output_filename, index=False)
        
        # for memory management (since file is ~1.5GB)
        del output
        
        return f'File uploaded to {self.output_folder} named {self.output_filename}'
        
    def _get_csv_paths(self) -> t.List[str]:
        
        return glob.glob(self.summary_data_folder + '*')
    
    def _parse_summary_file(self, filepath: str) -> pd.DataFrame:
        
        ''' Takes a single file and returns summary data for the file in relational format'''
        
        file = pd.read_csv(filepath)
        file = file[['info', 'metadata']] # remove any spurious columns
        file.dropna(subset=['info'], inplace=True)
        file.reset_index(drop=True, inplace=True)
        file['matchId'] = [ast.literal_eval(file['metadata'][i])['matchId'] for i in range(len(file))]
        file.drop_duplicates(subset='matchId', inplace=True)
        file.drop(columns=['metadata'], inplace=True)
        
        parsed_file = pd.DataFrame({})
        
        for i, item in file.iterrows():
            
            json = ast.literal_eval(item['info'])
            json_as_df = self._parse_summary_row(json)
            json_as_df['matchId'] = item['matchId']
            parsed_file = parsed_file.append(json_as_df)
            
        return parsed_file.reset_index(drop=True)
            
        
    def _parse_summary_row(self, json: JSON) -> pd.DataFrame:

        ''' Takes a single row of the matches summary data and outputs 10 rows, 1 per player'''

        #perhaps not the most efficient function, but let's make a dataframe for every json and return that
        game_start = json['gameStartTimestamp']
        game_duration = json['gameDuration']

        features = []

        for item in json['participants']:

            del item['perks']

            item['gameStart'] = game_start
            item['gameDuration'] = game_duration
            features.append(item)

        df = pd.DataFrame(features)

        return df

In [369]:
instance = SummaryParser('./match-data/summaries/', './match-data/parsed-summaries/', 'summaries_final.csv')

In [370]:
instance.parse_all_files()

./match-data/summaries/summary_6000.csv parsed successfully
./match-data/summaries/summary_13000.csv parsed successfully
./match-data/summaries/summary_25000.csv parsed successfully
./match-data/summaries/summary_46000.csv parsed successfully
./match-data/summaries/summary_37000.csv parsed successfully
./match-data/summaries/summary_29000.csv parsed successfully
./match-data/summaries/summary_11000.csv parsed successfully
./match-data/summaries/summary_35000.csv parsed successfully
./match-data/summaries/summary_48000.csv parsed successfully
./match-data/summaries/summary_39000.csv parsed successfully
./match-data/summaries/summary_44000.csv parsed successfully
./match-data/summaries/summary_27000.csv parsed successfully
./match-data/summaries/summary_4000.csv parsed successfully
./match-data/summaries/summary_8000.csv parsed successfully
./match-data/summaries/summary_40000.csv parsed successfully
./match-data/summaries/summary_23000.csv parsed successfully
./match-data/summaries/summ

'File uploaded to ./match-data/parsed-summaries/ named summary_output.csv'

In [375]:
# let's read in that data and see what it looks like! 
#The good news is, the file is just 274mb large - easy to work with!

with pd.option_context('display.max_rows', 10, 'display.max_columns', None):
    display(pd.read_csv('./match-data/parsed-summaries/summary_output.csv').head())

Unnamed: 0,assists,baronKills,bountyLevel,champExperience,champLevel,championId,championName,championTransform,consumablesPurchased,damageDealtToBuildings,damageDealtToObjectives,damageDealtToTurrets,damageSelfMitigated,deaths,detectorWardsPlaced,doubleKills,dragonKills,firstBloodAssist,firstBloodKill,firstTowerAssist,firstTowerKill,gameDuration,gameEndedInEarlySurrender,gameEndedInSurrender,gameStart,goldEarned,goldSpent,individualPosition,inhibitorKills,inhibitorTakedowns,inhibitorsLost,item0,item1,item2,item3,item4,item5,item6,itemsPurchased,killingSprees,kills,lane,largestCriticalStrike,largestKillingSpree,largestMultiKill,longestTimeSpentLiving,magicDamageDealt,magicDamageDealtToChampions,magicDamageTaken,matchId,neutralMinionsKilled,nexusKills,nexusLost,nexusTakedowns,objectivesStolen,objectivesStolenAssists,participantId,pentaKills,physicalDamageDealt,physicalDamageDealtToChampions,physicalDamageTaken,profileIcon,puuid,quadraKills,riotIdName,riotIdTagline,role,sightWardsBoughtInGame,spell1Casts,spell2Casts,spell3Casts,spell4Casts,summoner1Casts,summoner1Id,summoner2Casts,summoner2Id,summonerId,summonerLevel,summonerName,teamEarlySurrendered,teamId,teamPosition,timeCCingOthers,timePlayed,totalDamageDealt,totalDamageDealtToChampions,totalDamageShieldedOnTeammates,totalDamageTaken,totalHeal,totalHealsOnTeammates,totalMinionsKilled,totalTimeCCDealt,totalTimeSpentDead,totalUnitsHealed,tripleKills,trueDamageDealt,trueDamageDealtToChampions,trueDamageTaken,turretKills,turretTakedowns,turretsLost,unrealKills,visionScore,visionWardsBoughtInGame,wardsKilled,wardsPlaced,win
0,0,0,0,7766,11,86,Garen,0,2,817.0,817,817,11881,3,0,0,0,False,False,False,False,1061,False,True,1642620163907,5567,5100,TOP,0,0.0,0.0,6631,1054,0,3006,0,0,3340,13,1,2,NONE,0,2,1,438,0,0,4532,EUW1_5677892892,0,0,0.0,0.0,0,0,1,0,45056,4882,8047,5212,LbNsNJcAUBOpDuV9vXj0OZJaZdl1KCn7-cqM9854d-ryhZ...,0,,,SUPPORT,0,40,18,35,2,1,12,1,4,j4QBUANVrv1WpTrg_VTGYqMx-wY4O_TUbFcvSNo4DB_DBgo,431,sandyio,False,100,TOP,10,1061,48589,5695,0,12580,361,0,112,61,81,1,0,3533,813,0,0,0.0,4.0,0,8,0,0,6,False
1,3,0,0,7691,11,120,Hecarim,0,3,0.0,5026,0,10093,3,3,0,0,False,False,False,False,1061,False,True,1642620163907,6358,5975,JUNGLE,0,0.0,0.0,3133,3070,6664,3158,0,0,3364,17,0,2,NONE,138,0,1,405,23022,1113,6451,EUW1_5677892892,106,0,0.0,0.0,0,0,2,0,67796,4293,9953,4630,82nYwIs5IY0L_-yS5a4q73E8IlUerZXr3hFcBTZR3OqFpM...,0,,,SUPPORT,0,172,40,31,3,4,6,11,11,jZb9aNNy07A_APJv3iN2WStWxnbP7yacihMeNDvwuCZu3aze,326,why do i pIay,False,100,JUNGLE,9,1061,96429,6169,0,16563,8004,0,16,125,71,1,0,5611,763,158,0,0.0,4.0,0,17,3,3,4,False
2,0,0,0,6429,10,45,Veigar,0,1,0.0,0,0,3238,8,0,0,0,False,False,False,False,1061,False,True,1642620163907,4784,4675,MIDDLE,0,0.0,0.0,1082,6656,2033,2055,3158,0,3340,11,0,1,NONE,0,0,1,269,33859,3475,7406,EUW1_5677892892,0,0,0.0,0.0,0,0,3,0,5587,1311,2779,1391,nUIXB26-mpQpeOHuQ46jtkXpyNrYSmaLX0zEZ25nkSoj2f...,0,,,SUPPORT,0,65,36,11,3,3,4,2,12,-rG6rBK2ejhceTs9PqKI3cyl8Aym8_Lpj0DWsNvBm1vca8c,245,Korvish,False,100,MIDDLE,1,1061,39447,4786,0,10210,1400,0,101,31,135,1,0,0,0,24,0,0.0,4.0,0,4,1,0,5,False
3,2,0,0,5948,9,222,Jinx,0,8,2328.0,3163,2328,2358,4,4,0,0,False,False,False,False,1061,False,True,1642620163907,6360,6200,BOTTOM,0,0.0,0.0,2031,0,6672,1036,1036,3006,3363,21,0,1,NONE,323,0,1,706,146,146,5938,EUW1_5677892892,2,0,0.0,0.0,0,0,4,0,57269,4161,2353,7,bEyq9XCl1X8mzMSga8pgG12YwxbgYkN7UOXjD_RAP325vM...,0,,,DUO,0,74,31,9,3,3,4,3,7,FouVwcASV7gKCnZkg6ht77N9RPzE_4a9QuHTqqdr8u6Ni4...,45,XBredi Jr,False,100,BOTTOM,5,1061,58840,4408,0,8292,801,345,135,21,115,3,0,1424,100,0,1,1.0,4.0,0,15,4,2,9,False
4,3,0,0,5016,8,26,Zilean,0,5,993.0,1208,993,1847,3,2,0,0,False,False,True,False,1061,False,True,1642620163907,4283,3525,UTILITY,0,0.0,0.0,2003,3851,2055,2422,4644,0,3364,12,0,0,NONE,0,0,0,490,11281,2637,2742,EUW1_5677892892,0,0,0.0,0.0,0,0,5,0,1763,349,2932,22,r-GyqrnH_abnEg2ZbQHHnfiQZwTFt5jaRWuCeFi7fEffNi...,0,,,SUPPORT,0,78,30,45,2,2,4,1,14,zUE6QEPd5O3078NC82VnWj1HC7GTzQZXZEZy9I_nd-hf420,373,plusma,False,100,UTILITY,13,1061,13196,3139,0,5675,744,0,11,87,53,1,0,152,152,0,0,1.0,4.0,0,14,3,2,8,False


## Timeline data Parsing

In [281]:
import pandas as pd 
import json
import ast

import requests
import time

In [323]:
# each of these files are 1.1GB so will have to parse them individually (and maybe send to sql db)

timeline = pd.read_csv('./match-data/timelines/timeline_0.csv')

In [325]:
timeline.dropna(subset=['info'], inplace=True)
timeline['matchId'] = [ast.literal_eval(timeline['metadata'][i])['matchId'] for i in range(len(timeline))]
timeline.drop(columns='metadata', inplace=True)
timeline.drop_duplicates(subset='matchId', keep='last', inplace=True)
timeline.reset_index(drop=True, inplace=True)

timeline.head(5)

Unnamed: 0,info,matchId
0,"{'frameInterval': 60000, 'frames': [{'events':...",EUW1_5536806773
1,"{'frameInterval': 60000, 'frames': [{'events':...",EUW1_5680327054
2,"{'frameInterval': 60000, 'frames': [{'events':...",EUW1_5608364901
3,"{'frameInterval': 60000, 'frames': [{'events':...",EUW1_5677694404
4,"{'frameInterval': 60000, 'frames': [{'events':...",EUW1_5690444007


In [328]:
# check to see if we can identify the participant... it appears we can! We will need to keep this as a key

ast.literal_eval(timeline['info'][0])['participants']#['frames'][0]['participantFrames']['1']

[{'participantId': 1,
  'puuid': '69caSlX99d_QHBv7f9vATk_-czAJIaPHSuGzMyawAx7bGtDShXKVA2neaZ-8p37jmHzXeIpblam1Fw'},
 {'participantId': 2,
  'puuid': 'cL-Ls-gbd7o7nkbxGNy8Yv-YLsdMWdWeJ-a2YgQfIO2XNcaG_nHrbttxLJQC8aNULK0bCoOYEDbYYg'},
 {'participantId': 3,
  'puuid': 'aQcWEnior3rZabWuMn431Orus8TLnM9aHLt5OLgA5e9s38dwWNMSBXg_enauCIlGjVshWualeNeOxg'},
 {'participantId': 4,
  'puuid': 'MYP92oVmh6hXwi9iVP0ZpVQwLd7Eqlv5jfWw4TdKA1mXjap1GLO6pmS7F7bG9tu6LQ96l_8_6Zrhdg'},
 {'participantId': 5,
  'puuid': 'jMGvN-kPE8EWflrO7annfBkosYKpS3RPlAcWgbc2CZlCIH_xuqtwv3pOPnwf7d3xOsGvTXUfJDSxGA'},
 {'participantId': 6,
  'puuid': '1CgTb2q5yfrvEtmAGjx2e2qgewpyRS9PNTyQNHiW2dQP-07Y4EeUz4MLfJIpOHGw7kwmWiymJsKRcw'},
 {'participantId': 7,
  'puuid': 'jNS7id02ru9-LxObWiNqsE9Jq0CPP9NwPAzC0JK7EWtWQMvcBGbJnuNRjS_UpNzb3Ng5KujWjk3wiA'},
 {'participantId': 8,
  'puuid': '1QrwjZglk1BtcqF_8YmnYMZ7soDf5c771w5Z6LWkY8xYkT9-Ab5iuq3_VLav98tQ-98bPesl_Vvh_w'},
 {'participantId': 9,
  'puuid': 'TA2fyQvprsjy_W_Vecr4Hw33ZOWq6hr-nUUUAe

In [342]:
# 18 frames for 18 minutes of the game. 
len(ast.literal_eval(timeline['info'][0])['frames'])#[0]['participantFrames']['1']

18

In [344]:
# Let's look at minute 10
ast.literal_eval(timeline['info'][0])['frames'][10]

{'events': [{'itemId': 1042,
   'participantId': 4,
   'timestamp': 542435,
   'type': 'ITEM_PURCHASED'},
  {'killerId': 0,
   'laneType': 'BOT_LANE',
   'position': {'x': 10504, 'y': 1029},
   'teamId': 100,
   'timestamp': 547190,
   'type': 'TURRET_PLATE_DESTROYED'},
  {'itemId': 1036,
   'participantId': 7,
   'timestamp': 551450,
   'type': 'ITEM_DESTROYED'},
  {'itemId': 3044,
   'participantId': 7,
   'timestamp': 551450,
   'type': 'ITEM_PURCHASED'},
  {'itemId': 2055,
   'participantId': 7,
   'timestamp': 553399,
   'type': 'ITEM_PURCHASED'},
  {'itemId': 2055,
   'participantId': 7,
   'timestamp': 553531,
   'type': 'ITEM_PURCHASED'},
  {'itemId': 1029,
   'participantId': 7,
   'timestamp': 555612,
   'type': 'ITEM_PURCHASED'},
  {'creatorId': 6,
   'timestamp': 559443,
   'type': 'WARD_PLACED',
   'wardType': 'YELLOW_TRINKET'},
  {'bounty': 175,
   'killStreakLength': 1,
   'killerId': 8,
   'position': {'x': 6082, 'y': 6307},
   'timestamp': 560500,
   'type': 'CHAMPION_

## This is a very detailed dataset! It describes the game unfolding, grouped into minutes. Each minute is called a 'frame' in this large JSON

### From what I can tell, there are two main types of pieces of data in here:

**Events**

There are a number of different JSONS nested with the field 'events'. An event can happen at any time, there is a timestamp and is grouped to the next frame. These cover:

- Item purchased/destroyed by a player (assume destroyed means sold or used)
- Kills/Deaths with a breakdown of: what damage was dealt to the player to cause their death, where they died, who killed them and the id of the killed player
- Placing down wards (a ward is an in-game item granting vision of the map)
- Players levelling up in-game, when they levelled up
- Destroying a tower (where, when, which team)

**participantFrames**

Every frame contains a section called participantFrames. This is a snapshot of the current status of each of the players in-game. This includes statistics similar to (but with differences) to that which are present in the summary statistics.

### Knowing this, how can we turn this into a useful relational-like structure, without using thousands of columns?

We want to extract information which may give insights in a way the summary dataset cannot give, while also keeping the overall dimensionality of the extracted dataset as low as possible.

The question is, if I wanted to differentiate between a professional and an amateur player, what information would I want to use?

- The participantFrames section is similar (though a more granular breakdown) of the summaries data. For this reason let us not use this information to keep complexity down. 
- Events seem more interesting, they describe what is happening in a way the summaries data does not. But how can we use this information intelligently?

Let's try to come up with a list of things that would (intuivitely) be useful for clustering/supervised classification which does not, in any way, appear in the summaries data. 

- Where/when players are killed, and where/when they kill other players
- Where/when towers are destroyed and who by

If we limit ourselves to just these two types of events, we can create a small relational table with only a few column headers (e.g. player, position, time, event_type, matchId). Considering how to join this with the summaries data (and the fact we have just 20,000 timelines and 50,000 summaries) will be a challenge to tackle in the modelling step!

In [372]:
# steps:

# take each row and: generate a df of events in a standardised format. 
# Add in matchId and puuid (some joining will need to be done)
# Match puuid to summonerId and Player Name

# Parse for one file
# Parse for entire dataset

In [None]:
# let's do some data exploration to figure out how we'd build this parser

In [373]:
ast.literal_eval(timeline['info'][0])['participants']

{'frameInterval': 60000,
 'frames': [{'events': [{'realTimestamp': 1635959013095,
     'timestamp': 0,
     'type': 'PAUSE_END'}],
   'participantFrames': {'1': {'championStats': {'abilityHaste': 0,
      'abilityPower': 0,
      'armor': 30,
      'armorPen': 0,
      'armorPenPercent': 0,
      'attackDamage': 25,
      'attackSpeed': 100,
      'bonusArmorPenPercent': 0,
      'bonusMagicPenPercent': 0,
      'ccReduction': 0,
      'cooldownReduction': 0,
      'health': 490,
      'healthMax': 490,
      'healthRegen': 0,
      'lifesteal': 0,
      'magicPen': 0,
      'magicPenPercent': 0,
      'magicResist': 32,
      'movementSpeed': 345,
      'omnivamp': 0,
      'physicalVamp': 0,
      'power': 115,
      'powerMax': 115,
      'powerRegen': 0,
      'spellVamp': 0},
     'currentGold': 500,
     'damageStats': {'magicDamageDone': 0,
      'magicDamageDoneToChampions': 0,
      'magicDamageTaken': 0,
      'physicalDamageDone': 0,
      'physicalDamageDoneToChampions': 0,

In [377]:
info, matchId = timeline['info'][0], timeline['matchId'][0]

In [381]:
participants = ast.literal_eval(info)['participants']
frames = ast.literal_eval(info)['frames']

In [397]:
# let's get a list of all event types in a game.

types = set([])

for i in range(len(frames)):
    for j in range(len(frames[i]['events'])):
        types.add(frames[i]['events'][j]['type'])

In [398]:
types

{'BUILDING_KILL',
 'CHAMPION_KILL',
 'CHAMPION_SPECIAL_KILL',
 'ELITE_MONSTER_KILL',
 'GAME_END',
 'ITEM_DESTROYED',
 'ITEM_PURCHASED',
 'ITEM_UNDO',
 'LEVEL_UP',
 'PAUSE_END',
 'SKILL_LEVEL_UP',
 'TURRET_PLATE_DESTROYED',
 'WARD_KILL',
 'WARD_PLACED'}

In [400]:
# We want to keep only: CHAMPION_KILL, BUILDING_KILL

# what is a champion special kill?


for i in range(len(frames)):
    for j in range(len(frames[i]['events'])):
        if frames[i]['events'][j]['type'] =='CHAMPION_SPECIAL_KILL':
            print(frames[i]['events'][j])
            
# ['killerId'], ['position']['x'], ['position']['y'], ['timestamp'], ['type']. No victimId!

{'killType': 'KILL_FIRST_BLOOD', 'killerId': 1, 'position': {'x': 3479, 'y': 12917}, 'timestamp': 163872, 'type': 'CHAMPION_SPECIAL_KILL'}


In [484]:
# it's the first kill of the game. We will want to keep this too.

events = []

for i in range(len(frames)): #len(frames)
    for j in range(len(frames[i]['events'])):
        if frames[i]['events'][j]['type'] in ['TURRET_PLATE_DESTROYED']:
            events.append(frames[i]['events'][j])

            
events

# ['killerId'], ['position']['x'], ['position']['y'], ['timestamp'], ['type']

[{'killerId': 0,
  'laneType': 'BOT_LANE',
  'position': {'x': 10504, 'y': 1029},
  'teamId': 100,
  'timestamp': 357368,
  'type': 'TURRET_PLATE_DESTROYED'},
 {'killerId': 0,
  'laneType': 'MID_LANE',
  'position': {'x': 8955, 'y': 8510},
  'teamId': 200,
  'timestamp': 465504,
  'type': 'TURRET_PLATE_DESTROYED'},
 {'killerId': 0,
  'laneType': 'BOT_LANE',
  'position': {'x': 10504, 'y': 1029},
  'teamId': 100,
  'timestamp': 483343,
  'type': 'TURRET_PLATE_DESTROYED'},
 {'killerId': 0,
  'laneType': 'MID_LANE',
  'position': {'x': 5846, 'y': 6396},
  'teamId': 100,
  'timestamp': 526049,
  'type': 'TURRET_PLATE_DESTROYED'},
 {'killerId': 0,
  'laneType': 'MID_LANE',
  'position': {'x': 5846, 'y': 6396},
  'teamId': 100,
  'timestamp': 533612,
  'type': 'TURRET_PLATE_DESTROYED'},
 {'killerId': 0,
  'laneType': 'BOT_LANE',
  'position': {'x': 10504, 'y': 1029},
  'teamId': 100,
  'timestamp': 547190,
  'type': 'TURRET_PLATE_DESTROYED'},
 {'killerId': 0,
  'laneType': 'TOP_LANE',
  'pos

In [421]:
kills = []

for i in range(4): #len(frames)
    for j in range(len(frames[i]['events'])):
        if frames[i]['events'][j]['type'] in ['CHAMPION_KILL']:
            kills.append(frames[i]['events'][j])

# these are the features for a kill
# ['killerId'], ['position']['x'], ['position']['y'], ['timestamp'], ['type'], ['victimId']

print(kills[0]['killerId'])
print(kills[0]['position']['x'])
print(kills[0]['position']['y'])
print(kills[0]['timestamp'])
print(kills[0]['type'])
print(kills[0]['victimId'])

features = ['killerId'], ['position']['x'], ['position']['y'], ['timestamp'], ['type'], ['victimId']

{key: value for key, value in kills[0].items() if key in features}

1
3773
12987
163872
CHAMPION_KILL
6


In [None]:
# We will need to turn a single champion kill into two rows, with a new type called 'CHAMPION_DEATH' 
# so that we can have an entirely consistent schema across the entire timeline dataset.

In [430]:
for i in range(len(frames)):
    
    for j in range(len(frames[i]['events'])):
        
        if frames[i]['events'][j]['type'] in ['CHAMPION_SPECIAL_KILL', 'BUILDING_KILL']:
            reduced_event = {}

Unnamed: 0,test
0,1
1,3


In [444]:
# this is an ugly way of standardising the data - but in since this only has to be done once it saves time!

reduced_events = []

for frame in frames:
    for event in frame['events']:
        
        if event['type'] in ['CHAMPION_SPECIAL_KILL', 'BUILDING_KILL']:
            
            temp_event = {}
            temp_event['playerId'] = event['killerId']
            temp_event['x'] = event['position']['x']
            temp_event['y'] = event['position']['y']
            temp_event['timestamp'] = event['timestamp']
            temp_event['type'] = event['type']
            
            reduced_events.append(temp_event)
            
        if event['type'] == 'CHAMPION_KILL':
            
            temp_events = []
            temp_event_1 = {}
            temp_event_2 = {}
            
            temp_event_1['playerId'] = event['killerId']
            temp_event_1['x'] = event['position']['x']
            temp_event_1['y'] = event['position']['y']
            temp_event_1['timestamp'] = event['timestamp']
            temp_event_1['type'] = event['type']
            temp_events.append(temp_event_1)
            
            temp_event_2['playerId'] = event['victimId']
            temp_event_2['x'] = event['position']['x']
            temp_event_2['y'] = event['position']['y']
            temp_event_2['timestamp'] = event['timestamp']
            temp_event_2['type'] = 'CHAMPION_DEATH'
            temp_events.append(temp_event_2)
            
            reduced_events = reduced_events + temp_events
        
pd.DataFrame(reduced_events)
        

Unnamed: 0,playerId,timestamp,type,x,y
0,1,163872,CHAMPION_KILL,3773,12987
1,6,163872,CHAMPION_DEATH,3773,12987
2,1,163872,CHAMPION_SPECIAL_KILL,3479,12917
3,8,165755,CHAMPION_KILL,7571,7354
4,3,165755,CHAMPION_DEATH,7571,7354
5,7,193932,CHAMPION_KILL,6645,7531
6,2,193932,CHAMPION_DEATH,6645,7531
7,9,272604,CHAMPION_KILL,12880,2158
8,5,272604,CHAMPION_DEATH,12880,2158
9,7,304676,CHAMPION_KILL,6594,7248


In [486]:
# great now we just need to match the playerId to the puuid, add in the matchId and we're done!
# Let's collect some of the previous steps and turn into a function

def parse_timeline_row(info: str, matchId: str) -> pd.DataFrame:
    
    info_json = ast.literal_eval(info)
    
    reduced_events = []

    for frame in info_json['frames']:
        for event in frame['events']:

            if event['type'] in ['CHAMPION_SPECIAL_KILL', 'BUILDING_KILL', 'TURRET_PLATE_DESTROYED']:

                temp_event = {}
                temp_event['playerId'] = event['killerId']
                temp_event['x'] = event['position']['x']
                temp_event['y'] = event['position']['y']
                temp_event['timestamp'] = event['timestamp']
                temp_event['type'] = event['type']

                reduced_events.append(temp_event)

            if event['type'] == 'CHAMPION_KILL':

                temp_events = []
                temp_event_1 = {}
                temp_event_2 = {}

                temp_event_1['playerId'] = event['killerId']
                temp_event_1['x'] = event['position']['x']
                temp_event_1['y'] = event['position']['y']
                temp_event_1['timestamp'] = event['timestamp']
                temp_event_1['type'] = event['type']
                temp_events.append(temp_event_1)

                temp_event_2['playerId'] = event['victimId']
                temp_event_2['x'] = event['position']['x']
                temp_event_2['y'] = event['position']['y']
                temp_event_2['timestamp'] = event['timestamp']
                temp_event_2['type'] = 'CHAMPION_DEATH'
                temp_events.append(temp_event_2)

                reduced_events = reduced_events + temp_events

    reduced_events_df = pd.DataFrame(reduced_events)
    reduced_events_df['matchId'] = matchId
    reduced_events_df = reduced_events_df.loc[reduced_events_df['playerId'] != 0]
    reduced_events_df.reset_index(drop=True, inplace=True)
    
    participants_df = pd.DataFrame(info_json['participants'])
    
    final = pd.merge(reduced_events_df, participants_df, 
             how='left', left_on='playerId', right_on='participantId')
    
    final.drop(columns=['playerId', 'participantId'], inplace=True)
    
    return final

In [488]:
parse_timeline_row(timeline.iloc[0,0], timeline.iloc[0,1]).head(5)

Unnamed: 0,timestamp,type,x,y,matchId,puuid
0,163872,CHAMPION_KILL,3773,12987,EUW1_5536806773,69caSlX99d_QHBv7f9vATk_-czAJIaPHSuGzMyawAx7bGt...
1,163872,CHAMPION_DEATH,3773,12987,EUW1_5536806773,1CgTb2q5yfrvEtmAGjx2e2qgewpyRS9PNTyQNHiW2dQP-0...
2,163872,CHAMPION_SPECIAL_KILL,3479,12917,EUW1_5536806773,69caSlX99d_QHBv7f9vATk_-czAJIaPHSuGzMyawAx7bGt...
3,165755,CHAMPION_KILL,7571,7354,EUW1_5536806773,1QrwjZglk1BtcqF_8YmnYMZ7soDf5c771w5Z6LWkY8xYkT...
4,165755,CHAMPION_DEATH,7571,7354,EUW1_5536806773,aQcWEnior3rZabWuMn431Orus8TLnM9aHLt5OLgA5e9s38...


### Great - now we can do the same as what we did before and build a class to parse all the timeline data and send it to a single folder. 

There is a lot of overlap with the SummaryParser so let's inherit from that class to re-use the functions we've already made

In [512]:
import pandas as pd
import glob
import ast
import typing as t
import warnings
import gc


class TimelineParser(SummaryParser):
    
    ''' 
    
    Takes a set of timeline files and outputs them to a single file in a specified location
    
    Attributes
    -----------
    
    timeline_data_folder : str
        The path (relative or absolute) of the folder the data is coming from
    output_folder : str
        The path of the folder the parsed data is being sent to   
    output_filename : str
        The name of what to save the parsed data as. Note you must include .csv at the end
    
    Methods
    ----------
    
    parse_all_files(self):
        The only function necessary to call this class. All other functions are helpers.
        Takes all files from specified folder, parses them in series (storing data in memory),
        and saves the data to the specified path and name.
    
    
    '''
    
    def parse_timeline_files(self) -> str:
        
        ''' Takes the input folder path and outputs a csv with the folder and name'''
        
        output = pd.DataFrame({})
        
        for filepath in self._get_csv_paths():
            parsed_file = self._parse_timeline_file(filepath)
            output = output.append(parsed_file)
            
            print(f'{filepath} parsed successfully')
        
        output.reset_index(drop=True).to_csv(self.output_folder + self.output_filename, index=False)
        
        return f'File uploaded to {self.output_folder} named {self.output_filename}'
    
    def _parse_timeline_file(self, filepath: str) -> pd.DataFrame:
        
        ''' Takes a single file and returns timeline data for the file in relational format'''
        
        file = pd.read_csv(filepath)
        file = file[['info', 'metadata']] # remove any spurious columns
        file.dropna(subset=['info'], inplace=True)
        file.reset_index(drop=True, inplace=True)
        file['matchId'] = [ast.literal_eval(file['metadata'][i])['matchId'] for i in range(len(file))]
        file.drop_duplicates(subset='matchId', keep='last', inplace=True)
        file.drop(columns=['metadata'], inplace=True)
        
        parsed_file = pd.DataFrame({})
        
        for i, item in file.iterrows():
            
            try:
                json_as_df = self._parse_timeline_row(i, item['info'], item['matchId'])
                parsed_file = parsed_file.append(json_as_df)
            except:
                pass
            
            
        # memory management - only one 1GB file at a time!
        del file
        gc.collect()
            
        return parsed_file.reset_index(drop=True)
    
    
    
    def _parse_timeline_row(self, i: int, info: str, matchId: str) -> pd.DataFrame:

        info_json = ast.literal_eval(info)

        reduced_events = []

        for frame in info_json['frames']:
            for event in frame['events']:

                if event['type'] in ['CHAMPION_SPECIAL_KILL', 'BUILDING_KILL', 'TURRET_PLATE_DESTROYED']:

                    temp_event = {}
                    temp_event['playerId'] = event['killerId']
                    temp_event['x'] = event['position']['x']
                    temp_event['y'] = event['position']['y']
                    temp_event['timestamp'] = event['timestamp']
                    temp_event['type'] = event['type']

                    reduced_events.append(temp_event)

                if event['type'] == 'CHAMPION_KILL':

                    temp_events = []
                    temp_event_1 = {}
                    temp_event_2 = {}

                    temp_event_1['playerId'] = event['killerId']
                    temp_event_1['x'] = event['position']['x']
                    temp_event_1['y'] = event['position']['y']
                    temp_event_1['timestamp'] = event['timestamp']
                    temp_event_1['type'] = event['type']
                    temp_events.append(temp_event_1)

                    temp_event_2['playerId'] = event['victimId']
                    temp_event_2['x'] = event['position']['x']
                    temp_event_2['y'] = event['position']['y']
                    temp_event_2['timestamp'] = event['timestamp']
                    temp_event_2['type'] = 'CHAMPION_DEATH'
                    temp_events.append(temp_event_2)

                    reduced_events = reduced_events + temp_events

        reduced_events_df = pd.DataFrame(reduced_events)
        reduced_events_df['matchId'] = matchId
        reduced_events_df = reduced_events_df.loc[reduced_events_df['playerId'] != 0]
        reduced_events_df.reset_index(drop=True, inplace=True)
        
   
        try:
        
            participants_df = pd.DataFrame(info_json['participants'])
            final = pd.merge(reduced_events_df, participants_df, 
                     how='left', left_on='playerId', right_on='participantId')
            final.drop(columns=['playerId', 'participantId'], inplace=True)
            
            return final
        
        except:
            warnings.warn(f"Join didn't work - no events in game. Returning empty dataframe for row {i}")
            
            return pd.DataFrame({})
    
        
     

In [513]:
instance = TimelineParser('./match-data/timelines/', './match-data/parsed-timelines/', 'timeline_final.csv')

In [514]:
instance.parse_timeline_files()



./match-data/timelines/timeline_12000.csv parsed successfully




./match-data/timelines/timeline_9000.csv parsed successfully
./match-data/timelines/timeline_5000.csv parsed successfully




./match-data/timelines/timeline_7000.csv parsed successfully




./match-data/timelines/timeline_10000.csv parsed successfully




./match-data/timelines/timeline_3000.csv parsed successfully




./match-data/timelines/timeline_18000.csv parsed successfully




./match-data/timelines/timeline_14000.csv parsed successfully




./match-data/timelines/timeline_16000.csv parsed successfully




./match-data/timelines/timeline_0.csv parsed successfully




./match-data/timelines/timeline_1000.csv parsed successfully




./match-data/timelines/timeline_8000.csv parsed successfully




./match-data/timelines/timeline_4000.csv parsed successfully




./match-data/timelines/timeline_13000.csv parsed successfully




./match-data/timelines/timeline_11000.csv parsed successfully




./match-data/timelines/timeline_6000.csv parsed successfully
./match-data/timelines/timeline_15000.csv parsed successfully




./match-data/timelines/timeline_2000.csv parsed successfully




./match-data/timelines/timeline_17000.csv parsed successfully


'File uploaded to ./match-data/parsed-timelines/ named timeline_output.csv'

# We need to add a flag to our datasets to define whether a player is a professional or not.


I have collected a list of 100 professional player account names (found on the internet) and saved to a file named 'Pro_Account_Names'.

All we need to do is to join this to the existing 'players' table, and we will then have a map of pro players to ids, which we will be able to use to add a flag to our summaries and timeline data.

In [518]:
pros = pd.read_csv('Pro_Account_Names.csv')
players = pd.read_csv('./players.csv')

pro_ids = pd.merge(pros, players, how='inner', left_on='Pro_Account_Names', right_on='name')[['name', 'id', 'puuid']]

# let's save this file 
pro_ids.to_csv('pro_player_ids.csv', index=False)

**Finally we can read in the data we just created and join in this table, null values in the join will be turned into 0 (not pro player) and non-null values 1 (pro player)**

In [None]:
# for summaries:

In [528]:
summaries = pd.read_csv('./match-data/parsed-summaries/summaries_final.csv')
timelines = pd.read_csv('./match-data/parsed-timelines/timeline_final.csv')

In [536]:
summaries_joined = pd.merge(summaries, pro_ids[['id']], 
         left_on='summonerId', right_on='id', how='left')

In [540]:
summaries_joined['pro_flag'] = 0
summaries_joined.loc[summaries_joined['id'].notnull(), 'pro_flag'] = 1
summaries_joined.shape

(519230, 109)

In [541]:
summaries_joined.loc[summaries_joined['pro_flag'] == 1].shape
summaries_joined.drop(columns=['id'], inplace=True)

(10573, 109)

In [544]:
# we have 10,000 pro-player entries - which we expect for 100 pro players, each playing 100 games (our set)
summaries_joined.drop(columns=['id'], inplace=True)
summaries_joined.to_csv('./match-data/parsed-summaries/summary.csv', index=False)

In [550]:
# for timelines:

#timelines.drop(columns=['Unnamed: 0'], inplace=True)

timelines_joined = pd.merge(timelines, pro_ids[['id', 'name']], 
         left_on='id', right_on='id', how='left', suffixes=('', '_pro'))

timelines_joined['pro_flag'] = 0
timelines_joined.loc[timelines_joined['name_pro'].notnull(), 'pro_flag'] = 1
timelines_joined.shape


(635133, 10)

In [551]:
timelines_joined.loc[timelines_joined['pro_flag'] == 1].shape

(72829, 10)

In [None]:
# interestingly the pro players have a much larger fraction of events compare to their state in the summaries data

In [553]:
timelines_joined.drop(columns=['name_pro'], inplace=True)
timelines_joined.to_csv('./match-data/parsed-timelines/timeline.csv', index=False)