## LD2L data project

#### Introduction

In order to pull interesting data from ld2l.gg I have created a basic set of pulls to gather match data and parse it into a dataframe. 
This dataframe is exported to a csv and can be used with any BI software or can be read programatiicaly with dataframe packages for exploration

In [1]:
#import 

import pandas as pd
import numpy as np
import requests
import json
import bs4
import time
import os

# You will need your own API key from opendota 
# and a config.py file that sets the variable api_key to your key
from config import api_key

In [2]:
# display full dataframe
pd.set_option('display.max_columns', None)

#### Pulling season info

Basic match data is found on ld2l.gg/seasons/##/matches. My outline here will use season 37 for prototyping. 

Using BeautifulSoup (BS) (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) the ld2l matches page is parsed to get the ld2l match id. 
The ld2l match id does not match the dota match id

A cache file is created, unless it already exists, to avoid re-parsing saved info.

Note that seasons on the website do not relate to the ticket or season that would be listed in Dota/OpenDota api.
This method also reduces adding in matches that ticket holders use for scrims or other reasons that aren't official games.
One limitation is that this method can't pull unticketed data, even if entered in completly on the ld2l website.



In [6]:
#set ld2l season webpage
url = 'https://ld2l.gg/seasons/13/matches'
season = url.split('/')[-2]

soup = bs4.BeautifulSoup(requests.get(url).text, 'html.parser')
matches = []

# This is a directory string to help manage the data of different seasons
save_dir = f'match_data_{url.split("/")[-2]}'

for a in soup.find_all('a', href=True):
    if 'match' in a['href'] and 'season' not in a['href']:
        matches.append('https://ld2l.gg' + a['href'])

#sort matches by ID
matches.sort(key=lambda x: int(x.split('/')[-1]))

# create a folder to store match data if it doesn't exist
if not os.path.exists(save_dir):
    os.mkdir(save_dir)

# matches text file location to variable
matches_file = f'{save_dir}/matches_{season}.txt'

# create a matches text file to store match IDs if it doesn't exist

if not os.path.exists(matches_file):
    with open(matches_file, 'w') as f:
        f.write('')

len(matches)


7

#### Converting to OpenDota links

After gathering the match data, each match page is opened via BS. From here the Match ID is extracted from the 
OpenDota link and a correctly formatted OpenDota API link is added to a list.

This section skips over matches that have been parsed already by checking the matches.txt file created in the last secion

In [4]:
# below code is for getting opendota links

od_matches = []

for match in matches:
    #check if match is already in file matches.txt to prevent re-scraping and angry butterygreg
    if match in open(matches_file).read():
        pass
    else:
        #write match to file
        with open(matches_file, 'a') as f:
            f.write(match + '\n')


        soup = bs4.BeautifulSoup(requests.get(match).text, 'html.parser')
        for a in soup.find_all('a', href=True):
            if 'opendota' in a['href']:
                if 'matches/0' in a['href']:
                    break
            # get match id from end of url
                match_id = a['href'].split('/')[-1]

                # here an opendota link is created and appended to the list, note the api key is required
                od_matches.append(f"https://api.opendota.com/api/matches/{match_id}?api_key={api_key}")
                break

len(od_matches)

24

#### Pulling OpenDota Jsons

In [5]:
# hold list of file names
file_names = []

for files in os.listdir(save_dir):
    if files.endswith('.json'):
        file_names.append(files)

for i, match in enumerate(od_matches):

    file_path = f'{save_dir}/match_{match.split("/")[-1]}.json'

    # get match id
    match_id = match.split('/')[-1].split('?')[0]

    #check if file already exists
    if os.path.isfile(f'{file_path}'):
        pass
    else:
        if match_id == '0':
            pass
        # get json of match and save to json file
        else:
            match_json = requests.get(match).json()
            with open(f'{save_dir}/match_{match_id}.json', 'w') as f:
                json.dump(match_json, f)
                file_names.append(f'{save_dir}/match_{match_id}.json')

#### DataFrame formatting and basic cleaning

Below a blank dataframe is created with the selected features from the players section in the read json files. 
As with earlier sections, if a cached match_data.csv exists, new items will be concatenated instead of a new creations, saving time and resources.

In [None]:
# create an empty dataframe to hold all match data if it doesn't exist

if not os.path.exists('match_data.csv'):
    match_data = pd.DataFrame(columns=['match_id', 'date', 'week', 'account_id', 'personaname', 'teamID', 'rank_tier', 'kills', 'assists',
       'deaths', 'kills_per_min', 'kda', 'denies', 'gold', 'gold_per_min', 'gold_spent', 'hero_damage', 'damage_taken',
       'hero_healing', 'hero_id', 'item_0', 'item_1', 'item_2', 'item_3',
       'item_4', 'item_5', 'item_neutral', 'last_hits', 'level',
       'net_worth', 'tower_damage', 'xp_per_min', 'radiant_win',
       'duration', 'patch', 'isRadiant', 'win', 'lose',
       'total_gold', 'total_xp', 'obs_placed', 'sen_placed', 'rune_pickups', 'camps_stacked', 'stuns', 'creeps_stacked',
       'firstblood_claimed', 'pings', 'teamfight_participation', 'roshans_killed'])
    match_data.to_csv('match_data.csv')
else:
    match_data = pd.read_csv('match_data.csv', index_col=None)

In [None]:
for  i, file in enumerate(file_names):

    # read first json file as a dictionary
    with open(file) as f:
        data = json.load(f)

    # get match id
    match_id = data['match_id']

    # if match id is already in matches_df, skip
    if match_id in match_data['match_id'].values:
        pass
    else:

        rad_team_id = data['radiant_team_id']
        dire_team_id = data['dire_team_id']
        
    # read player from data into a dataframe

        df = pd.DataFrame(data['players'])

        # damage taken needs to be transformed. it is a nested dictionary and should be replaced with the sum of the values

        df['damage_taken'] = df['damage_taken'].apply(lambda x: sum(x.values()))

        #convert start_time from unix time to datetime using
        df['start_time'] = pd.to_datetime(df['start_time'], unit='s')
        df['date'] = df['start_time'].dt.date

        #games are played weekly. create a column for the week of the game. Week 1 starts on 2023-01-22, using isocalendar
        df['week'] = df['start_time'].dt.isocalendar().week - 2

        #drop start_time
        df.drop('start_time', axis=1, inplace=True)

        # if isRadiant is true, set teamID to radiant team ID, else set to dire team ID

        df['teamID'] = df['isRadiant'].apply(lambda x: rad_team_id if x == True else dire_team_id)

        new_order = match_data.columns.tolist()

        df = df[new_order]

        # append to main df via concat

        match_data = pd.concat([match_data, df], axis=0)

        # replace NaN with 0
        match_data.fillna(0, inplace=True)

        

        


        # save to csv every loop
        match_data.to_csv('match_data.csv', index=False)

#### Preview

Below will be a dataframe preview. Please note that some assumptions may cause the wrong week to display.

In [None]:
match_data.head(20)

In [14]:
# prototype for function to get match data from opendota, will test on a different season than the prototype

def get_ld2l_matches(season):
    """
    This function reads ld2l.gg to get the specific match IDs that are ticketed and registered on OpenDota into a list. 
    This list is saved as a text file to prevent re-scraping of data.

    This function also creates a folder to store the match data in and will be the directory for the matches.txt file and later the match data.

    inputs: season (int) - the season of the matches to be pulled as listed on ld2l.gg matches page, different than the opendota season due to ticketing issues
    Returns: 
        list of ld2l.gg match IDs, 
        save_dir (str) - directory string to help manage the data of different seasons to pass to the next function
        matches.txt directory for future use in following functions

    output: text file that matches.txt that another function will use to check if the match data has already been pulled and write to.
    """
    url = f'https://ld2l.gg/seasons/{season}/matches'
    soup = bs4.BeautifulSoup(requests.get(url).text, 'html.parser')
    matches = []

    # This is a directory string to help manage the data of different seasons
    save_dir = f'match_data_{url.split("/")[-2]}'

    for a in soup.find_all('a', href=True):
        if 'match' in a['href'] and 'season' not in a['href']:
            matches.append('https://ld2l.gg' + a['href'])

    #sort matches by ID
    matches.sort(key=lambda x: int(x.split('/')[-1]))

    # create a folder to store match data if it doesn't exist
    if not os.path.exists(save_dir):
        os.mkdir(save_dir)

    # save match file location to a text file

    matches_file = f"{save_dir}/matches_{season}.txt"

    # create a matches text file to store match IDs if it doesn't exist
    if not os.path.exists(f'{matches_file}'):
        with open(f'{matches_file}', 'w') as f:
            f.write('')

    return matches, save_dir, matches_file

def get_od_matches(matches, save_dir, matches_file):
    """
    This function takes the list of ld2l.gg match IDs and converts them to opendota match IDs.

    input: 
        list of ld2l.gg match IDs, ideally from the get_ld2l_matches function 
        the save_dir string from that function
        the matches_file string from that function

    Returns: 
        list of opendota match IDs, 
        save_dir (str) - the directory string for the match data to pass to the next function

    output: writes match IDs to matches.txt file to prevent re-scraping of data
    """
    
    od_matches = []

    for match in matches:
        #check if match is already in matches.txt
        if match in open(matches_file).read():
            pass
        else:
            #write to matches file
            with open(matches_file, 'a') as f:
                f.write(match + '\n')

            # reads html from ld2l.gg match page and finds the opendota match ID
            soup = bs4.BeautifulSoup(requests.get(match).text, 'html.parser')
            for a in soup.find_all('a', href=True):
                if 'opendota' in a['href']:
                    # if match id is 0, it means the match wasnt ticketed and can be ignored
                    if "matches/0" in a['href']:
                        break
                    else:
                        match_id = a['href'].split('/')[-1]
                        od_matches.append(f'https://api.opendota.com/api/matches/{match_id}')
                
    return od_matches, save_dir

def get_match_data(od_matches, save_dir):
    """
    This function pulls the raw json data for each match and saves as a json file in the save_dir folder.

    Inputs:
        list of opendota match IDs, ideally from the get_od_matches function
        the save_dir string from that function

    Returns: save_dir (str) - the directory string for the match data to pass to the next function
    """

    file_names = []

    for files in os.listdir(save_dir):
        if files.endswith('.json'):
            file_names.append(files)

    for i, match in enumerate(od_matches):

        file_path = f'{save_dir}/match_{match.split("/")[-1]}.json'

        # get match id
        match_id = match.split('/')[-1].split('?')[0]

        #check if file already exists
        if os.path.isfile(f'{file_path}'):
            pass
        else:
            if match_id == '0':
                pass
            # get json of match and save to json file
            else:
                match_json = requests.get(match).json()
                with open(f'{save_dir}/match_{match_id}.json', 'w') as f:
                    json.dump(match_json, f)
                    file_names.append(f'{save_dir}/match_{match_id}.json')

    return file_names, save_dir

def create_season_dataframe(file_names, save_dir):
    """
    This function reads the json files in the save_dir folder and creates a dataframe of the match data.
    The dataframe is saved as a csv file for processing outside of this script. 

    If the csv file already exists, it will be read and returned with new data appended to the end.

    Inputs: 
        file_names (list) - list of file names in the save_dir folder
        save_dir (str) - the directory string for the match data. Ideally passed from previous function.
    Returns: match_data (pd.DataFrame) - the dataframe of match data
    Output: match_data.csv file in save_dir folder
    """

    # create an empty dataframe to hold all match data if it doesn't exist in format
    # match_data_{season}.csv

    # csv filepath to variable

    csv_path = f'{save_dir}/match_data_{save_dir.split("_")[-1]}.csv'

    if not os.path.exists(f'{csv_path}'):
        match_data = pd.DataFrame(columns=[
            'match_id', 'date', 'week', 'account_id', 'personaname', 'teamID', 
            'rank_tier', 'kills', 'assists''deaths', 'kills_per_min', 'kda', 
            'denies', 'gold', 'gold_per_min', 'gold_spent', 'hero_damage', 'damage_taken',
            'hero_healing', 'hero_id', 'item_0', 'item_1', 'item_2', 'item_3',
            'item_4', 'item_5', 'item_neutral', 'last_hits', 'level',
            'net_worth', 'tower_damage', 'xp_per_min', 'radiant_win', 
            'duration', 'patch', 'isRadiant', 'win', 'lose', 'total_gold', 
            'total_xp', 'obs_placed', 'sen_placed', 'rune_pickups', 'camps_stacked', 
            'stuns', 'creeps_stacked', 'firstblood_claimed', 'pings', 'teamfight_participation', 
            'roshans_killed']
        )
        match_data.to_csv(f'{csv_path}', index=False)
    else:
        match_data = pd.read_csv(f'{csv_path}')
    



    # read json files in save_dir and append to match_data

    for i, file in enumerate(file_names):

        # read json file
        with open(file) as f:
            match = json.load(f)
        
        # get match id
        match_id = match['match_id']

        # check if match is already in match_data, passes if parsed already
        if match_id in match_data['match_id'].values:
            pass
        else:

            # get team ids
            radiant_team_id = match['radiant_team']['team_id']
            dire_team_id = match['dire_team']['team_id']

            # read player data into a temporary dataframe
            # the dataframe will be concatenated to the match_data dataframe

            df = pd.DataFrame(data['players'])

            #flattening the damage_taken feater

            df['damage_taken'] = df['damage_taken'].apply(lambda x: sum(x.values()))

            # convert unix time to datetime

            df['start_time'] = pd.to_datetime(df['start_time'], unit='s')
            df['date'] = df['start_time'].dt.date

            # games are played weekly. create a column for the week of the game. 
            # This will need to be transformed to the correct week number for each season
            # If games are played early or late in the week, this will need to be adjusted
            # It will also not work for seasons that go over a year boundary
            df['week'] = df['start_time'].dt.isocalendar().week

            # drop unix start time
            df.drop('start_time', axis=1, inplace=True)

            # assign team id to each player

            df['teamID'] = df['isRadiant'].apply(lambda x: rad_team_id if x == True else dire_team_id)

            # make columns the same as the match_data dataframe

            new_order = match_data.columns

            df = df[new_order]

            # append to match_data

            match_data = match_data.append(df, ignore_index=True)

            # fill in missing values
            # replace NaN with 0
            match_data.fillna(0, inplace=True)

            # save match_data to csv every loop
            match_data.to_csv(f'{csv_path}', index=False)


    return match_data

In [16]:
create_season_dataframe(*get_match_data(*get_od_matches(*get_ld2l_matches(13))))

FileNotFoundError: [Errno 2] No such file or directory: 'match_4699692683.json'