# Arkhamdb Decklists Data Loading 
Created: 2021-10-29  
Updated: 2021-10-30   
Author: Spencer Simon

## Overview

This notebook downloads data from [arkhamdb.com](https://arkhamdb.com) using their [public api](https://arkhamdb.com/api/).  

The data downloaded includes all decklists created and published on the site.  

This data is downloaded and prepared for use in creating racing bar charts of investigator popularity over time, as well as additional analysis.  

Data is exported as a CSV.

## Setup

### Install and import necessary libraries

In [3]:
import pandas as pd
import urllib.request, json 
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime

### Define global variables

In [37]:
base_url = 'https://arkhamdb.com'
api_deck_by_date = '/api/public/decklists/by_date/'

In [38]:
# Get current year and month as ints and strings of numbers:
current_yr = datetime.now().year
current_yr_string = str(current_yr)
current_month = datetime.now().month

# Make sure month string is 2 digits
if current_month <= 9:
    current_month_string = '0' + str(current_month)
else:
    current_month_string = str(current_month)

In [39]:
dates_old = ['2016-09-02', '2016-10-12', '2016-10-15', '2016-10-18', '2016-10-19', 
             '2016-10-22', '2016-10-23', '2016-10-24', '2016-10-27', '2016-10-29', 
             '2016-10-30',]

In [40]:
urls_old = [base_url + api_deck_by_date + i for i in dates_old]

### Define functions

In [41]:
def get_urls(year, month):
    """
    year: string (YYYY)
    month: string (MM)
    
    Returns a list of the arkhamdb API urls for all decklists from all days in the input year and month.
    Includes some addition days/urls as well (e.g. February 30th).
    """
    
    # Check that input parameters are strings, and month is length 2
    if not isinstance(year, str) or not isinstance(month, str):
        print("Error: year and month must be strings")
        return
    elif len(month) != 2:
        print("Error: input month must be 2 characters long (e.g. 02) /n")
        return
    
    urls = []
    # Create list of url strings I will use to request data
    for i in range(1,32):
        if i <= 9:
            url_temp = base_url + api_deck_by_date + f'{year}-{month}' + '-0' + str(i)
        else:
            url_temp = base_url + api_deck_by_date + f'{year}-{month}-' + str(i)
        urls.append(url_temp)
    
    return urls

In [42]:
def fill_df_month(year, month):
    """
    year: string (YYYY)
    month: string (MM)
    
    Returns a list with arkhamdb information retrieved from the API for decklists for the given month.
    """
    
    # Check that input parameters are strings, and month is length 2
    if not isinstance(year, str) or not isinstance(month, str):
        print("Error: year and month must be strings")
        return
    elif len(month) != 2:
        print("Error: input month must be 2 characters long (e.g. 02) /n")
        return
    
    small_dfs = []
    
    # Set a special case for September and October 2016, when very few decks were published
    if year == '2016' and month == '09':         # Fill using hard coded values for first day from urls_old, the only day in September
        try:
            small_dfs.append(pd.read_json(urls_old[0]))
        except:
            print(f"Error for Date {url[-10:]} \n")
    elif year == '2016' and month == '10':
        for url in urls_old[1:]:             # Fill using rest of urls_old, which is october
            try:
                small_dfs.append(pd.read_json(url))
            except:
                print(f"Error for Date {url[-10:]} \n")
    else:
        for url in get_urls(year, month):    # Else, fill in list of decklists for the month using get_urls()
            try:
                small_dfs.append(pd.read_json(url))
            except:
                print(f"Error for Date {url[-10:]} \n")

    return small_dfs

In [43]:
def fill_df_full():
    """
    Returns a dataframe with all arkhamdb data from the arkhamdb API for decklists through the previous month.
    """
    month_lists = [] # Initialize list of decklists
    
    # Loop through all years 2016 to present
    for yr in range(2016, current_yr+1):
        # Set start month to 9 in 2016, as that is the first month a decklist is published
        if yr == 2016:
            month_iter = ['09', '10', '11', '12']
        else:
            month_iter = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
            
        # Print when starting a new year to update progress:
        print(f" ------------------ \n Starting year {yr} \n")
        
        # Loop through all months through the month before the current month, 
        # so we don't include partial month data for this month
        # For each month, get all decklists for the month. Append that to overall months list.
        for m in month_iter:
            temp_list = fill_df_month(str(yr), m)
            month_lists.append(temp_list)         # Append to lists here, then convert to df later
    
    # flatten list so I can pd.concat to make df
    flatter_list = [item for sublist in month_lists for item in sublist] 
    
    df = pd.concat(flatter_list, ignore_index=True)    # Create dataframe from the list above
    df.set_index(['id'], inplace=True)                 # set df index to id column
    
    return df

## Query Arkhamdb to get data

In [44]:
# Initially: took ~25 mins to run. Look at speed improvements, multi-processing, etc.
df_decklists_raw = fill_df_full()

---------------- 
 Starting year 2016 

Error for Date 2016-11-02 

Error for Date 2016-11-04 

Error for Date 2016-11-09 

Error for Date 2016-11-27 

Error for Date 2016-12-08 

Error for Date 2016-12-09 

Error for Date 2016-12-24 

Error for Date 2016-12-25 

Error for Date 2016-12-28 

Error for Date 2016-12-31 

---------------- 
 Starting year 2017 

Error for Date 2017-01-09 

Error for Date 2017-01-11 

Error for Date 2017-01-13 

Error for Date 2017-02-19 

Error for Date 2017-04-31 

Error for Date 2017-05-01 

Error for Date 2017-07-28 

Error for Date 2017-07-29 

Error for Date 2017-08-26 

Error for Date 2017-12-23 

---------------- 
 Starting year 2018 

Error for Date 2018-05-07 

---------------- 
 Starting year 2019 

Error for Date 2019-12-13 

Error for Date 2019-12-29 

---------------- 
 Starting year 2020 

---------------- 
 Starting year 2021 

Error for Date 2021-03-05 

Error for Date 2021-04-18 

Error for Date 2021-10-31 

Error for Date 2021-11-01 

Erro

In [47]:
# df_decklists_raw.head()
df_decklists_raw.shape

(30222, 19)

## Clean Data

## Export Data

In [48]:
df_decklists.to_csv('decklists.csv')

## Notes & To-Do's:

- Current function takes a long time to run. Use multithreading or other improvements to make faster
- Clean df
- Export clean df to csv
- Write list of known errors/improvements
- add a statement in the for loop of function to exit the loop if month = curr month and year = current year

# From searching: 
def get_url(url):
    return requests.get(url)

list_of_urls = ["https://postman-echo.com/get?foo1=bar1&foo2=bar2"]*10

with ThreadPoolExecutor(max_workers=2) as pool:
    response_list = list(pool.map(get_url,list_of_urls))

for response in response_list:
    print(response)

In [None]:
#with urllib.request.urlopen(url_deck) as url:
#    data = json.loads(url.read().decode())
#    #print(json.dumps(data, indent=1))