Downloading CrossFit Open data 
=====

Author: [Ray Bell](https://github.com/raybellwaves)

#### The Class object `cfopendata` is used to download Crossfit Open data from the [mainsite](https://games.crossfit.com/leaderboard/open/2017?division=1&region=0&scaled=0&sort=0&occupation=0&page=1).

In [1]:
import cfanalytics as cfa

The paramters for `Cfopendata` are as follows (in bold):

__year__ : 2011 - 2018

Let's grab last years results (only tested on 2017):

In [2]:
year = 2017
print(year)

2017


__division__

The CrossFit open has divisions which have numerical values:
    
1. = Men
2. = Women
3. = Men 45-49
4. = Women 45-49
5. = Men 50-54
6. = Women 50-54
7. = Men 55-59
8. = Women 55-59
9. = Men 60+
10. = Women 60+
11. = Team
12. = Men 40-44
13. = Women 40-44
14. = Boys 14-15
15. = Girls 14-15
16. = Boys 16-17
17. = Girls 16-17
18. = Men 35-39
19. = Women 35-39

Let's grab the division with the least number of entries (Girls 14-15) to make sure we can download some data.

In [3]:
division = 15
print(division)

15


__scaled__ : 0 or 1

0 indicates if the workout is prescribed (Rx) or scaled (Sc).

Let's grab the Rx data.

In [4]:
scaled = 0
print(scaled)

0


These parameters are used to obtain data which are uploaded to the CrossFit games server. To look at the raw JSON data for these parameters take a look at this [url](https://games.crossfit.com/competitions/api/v1/competitions/open/2017/leaderboards?page=1&competition=1&year=2017&division=15&scaled=0&sort=0&fittest=1&fittest1=0&occupation=0) which will correspond with the score display at the [main website](https://games.crossfit.com/leaderboard/open/2017?division=15&region=0&scaled=0&sort=0&occupation=0&page=1).

__batchpages__

To speed up downloading the data, the pages are called *asynchronously* (concurrent tasks). Here's a nice [blog](https://hackernoon.com/asyncio-for-the-working-python-developer-5c468e6e2e8e) explaining [asyncio](https://docs.python.org/3.4/library/asyncio.html). This image is a nice visulization of synchronous vs asynchronous: <img src="files/HTTP_pipelining2.svg"> from [wikipedia](https://en.wikipedia.org/wiki/HTTP_pipelining#/media/File:HTTP_pipelining2.svg).

This parameter allows us to choose how many pages to call at once.

There is a trade off between speed and stabaility, hence this parameter is adjustable.

There are 33 pages of data for our parameters above, so we will set the value at 10.

In [5]:
batchpages = 10
print(batchpages)

10


__ddir__ is the data directory where we want the data downloaded to.

I'm running this inside my local directory of the [examples folder](https://github.com/raybellwaves/cfanalytics/tree/master/examples) so I will simply generate a *Data* folder and use that.

In [6]:
import os
ddir = os.getcwd()+'/Data'
if not os.path.isdir(ddir):
    os.makedirs(ddir)
print(ddir)

/Volumes/SAMSUNG/WORK/CFanalytics_2017/test_GitHub_folder/test_examples/Data


Now I will download the data using `Cfopendata`.

In [7]:
%time cfa.Cfopendata(year, division, scaled, batchpages, ddir)

CPU times: user 5.16 s, sys: 94.8 ms, total: 5.25 s
Wall time: 9.67 s


<cfanalytics.core.cfopendata.Cfopendata at 0x105dd8198>

A file in /Data now exists named __Girls_14-15_Rx_2017_raw__ which is a `pandas.Dataframe` and __Girls_14-15_Rx_2017_raw.csv__.

Open it up by either double clicking it (which may open it in excel) or (better yet) try the new [jupyter lab](https://github.com/jupyterlab/jupyterlab).