### Introduction

Our goal is to write a simple script that uses [ClinicalTrials.gov](https://www.clinicaltrials.gov/api/gui)'s API.

The script will take a beginning date and ending date and return all trials that were posted to the site during that time frame (inclusive).

The script will save a CSV file containing the downloaded data.

Let's call the script `ctscrape` for "ClinicalTrials scrape".

For `ctscrape`, we could use Python's built-in `sys` library to collect command-line arguments, or use another option whose name I can't remember right now.

Instead of getting ahead of ourselves, let's settle for a function (which will become the script's main function) that takes the arguments above.

In [None]:
def ctscrape(begin_date, end_date, pth='.', filename='ctscrape'):
    """
    Given a `begin_date` and `end_date`, query the ClinicalTrials.gov API
    and save trials published during that timeframe (inclusive) to 
    `filename` as a CSV located at `pth`.
    
    Requirements:
    * Do not overwrite data -- append timestamp to `filename`.
    """
    
    ## Psudocode:
    # 
    #
    # Step 1: Query the API
    # Step 2: Convert response data to a dataframe
    # Step 3: Save to CSV
    #
    #
    
    return None

Let's take a look at the API.

### Examine the API

<!-- ![screenshot](assets/ClinicalTrials.gov.png) -->

<div>
<img src="assets/ClinicalTrials.gov1.png" width="1000"/>
</div>

We see right away that there are three types of query URLs. From reading their descriptions, it seems like one of the first two will meet our needs. "Full Studies" might work out of the box since more information is better than less.

Scrolling back up and reading the "API URLs" page, we can learn that the API will return results in XML format, but that we can use the parameter `fmt=JSON` to change this to JSON. This will likely be easier to manipulate in Python.

<div>
<img src="assets/ClinicalTrials.gov2.png" width="1000"/>
</div>

We can safely disreard all the Info URLs for now, but might come back to these as the need arises. As foreshadowed when we looked at the homepage, we can focus on query URLs. Let's read more about these.

After reading in more detail about Full Studies and Study Fields URLs, the Full Studies URL seems most suited to our purpose.

The idea for this script is to run it frequently to catch studies as they're posted, perhaps as often as daily. Using Full Studies, we can cast a wide net to get as much information as possible, then narrow down to the subset of information needed by the client (Blossom).

Let's construct an example query on the Full Studies URL to get all studies posted during a timeframe to be determined as we build. We could assume the period will be one week, but this may return more than the URL's 100 result limit.

Looking at the ClinicalTrials.gov website, it looks like we want our `*_date` arguments to filer on the "First Posted" field:

<div>
<img src="assets/ClinicalTrials.gov3.png" width="400"/>
</div>

Looking back at the API documentation, we can use their "[Crosswalk](https://www.clinicaltrials.gov/api/gui/ref/crosswalks#keyRecordDateshttps://www.clinicaltrials.gov/api/gui/ref/crosswalks#keyRecordDates)" 
page to identify the API parameter name we need to use: `StudyFirstPostDate`:

<div>
<img src="assets/ClinicalTrials.gov4.png" width="1000"/>
</div>

Looking further at the [Search Operations](https://www.clinicaltrials.gov/api/gui/ref/expr) page and searching for "date", we can see two functions that will be useful: `RANGE` and `TILT`.
* `RANGE` will take two dates in what looks to be mm/dd/yyyy form (or `MIN`/`MAX`).
* `TILT` will sort the results based on the given field, including date.

Here's the relevant example syntax:

```
AREA[ResultsFirstPostDate]RANGE[01/01/2015, MAX]
TILT[StudyFirstPostDate]"heart attack"
```

Going back to the [Field Values](https://www.clinicaltrials.gov/api/gui/demo/simple_field_values) URL for a moment, I can see that there have been 80 studies posted in the past week (ran on November 18):

<div>
<img src="assets/ClinicalTrials.gov5.png" width="500"/>
</div>

So we should definitely be running this script at least weekly, if not two or more times per week, to ensure we don't go above the 100 limit.

In fact, let's think now about how we should deal with that. To start, let's see what the Full Studies URL returns if we request too much. ... Huh, that's odd. The documentation says it will return up to 100, but I was able to get 101 after completing a CAPTcha. In fact, I can get it to return 150. I think this might be a quirk of running it in their sandbox.

Here's the error message returned when max_rnk - min_rnk >= 100:

`{"Error":"Invalid query: Requesting too many FullStudies - limit MaxRank-MinRank < 100"}`

We should have general error-handling, but should check for this specific error.

Okay, after playing around with the API, I think I have a decent algorithm for addressing this problem:
* Step 1: Make an initial request to the API with the full date range and min_rnk=1 and max_rnk=100.
* Step 2: Make another request with min_rnk,max_rnk = (100,199).
* Step 3: And so on, until you reach max_rank=n_studies.

`n_studies` is given by the API as `NStudiesFound` in each result.

### `ctscrape` Draft 1

In [145]:
import requests
from datetime import datetime
import os
import re
import urllib
import pandas as pd

DATE_FORMAT = '%m/%d/%Y'
API_URL = 'https://www.clinicaltrials.gov/api/query/full_studies'

def query_api(qparams):
    # Construct query URL
    qstring = '&'.join(['='.join([k,str(v)]) for k,v in qparams.items()])
    qurl = API_URL + '?' + qstring

    try:
        r = requests.get(qurl)
    except Exception:
        raise Exception(f'Error running API query with parameters {qparams}')

    resp = r.json()['FullStudiesResponse']
    
    return resp

def ctscrape_1(begin=None, end=None, pth='.', filename='ctscrape'):
    """
    Given a `begin_date` and `end_date`, query the ClinicalTrials.gov API
    and save trials published during that timeframe (inclusive) to 
    `filename` as a CSV located at `pth`.
    
    `begin`: str, date in format mm/dd/yyyy
    `end`: str, date in format mm/dd/yyyy
    `pth`: str, the directory where you'd like to store the file
    `filename`: str, the filename you'd like to use. Will be appended with timestamp and .csv.
    
    Requirements:
    * Do not overwrite data -- append timestamp to `filename`.
    """
    
    # Argument checking
    if begin is None or end is None:
        raise Error('Enter `begin` and `end`.')

    try:
        datetime.strptime(begin, DATE_FORMAT)
        datetime.strptime(end, DATE_FORMAT)
    except ValueError:
        raise ValueError('Enter date in mm/dd/yyyy format.')
    
    if not os.path.isdir(pth):
        raise ValueError('`pth` is not a directory')
        
    pat = re.compile(r'[A-Za-z0-9_]+')
    if not re.fullmatch(pat, filename):
        raise ValueError('`filename` must be alphanumeric or underscore. Will be appended with timestamp and .csv')
        
    expr = f'TILT[StudyFirstPostDate]AREA[StudyFirstPostDate]RANGE[{begin}, {end}]'
    expr = urllib.parse.quote_plus(expr)
    
    # Store query results here
    res = []
        
    # Construct our request
    qparams = dict(
        expr = expr,
        min_rnk = 1,
        max_rnk = 100,
        fmt = 'json'
    )
    
    # Hit the query once to set n_studies and get the first result set
    resp = query_api(qparams)
    n_studies = int(resp['NStudiesFound'])
    res.extend(resp['FullStudies'])
    qparams['min_rnk'] += 100
    qparams['max_rnk'] += 100

    # Continue to query the API until all studies are retrieved
    while qparams['min_rnk'] < n_studies:
        resp = query_api(qparams)
        res.extend(resp['FullStudies'])
        qparams['min_rnk'] += 100
        qparams['max_rnk'] += 100
            
    # Convert to DF and remove prefix in column names
    studies_df = pd.json_normalize(res)
    studies_df.columns = [c.split('.')[-1] for c in studies_df.columns]
    
    # Write to file
    epoch = str(datetime.timestamp(datetime.utcnow())).replace('.', '')
    filename = filename + f'_{epoch}' + '.csv'
    out = os.path.join(pth, filename)
    studies_df.to_csv(out, index=False)
    
    return studies_df

ctscrape_1(begin='11/17/2021', end='11/18/2022', pth='./data')

Unnamed: 0,Rank,NCTId,OrgStudyId,OrgFullName,OrgClass,BriefTitle,OfficialTitle,Acronym,StatusVerifiedDate,OverallStatus,...,GenderBased,GenderDescription,IPDSharingURL,ExpandedAccessNCTId,ExpandedAccessStatusForNCTId,TargetDuration,OrgStudyIdType,OrgStudyIdLink,SubmissionInfo,WhyStopped
0,1,NCT05126264,2019-000736-25,Universidad Complutense de Madrid,OTHER,Efficacy of Chronoterapy in Oral Surgery,Efficacy of the Dosage of a Non-steroidal Anti...,ECOS,February 2022,Completed,...,,,,,,,,,,
1,2,NCT05126251,GZY-KJS-2021-006,Beijing University of Chinese Medicine,OTHER,Tangningtongluo Tablet for People With Prediab...,Efficacy and Safety of Tangningtongluo Tablet ...,,March 2022,Recruiting,...,,,,,,,,,,
2,3,NCT05126238,BINOS,Negovsky Reanimatology Research Institute,OTHER_GOV,A Lithium-Based Medication to Improve Neurolog...,A Lithium-Based Medication to Improve Neurolog...,,March 2022,Recruiting,...,,,,,,,,,,
3,4,NCT05126225,18-001769,"University of California, Los Angeles",OTHER,Buddhist Understanding and Reduction of Myanma...,Buddhist Understanding and Reduction of Myanma...,,June 2022,Not yet recruiting,...,,,,,,,,,,
4,5,NCT05126212,2021-127,"University Hospital, Angers",OTHER_GOV,Evaluation of the User Experience of an Innova...,Hospi'Senior : Evaluation of the User Experien...,HospiSenior-UX,October 2021,Not yet recruiting,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279,280,NCT05122637,DEV-0213,"RapidPulse, Inc",INDUSTRY,Feasibility Study of RapidPulseTM Aspiration S...,Feasibility Study of RapidPulseTM Aspiration S...,RapidPulseFS,October 2022,"Active, not recruiting",...,,,,,,,,,,
280,281,NCT05122624,R21AI161301,Johns Hopkins Bloomberg School of Public Health,OTHER,A Clinical Risk Score for Early Management of ...,PredicTB: Validating a Clinical Risk Score for...,PredicTB,January 2022,Recruiting,...,,,,,,,U.S. NIH Grant/Contract,https://reporter.nih.gov/quickSearch/R21AI161301,,
281,282,NCT05122611,RFMRI01,FUSMobile Inc.,INDUSTRY,Post Lumbar Radiofrequency Neurotomy Imaging,Post Lumbar Radiofrequency Neurotomy Imaging,,August 2021,Completed,...,,,,,,,,,,
282,283,NCT05122598,COTAG061981,Evon Medics LLC,INDUSTRY,Development and Evaluation of Computerized Olf...,Development and Evaluation of Computerized Olf...,,May 2022,Recruiting,...,,,,,,,,,,
