# BioThings 
My personal tutorial for learning and followingt the tutorial documentation for the Biothings API for contribution to the Su and Wu labs of Scripps Research.   
  
    
**Input Data:** https://www.pharmgkb.org/
      
**Links**     
[biothings_tutorial](https://docs.biothings.io/en/latest/doc/studio_tutorial.html)  


## Parser  
In order to ingest this data and make it available as an API, we first need to write a parser. Data is pretty simple, tab-separated files, and we’ll make it even simpler by using pandas python library. The first version of this parser is available in branch pharmgkb_v1 at https://github.com/sirloon/pharmgkb/blob/pharmgkb_v1/parser.py. After some boilerplate code at the beginning for dependencies and initialization, the main logic is the following:

In [None]:
  
import os, pandas, csv, re
import math

from biothings import config
from biothings.utils.dataload import dict_convert
logging = config.logger


"""
Parsing function - load_annotations 
"""
def load_annotations(data_folder):
    
    """
    Our parsing function is named ```load_annotations```, it could be named anything else,
    but it has to take a folder path ```data_folder``` containing the downloaded data. This path is
    automatically set by the Hub and points to the latest version available. More on this later.  
    
    It is the responsibility of the parser to select, within that folder, the **file(s) of interest**.
    *Here we need data from a file named var_drug_ann.tsv.* Following the moto “don’t assume it, prove it”,
    we make that file exists.
    """
    infile = os.path.join(data_folder,"var_drug_ann.tsv")
    assert os.path.exists(infile)
    dat = pandas.read_csv(infile,sep="\t",squeeze=True,quoting=csv.QUOTE_NONE).to_dict(orient='records')
    results = {}
    
    """
    We then open and read the TSV file using pandas.read_csv() function. At this point,
    a record rec looks like the following:
    """
    for rec in dat:

        if not rec["Gene"] or pandas.isna(rec["Gene"]):
            logging.warning("No gene information for annotation ID '%s'", rec["Annotation ID"])
            continue
        _id = re.match(".* \((.*?)\)",rec["Gene"]).groups()[0]
        # we'll remove space in keys to make queries easier. Also, lowercase is preferred
        # for a BioThings API. We'll an helper function from BioThings SDK
        process_key = lambda k: k.replace(" ","_").lower()
        rec = dict_convert(rec,keyfn=process_key)
        results.setdefault(_id,[]).append(rec)
        
    for _id,docs in results.items():
        doc = {"_id": _id, "annotations" : docs}
        yield doc

---