# Process HUGO Gene Nomenclature Committee Data

Jupyter Notebook to download and preprocess files to transform to BioLink RDF.

### Download files

The download can be defined:
* in this Jupyter Notebook using Python
* as a Bash script in the `download/download.sh` file, and executed using `d2s download hgnc`



In [9]:
import os
import glob
import requests
import functools
import shutil
import pandas as pd 

# Use Pandas, load file in memory
def convert_tsv_to_csv(tsv_file):
    csv_table=pd.read_table(tsv_file,sep='\t')
    csv_table.to_csv(tsv_file[:-4] + '.csv',index=False)

# Variables and path for the dataset
dataset_id = 'hgnc'
dsri_flink_pod_id = 'flink-jobmanager-###'
input_folder = '/notebooks/workspace/input/' + dataset_id
mapping_folder = '/notebooks/datasets/' + dataset_id + '/mapping'
os.makedirs(input_folder, exist_ok=True)

In [None]:
# Use input folder as working folder
os.chdir(input_folder)

files_to_download = [
    'https://raw.githubusercontent.com/MaastrichtU-IDS/d2s-scripts-repository/master/resources/cohd-sample/concepts.tsv'
]

# Download each file and uncompress them if needed
# Use Bash because faster and more reliable than Python
for download_url in files_to_download:
    os.system('wget -N ' + download_url)
    os.system('find . -name "*.tar.gz" -exec tar -xzvf {} \;')
    os.system('unzip -o \*.zip')

# Rename .txt to .tsv
listing = glob.glob('*.txt')
for filename in listing:
    os.rename(filename, filename[:-4] + '.tsv')

    
## Convert TSV to CSV to be processed with the RMLStreamer
# use Pandas (load in memory)
convert_tsv_to_csv('concepts.tsv')
# Use Bash 
# cmd_convert_csv = """sed -e 's/"/\\"/g' -e 's/\t/","/g' -e 's/^/"/' -e 's/$/"/'  -e 's/\r//' concepts.tsv > concepts.csv"""
# os.system(cmd_convert_csv)

## Process and load concepts

We will use CWL workflows to integrate data with SPARQL queries. The structured data is first converted to a generic RDF based on the data structure, then mapped to BioLink using SPARQL. The SPARQL queries are defined in `.rq` files and can be [accessed on GitHub](https://github.com/MaastrichtU-IDS/d2s-project-template/tree/master/datasets/hgnc/mapping).

Start the required services (here on our server, defined by the `-d trek` arg):

```bash
d2s start tmp-virtuoso drill -d trek
```

Run one of the following d2s command in the d2s-project folder:

```bash
d2s run csv-virtuoso.cwl hgnc
d2s run xml-virtuoso.cwl hgnc
```

[HCLS metadata](https://www.w3.org/TR/hcls-dataset/) can be computed for the hgnc graph:

```bash
d2s run compute-hcls-metadata.cwl hgnc
```

## Load the BioLink model

Load the [BioLink model ontology as Turtle](https://github.com/biolink/biolink-model/blob/master/biolink-model.ttl) in the graph `https://w3id.org/biolink/biolink-model` in the triplestore
