In [8]:
import pandas as pd 
import hashlib
from datetime import date
from urllib.parse import quote_plus
import json
from os import listdir, path, remove
import gzip
import shutil
import requests
import time
from IPython.display import Markdown, display

# Ingesting LINCS Data into the C2M2

*Created by Sherry Xie (LINCS DCC, Ma'ayan Lab @ Icahn School of Medicine at Mount Sinai)*

This notebook provides an example for building C2M2 metadata tables from a subset of LINCS data. This notebook is **not** meant to be directly used as a pipeline, as the data from each DCC is unique and should be processed individually, and will look different from what is presented here. However, this notebook does provide a step-by-step guide to how the C2M2 may be used in action. 

## Initial Steps

First, make sure you have the latest published C2M2 model schema. The current version (and earlier versions) can be found [here](https://osf.io/c63aw/). 
You may also want to download the latest ontology reference files and term-builder script [here](https://osf.io/bq6k9/). 

From the schema, it is easy to extract the fields from each table.

In [2]:
def extract_tables(): 
    f = json.load(open('C2M2_datapackage.json', 'r'))
    return [obj['name'] for obj in f['resources']]
print(extract_tables())

['file', 'biosample', 'subject', 'dcc', 'project', 'project_in_project', 'collection', 'collection_in_collection', 'file_describes_collection', 'collection_defined_by_project', 'file_in_collection', 'biosample_in_collection', 'subject_in_collection', 'file_describes_biosample', 'file_describes_subject', 'biosample_from_subject', 'biosample_disease', 'subject_disease', 'biosample_substance', 'subject_substance', 'biosample_gene', 'subject_race', 'subject_role_taxonomy', 'assay_type', 'ncbi_taxonomy', 'anatomy', 'file_format', 'data_type', 'disease', 'compound', 'substance', 'gene', 'id_namespace']


In [3]:
def extract_cols(tablename): 
    # get columns for {tablename} table directly from C2M2 datapackage schema
    f = json.load(open('C2M2_datapackage.json', 'r'))
    for obj in f['resources']: 
        if obj['name'] == tablename: 
            return [field['name'] for field in obj['schema']['fields']]
    raise Exception(f"'{tablename}' is not a valid C2M2 table.")

print("FILE.TSV:", extract_cols('file'), "\n")
print("BIOSAMPLE.TSV:", extract_cols('biosample'), "\n")
print("SUBJECT.TSV:", extract_cols('subject'), "\n")

FILE.TSV: ['id_namespace', 'local_id', 'project_id_namespace', 'project_local_id', 'persistent_id', 'creation_time', 'size_in_bytes', 'uncompressed_size_in_bytes', 'sha256', 'md5', 'filename', 'file_format', 'compression_format', 'data_type', 'assay_type', 'mime_type', 'bundle_collection_id_namespace', 'bundle_collection_local_id'] 

BIOSAMPLE.TSV: ['id_namespace', 'local_id', 'project_id_namespace', 'project_local_id', 'persistent_id', 'creation_time', 'assay_type', 'anatomy'] 

SUBJECT.TSV: ['id_namespace', 'local_id', 'project_id_namespace', 'project_local_id', 'persistent_id', 'creation_time', 'granularity', 'sex', 'ethnicity', 'age_at_enrollment'] 



## LINCS Data

The LINCS data used in this notebook is a subset of the L1000 CRISPR KO and chemical perturbation gene expression profiles from the most recent LINCS L1000 2021 data release. We optionally pre-processed the original files to divide them into replicates, and then renamed all files to be more helpful. 

The first step of building the C2M2 table is making sure that all of the LINCS metadata is taken into account. All of the anatomy, gene, chemical, and disease information will be required at later points.

### LINCS L1000 data

The majority of LINCS data consists of files containing L1000 gene expression profiles, which are stored in Amazon S3 buckets. The link to each file doubles as the persistent ID for that file. 

Replicate profiles in each file correspond to a single perturbational signature: gene expression is measured at a specific timepoint for a specific treatment applied to a given cell line in the specified dosage (if applicable). Each perturbational signature is considered its own biosample, and generally there are 1-3 replicate profiles stored in the corresponding file. Subjects are the cell lines to which perturbations are applied. 

You may notice that the filenames reflect the actual perturbation performed -- this is intentionally done, in order to improve user readability and also facilitate mapping between files, biosamples, and subjects.

In [13]:
display(Markdown("**L1000_LINCS_DCIC_HAHN001_A549_96H_A15_CDK4.tsv**"))
display(Markdown("CRISPR knockout perturbation of CDK4 in the A549 cell line"))
pd.read_csv('lincs_data/L1000_LINCS_DCIC_HAHN001_A549_96H_A15_CDK4.tsv.gz', sep='\t', index_col=0).head()

**L1000_LINCS_DCIC_HAHN001_A549_96H_A15_CDK4.tsv**

CRISPR knockout perturbation of CDK4 in the A549 cell line

Unnamed: 0_level_0,HAHN001_A549_96H_X1_B29:A15,HAHN001_A549_96H_X2_B29:A15
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
DDR1,6.2896,6.036925
PAX8,4.693225,4.2071
GUCA1A,5.41065,5.48125
EPHB3,7.586,7.6845
ESRRA,7.395875,7.823425


In [14]:
display(Markdown("**L1000_LINCS_DCIC_ABY001_A375_XH_A13_afatinib_10uM.tsv**"))
display(Markdown("Treatment of the A375 cell line with 10uM of afatinib"))
pd.read_csv('lincs_data/L1000_LINCS_DCIC_ABY001_A375_XH_A13_afatinib_10uM.tsv.gz', sep='\t', index_col=0).head()

**L1000_LINCS_DCIC_ABY001_A375_XH_A13_afatinib_10uM.tsv**

Treatment of the A375 cell line with 10uM of afatinib

Unnamed: 0_level_0,ABY001_A375_XH_X1_B15:A13
symbol,Unnamed: 1_level_1
NAT2,5.7191
ADA,7.4924
CDH2,6.56405
AKT3,9.1052
MED6,7.451


### LINCS signature metadata

In [15]:
lincs_meta = pd.read_csv(
    'https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/siginfo_beta.txt', sep='\t', # LINCS metadata
    usecols=['sig_id', 'pert_id', 'cmap_name', 'det_wells', 'pert_idose', 'pert_type', 'cell_iname'], low_memory=False
) 

In [16]:
# limit to only CRISPR and Chemical perturbations for now
lincs_meta = lincs_meta[lincs_meta['pert_type'].isin(['trt_cp', 'trt_xpr'])].copy()

In [17]:
def get_pert_name(row):
    '''
    Get name of LINCS perturbagen. 
    '''
    if pd.isnull(row.cmap_name):
        name = row.pert_id 
    else:
        name = row.cmap_name
    return name.replace('/', '').replace(' ','')

def rename(row): 
    pert_name = get_pert_name(row)
    try:
        if pd.isnull(row.pert_idose):
            rep_id = '_'.join([row.sig_id.split(':')[0], row.det_wells.split('|')[0], pert_name])
        else:
            rep_id = '_'.join([row.sig_id.split(':')[0], row.det_wells.split('|')[0], pert_name, row.pert_idose.replace(' ', '').replace('/', '_per_')])
    except:
        rep_id = ''
    return rep_id

In [18]:
lincs_meta['file_id'] = lincs_meta.apply(rename, axis=1)
lincs_meta.head()

Unnamed: 0,pert_idose,pert_id,sig_id,pert_type,cell_iname,det_wells,cmap_name,file_id
0,100 ug/ml,BRD-U44432129,MET001_N8_XH:BRD-U44432129:100:336,trt_cp,NAMEC8,H05|H06|H07|H08,BRD-U44432129,MET001_N8_XH_H05_BRD-U44432129_100ug_per_ml
1,10 uM,BRD-K81418486,ABY001_A549_XH:BRD-K81418486:10:3,trt_cp,A549,L04|L08|L12,vorinostat,ABY001_A549_XH_L04_vorinostat_10uM
2,2.5 uM,BRD-K70511574,ABY001_HT29_XH:BRD-K70511574:2.5:24,trt_cp,HT29,E18|E22,HMN-214,ABY001_HT29_XH_E18_HMN-214_2.5uM
3,10 uM,BRD-K81418486,LTC002_HME1_3H:BRD-K81418486:10,trt_cp,HME1,F19,vorinostat,LTC002_HME1_3H_F19_vorinostat_10uM
4,10 uM,BRD-A61304759,ABY001_H1975_XH:BRD-A61304759:10:3,trt_cp,H1975,P01|P05|P09,tanespimycin,ABY001_H1975_XH_P01_tanespimycin_10uM


### LINCS assay, file format, data type mappings

In [19]:
lincs_assay = pd.read_csv('lincs_metadata/lincs_assay_mappings.tsv', sep='\t', index_col=0)
lincs_assay.head()

Unnamed: 0_level_0,lincs_technologies,bao_assay_id,obi_assay_id,edam_format_id,edam_data_type_id,mime_type
lincs_assayname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aggregated small molecule biochemical target activity,,BAO:0010050,OBI:0001632,,,
Aggregated small molecule biochemical target activity,Bead-based immunoassay,BAO:0010050,OBI:0002970,format:3475,data:2603,text/tab-separated-values
ATAC-seq epigenetic profiling assay,ATAC-seq,BAO:0010038,OBI:0002039,format:3612,data:3002,
Bead-based immunoassay for protein state,Bead-based immunoassay,BAO:0010050,OBI:0002970,format:3475,data:2603,text/tab-separated-values
ELISA protein secretion profiling assay,ELISA,BAO:0000134,OBI:0000661,format:3475,data:2603,text/tab-separated-values


In [20]:
lincs_assay['isL1000'] = lincs_assay.index.map(lambda x: 'L1000' in x)
lincs_assay[lincs_assay['isL1000']]

Unnamed: 0_level_0,lincs_technologies,bao_assay_id,obi_assay_id,edam_format_id,edam_data_type_id,mime_type,isL1000
lincs_assayname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
L1000 mRNA profiling assay,L1000,BAO:0010046,OBI:0002965,format:3475,data:0928,text/tab-separated-values,True


In [21]:
l1000_format = lincs_assay.loc['L1000 mRNA profiling assay', 'edam_format_id'] # TSV
l1000_dtype = lincs_assay.loc['L1000 mRNA profiling assay', 'edam_data_type_id'] # gene expression profile
l1000_assay = lincs_assay.loc['L1000 mRNA profiling assay', 'obi_assay_id'] # L1000 mRNA profiling assay
l1000_mime = lincs_assay.loc['L1000 mRNA profiling assay', 'mime_type'] # TSV 

l1000_compression = 'format:3989' # GZIP

### LINCS cell line (disease/anatomy) metadata

Cell line metadata (disease and tissue names) obtained from LDP API

In [22]:
lincs_cell_disease = pd.read_csv('lincs_metadata/lincs_disease_ontology_mappings.tsv', sep='\t')
lincs_cell_disease.head()

Unnamed: 0,cell_line,disease,doid
0,PC3,prostate adenocarcinoma,DOID:2526
1,A375,melanoma,DOID:1909
2,A549,lung cancer,DOID:1324
3,H1975,lung cancer,DOID:1324
4,HEPG2,carcinoma,DOID:305


In [23]:
lincs_meta.shape

(861161, 8)

In [24]:
lincs_meta = pd.merge(left=lincs_meta, left_on='cell_iname', right=lincs_cell_disease, right_on='cell_line', how='left')
lincs_meta.head()

Unnamed: 0,pert_idose,pert_id,sig_id,pert_type,cell_iname,det_wells,cmap_name,file_id,cell_line,disease,doid
0,100 ug/ml,BRD-U44432129,MET001_N8_XH:BRD-U44432129:100:336,trt_cp,NAMEC8,H05|H06|H07|H08,BRD-U44432129,MET001_N8_XH_H05_BRD-U44432129_100ug_per_ml,,,
1,10 uM,BRD-K81418486,ABY001_A549_XH:BRD-K81418486:10:3,trt_cp,A549,L04|L08|L12,vorinostat,ABY001_A549_XH_L04_vorinostat_10uM,A549,lung cancer,DOID:1324
2,2.5 uM,BRD-K70511574,ABY001_HT29_XH:BRD-K70511574:2.5:24,trt_cp,HT29,E18|E22,HMN-214,ABY001_HT29_XH_E18_HMN-214_2.5uM,HT29,colon adenocarcinoma,DOID:1520
3,10 uM,BRD-K81418486,LTC002_HME1_3H:BRD-K81418486:10,trt_cp,HME1,F19,vorinostat,LTC002_HME1_3H_F19_vorinostat_10uM,HME1,leukemia,DOID:1240
4,10 uM,BRD-A61304759,ABY001_H1975_XH:BRD-A61304759:10:3,trt_cp,H1975,P01|P05|P09,tanespimycin,ABY001_H1975_XH_P01_tanespimycin_10uM,H1975,lung cancer,DOID:1324


### LINCS chemical/substance metadata

BRD-ID to PC SID mappings obtained via bulk download from PubChem

In [25]:
lincs_chem = pd.read_csv('https://s3.amazonaws.com/macchiato.clue.io/builds/LINCS2020/compoundinfo_beta.txt', sep='\t')
lincs_chem.head()

Unnamed: 0,pert_id,cmap_name,target,moa,canonical_smiles,inchi_key,compound_aliases
0,BRD-A08715367,L-theanine,,,CCNC(=O)CCC(N)C(O)=O,DATAGRPVKZEWHA-UHFFFAOYSA-N,l-theanine
1,BRD-A12237696,L-citrulline,,,NC(CCCNC(N)=O)C(O)=O,RHGKLRLOHDJJDR-UHFFFAOYSA-N,l-citrulline
2,BRD-A18795974,BRD-A18795974,,,CCCN(CCC)C1CCc2ccc(O)cc2C1,BLYMJBIZMIGWFK-UHFFFAOYSA-N,7-hydroxy-DPAT
3,BRD-A27924917,BRD-A27924917,,,NCC(O)(CS(O)(=O)=O)c1ccc(Cl)cc1,WBSMZVIMANOCNX-UHFFFAOYSA-N,2-hydroxysaclofen
4,BRD-A35931254,BRD-A35931254,,,CN1CCc2cccc-3c2C1Cc1ccc(O)c(O)c-31,VMWNQDUVQKEIOC-UHFFFAOYSA-N,r(-)-apomorphine


In [26]:
lincs_brd_sid = pd.read_csv('lincs_metadata/lincs_pubchemSID_brdID.tsv', sep='\t')
lincs_brd_sid.head()

Unnamed: 0,pubchem_sid,brd_id
0,376252234,BRD-K99992083
1,376252233,BRD-K99866568
2,376252232,BRD-K95094737
3,376252231,BRD-K94510310
4,376252230,BRD-K93627375


In [27]:
lincs_chem_sid = pd.merge(left=lincs_chem, left_on='pert_id', right=lincs_brd_sid, right_on='brd_id')
lincs_chem_sid = lincs_chem_sid[['cmap_name', 'pert_id', 'pubchem_sid']]
lincs_chem_sid.head()

Unnamed: 0,cmap_name,pert_id,pubchem_sid
0,BRD-A02726762,BRD-A02726762,376251578
1,BRD-A18573497,BRD-A18573497,376251581
2,BRD-A26627799,BRD-A26627799,376251583
3,BRD-A35430844,BRD-A35430844,376251589
4,BRD-A41228941,BRD-A41228941,376251591


In [28]:
lincs_meta = pd.merge(left=lincs_meta, left_on='pert_id', right=lincs_chem_sid, right_on='pert_id', how='left')
lincs_meta.head()

Unnamed: 0,pert_idose,pert_id,sig_id,pert_type,cell_iname,det_wells,cmap_name_x,file_id,cell_line,disease,doid,cmap_name_y,pubchem_sid
0,100 ug/ml,BRD-U44432129,MET001_N8_XH:BRD-U44432129:100:336,trt_cp,NAMEC8,H05|H06|H07|H08,BRD-U44432129,MET001_N8_XH_H05_BRD-U44432129_100ug_per_ml,,,,,
1,10 uM,BRD-K81418486,ABY001_A549_XH:BRD-K81418486:10:3,trt_cp,A549,L04|L08|L12,vorinostat,ABY001_A549_XH_L04_vorinostat_10uM,A549,lung cancer,DOID:1324,,
2,2.5 uM,BRD-K70511574,ABY001_HT29_XH:BRD-K70511574:2.5:24,trt_cp,HT29,E18|E22,HMN-214,ABY001_HT29_XH_E18_HMN-214_2.5uM,HT29,colon adenocarcinoma,DOID:1520,,
3,10 uM,BRD-K81418486,LTC002_HME1_3H:BRD-K81418486:10,trt_cp,HME1,F19,vorinostat,LTC002_HME1_3H_F19_vorinostat_10uM,HME1,leukemia,DOID:1240,,
4,10 uM,BRD-A61304759,ABY001_H1975_XH:BRD-A61304759:10:3,trt_cp,H1975,P01|P05|P09,tanespimycin,ABY001_H1975_XH_P01_tanespimycin_10uM,H1975,lung cancer,DOID:1324,,


### LINCS gene data

In [29]:
def get_ensembl(gene):
    ensembl_url = f"https://rest.ensembl.org/lookup/symbol/homo_sapiens/{gene}?content-type=application/json"
    try:
        time.sleep(2)
        resp = requests.get(ensembl_url).json()
        if "error" in resp.keys(): 
            return ''
        else: 
            return resp['id']
    except requests.exceptions.ConnectionError as e:
        print('connection lost')
        return e

In [30]:
# for efficiency, list only the files that we are processing
_filelist = [x.replace('L1000_LINCS_DCIC_', '').split('.tsv')[0] for x in listdir('lincs_data')]
lincs_meta = lincs_meta[lincs_meta['file_id'].isin(_filelist)]
lincs_meta.head()

Unnamed: 0,pert_idose,pert_id,sig_id,pert_type,cell_iname,det_wells,cmap_name_x,file_id,cell_line,disease,doid,cmap_name_y,pubchem_sid
9194,,BRDN0001054908,HAHN001_A549_96H:A03,trt_xpr,A549,A03,AKT1,HAHN001_A549_96H_A03_AKT1,A549,lung cancer,DOID:1324,,
17496,,HAHN-000191,HAHN001_A549_96H:A18,trt_xpr,A549,A18,MAPK1,HAHN001_A549_96H_A18_MAPK1,A549,lung cancer,DOID:1324,,
21260,10 uM,BRD-K19687926,ABY001_A375_XH:BRD-K19687926:10:24,trt_cp,A375,A16|A20|A24,lapatinib,ABY001_A375_XH_A16_lapatinib_10uM,A375,melanoma,DOID:1909,,
25498,,BRDN0001053869,HAHN001_A549_96H:A15,trt_xpr,A549,A15,CDK4,HAHN001_A549_96H_A15_CDK4,A549,lung cancer,DOID:1324,,
25940,10 uM,BRD-K66175015,ABY001_A375_XH:BRD-K66175015:10:24,trt_cp,A375,A13|A17|A21,afatinib,ABY001_A375_XH_A13_afatinib_10uM,A375,melanoma,DOID:1909,,


In [31]:
lincs_meta['ensembl'] = lincs_meta.apply(lambda x: get_ensembl(x['cmap_name_x']) if x['pert_type'] == 'trt_xpr' else '', axis=1)
lincs_meta.head()

Unnamed: 0,pert_idose,pert_id,sig_id,pert_type,cell_iname,det_wells,cmap_name_x,file_id,cell_line,disease,doid,cmap_name_y,pubchem_sid,ensembl
9194,,BRDN0001054908,HAHN001_A549_96H:A03,trt_xpr,A549,A03,AKT1,HAHN001_A549_96H_A03_AKT1,A549,lung cancer,DOID:1324,,,ENSG00000142208
17496,,HAHN-000191,HAHN001_A549_96H:A18,trt_xpr,A549,A18,MAPK1,HAHN001_A549_96H_A18_MAPK1,A549,lung cancer,DOID:1324,,,ENSG00000100030
21260,10 uM,BRD-K19687926,ABY001_A375_XH:BRD-K19687926:10:24,trt_cp,A375,A16|A20|A24,lapatinib,ABY001_A375_XH_A16_lapatinib_10uM,A375,melanoma,DOID:1909,,,
25498,,BRDN0001053869,HAHN001_A549_96H:A15,trt_xpr,A549,A15,CDK4,HAHN001_A549_96H_A15_CDK4,A549,lung cancer,DOID:1324,,,ENSG00000135446
25940,10 uM,BRD-K66175015,ABY001_A375_XH:BRD-K66175015:10:24,trt_cp,A375,A13|A17|A21,afatinib,ABY001_A375_XH_A13_afatinib_10uM,A375,melanoma,DOID:1909,,,


### LINCS cell line anatomy

In [32]:
lincs_anat = pd.read_csv('lincs_metadata/lincs_cell_anatomy.tsv', sep='\t')
lincs_anat.head()

Unnamed: 0,cell,anatomy
0,A375,UBERON:0002097
1,A549,UBERON:0002048
2,H1975,UBERON:0002048
3,HEPG2,UBERON:0002107
4,HT29,UBERON:0000160


In [33]:
lincs_meta = pd.merge(left=lincs_meta, left_on='cell_iname', right=lincs_anat, right_on='cell', how='left')
lincs_meta.head()

Unnamed: 0,pert_idose,pert_id,sig_id,pert_type,cell_iname,det_wells,cmap_name_x,file_id,cell_line,disease,doid,cmap_name_y,pubchem_sid,ensembl,cell,anatomy
0,,BRDN0001054908,HAHN001_A549_96H:A03,trt_xpr,A549,A03,AKT1,HAHN001_A549_96H_A03_AKT1,A549,lung cancer,DOID:1324,,,ENSG00000142208,A549,UBERON:0002048
1,,HAHN-000191,HAHN001_A549_96H:A18,trt_xpr,A549,A18,MAPK1,HAHN001_A549_96H_A18_MAPK1,A549,lung cancer,DOID:1324,,,ENSG00000100030,A549,UBERON:0002048
2,10 uM,BRD-K19687926,ABY001_A375_XH:BRD-K19687926:10:24,trt_cp,A375,A16|A20|A24,lapatinib,ABY001_A375_XH_A16_lapatinib_10uM,A375,melanoma,DOID:1909,,,,A375,UBERON:0002097
3,,BRDN0001053869,HAHN001_A549_96H:A15,trt_xpr,A549,A15,CDK4,HAHN001_A549_96H_A15_CDK4,A549,lung cancer,DOID:1324,,,ENSG00000135446,A549,UBERON:0002048
4,10 uM,BRD-K66175015,ABY001_A375_XH:BRD-K66175015:10:24,trt_cp,A375,A13|A17|A21,afatinib,ABY001_A375_XH_A13_afatinib_10uM,A375,melanoma,DOID:1909,,,,A375,UBERON:0002097


In [34]:
lincs_meta = lincs_meta.set_index('file_id')

## Build DCC-relevant tables

### ID Namespace

In [24]:
extract_cols('id_namespace')

['id', 'abbreviation', 'name', 'description']

In [25]:
idn = pd.DataFrame(
    [
        [
            'https://www.lincsproject.org', 
             'LINCS', 
             'Library of Integrated Network-Based Cellular Signatures', 
             'A library that catalogs changes that occur when different types of cells are exposed to a variety of agents that disrupt normal cellular functions'
        ]
    ], 
    columns=extract_cols('id_namespace')
)
idn.head()

Unnamed: 0,id,abbreviation,name,description
0,https://www.lincsproject.org,LINCS,Library of Integrated Network-Based Cellular S...,A library that catalogs changes that occur whe...


### Project, Project in Project

In [26]:
print(extract_cols('project'))

['id_namespace', 'local_id', 'persistent_id', 'creation_time', 'abbreviation', 'name', 'description']


In [27]:
proj_table = pd.DataFrame(
    [
        [
            'https://www.lincsproject.org', 
             'LINCS', 
             'https://www.lincsproject.org', 
             '', 
             'LINCS', 
             'The Library of Integrated Network-Based Cellular Signatures (LINCS) Program aims to create a network-based understanding of biology'
        ], 
        [
            'https://www.lincsproject.org', 
             'LINCS-2021', 
             'https://clue.io/data/CMap2020#LINCS2020', 
             date(2020, 11, 20), 
             'LINCS-2021', 
             'LINCS 2021 Data Release', 
             'The 2021 beta release of the CMap LINCS gene expression resource'
        ]
    ], 
    columns=extract_cols('project')
)
proj_table.head()

Unnamed: 0,id_namespace,local_id,persistent_id,creation_time,abbreviation,name,description
0,https://www.lincsproject.org,LINCS,https://www.lincsproject.org,,LINCS,The Library of Integrated Network-Based Cellul...,
1,https://www.lincsproject.org,LINCS-2021,https://clue.io/data/CMap2020#LINCS2020,2020-11-20,LINCS-2021,LINCS 2021 Data Release,The 2021 beta release of the CMap LINCS gene e...


In [28]:
print(extract_cols('project_in_project'))

['parent_project_id_namespace', 'parent_project_local_id', 'child_project_id_namespace', 'child_project_local_id']


In [29]:
p_in_p = pd.DataFrame(
    [
        [
            'https://www.lincsproject.org', 
            'LINCS', 
            'https://www.lincsproject.org/', 
            'LINCS-2021'
        ]
    ],
    columns=extract_cols('project_in_project')
)
p_in_p.head()

Unnamed: 0,parent_project_id_namespace,parent_project_local_id,child_project_id_namespace,child_project_local_id
0,https://www.lincsproject.org,LINCS,https://www.lincsproject.org/,LINCS-2021


### DCC

In [30]:
print(extract_cols('dcc'))

['id', 'dcc_name', 'dcc_abbreviation', 'dcc_description', 'contact_email', 'contact_name', 'dcc_url', 'project_id_namespace', 'project_local_id']


In [31]:
dcc = pd.DataFrame(
    [ 
        [
            'cfde_registry_dcc:lincs', 
            'Library of Integrated Network-based Cellular Signatures', 
            'LINCS', 
            'The LINCS Program aims to create a network-based understanding of biology by cataloging changes in gene expression', 'avi.maayan@mssm.edu', 'Avi Ma\'ayan', 'https://www.lincsproject.org/', 'https://www.lincsproject.org/', 'LINCS'
        ]
    ], 
    columns=extract_cols('dcc')
)
dcc.head()

Unnamed: 0,id,dcc_name,dcc_abbreviation,dcc_description,contact_email,contact_name,dcc_url,project_id_namespace,project_local_id
0,cfde_registry_dcc:lincs,Library of Integrated Network-based Cellular S...,LINCS,The LINCS Program aims to create a network-bas...,avi.maayan@mssm.edu,Avi Ma'ayan,https://www.lincsproject.org/,https://www.lincsproject.org/,LINCS


### Collection

In [32]:
coll = pd.DataFrame(columns=extract_cols('collection'))

coll.head()

Unnamed: 0,id_namespace,local_id,persistent_id,creation_time,abbreviation,name,description


## Build core entity tables
### File

In [33]:
def get_sizes(fname, fdir):
    comp = path.getsize(f"{fdir}/{fname}")
    with gzip.open(f"{fdir}/{fname}", 'rb') as f: 
        with open(f"{fdir}/{fname.replace('.gz', '')}", 'wb') as g:
            shutil.copyfileobj(f, g)
    uncomp = path.getsize(f"{fdir}/{fname.replace('.gz', '')}")
    remove(f"{fdir}/{fname.replace('.gz', '')}")
    return comp, uncomp

def get_hashes(fname, fdir): 
    md5 = hashlib.md5()
    sha256 = hashlib.sha256()
    with open(f"{fdir}/{fname}", 'rb') as data:
        for chunk in iter(lambda: data.read(4096), b""):
            md5.update(chunk)
            sha256.update(chunk)
    md5_hash = md5.hexdigest()
    sha256_hash = sha256.hexdigest()
    return md5_hash, sha256_hash


In [34]:
file_cols = extract_cols('file')
print(file_cols)

['id_namespace', 'local_id', 'project_id_namespace', 'project_local_id', 'persistent_id', 'creation_time', 'size_in_bytes', 'uncompressed_size_in_bytes', 'sha256', 'md5', 'filename', 'file_format', 'compression_format', 'data_type', 'assay_type', 'mime_type', 'bundle_collection_id_namespace', 'bundle_collection_local_id']


In [35]:
def build_file(file_dir, proj_id): 
    file_table = []
    id_namespace = 'https://www.lincsproject.org'
    s3_base = 'https://lincs-dcic.s3.amazonaws.com/LINCS-data-2020'
    files = listdir(file_dir)
    for f in files: 
        if f.startswith('.DS'): continue
        comp_size, uncomp_size = get_sizes(f, file_dir)
        file_md5, file_sha256 = get_hashes(f, file_dir)
        file_table.append([
            id_namespace, 
            f.split('.tsv')[0], 
            id_namespace, 
            proj_id, 
            quote_plus(f"{s3_base}/{f}", safe='/:'), 
            date(2021, 11, 23), 
            comp_size, 
            uncomp_size, 
            file_sha256, 
            file_md5,
            f, 
            l1000_format, 
            l1000_compression, 
            l1000_dtype,
            l1000_assay, 
            l1000_mime, 
            '', 
            ''
        ])
    return pd.DataFrame(file_table, columns=file_cols)

In [37]:
file_table = build_file('lincs_data', 'LINCS-2021')
file_table

Unnamed: 0,id_namespace,local_id,project_id_namespace,project_local_id,persistent_id,creation_time,size_in_bytes,uncompressed_size_in_bytes,sha256,md5,filename,file_format,compression_format,data_type,assay_type,mime_type,bundle_collection_id_namespace,bundle_collection_local_id
0,https://www.lincsproject.org,L1000_LINCS_DCIC_ABY001_A375_XH_A16_lapatinib_...,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,126356,311009,186e6229f384fab25fa9a3262e6d6ec3ec5fd1a67f97a7...,0b6dd79642c9696babbcabe076a61acb,L1000_LINCS_DCIC_ABY001_A375_XH_A16_lapatinib_...,format:3475,format:3989,data:0928,OBI:0002965,text/tab-separated-values,,
1,https://www.lincsproject.org,L1000_LINCS_DCIC_HAHN001_A549_96H_A18_MAPK1,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,215803,532038,b3ae124e7db77624995dcda39e48d414cd32d4235f2c00...,c8048aa201b9e1f0f181031d9441f3f7,L1000_LINCS_DCIC_HAHN001_A549_96H_A18_MAPK1.ts...,format:3475,format:3989,data:0928,OBI:0002965,text/tab-separated-values,,
2,https://www.lincsproject.org,L1000_LINCS_DCIC_ABY001_A375_XH_A15_neratinib_...,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,126319,310974,09c49b55c1004eaa11a273d3f68e315a85a2f39d5ffc92...,b7ab9b3f789da9ca7b5aa0c8c0af2f56,L1000_LINCS_DCIC_ABY001_A375_XH_A15_neratinib_...,format:3475,format:3989,data:0928,OBI:0002965,text/tab-separated-values,,
3,https://www.lincsproject.org,L1000_LINCS_DCIC_HAHN001_A549_96H_A09_BRAF,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,215559,532320,2980dae80a07745d99877cc8081604e0f09131327bf803...,e88a206d4357ed16fa979619fdfd1e29,L1000_LINCS_DCIC_HAHN001_A549_96H_A09_BRAF.tsv.gz,format:3475,format:3989,data:0928,OBI:0002965,text/tab-separated-values,,
4,https://www.lincsproject.org,L1000_LINCS_DCIC_HAHN001_A549_96H_A03_AKT1,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,215916,532392,682ecf32c329c34823f7155aa592101146da45317a5032...,92dcdeed5b40d912d3bfe079585ed3c0,L1000_LINCS_DCIC_HAHN001_A549_96H_A03_AKT1.tsv.gz,format:3475,format:3989,data:0928,OBI:0002965,text/tab-separated-values,,
5,https://www.lincsproject.org,L1000_LINCS_DCIC_ABY001_A375_XH_A13_afatinib_10uM,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,126379,310990,d397db54e4f8bd22462c9c231b396607d61f9458cb54fc...,ba7879c362600d3b18223b66c5d9399b,L1000_LINCS_DCIC_ABY001_A375_XH_A13_afatinib_1...,format:3475,format:3989,data:0928,OBI:0002965,text/tab-separated-values,,
6,https://www.lincsproject.org,L1000_LINCS_DCIC_HAHN001_A549_96H_A15_CDK4,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,216074,532378,d44ff23417a925c5ba78a7ef0800280eae09bbae408922...,49995a9b7487af33511502fbee8b447b,L1000_LINCS_DCIC_HAHN001_A549_96H_A15_CDK4.tsv.gz,format:3475,format:3989,data:0928,OBI:0002965,text/tab-separated-values,,
7,https://www.lincsproject.org,L1000_LINCS_DCIC_ABY001_A375_XH_A14_erlotinib_...,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,126244,311037,454f3429230837ff6fa123145d559811bab55e6d9437de...,d8a99604cd08f26847dd678e42d578f2,L1000_LINCS_DCIC_ABY001_A375_XH_A14_erlotinib_...,format:3475,format:3989,data:0928,OBI:0002965,text/tab-separated-values,,


### Biosample

In [38]:
def f2b(fname): 
    return fname.replace('L1000_LINCS_DCIC_', '')

def f2s(fname):
    return fname.split('_')[4]

def b2s(bname):
    return bname.split('_')[1]

In [39]:
print(extract_cols('biosample'))

['id_namespace', 'local_id', 'project_id_namespace', 'project_local_id', 'persistent_id', 'creation_time', 'assay_type', 'anatomy']


In [40]:
def build_biosample(file_df): 
    bio = file_df.copy()
    bio['local_id'] = bio['local_id'].apply(f2b)
    bio['anatomy'] = bio['local_id'].apply(lambda x: lincs_meta.loc[x, 'anatomy'])
    return bio[extract_cols('biosample')]

bio_table = build_biosample(file_table)
bio_table

Unnamed: 0,id_namespace,local_id,project_id_namespace,project_local_id,persistent_id,creation_time,assay_type,anatomy
0,https://www.lincsproject.org,ABY001_A375_XH_A16_lapatinib_10uM,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,OBI:0002965,UBERON:0002097
1,https://www.lincsproject.org,HAHN001_A549_96H_A18_MAPK1,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,OBI:0002965,UBERON:0002048
2,https://www.lincsproject.org,ABY001_A375_XH_A15_neratinib_10uM,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,OBI:0002965,UBERON:0002097
3,https://www.lincsproject.org,HAHN001_A549_96H_A09_BRAF,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,OBI:0002965,UBERON:0002048
4,https://www.lincsproject.org,HAHN001_A549_96H_A03_AKT1,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,OBI:0002965,UBERON:0002048
5,https://www.lincsproject.org,ABY001_A375_XH_A13_afatinib_10uM,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,OBI:0002965,UBERON:0002097
6,https://www.lincsproject.org,HAHN001_A549_96H_A15_CDK4,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,OBI:0002965,UBERON:0002048
7,https://www.lincsproject.org,ABY001_A375_XH_A14_erlotinib_10uM,https://www.lincsproject.org,LINCS-2021,https://lincs-dcic.s3.amazonaws.com/LINCS-data...,2021-11-23,OBI:0002965,UBERON:0002097


### Subject

In [41]:
print(extract_cols('subject'))

['id_namespace', 'local_id', 'project_id_namespace', 'project_local_id', 'persistent_id', 'creation_time', 'granularity', 'sex', 'ethnicity', 'age_at_enrollment']


In [42]:
def build_subject(bio_df):
    sub = bio_df.copy()
    sub['local_id'] = sub['local_id'].apply(b2s)
    sub['project_local_id'] = 'LINCS'
    sub['persistent_id'] = ''
    sub['granularity'] = 'cfde_subject_granularity:4'
    sub['sex'] = '' 
    sub['ethnicity'] = '' 
    sub['age_at_enrollment'] = '' 
    return sub[extract_cols('subject')]

sub_table = build_subject(bio_table)
sub_table

Unnamed: 0,id_namespace,local_id,project_id_namespace,project_local_id,persistent_id,creation_time,granularity,sex,ethnicity,age_at_enrollment
0,https://www.lincsproject.org,A375,https://www.lincsproject.org,LINCS,,2021-11-23,cfde_subject_granularity:4,,,
1,https://www.lincsproject.org,A549,https://www.lincsproject.org,LINCS,,2021-11-23,cfde_subject_granularity:4,,,
2,https://www.lincsproject.org,A375,https://www.lincsproject.org,LINCS,,2021-11-23,cfde_subject_granularity:4,,,
3,https://www.lincsproject.org,A549,https://www.lincsproject.org,LINCS,,2021-11-23,cfde_subject_granularity:4,,,
4,https://www.lincsproject.org,A549,https://www.lincsproject.org,LINCS,,2021-11-23,cfde_subject_granularity:4,,,
5,https://www.lincsproject.org,A375,https://www.lincsproject.org,LINCS,,2021-11-23,cfde_subject_granularity:4,,,
6,https://www.lincsproject.org,A549,https://www.lincsproject.org,LINCS,,2021-11-23,cfde_subject_granularity:4,,,
7,https://www.lincsproject.org,A375,https://www.lincsproject.org,LINCS,,2021-11-23,cfde_subject_granularity:4,,,


## Build core entity association tables

The easiest way to build association tables between core entities is to copy over some of the fields. In the case of LINCS data, because the file, biosample, and subject IDs can be determined from the filenames, it is relatively easy to re-use columns and map between entity IDs.

Below are examples of how these tables were built for the LINCS data.

### File Describes Biosample

In [43]:
print(extract_cols('file_describes_biosample'))

['file_id_namespace', 'file_local_id', 'biosample_id_namespace', 'biosample_local_id']


In [44]:
def build_fdb(file_df, bio_df): 
    fdb = file_df[['id_namespace', 'local_id']].copy()
    fdb = fdb.rename(columns={'id_namespace': 'file_id_namespace', 'local_id': 'file_local_id'})
    fdb['biosample_id_namespace'] = fdb['file_id_namespace']
    fdb['biosample_local_id'] = fdb['file_local_id'].apply(f2b)
    for row in fdb.itertuples():
        assert row.biosample_local_id in bio_df['local_id'].tolist()
    return fdb[extract_cols('file_describes_biosample')]

build_fdb(file_table, bio_table)

Unnamed: 0,file_id_namespace,file_local_id,biosample_id_namespace,biosample_local_id
0,https://www.lincsproject.org,L1000_LINCS_DCIC_ABY001_A375_XH_A16_lapatinib_...,https://www.lincsproject.org,ABY001_A375_XH_A16_lapatinib_10uM
1,https://www.lincsproject.org,L1000_LINCS_DCIC_HAHN001_A549_96H_A18_MAPK1,https://www.lincsproject.org,HAHN001_A549_96H_A18_MAPK1
2,https://www.lincsproject.org,L1000_LINCS_DCIC_ABY001_A375_XH_A15_neratinib_...,https://www.lincsproject.org,ABY001_A375_XH_A15_neratinib_10uM
3,https://www.lincsproject.org,L1000_LINCS_DCIC_HAHN001_A549_96H_A09_BRAF,https://www.lincsproject.org,HAHN001_A549_96H_A09_BRAF
4,https://www.lincsproject.org,L1000_LINCS_DCIC_HAHN001_A549_96H_A03_AKT1,https://www.lincsproject.org,HAHN001_A549_96H_A03_AKT1
5,https://www.lincsproject.org,L1000_LINCS_DCIC_ABY001_A375_XH_A13_afatinib_10uM,https://www.lincsproject.org,ABY001_A375_XH_A13_afatinib_10uM
6,https://www.lincsproject.org,L1000_LINCS_DCIC_HAHN001_A549_96H_A15_CDK4,https://www.lincsproject.org,HAHN001_A549_96H_A15_CDK4
7,https://www.lincsproject.org,L1000_LINCS_DCIC_ABY001_A375_XH_A14_erlotinib_...,https://www.lincsproject.org,ABY001_A375_XH_A14_erlotinib_10uM


### Biosample From Subject

In [45]:
print(extract_cols('biosample_from_subject'))

['biosample_id_namespace', 'biosample_local_id', 'subject_id_namespace', 'subject_local_id', 'age_at_sampling']


In [46]:
def build_bfs(bio_df, sub_df): 
    bfs = bio_df[['id_namespace', 'local_id']].copy()
    bfs = bfs.rename(columns={'id_namespace': 'biosample_id_namespace', 'local_id': 'biosample_local_id'})
    bfs['subject_id_namespace'] = bfs['biosample_id_namespace']
    bfs['subject_local_id'] = bfs['biosample_local_id'].apply(b2s)
    bfs['age_at_sampling'] = ''
    for row in bfs.itertuples():
        assert row.subject_local_id in sub_df['local_id'].tolist()
    return bfs[extract_cols('biosample_from_subject')]

build_bfs(bio_table, sub_table)

Unnamed: 0,biosample_id_namespace,biosample_local_id,subject_id_namespace,subject_local_id,age_at_sampling
0,https://www.lincsproject.org,ABY001_A375_XH_A16_lapatinib_10uM,https://www.lincsproject.org,A375,
1,https://www.lincsproject.org,HAHN001_A549_96H_A18_MAPK1,https://www.lincsproject.org,A549,
2,https://www.lincsproject.org,ABY001_A375_XH_A15_neratinib_10uM,https://www.lincsproject.org,A375,
3,https://www.lincsproject.org,HAHN001_A549_96H_A09_BRAF,https://www.lincsproject.org,A549,
4,https://www.lincsproject.org,HAHN001_A549_96H_A03_AKT1,https://www.lincsproject.org,A549,
5,https://www.lincsproject.org,ABY001_A375_XH_A13_afatinib_10uM,https://www.lincsproject.org,A375,
6,https://www.lincsproject.org,HAHN001_A549_96H_A15_CDK4,https://www.lincsproject.org,A549,
7,https://www.lincsproject.org,ABY001_A375_XH_A14_erlotinib_10uM,https://www.lincsproject.org,A375,


In [54]:
bfs_table = build_bfs(bio_table, sub_table)

## Build other association tables

Similar to how the association tables above can be built easily from each of the core entity tables, building the other association tables are simple if all metadata is mapped beforehand.

In particular, reusing the core association entity tables also becomes helpful here. 

### Biosample Disease

In [47]:
print(extract_cols('biosample_disease'))

['biosample_id_namespace', 'biosample_local_id', 'disease']


In [48]:
def build_bd(bio_df): 
    bd = bio_df[['id_namespace', 'local_id']].copy()
    bd = bd.rename(columns={'id_namespace': 'biosample_id_namespace', 'local_id': 'biosample_local_id'})
    bd['disease'] = bd['biosample_local_id'].apply(lambda x: lincs_meta.loc[x, 'doid'])
    return bd[extract_cols('biosample_disease')]

build_bd(bio_table)

Unnamed: 0,biosample_id_namespace,biosample_local_id,disease
0,https://www.lincsproject.org,ABY001_A375_XH_A16_lapatinib_10uM,DOID:1909
1,https://www.lincsproject.org,HAHN001_A549_96H_A18_MAPK1,DOID:1324
2,https://www.lincsproject.org,ABY001_A375_XH_A15_neratinib_10uM,DOID:1909
3,https://www.lincsproject.org,HAHN001_A549_96H_A09_BRAF,DOID:1324
4,https://www.lincsproject.org,HAHN001_A549_96H_A03_AKT1,DOID:1324
5,https://www.lincsproject.org,ABY001_A375_XH_A13_afatinib_10uM,DOID:1909
6,https://www.lincsproject.org,HAHN001_A549_96H_A15_CDK4,DOID:1324
7,https://www.lincsproject.org,ABY001_A375_XH_A14_erlotinib_10uM,DOID:1909


In [56]:
bd_table = build_bd(bio_table)

### Biosample Gene

In [49]:
print(extract_cols('biosample_gene'))

['biosample_id_namespace', 'biosample_local_id', 'gene']


In [52]:
def build_bg(bio_df): 
    bg = bio_df[['id_namespace', 'local_id']].copy()
    bg = bg.rename(columns={'id_namespace': 'biosample_id_namespace', 'local_id': 'biosample_local_id'})
    bg['gene'] = bg['biosample_local_id'].apply(lambda x: lincs_meta.loc[x, 'ensembl'])
    # remove empty values, aka non-genetic perturbations
    bg = bg[bg['gene'] != '']
    return bg[extract_cols('biosample_gene')]

build_bg(bio_table)

Unnamed: 0,biosample_id_namespace,biosample_local_id,gene
1,https://www.lincsproject.org,HAHN001_A549_96H_A18_MAPK1,ENSG00000100030
3,https://www.lincsproject.org,HAHN001_A549_96H_A09_BRAF,ENSG00000157764
4,https://www.lincsproject.org,HAHN001_A549_96H_A03_AKT1,ENSG00000142208
6,https://www.lincsproject.org,HAHN001_A549_96H_A15_CDK4,ENSG00000135446


### Subject Disease

Subject disease is an example where we actually reuse the biosample from subject AND biosampel disease tables to build a table very easily.

In [55]:
print(extract_cols('subject_disease'))

['subject_id_namespace', 'subject_local_id', 'disease']


In [57]:
def build_sd(bfs_df, bd_df): 
    sd = bfs_df[['biosample_local_id', 'subject_id_namespace', 'subject_local_id']].copy()
    sd['disease'] = sd['biosample_local_id'].apply(lambda x: bd_df.set_index('biosample_local_id').loc[x, 'disease'])
    sd = sd[extract_cols('subject_disease')] # remove biosample column
    sd = sd.drop_duplicates(subset=['subject_local_id'])
    return sd

build_sd(bfs_table, bd_table)

Unnamed: 0,subject_id_namespace,subject_local_id,disease
0,https://www.lincsproject.org,A375,DOID:1909
1,https://www.lincsproject.org,A549,DOID:1324


## Build Controlled Vocabularies

Using the term builder script provided by the CFDE-CC, the controlled vocabularies can easily be built once you have the required tables built (file, biosample, biosample_disease, biosample_gene, biosample_substance, subject_disease, subject_role_taxonomy, subject_substance). 