<a href="https://colab.research.google.com/github/HamidrezaKmK/2times2048/blob/master/data_documentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Documentation

Here, we will present an explanation for the type of data we are working with in this work. All the data is stored in a directory referenced as `data_dir`.

The following sets up google drive and git repository. Run before proceeding ...

In [123]:
%%writefile sys_setup.py
#!usr/bin/bash python

import sys
from google.colab import drive
import subprocess
import os

data_dir = None

if __name__ == '__main__':
  env = os.environ.copy()
  print("Mounting drive...")
  drive.mount('/content/drive')
  print("Mount complete!")

  data_dir = input("Enter data directory: ")
  if len(data_dir) == 0:
    # default value
    data_dir = '/content/drive/MyDrive/Drugs/Data/polished'
  with open(".dir", 'w') as f:
    f.write(data_dir)
  # Print out the length of the list
  while True:
    opt = input("What are you trying to do? [clone/pull] ")
    if opt == 'clone':
      addr = "https://[TOKEN]@github.com/HamidrezaKmK/DrugCombination"
      print(f"Trying to connect to {addr}")
      token = input("Enter token: ")
      addr = addr.replace('[TOKEN]', token)
      subprocess.run(['git', 'clone', addr], shell=True, env=env)
      break
    elif opt == 'pull':
      path = os.path.join('/content', 'DrugCombination')
      subprocess.run(['cd', path], shell=True, env=env)
      subprocess.run(['git', 'pull'], shell=True, env=env)
      break


Overwriting sys_setup.py


In [40]:
%run sys_setup.py 
with open('.dir', 'r') as f:
  data_dir = f.read()
print(data_dir)

Mounting drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Mount complete!
Enter data directory: 
What are you trying to do? [clone/pull] pull
/content/drive/MyDrive/Drugs/Data/polished


In [59]:
import pandas as pd
import json
import pickle
from tqdm import tqdm

## Combo databases

The `data_dir` contains a subfolder `combo` which contains multiple csv files. Each row in these csv files contains data about a drug combination which contains information about the two drugs, information about the cell line, the drug doses, and the response gained from applying this drug combination. The columns of these `csv` files all follow this scheme:
* `drug{i}`: For each `i` depending on the number of drugs, this field contains the `i`th drug name. It might be the case that this field contains no information. In that case, for example if we have a dataset of drug pair combinations, then if one of these fields is empty then the response pertains to a monotherapy.
* `cid{i}`: This contains the compound id relating to the `drug{i}`. The compound id is accessible via `pubchem` interface.
* `smiles{i}`: This contains the compound smiles relating to `drug{i}`.
* `conc{i}`: The concentration of `drug{i}` represented un **micro mol (uM)**.
* `cell_id`: This contains the id of the cell taht is targetted in this combination. The cell id is a local id we created for the project and to check what kind of sample an id represents, you can check the `cell_id_to_names.json` file.
* `cell_name`: One of the names of the cell.
* `study_name`: The study name associated with the drug combination, e.g., `ALMANAC`, `ONEIL`, etc.

Run the following code to see two example of combo datasets we have.


In [80]:
df = pd.read_csv(os.path.join(data_dir, 'combo', 'ALMANAC2017.csv'))
df.head()

Unnamed: 0,drug1,drug2,conc1,conc2,cell_name,response,cell_id,smiles1,smiles2,cid1,cid2,study_name
0,NSC 102816,NSC105014,0.03,0.01,NCI-H460,100.99,20,C1=NC(=NC(=O)N1C2C(C(C(O2)CO)O)O)N,C1C(C(OC1N2C=NC3=C(N=C(N=C32)Cl)N)CO)O,9444.0,1546.0,ALMANAC
1,NSC 102816,NSC105014,0.03,0.1,NCI-H460,92.35,20,C1=NC(=NC(=O)N1C2C(C(C(O2)CO)O)O)N,C1C(C(OC1N2C=NC3=C(N=C(N=C32)Cl)N)CO)O,9444.0,1546.0,ALMANAC
2,NSC 102816,NSC105014,0.03,1.0,NCI-H460,60.92,20,C1=NC(=NC(=O)N1C2C(C(C(O2)CO)O)O)N,C1C(C(OC1N2C=NC3=C(N=C(N=C32)Cl)N)CO)O,9444.0,1546.0,ALMANAC
3,NSC 102816,NSC105014,0.3,0.01,NCI-H460,75.2,20,C1=NC(=NC(=O)N1C2C(C(C(O2)CO)O)O)N,C1C(C(OC1N2C=NC3=C(N=C(N=C32)Cl)N)CO)O,9444.0,1546.0,ALMANAC
4,NSC 102816,NSC105014,0.3,0.1,NCI-H460,75.48,20,C1=NC(=NC(=O)N1C2C(C(C(O2)CO)O)O)N,C1C(C(OC1N2C=NC3=C(N=C(N=C32)Cl)N)CO)O,9444.0,1546.0,ALMANAC


In [125]:
df = pd.read_csv(os.path.join(data_dir, 'combo', 'DrugComboDB_ONEIL_ALMANAC.csv'))
df.head()

Unnamed: 0,drug1,drug2,cell_name,conc1,conc2,response,study_name,cid1,cid2,smiles1,smiles2,cell_id
0,5-Fluorouracil,Veliparib,A2058,0.0,0.0,100.000626,ONEIL,3385,11960529,C1=C(C(=O)NC(=O)N1)F,CC1(CCCN1)C2=NC3=C(C=CC=C3N2)C(=O)N,0
1,5-Fluorouracil,Veliparib,A2058,0.0,0.35,101.993009,ONEIL,3385,11960529,C1=C(C(=O)NC(=O)N1)F,CC1(CCCN1)C2=NC3=C(C=CC=C3N2)C(=O)N,0
2,5-Fluorouracil,Veliparib,A2058,0.0,1.08,100.957673,ONEIL,3385,11960529,C1=C(C(=O)NC(=O)N1)F,CC1(CCCN1)C2=NC3=C(C=CC=C3N2)C(=O)N,0
3,5-Fluorouracil,Veliparib,A2058,0.0,3.25,99.930372,ONEIL,3385,11960529,C1=C(C(=O)NC(=O)N1)F,CC1(CCCN1)C2=NC3=C(C=CC=C3N2)C(=O)N,0
4,5-Fluorouracil,Veliparib,A2058,0.0,10.0,98.861202,ONEIL,3385,11960529,C1=C(C(=O)NC(=O)N1)F,CC1(CCCN1)C2=NC3=C(C=CC=C3N2)C(=O)N,0


## Mappings and json files

The `data_dir/mappings` folder contains multiple `json` files that are explained below:

* `cell_id_to_names.json`: This file contains a dictionary mapping any of the cell ids to a set of names. These names may contain the `ACH-id` for ccle id `COSMIC-id` for cosmic ids and multiple conventional names and aliases for cells.
* `drug_id_to_names.json`: This file contains a mapping between the compound id of drugs and their names and aliases. These names might contain formats such as `NSC-no.`, `DrugName`, etc.
* `gene_id_to_names.json`: This file contains a mapping between the gene ids and their names. These names might represent their transcript id, gene id, or other conventional names. 
* `cell_to_expression.json`: Each cell id is mapped to a dictionary mapping the type of screening to two lists each. The first one being the gene names and the second one containing their expression.
* `gene_id_to_ppi_embedding`: Each gene is mapped to an embedding of fixed length. This embedding is obtained via a Node2Vec on the Protein Protein Interaction (PPI) network of the LINCS landmark genes.
* `cell_id_to_tissue`: This is a simple map between the cell id and the associated tissue name. It would come in handy when trying to evaluate the results with respect to their correspoding tissue.

It is good to sanity check the json files for probable errors in the mappings. The files are generated using a combination of manual search and automatic APIs. Any manual edit is highly recommended.

Using the dictionaries above, one can fetch all the data needed for representing a cell and a number of drugs in high dimension.

In [122]:
with open(os.path.join(data_dir, 'mappings', 'cell_id_to_names.json'), 'r') as f:
  cell_id_to_names = json.load(f)
with open(os.path.join(data_dir, 'mappings', 'drug_id_to_names.json'), 'r') as f:
  drug_id_to_names = json.load(f)
with open(os.path.join(data_dir, 'mappings', 'gene_id_to_names.json'), 'r') as f:
  gene_id_to_names = json.load(f)
with open(os.path.join(data_dir, 'mappings', 'cell_id_to_expressions.json'), 'r') as f:
  cell_id_to_expressions = json.load(f)
with open(os.path.join(data_dir, 'mappings', 'gene_id_to_ppi_embedding.json'), 'r') as f:
  cell_id_to_expressions = json.load(f)
with open(os.path.join(data_dir, 'mappings', 'cell_id_to_tissue.json'), 'r') as f:
  cell_id_to_tissue = json.load(f)
