# Creating the "FEATURELESS" dataset 

Version 2.0.0 (May 29th 2023). Run this notebook before running the other two. It compiles a matrix A of drug-disease matchings which is the basis for the PREDICT and TRANSCRIPT datasets. Ensure that you are connected to the Internet before running this notebook.

## Libraries

In [1]:
import numpy as np
import pandas as pd
import subprocess as sb
import os

from time import sleep
import requests
import pickle

from joblib import Parallel, delayed
import multiprocessing
njobs=multiprocessing.cpu_count()-1

import sys
sys.path.insert(0, "../src/")

import utils
import paths_global

## Local paths

In [2]:
## Where database files are stored
print('root_folder="%s"' % paths_global.root_folder)
## Where intermediary files are stored
print('data_folder="%s"' % paths_global.data_folder)

root_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/"
data_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/RECeSS/cfdr/data/"


In [3]:
featureless_folder = paths_global.data_folder+"FEATURELESS/"
sb.Popen(["mkdir", "-p", featureless_folder])
## Where FEATURELESS dataset files are stored
print('featureless_folder="%s"' % featureless_folder)

featureless_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/RECeSS/cfdr/data/FEATURELESS/"


## Drug and disease identifiers

Drug identifiers are DrugBank's when they exist, otherwise PubChem CIDs. Disease identifiers are OMIM identifiers, when they exist, otherwise concept ids ([CID](https://www.ncbi.nlm.nih.gov/medgen/docs/help/)).

#### Database access

To use the code in the notebooks *PREDICT_dataset.ipynb* and *TRANSCRIPT_dataset.ipynb*, you need to register to [LINCS L1000](https://clue.io/developer-resources#apisection), [STRING](https://string-db.org/) and [DisGeNet](https://www.disgenet.org/). The registration to these databases is free (often only requires an academic e-mail address), time-unlimited, but mandatory.

##### a. Registration to the DisGeNET database

Click on [this link](https://www.disgenet.org/signup/) to sign up to DisGeNET. Once you are registered, open an empty .TXT file, and write down

    on the first line: the e-mail address used for registration
    on the second line: the chosen password

Save the file, and replace the corresponding path in *paths_global.py*

In [4]:
## Where DisGeNet credentials are stored
print('disgenet_file="%s"' % paths_global.disgenet_file)

disgenet_file="../credentials/credentials_DISGENET.txt"


##### b. Registration to the STRING database

The STRING database requires an identification of the person sending requests to the database. Write down on the first line of an empty .TXT file your e-mail address, and replace the corresponding path in *paths_global.py*

In [5]:
## Where STRING credentials are stored
print('string_file="%s"' % paths_global.string_file)

string_file="../credentials/credentials_STRING.txt"


##### c. Registration to the LINCS L1000 database access CLUE.io

Click on [this link](https://clue.io/lincs) to sign up to CLUE.io. Once you are registered, open an empty .TXT file, and write down

    on the first line: the e-mail address used for registration
    on the second line: the chosen password
    on the third line: the user key you were assigned

Save the file, and replace the corresponding path in *paths_global.py*

In [6]:
## Where LINCS credentials are stored
print('lincs_file="%s"' % paths_global.lincs_file)

lincs_file="../credentials/credentials_LINCS.txt"


##### d. Registration to the [DrugBank](https://go.drugbank.com/) database and file downloading (v. 5.1.8.)

You need first to ask for access to the DrugBank database at *info@drugbank.com* (free for 6 mo for academic purposes). Note your credentials (first line: login, second line: password) at file `drugbank_file`.

In [7]:
## Where DrugBank credentials are stored
print('drugbank_file="%s"' % paths_global.drugbank_file)
## Where DrugBank files are stored
print('drugbank_folder="%s"' % paths_global.drugbank_folder)

## File download
get_str = lambda cmd : str(sb.check_output(cmd, shell=True).decode("utf-8").split("\n")[0])
login, pwd = [get_str(cmd+" -n1 "+paths_global.drugbank_file) for cmd in ["head", "tail"]]
base_url = "https://go.drugbank.com/releases/5-1-8/downloads/"
filenames = {"COMPLETE DATABASE": ["all-full-database", "drugbank_all_full_database.xml.zip"],
            "STRUCTURES/Structure External Links": ["all-structure-links", "drugbank_all_structure_links.csv.zip"]}
for section in filenames:
    fname = paths_global.drugbank_folder+section+"/"+filenames[section][-1]
    if (not os.path.exists(fname)):
        sb.call("mkdir -p "+paths_global.drugbank_folder+section+"/", shell=True)
        sb.call("curl -Lfv -o "+fname+" -u "+login+":"+pwd+" "+base_url+filenames[section][0])

drugbank_file="../credentials/DrugBank_credentials.txt"
drugbank_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/DrugBank/"


###### d.a [DrugBank](https://go.drugbank.com/) identifiers for drug components

In [8]:
if (not os.path.exists(paths_global.data_folder+"drugbankid2drugname.pck")):
    ## Find drug names from DrugBank identifiers (using file from DrugBank after registration)
    path=paths_global.drugbank_folder+"COMPLETE DATABASE"+"/"
    drugbank_db_file = path+"full database.xml"
    if (not os.path.exists(drugbank_db_file)):
        sb.call("unzip "+path+"drugbank_all_full_database.xml.zip", shell=True)
    cmd_file="cat \'"+drugbank_db_file+"\'"
    cmd_drugbank_names="grep -e '^  <name>' | sed 's/  <name>//g' | sed 's/<\/name>//g'"
    cmd_drugbank_ids="grep -e '^  <drugbank-id primary=\"true\">' | sed 's/<\/drugbank-id>//g' | sed 's/<drugbank-id primary=\"true\">//g' | sed 's/ //g'"
    drugbank_ids=sb.check_output(cmd_file+" | "+cmd_drugbank_ids, shell=True).decode("utf-8").split("\n")
    drugbank_names=sb.check_output(cmd_file+" | "+cmd_drugbank_names, shell=True).decode("utf-8").split("\n")
    di_drugbankid2drugname = dict(zip(drugbank_ids, drugbank_names))
    ## Manually retrieved from DrugBank website
    new_names = {"DB00510": "Divalproex sodium", 
        "DB01402": "Bismuth", 
        "DBCAT004271": 'Asparaginase',
        "DB00371" : "Meprobamate",
        "DB00394": "Beclomethasone dipropionate",
        "DB00422": "Methylphenidate",
        "DB00462": "Methscopolamine bromide",
        "DB00464": "Sodium tetradecyl sulfate",
        "DB00525": "Tolnaftate",
        "DB00527": "Cinchocaine",
        "DB00563": "Methotrexate",
        "DB05381": "Histamine",
        "DB00931": "Metacycline",
        "DB00326": "Calcium glucoheptonate",
        "DB14520": "Tetraferric tricitrate decahydrate",
        "DB00717": "Norethisterone",
        "DB01258": "Aliskiren",
        "DB00006" : "Bivalirudin",
    }
    di_drugbankid2drugname.update(new_names)
    with open(paths_global.data_folder+"drugbankid2drugname.pck", "wb") as f:
        pickle.dump(di_drugbankid2drugname, f)
else:
    with open(paths_global.data_folder+"drugbankid2drugname.pck", "rb") as f:
        di_drugbankid2drugname = pickle.load(f)

##### e. [OMIM](https://www.omim.org/) identifiers for diseases

Those are identifiers from OMIM, up to the "D" letter which is needed in order to have valid column names. You need to ask for (1 year-long) registration to download files (free for academic purposes). Then a link to these files will be sent by e-mail.

In [9]:
## Where OMIM files are stored
print('omim_folder="%s"' % paths_global.omim_folder)

if (not os.path.exists(paths_global.data_folder+"omimid2diseasename.pck")):
    disease_file=paths_global.omim_folder+"mimTitles.txt"
    disease_df = pd.read_csv(disease_file, sep="\t", header=2, index_col=1)[["Preferred Title; symbol"]].dropna()
    disease_df.index = ["D"+str(int(x)) for x in disease_df.index]
    disease_df["Preferred Title; symbol"] = [x[0]+x[1:].lower() for x in disease_df["Preferred Title; symbol"]]
    di_omimid2diseasename = dict(zip(list(disease_df.index), list(disease_df["Preferred Title; symbol"])))
    with open(paths_global.data_folder+"omimid2diseasename.pck", "wb") as f:
        pickle.dump(di_omimid2diseasename, f)
else:
    with open(paths_global.data_folder+"omimid2diseasename.pck", "rb") as f:
        di_omimid2diseasename = pickle.load(f)

omim_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/OMIM/"


##### f. [SIDER](http://sideeffects.embl.de/) database of drug side effects

The access to this database is free and without registration. Files are located at this [page](http://sideeffects.embl.de/download/).

In [10]:
## Where SIDER files are stored
print('sider_folder="%s"' % paths_global.sider_folder)

base_url = "http://sideeffects.embl.de/media/download/"
filenames = ["drug_names.tsv", "meddra_all_se.tsv.gz"]

for fname in filenames:
    fm = paths_global.sider_folder+(fname if (fname[-3:]!=".gz") else fname[:-3])
    if (not os.path.exists(fm)):
        sb.call("mkdir -p "+paths_global.sider_folder, shell=True)
        sb.call("wget -N -O "+paths_global.sider_folder+fname+" "+base_url+fname, shell=True)
        if (".gz" == fname[-3:]):
            sb.call("gzip -d "+paths_global.sider_folder+fname+" && rm -f "+paths_global.sider_folder+fname, shell=True)

sider_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/SIDER/"


#### All dictionaries for identifiers

In [11]:
assert os.path.exists(paths_global.data_folder+"drugbankid2drugname.pck")
with open(paths_global.data_folder+"drugbankid2drugname.pck", "rb") as f:
    di_drugbankid2drugname = pickle.load(f)
    
assert os.path.exists(paths_global.data_folder+"omimid2diseasename.pck")
with open(paths_global.data_folder+"omimid2diseasename.pck", "rb") as f:
    di_omimid2diseasename = pickle.load(f)
    
cids_file = paths_global.data_folder+"medgenid2diseasename.pck"
if (not os.path.exists(cids_file)):
    di_medgenid2diseasename = {}
else:
    with open(cids_file, "rb") as f:
        di_medgenid2diseasename = pickle.load(f)
        
pubchem_file = paths_global.data_folder+"pubchemid2drugname.pck"
if (not os.path.exists(pubchem_file)):
    di_pubchemid2drugname = {}
else:
    with open(pubchem_file, "rb") as f:
        di_pubchemid2drugname = pickle.load(f)

## Build matrix A : $N_S \times N_D$ of drug-disease associations

### 1. [RepoDB](http://apps.chiragjpgroup.org/repoDB/) dataset

Drug identifiers are [DrugBank](https://go.drugbank.com/) identifiers. Disease identifiers are Concept ID ([CUI](https://www.ncbi.nlm.nih.gov/medgen/docs/help/)) identifiers. The full dataset should be downloaded from [this website](http://apps.chiragjpgroup.org/repoDB/) (Tab "Download", then button "Download the full repoDB Dataset"). Then enter the path to this file in *paths_global.py*.

In [12]:
print('repodb_folder="%s"' % paths_global.repodb_folder)
assert os.path.exists(paths_global.repodb_folder+"full.csv")

repodb_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/RepoDB/"


In [13]:
## Download the full database on the website
if (not os.path.exists(featureless_folder+"RepoDB_set.csv")):
    dataset = pd.read_csv(paths_global.repodb_folder+"full.csv", header=0)

    ## Select late phase trials
    late_phase_trials=(dataset["phase"]=="Phase 3")|(dataset["phase"]=="Phase 2/Phase 3")|(dataset["phase"]=="Phase 2")|pd.isnull(dataset["phase"])
    dataset = dataset.loc[late_phase_trials]

    ## Remove suspended/withdrawn trials
    dataset = dataset.query("status!='Suspended'").query("status!='Withdrawn'")
    test_outcome = lambda out, outs : all([not (o in out.lower()) for o in outs])
    test_all_outcomes = lambda out_ls, outs : list(map(lambda x : str(x)=="nan" or test_outcome(x, outs), list(out_ls)))

    ## Remove unspecified outcomes in terminated trials
    dataset = dataset.loc[(dataset["status"]!="Terminated")|(~pd.isnull(dataset["DetailedStatus"]))]
    unspecified_outcomes = ["detailed description", "study completed per investigator", "study closed by the nci", "the first phase was completed", "administrative reasons", "study terminated", "study has finished", "administratively complete", "data collection complete", "no records are available"]
    unspecified_outcomes += ["interim analysis indicated study should be terminated", "trial stopped on sept 24, 2007", "irb study closure"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], unspecified_outcomes))]

    ## Remove trials with slow/low accrual/enrollment/recruitment
    slow_outcomes = [adj+" "+noun for adj in ["slow", "diminished", "difficult", "low", "poor", "bad", "lack of", "inadequate", "recruitable", "insufficient"] for noun in ["accural", "recruitement", "inclusion", "subject", "accrual", "rectruitment", "participant", "patient", "enrollment", "enrolment", "recruitment"]]
    slow_outcomes += [a+" "+b for a in ["unable", "failure", "difficult", "inability"] for b in ["to recruit", "to enroll"]]
    slow_outcomes += [a+" "+b for a in ["enrolling", "enrollment", "recruitment", "difficulty", "difficulty of", "recruitment of"] for b in ["recrutement", "participant", "difficulties", "challenge", "in enrolling"]]
    slow_outcomes += ["recruitment", "sample size", "recruiting", "not enrolled completely", "lack of eligible patients", "enough patient", "enrollment number"]
    slow_outcomes += ["inclusion rate", "target enrollment", "feasibility", "discontinued", "enrollee", "recruit new patient"]
    slow_outcomes += ["accrual was not optimized", "low rate of accrual", "short of participants", "accrual goal for interventional part not achievable"]
    slow_outcomes += ["enrollment issues", 'difficulty to recruit patients', "enrollment very slow", "accrual was too slow", "enrollment was much slower than anticipated", "only 2 subjects enrolled", "recruitement did not meet expectations", 'closed to enrollment', "difficulty in accruing subjects", "no more eligible patients"]
    slow_outcomes += ["5 subjects could be enrolled", "subject recuitment halted", "accrual was very low", "not be able to reach stated accrual"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], slow_outcomes))]

    ## Remove trials which were ended by sponsor/lack of funding/access to drug
    sponsor_outcomes = ["medication supply issue", "due study team travel restrictions", "could not receive the support from the national medical insurance", "lack of experimental medication", "withdrawal of support from our collaborator", "financial resource limitations", "terminated study due to lack of funds", "unavailibility of methylnaltrexone ", "insurance companies to cover", "stopped drug delivery", "change in the national policy of medications", "sponsor", "limited availability of drug", "drug supply", "study medication expired", "not able to obtain the study drug", "registration of the medicine is no longer being pursued", "company decision", "competitor study", "drugs unavailable", "unable to secure drug", "access to study drug", "competition", "competing", "funding", "business", "supply study drug", "drug availability", "drug no longer available", "insurance coverage"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], sponsor_outcomes))]

    ## Remove trials which were ended for scientific reasons
    scientific_outcomes = ["changes in standard care", "hri no longer conducting research", "to be compliant with the timelines", "was determined not feasible", "changing aetiology of squamous cell carcinoma", "preliminary analysis", "change in development plan", "change in standard of care", "h1n1 pandemic is now over", "medical/ethical reasons", "original investigator left", "pi ", "this study will not be written up", "h1n1 pandemic concluded", "fda placed a clinical hold", "on hold at the request of the fda", "trial not progressing toward scientific goals", "pi's ", "some of the researchers finished their participation in the study", 'fda hold may 2007', "major revisions needed", "protocol modification", "design changes were needed", "non-compliance"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], scientific_outcomes))]

    ## Remove trials (lack of clear outcome)
    other_outcomes = ['safety concern of active control drug', 'few delirious patients were enrolled', 'trial design contingent on RFA optimization', 'efficacy interim analysis as per protocol', 'interim analysis showed that the primary endpoint would not be met', 'study was never initiated under new location/provider group', 'to complete the study in an appropriate time frame', 'malaria prev. fell in the study area, so we cannot evaluate the primary endpoint', 'sufficient number to reach the primary endpoint and as planned', 'sufficient number of subjects accrued to conduct analysis', 'aoi pharma terminated the license agreement', 'dsmb stopped study because placebo arm had more adverse events', 'drug was no longer available', 'pfizer has terminated the execution of this protocol', "clinical trial terminated due to results from recent nonclinical studies", "data from the c08 study and avant study", "technical/operational issues", "study closed and subject follow-up completed following analysis of blinded study data"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], other_outcomes))]
    dataset = dataset.loc[(dataset["DetailedStatus"]!="Completed")&(dataset["DetailedStatus"]!="Terminated")]

    ## Create positive and negative outcome sets
    test_outcome = lambda out, outs : any([(o in out.lower()) for o in outs])
    test_all_outcomes = lambda out_ls, outs : list(map(lambda x : str(x)=="nan" or test_outcome(x, outs), list(out_ls)))
    detailed_status = dataset['DetailedStatus']
    ## drug_id are DrugBank identifiers
    dataset = dataset[['drug_name', 'drug_id', 'ind_name', 'ind_id', 'status']]
    dataset["status"] = "Negative"
    positive_outcomes = ["publishing the results", "demonstrating efficacy", 'had already been prescribed Cymbalta', 'drug now on market', 'PXD101-CLN-19']
    dataset["status"].loc[(test_all_outcomes(detailed_status, positive_outcomes))] = "Positive"
    
    ## Populate drug and disease identifiers
    for idx in dataset.index:
        drug_id = dataset.loc[idx]["drug_id"]
        drug_name = dataset.loc[idx]["drug_name"]
        ind_id = dataset.loc[idx]["ind_id"]
        ind_name = dataset.loc[idx]["ind_name"]
        di_drugbankid2drugname.setdefault(drug_id, drug_name)
        di_medgenid2diseasename.setdefault(ind_id, ind_name)
    
    with open(paths_global.data_folder+"drugbankid2drugname.pck", "wb") as f:
        pickle.dump(di_omimid2diseasename, f)
    with open(paths_global.data_folder+"medgenid2diseasename.pck", "wb") as f:
        pickle.dump(di_medgenid2diseasename, f)
        
    dataset_repodb = pd.DataFrame([], index=dataset.index)
    dataset_repodb["ind_id"] = dataset["ind_id"]
    dataset_repodb["drug_id"] = dataset["drug_id"]
    dataset_repodb["rating"] = [int(x=="Positive")-int(x=="Negative") for x in dataset["status"]]
    dataset_repodb.to_csv(featureless_folder+"RepoDB_set.csv")
    
dataset_repodb = pd.read_csv(featureless_folder+"RepoDB_set.csv", index_col=0)
utils.print_dataset(dataset_repodb, "ind_id", "drug_id", "rating")
dataset_repodb.T

Ndrugs=1531	Ndiseases=1287
6686 positive	190 negative	1963521 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9987,9988,10099,10116,10117,10164,10179,10250,10283,10396
ind_id,C0272275,C0585362,C3163899,C1319317,C0280324,C0007102,C0010674,C0079773,C0003873,C3495559,...,C0026764,C0026764,C0948780,C0021359,C0021359,C0019196,C0277554,C0346976,C0023283,C0039445
drug_id,DB00001,DB00002,DB00002,DB00002,DB00002,DB00002,DB00003,DB00004,DB00005,DB00005,...,DB00773,DB00987,DB00641,DB00783,DB06825,DB00715,DB00682,DB00441,DB00196,DB00112
rating,1,1,1,1,1,1,1,1,1,1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


### 2. [Gottlieb](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159979/) dataset

In [14]:
A_gottlieb = utils.load_dataset("Gottlieb", save_folder=paths_global.data_folder)["ratings_mat"]
dataset_gottlieb = utils.matrix2ratings(A_gottlieb, "ind_id", "drug_id", "rating")

print("Sparsity = "+str(utils.compute_sparsity(A_gottlieb))+"%")
utils.print_dataset(dataset_gottlieb, "ind_id", "drug_id", "rating")
dataset_gottlieb.T

Sparsity = 1.0414365682698576%
Ndrugs=593	Ndiseases=313
1933 positive	0 negative	183676 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932
ind_id,D131200,D150699,D175510,D176807,D600082,D601518,D604416,D604416,D114480,D131200,...,D110100,D157700,D161900,D608622,D126200,D102500,D166710,D606842,D144700,D605839
drug_id,DB00007,DB00007,DB00007,DB00007,DB00007,DB00007,DB00007,DB00010,DB00014,DB00014,...,DB04861,DB04861,DB04861,DB04861,DB05259,DB06285,DB06285,DB06285,DB06287,DB06287
rating,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


### 3. [CDataset](https://academic.oup.com/bioinformatics/article/34/11/1904/4820334) dataset

In [15]:
A_cdataset = utils.load_dataset("Cdataset", save_folder=paths_global.data_folder)["ratings_mat"]
dataset_cdataset = utils.matrix2ratings(A_cdataset, "ind_id", "drug_id", "rating")

print("Sparsity = "+str(utils.compute_sparsity(A_cdataset))+"%")
utils.print_dataset(dataset_cdataset, "ind_id", "drug_id", "rating")
dataset_cdataset.T

Sparsity = 0.9337419376251535%
Ndrugs=663	Ndiseases=409
2532 positive	0 negative	268635 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2522,2523,2524,2525,2526,2527,2528,2529,2530,2531
ind_id,D114480,D131200,D176807,D601518,D125700,D134430,D134500,D193400,D277480,D304900,...,D125852,D222100,D300136,D600319,D601208,D601318,D601666,D601941,D601942,D603266
drug_id,DB00014,DB00014,DB00014,DB00014,DB00035,DB00035,DB00035,DB00035,DB00035,DB00035,...,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907
rating,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


### 4. [Indep](https://github.com/bioinfomaticsCSU/MBiRW) dataset

In [16]:
A_indep = utils.load_dataset("indep", save_folder=paths_global.data_folder)["ratings_mat"]
dataset_indep = utils.matrix2ratings(A_indep, "ind_id", "drug_id", "rating")

print("Sparsity = "+str(utils.compute_sparsity(A_indep))+"%")
utils.print_dataset(dataset_indep, "ind_id", "drug_id", "rating")
dataset_indep.T

Sparsity = 0.4000555632726768%
Ndrugs=115	Ndiseases=45
144 positive	0 negative	5031 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,134,135,136,137,138,139,140,141,142,143
ind_id,D134500,D227300,D256370,D540000,D166710,D166710,D277440,D607499,D140600,D165720,...,D266600,D140600,D165720,D607850,D603165,D603165,D603165,D190300,D603165,D603165
drug_id,DB00035,DB00035,DB00091,DB00125,DB00136,DB00153,DB00169,DB00176,DB00193,DB00193,...,DB01250,DB01283,DB01283,DB01283,DB01620,DB08799,DB08802,DB08824,DB08835,DB08906
rating,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


### 5. [Fdataset and DNdataset](http://bioinformatics.csu.edu.cn/resources/softs/DrugRepositioning/DRRS/index.html) datasets

The "F dataset" is the "Gottlieb" dataset.

In [17]:
A_fdataset = utils.load_dataset("Fdataset", save_folder=paths_global.data_folder)["ratings_mat"]
dataset_fdataset = utils.matrix2ratings(A_fdataset, "ind_id", "drug_id", "rating")

print("Sparsity = "+str(utils.compute_sparsity(A_fdataset))+"%")
utils.print_dataset(dataset_fdataset, "ind_id", "drug_id", "rating")
dataset_fdataset.T

Sparsity = 1.0414365682698576%
Ndrugs=593	Ndiseases=313
1933 positive	0 negative	183676 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932
ind_id,D131200,D150699,D175510,D176807,D600082,D601518,D604416,D604416,D114480,D131200,...,D110100,D157700,D161900,D608622,D126200,D102500,D166710,D606842,D144700,D605839
drug_id,DB00007,DB00007,DB00007,DB00007,DB00007,DB00007,DB00007,DB00010,DB00014,DB00014,...,DB04861,DB04861,DB04861,DB04861,DB05259,DB06285,DB06285,DB06285,DB06287,DB06287
rating,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [18]:
A_dndataset = utils.load_dataset("DNdataset", save_folder=paths_global.data_folder)["ratings_mat"]
dataset_dndataset = utils.matrix2ratings(A_dndataset, "ind_id", "drug_id", "rating")

print("Sparsity = "+str(utils.compute_sparsity(A_dndataset))+"%")
utils.print_dataset(dataset_dndataset, "ind_id", "drug_id", "rating")
dataset_dndataset.T

Sparsity = 0.0149802937802058%
Ndrugs=550	Ndiseases=360
1008 positive	0 negative	196992 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,998,999,1000,1001,1002,1003,1004,1005,1006,1007
ind_id,1043,3261,1141,449,2324,3819,817,4119,2200,23,...,827,1723,176,2037,2549,1997,2154,3819,3108,3278
drug_id,3,6,10,16,16,17,18,24,26,33,...,1469,1470,1471,1471,1473,1476,1476,1481,1484,1487
rating,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


### 6. [Covid-19](https://github.com/vikram-s-narayan/collaborative-filtering-for-drug-repurposing-COVID-V3) dataset

It seems every drug which has been tested for Covid-19 is rated "1" in this dataset, so official drug recommendations (up to May $17^{th}$ $2023$) are used to refine the ratings. This dataset also provides other indications, but since there is no source about how this dataset was built, they will be ignored.

Current (up to May $2023$) therapeutic recommendations from the NIH (USA), FDA (USA), NHS (UK) and HAS (FR) (which conform to those of [WHO](https://www.who.int/teams/health-care-readiness/covid-19/therapeutics), [EMA](https://www.ema.europa.eu/en/human-regulatory/overview/public-health-threats/coronavirus-disease-covid-19/treatments-vaccines/covid-19-treatments), and [CDC](https://www.cdc.gov/coronavirus/2019-ncov/your-health/treatments-for-severe-illness.html)):

- [NIH guidelines](https://www.covid19treatmentguidelines.nih.gov/special-populations/pregnancy/pregnancy-lactation-and-covid-19-therapeutics/) (April $20$, $2023$): Baricitinib (DB11817), Dexamethasone (DB01234), Heparin (DB01109), Molnupiravir (DB15661), Remdesivir (DB14761), Nirmatrelvir (DB16691)+Ritonavir (DB00503), Tocilizumab (DB06273)

- [FDA guidelines](https://www.fda.gov/drugs/emergency-preparedness-drugs/coronavirus-covid-19-drugs) (May $12^{th}$, $2023$): Tocilizumab (DB06273), Remdesivir (DB14761), Baricitinib (DB11817), Nirmatrelvir (DB16691)+Ritonavir (DB00503), Molnupiravir (DB15661), Anakinra (DB00026), Vilobelimab (DB16416)

- [NHS guidelines](https://www.nhs.uk/conditions/covid-19/treatments-for-covid-19/) (March $21^{th}$, $2023$): Nirmatrelvir (DB16691)+Ritonavir (DB00503), Remdesivir (DB14761), Molnupiravir (DB15661), Sotrovimab (DB16355)

- [HAS guidelines](https://www.has-sante.fr/jcms/p_3303843/fr/medicaments-dans-le-cadre-de-la-covid-19) (February $9^{th}$, $2023$): Sotrovimab (DB16355), Dexamethasone (DB01234), Tocilizumab (DB06273), Nirmatrelvir (DB16691)+Ritonavir (DB00503), Tixagevimab (DB16394)+Cilgavimab (DB16393), Molnupiravir (DB15661), Casirivimab (DB15941)+Imdevimab (DB15940), Remdesivir (DB14761)

All other cited drugs are assumed to be negative matchings for Covid-19.

In [19]:
## Download the associated file on GitHub
print('covid_folder="%s"' % paths_global.covid_folder)
if (not os.path.exists(paths_global.covid_folder)):
    covid_url="https://raw.githubusercontent.com/vikram-s-narayan/collaborative-filtering-for-drug-repurposing-COVID-V3/master/approved_COVID.csv"
    
    if (not os.path.exists(paths_global.covid_folder)):
        sb.Popen(["mkdir", "-p", paths_global.covid_folder])
        sb.Popen(["wget", "-O", paths_global.covid_folder+"approved_COVID.csv", covid_url])
assert os.path.exists(paths_global.covid_folder+"approved_COVID.csv")

covid_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/collaborative-filtering-for-drug-repurposing-COVID-V3/"


In [20]:
if (not os.path.exists(featureless_folder+"Covid_set.csv")):
    ## According to notebook, this dataset is the same than RepoDB except
    ## for the last 15 lines
    approved_Covid = pd.read_csv(paths_global.covid_folder+"approved_COVID.csv").iloc[-15:,:]
    approved_Covid["rating"] = -1   
    approved_Covid["ind_name"] = "Coronavirus disease 2019"
    approved_Covid.index = range(15)
    positive_ratings = {
        'DB11817': 'Baricitinib',
        'DB01234': 'Dexamethasone',
        'DB01109': 'Heparin',
        'DB15661': 'Molnupiravir',
        'DB14761': 'Remdesivir',
        'DB16691': 'Nirmatrelvir',
        'DB00503': 'Ritonavir',
        'DB06273': 'Tocilizumab',
        'DB00026': 'Anakinra',
        'DB16416': 'Vilobelimab',
        'DB16355': 'Sotrovimab',
        'DB16394': 'Tixagevimab',
        'DB16393': 'Cilgavimab',
        'DB15941': 'Casirivimab',
        'DB15940': 'Imdevimab',      
    }
    add_ratings = [[drug, "Coronavirus disease 2019", 1] for drug in positive_ratings.values()]
    add_ratings += [ 
        ["Sarilumab", "Coronavirus disease 2019", -1], 
        ["Bamlanivimab", "Coronavirus disease 2019", -1], 
        ["Atovaquone", "Coronavirus disease 2019", -1], 
        ["Omeprazole", "Coronavirus disease 2019", -1], 
        ["Famotidine", "Coronavirus disease 2019", -1],
        ["Oseltamivir", "Coronavirus disease 2019", -1], 
        ["Favipiravir", "Coronavirus disease 2019", -1], 
        ['Emtricitabine', "Coronavirus disease 2019", -1], 
        ["Tenofovir", "Coronavirus disease 2019", -1], 
        ["Atazanavir", "Coronavirus disease 2019", -1], 
        ["Hydroxychloroquine", "Coronavirus disease 2019", -1], 
        ["Lopinavir", "Coronavirus disease 2019", -1], 
        ["Ritonavir", "Coronavirus disease 2019", -1]
    ]
    for drug, ind, rating in add_ratings:
        if (drug not in list(approved_Covid["drug_name"])):
            approved_Covid.loc[len(approved_Covid.index)] = [drug, ind, rating]
    covid_drug_ids = {
        "DB15941": "Casirivimab",
        "DB00927": "Famotidine",
        "DB15718": "Bamlanivimab",
        "DB00503": "Ritonavir",
        "DB11817": "Baricitinib",
        "DB01234": "Dexamethasone",
        "DB00207": "Azithromycin",
        "DB14126": "Tenofovir", 
        "DB00338": "Omeprazole",
        "DB01117": "Atovaquone",
        "DB00879": "Emtricitabine",
        "DB00198": "Oseltamivir", 
        "DB06273": "Tocilizumab",
        "DB12466": "Favipiravir", 
        "DB11767": "Sarilumab",
        "DB14761": "Remdesivir",
        "DB01611": "Hydroxychloroquine",
        "DB01072": "Atazanavir",
        "DB15940": "Imdevimab",
        "DB01601": "Lopinavir",
    }
    covid_drug_ids.update(positive_ratings)
    covid_disease_ids = {
        "C5203670": "Coronavirus disease 2019",
        "CN294793": "Asymptomatic COVID-19 infection",
        "CN294811": "COVID-19–associated multisystem inflammatory syndrome in children",
        "CN294802": "Critical COVID-19 infection",
        "CN294794": "Mild COVID-19 infection",
        "CN294795": "Moderate COVID-19 infection",
        "CN294804": "Presymptomatic COVID-19 infection",
        "CN294796": "Severe COVID-19 infection", 
    }
    di_drugbankid2drugname.update(covid_drug_ids)
    di_medgenid2diseasename.update(covid_disease_ids)

    with open(paths_global.data_folder+"drugbankid2drugname.pck", "wb") as f:
        pickle.dump(di_drugbankid2drugname, f)
    with open(paths_global.data_folder+"medgenid2diseasename.pck", "wb") as f:
        pickle.dump(di_medgenid2diseasename, f)
        
    approved_Covid["ind_name"] = ["C5203670"]*len(approved_Covid.index)
    di = {}
    for k in covid_drug_ids:
        di.setdefault(covid_drug_ids[k], k)
    approved_Covid["drug_name"] = [di[idx] for idx in approved_Covid["drug_name"]]
    approved_Covid.columns = ["drug_id", "ind_id", "rating"]
    dataset_covid = approved_Covid[["ind_id", "drug_id", "rating"]]
    dataset_covid.to_csv(featureless_folder+"Covid_set.csv")
    
dataset_covid = pd.read_csv(featureless_folder+"Covid_set.csv", index_col=0)
utils.print_dataset(dataset_covid, "ind_id", "drug_id", "rating")
dataset_covid.T

Ndrugs=28	Ndiseases=1
12 positive	17 negative	-1 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
ind_id,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,...,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670
drug_id,DB14761,DB01611,DB00207,DB06273,DB01601,DB00503,DB01117,DB00338,DB00927,DB00198,...,DB16691,DB00026,DB16416,DB16355,DB16394,DB16393,DB15941,DB15940,DB11767,DB15718
rating,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,1,1,1,1,1,1,1,1,-1,-1


### 7. Epilepsy dataset

This dataset has been compiled by Dr. Baptiste Porte, at Inserm Unit 1141 (Neurodiderot) in January $2021$ (DOI:). It can be downloaded at [this Zenodo page](https://zenodo.org/record/7974586).

In [21]:
print('epilepsy_folder="%s"' % paths_global.epilepsy_folder)

## Download the associated file on Zenodo
if (not os.path.exists(paths_global.epilepsy_folder)):
    sb.Popen(["wget", "-O", paths_global.epilepsy_folder, "https://zenodo.org/record/7974586/files/Porte_epilepsy_dataset.zip"])

epilepsy_files = [paths_global.epilepsy_folder+fname for fname in ["Liste anticonvu.csv", "Liste proconvu.csv"]]
for fname in epilepsy_files:
    assert os.path.exists(fname)

epilepsy_folder="../epilepsy_data/"


In [22]:
if (not os.path.exists(featureless_folder+"Epilepsy_set.csv")):
    ep_matrices = [pd.read_csv(fname, index_col=0)[['drug_name', 'verif 2']] for fname in epilepsy_files]
    dataset_epilepsy=pd.concat(tuple(ep_matrices), axis=0)
    
    pubchem_cid = dict(zip(list(dataset_epilepsy.index), list(dataset_epilepsy["drug_name"])))
    di_pubchemid2drugname.update(pubchem_cid)
    with open(paths_global.data_folder+"pubchemid2drugname.pck", "wb+") as f:
        pickle.dump(di_pubchemid2drugname, f)

    di = {di_drugbankid2drugname[k]:k for k in di_drugbankid2drugname}
    dataset_epilepsy["drug_id"] = [di.get(dataset_epilepsy["drug_name"].loc[idx], idx) for idx in dataset_epilepsy.index]
    dataset_epilepsy["rating"] = list(dataset_epilepsy["verif 2"])
    dataset_epilepsy["ind_id"] = ["C0014544"]*len(dataset_epilepsy.index)
    dataset_epilepsy = dataset_epilepsy[["ind_id", "drug_id", "rating"]]
    dataset_epilepsy.index = range(dataset_epilepsy.shape[0])

    di_medgenid2diseasename.update({"C0014544": "Epilepsy"})
    with open(paths_global.data_folder+"medgenid2diseasename.pck", "wb") as f:
        pickle.dump(di_medgenid2diseasename, f)
        
    dataset_epilepsy.to_csv(featureless_folder+"Epilepsy_set.csv")
        
dataset_epilepsy = pd.read_csv(featureless_folder+"Epilepsy_set.csv", index_col=0)
utils.print_dataset(dataset_epilepsy, "ind_id", "drug_id", "rating")
dataset_epilepsy.T

Ndrugs=69	Ndiseases=1
34 positive	35 negative	0 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,59,60,61,62,63,64,65,66,67,68
ind_id,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,...,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544
drug_id,DB00425,DB00514,DB00874,DB01144,DB00908,DB09210,DB00356,DB00404,DB00423,DB00475,...,DB01165,DB01330,DB01393,DB06774,DB00268,DB06151,DB09462,DB01026,DB06148,DB00291
rating,1,1,1,1,1,1,1,1,1,1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


### 8. Merge all datasets

Inconsistent outcomes are ratings which are contradicted by another rating (in {0,1,-1}) in the union of ratings of all considered datasets. We correct inconsistent outcomes as follows:
- If there is at least one negative outcome reported, then it is a negative outcome (in order to be conservative with respect to drug recommendations).
- If there is at least one positive outcome and no negative outcome reported, then it is a positive outcome.

In [23]:
if (not os.path.exists(featureless_folder+"all_ratings_merged.csv")):
    dfs = [dataset_repodb, dataset_gottlieb, dataset_cdataset, 
       dataset_indep, dataset_covid, dataset_epilepsy]
    ratings_A = utils.merge_ratings(dfs, "ind_id", "drug_id", "rating")
    
    utils.print_dataset(ratings_A, "ind_id", "drug_id", "rating")
    
    #A = utils.ratings2matrix(ratings_A, "ind_id", "drug_id", "rating")
    #A.columns = [x.split(".")[0] for x in A.columns]
    #A.to_csv(featureless_folder+"all_ratings_merged.csv")
    ratings_A.to_csv(featureless_folder+"all_ratings_merged.csv")

#A = pd.read_csv(featureless_folder+"all_ratings_merged.csv", index_col=0)
ratings_A = pd.read_csv(featureless_folder+"all_ratings_merged.csv", index_col=0)

A = utils.ratings2matrix(ratings_A, "ind_id", "drug_id", "rating")
print("Sparsity = "+str(utils.compute_sparsity(A))+"%")

utils.print_dataset(ratings_A, "ind_id", "drug_id", "rating")
ratings_A.T

Sparsity = 0.34634607963172265%
Ndrugs=1622	Ndiseases=1703
11339 positive	244 negative	2750683 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11573,11574,11575,11576,11577,11578,11579,11580,11581,11582
ind_id,C0272275,C0585362,C3163899,C1319317,C0280324,C0007102,C0010674,C0079773,C0003873,C3495559,...,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544
drug_id,DB00001,DB00002,DB00002,DB00002,DB00002,DB00002,DB00003,DB00004,DB00005,DB00005,...,DB01165,DB01330,DB01393,DB06774,DB00268,DB06151,DB09462,DB01026,DB06148,DB00291
rating,1,1,1,1,1,1,1,1,1,1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


In [24]:
assert all([x[0] in ["C", "D"] and x[:2]!="DB" for x in ratings_A["ind_id"].unique()])
assert all([str(x)[0] not in ["C", "D"] or str(x)[:2]=="DB" for x in ratings_A["drug_id"].unique()])
assert all([x[0] in ["C", "D"] and x[:2]!="DB" for x in ratings_A["ind_id"].unique()])

### I. Convert all disease/drug ids to the same notation

Convert all OMIM/RepoDB disease id into MedGen Concept id

In [25]:
ratings_A = pd.read_csv(featureless_folder+"all_ratings_merged.csv", index_col=0)

disease_id_list = list(ratings_A["ind_id"].unique())
disease_list = utils.disease_id2name(disease_id_list, paths_global.data_folder, True)
for ix, x in enumerate(disease_list):
    if ("Moved to " in x):
        disease = "D"+"".join(x.split("Moved to ")[-1])
        disease_list[ix] = utils.disease_id2name([disease], paths_global.data_folder, True)[0]
        
convert_dnames = {
    "Metastatic Renal Cell Cancer": "Renal Cell Cancer",
    "Bladder muscle dysfunction - overactive": "Overactive bladder",
    "Malignant neoplasm of stomach stage IV": "Malignant neoplasm of stomach",
    "Escherichia coli septicemia": "Escherichia coli infection", 
    "urinary tract infection": "URINARY TRACT INFECTION",
    "candidal peritonitis": "peritonitis",
    "Pseudomonas aeruginosa meningitis": "Bacterial meningitis",
    "Staphylococcal bacteraemia": "bacteraemia", 
    "Accidental poisoning by lead and its compounds and fumes": "Lead poisoning",
    "Sprains and Strains": "Ehler-Danlos",
    "Meningitis due to Bacteroides": "Bacteroides infectious disease", 
    "Endocarditis haemophilus": "Bacterial endocarditis", 
    "Accidental poisoning by methyl alcohol": "Alcohol dependence",
    "Generalized glycogen storage disease of infants": "Glycogen storage disease, type II",
    "Arsenic Poisoning": "Arsenic",
    "Thallium poisoning": "Thallium",
    "Antimetabolite overdose": "Antimetabolite adverse reaction",
    "Arthropod bite wound": "wound", 
    "colon cancer liver metastasis": "Colorectal Cancer",
    "PYELONEPHRITIS E COLI": "Pyelonephritis",  
    "Head and neck cancer metastatic": "Head and neck carcinoma",
    "Carcinoma breast stage IV": "Breast carcinoma",
    "Cystitis escherichia": "Bacterial cystitis",
    "Central retinal vein occlusion with macular edema": "Central retinal vein occlusion",
    "Cicatrix, Hypertrophic": "Cicatrix",
    "Contact dermatitis due to Rhus diversiloba": "Toxicodendron dermatitis",
    "Haemophilus parainfluenzae pneumonia": "Haemophilus influenzae pneumonia",
    "Escherichia coli septicemia": "Escherichia coli meningitis",
    "Diarrhoea predominant irritable bowel syndrome": "irritable bowel syndrome",
    "Malignant neoplasm of stomach stage IV": "Gastric cancer",
    "Tinea corporis (disorder)": "Tinea corporis",
    "Urinary tract infection fungal": "Fungal infectious disease",
    "Pseudomonas aeruginosa meningitis": "Pseudomonas aeruginosa infectious disease",
    "Humoral hypercalcemia of malignancy (disorder)": "Humoral hypercalcemia of malignancy",
    "Staphylococcal bacteraemia": "bacteremia",
    "Septicemia due to anaerobes": "Anaerobic bacteria infectious disease",
    "Unspecified Abortion": "Spontaneous abortion",
    "Pinta": "Pinta disease",
    "Poisoning by acetaminophen": "Acetaminophen response",
    "anticholinergic toxicity": "Anticholinergic Syndrome",
    "Toxic effect of cyanide": "cyanide",
    "Poisoning by sulfadiazine": "Sulfadiazine adverse reaction",
    "Poisoning by pyrimethamine": "Pyrimethamine adverse reaction",
    "Poisoning by digitalis glycoside": "Adverse reaction to Digitalis glycoside",
    "Septicemia due to Bacteroides": "Sepsis caused by Bacteroides",
    "Nutritional deficiency associated with AIDS": "Nutritional deficiency with AIDS",
    "Malignant lymphoma, lymphocytic, intermediate differentiation, diffuse": "Lymphoma",
    "Branch retinal vein occlusion with macular edema": "Branch retinal vein occlusion",
    "Moraxella catarrhalis pneumonia": "Acute Moraxella catarrhalis bronchitis",
    "Septicemia candida": "candida infection",
    "Ethylene glycol poisoning": "Ethylene glycol", 
    "Ethylene glycol poisoning (disorder)": "Ethylene glycol", 
    "Heparin overdose": "Heparin-induced thrombocytopenia",
    "Proteus septicemia": "Proteus infectious disease",
    "Gastric spasm (disorder)": "Gastric spasm",
    "Septicemia due to enterococcus": "Enterococcus faecalis infection",
    "Septic arthritis haemophilus": "Septic arthritis",
    "Renal disease with edema NOS": "Nephrotic syndrome, type 3",
    "Citrobacter sepsis": "Infection caused by Citrobacter",
    "URINARY TRACT INFECTION CITROBACTER": "Infection caused by Citrobacter",
    "Acute ST segment elevation myocardial infarction (disorder)": "ST-elevation myocardial infarction",
    "Toxoplasmosis associated with AIDS": "AIDS-related Toxoplasmosis",
    "Bacteroides empyema": "Bacteroides",
    "pharyngitis due to Haemophilus influenzae": "Haemophilus influenzae", 
    "transvaginal ultrasound: length of cervix": "Length of uterine cervix",
    "Fungal septicemia": "Sepsis due to fungus",
    "Streptococcal endocarditis": "Subacute bacterial endocarditis",
    "Lipodystrophy due to Human immunodeficiency virus infection and antiretroviral therapy": "HIV Lipodystrophy",
    "Burn injury":"Scarring",
    "Poisoning by opiate analgesic drug":"opiate analgesic",
}

def get_cid_job(did, dname):
    try:
        if (dname[0]=="C"):
            try:
                if (int(dname[1:].split(".")[0])>0): ## determine whether it is already a concept id
                    dcid = dname
            except:
                dcid = utils.get_concept_id(dname, delay=0)
        else:
            dcid = utils.get_concept_id(dname, delay=0)
    except:
            if ((dname in convert_dnames) and (convert_dnames[dname] is not None)):
                dcid = get_cid_job(did, convert_dnames[dname])
            else:
                raise ValueError("did=%s dname=%s" % (did,dname))
    return dcid

if (not os.path.exists(featureless_folder+"disease_list_names.csv")):
    disease_cid_list = Parallel(n_jobs=njobs, backend='loky')(delayed(get_cid_job)(di_id, di_cid) for di_id, di_cid in enumerate(disease_list))
    pd.DataFrame(disease_list, index=disease_cid_list, columns=["disease_name"]).to_csv(featureless_folder+"disease_list_names.csv")
disease_cid_list = list(pd.read_csv(featureless_folder+"disease_list_names.csv", index_col=0).index)

assert len(disease_cid_list)==len(disease_id_list)==len(disease_list)

Add (cid, disease_name) to the identifiers for disease

In [26]:
ratings_A = pd.read_csv(featureless_folder+"all_ratings_merged.csv", index_col=0)

di = dict(zip([x.split(".")[0] for x in disease_cid_list], [x.split(".")[0] for x in disease_list]))
di_medgenid2diseasename.update(di)
with open(cids_file, "wb+") as f:
    pickle.dump(di_medgenid2diseasename, f)
ids_none = [i for i, d in enumerate(disease_cid_list) if (d is not None)]
A = A[A.columns[ids_none]]
A.columns = [disease_cid_list[i].split(".")[0] for i in ids_none]

A.to_csv(featureless_folder+"all_ratings_converted_diseases.csv")

for x in list(di_medgenid2diseasename.keys()):
    if (len(x.split("."))>1):
        di_medgenid2diseasename.setdefault(x.split(".")[0], di_medgenid2diseasename[x].split(".")[0])
        di_medgenid2diseasename.pop(x)
    elif (str(x) in ["None", "nan"]):
        di_medgenid2diseasename.pop(x)

A

Unnamed: 0_level_0,C0272275,C0585362,C3163899,C1319317,C0280324,C0007102,C0010674,C0079773,C0003873,CN263340,...,C0011854,C1848042,C1838261,C1832605,C1832474,C1866519,C1866041,C1866040,C1864068,C5203670
drug_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
104999,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
442021,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
442872,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5917,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB00001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
DB16355,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB16393,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB16394,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB16416,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
assert all([((c[:2]=="CN") and (int(c[2:])>0)) or ((c[0]=="C") and (int(c[1:])>0))  for c in A.columns])

### II. Convert all drug id to DrugBank ids, get PubChem CIDS

Convert remaining PubChem IDS drug ids into DrugBank ids whenever possible. Otherwise, use the PubChem CID prefixed by "CID".

In [28]:
A = pd.read_csv(featureless_folder+"all_ratings_converted_diseases.csv", index_col=0)
pubchem_ids = [k for k in A.index if (k[:2]!="DB")]
drugbank_ids = utils.get_pubchem_drugbank(pubchem_ids, paths_global.data_folder)
for k, d in enumerate(drugbank_ids):
    if (d is None):
        drugbank_ids[k] = "CID"+pubchem_ids[k]
drugbank_di = dict(zip(pubchem_ids, drugbank_ids))
A.index = [drugbank_di.get(k,k) for k in A.index]

A.columns = [a.split(".")[0] for a in A.columns]
A = A.iloc[:,~A.columns.duplicated()]
A.to_csv(featureless_folder+"all_ratings.csv")
A

Unnamed: 0,C0272275,C0585362,C3163899,C1319317,C0280324,C0007102,C0010674,C0079773,C0003873,CN263340,...,C1865810,C1848042,C1838261,C1832605,C1832474,C1866519,C1866041,C1866040,C1864068,C5203670
CID104999,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
CID442021,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
CID442872,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB13415,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB00001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
DB16355,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB16393,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB16394,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB16416,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Populate drug names associated with DrugBank identifiers

In [29]:
A = pd.read_csv(featureless_folder+"all_ratings.csv", index_col=0)
drugbank_ids = [a if (a[:2]=="DB") else None for a in list(A.index)]
with open(paths_global.data_folder+"drugbankid2drugname.pck", "rb") as f:
    di = pickle.load(f)
drugnames_from_drugbank = [di.get(d) for d in drugbank_ids]

Populate drug names associated with PubChem identifiers

In [30]:
A = pd.read_csv(featureless_folder+"all_ratings_converted_diseases.csv", index_col=0)
pubchem_ids = [int(a) if (a[:2]!="DB") else None for a in list(A.index)]
drugbank_ids = [a if (a[:2]=="DB") else None for a in list(A.index)]
drugnames_from_pubchem = [utils.get_pubchem_name(idx) if (idx is not None) else idx for idx in pubchem_ids]
drugnames_from_pubchem = [None if ((x is None) or (len(x)==0)) else x[0] for x in drugnames_from_pubchem]
print(len([x for x in drugnames_from_pubchem if (x is not None)]))

4


In [31]:
drugbank_ids = [a if (a[:2]=="DB") else None for a in list(A.index)]
drugbank_notnone = [i for i, x in enumerate(drugbank_ids) if (x is not None)]

pubchem_ids = utils.get_pubchem_id([x for x in drugbank_ids if (x is not None)]) 
df = pd.DataFrame(pubchem_ids, columns=["drug_name"], index=[drugbank_ids[i] for i in drugbank_notnone])
di = df.dropna().astype(int).to_dict()["drug_name"]
di_pubchem = {di[k]: k for k in di}
di_pubchemid2drugname.update(di_pubchem)
with open(pubchem_file, "wb+") as f:
    pickle.dump(di_pubchemid2drugname, f)

In [32]:
A = pd.read_csv(featureless_folder+"all_ratings.csv", index_col=0)
assert all([i[:2] in ["DB", "CI"] for i in A.index])

### III. Final matrix

In [33]:
A = utils.load_dataset("FEATURELESS", save_folder=paths_global.data_folder)["ratings_mat"]
A

Unnamed: 0,C0272275,C0585362,C3163899,C1319317,C0280324,C0007102,C0010674,C0079773,C0003873,CN263340,...,C1865810,C1848042,C1838261,C1832605,C1832474,C1866519,C1866041,C1866040,C1864068,C5203670
CID104999,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
CID442021,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
CID442872,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB13415,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB00001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
DB16355,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB16393,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB16394,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DB16416,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
ratings_A = utils.matrix2ratings(A, "ind_id", "drug_id", "rating")

print("Sparsity = "+str(utils.compute_sparsity(A))+"%")
utils.print_dataset(ratings_A, "ind_id", "drug_id", "rating")
ratings_A.T

Sparsity = 0.33728805072386664%
Ndrugs=1600	Ndiseases=1576
8397 positive	225 negative	2512978 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8612,8613,8614,8615,8616,8617,8618,8619,8620,8621
ind_id,C0006840,C0006840,C0006840,C0006840,C0149893,C0035235,C0339170,C0042510,C0149782,C0279639,...,C1851649,C1851649,C1851649,C1851649,C1851649,C1851649,C1851649,C1851649,C1851649,C1851649
drug_id,CID104999,CID442021,CID442872,DB13415,DB00001,DB00002,DB00002,DB00002,DB00002,DB00002,...,DB14761,DB15661,DB15718,DB15940,DB15941,DB16355,DB16393,DB16394,DB16416,DB16691
rating,-1,-1,-1,-1,1,1,1,1,1,1,...,-1,1,-1,1,1,1,1,1,1,1
