# Creating the "FEATURELESS" dataset 

Version 1.0.0 (December 28th 2022). Run this notebook before running the other two. It compiles a matrix A of drug-disease matchings which is the basis for the PREDICT and TRANSCRIPT datasets. Ensure that you are connected to the Internet before running this notebook.

## Libraries

In [1]:
import numpy as np
import pandas as pd
import subprocess as sb
import os

from time import sleep
import requests
import pickle

from joblib import Parallel, delayed
import multiprocessing
njobs=multiprocessing.cpu_count()-1

import sys
sys.path.insert(0, "../../src/")

import utils
import paths_global

## Local paths

In [2]:
## Where database files are stored
print('root_folder="%s"' % paths_global.root_folder)
## Where intermediary files are stored
print('data_folder="%s"' % paths_global.data_folder)

root_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/"
data_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/RECeSS/cfdr/data/"


In [3]:
featureless_folder = paths_global.data_folder+"FEATURELESS_v1.0.0/"
sb.Popen(["mkdir", "-p", featureless_folder])
## Where FEATURELESS dataset files are stored
print('featureless_folder="%s"' % featureless_folder)

featureless_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/RECeSS/cfdr/data/FEATURELESS_v1.0.0/"


## Drug and disease identifiers

Drug identifiers are DrugBank's when they exist, otherwise PubChem CIDs. Disease identifiers are OMIM identifiers, when they exist, otherwise concept ids ([CID](https://www.ncbi.nlm.nih.gov/medgen/docs/help/)).

#### Database access

To use the code in the notebooks *PREDICT_dataset.ipynb* and *TRANSCRIPT_dataset.ipynb*, you need to register to [LINCS L1000](https://clue.io/developer-resources#apisection), [STRING](https://string-db.org/) and [DisGeNet](https://www.disgenet.org/). The registration to these databases is free (often only requires an academic e-mail address), time-unlimited, but mandatory.

##### a. Registration to the DisGeNET database

Click on [this link](https://www.disgenet.org/signup/) to sign up to DisGeNET. Once you are registered, open an empty .TXT file, and write down

    on the first line: the e-mail address used for registration
    on the second line: the chosen password

Save the file, and replace the corresponding path in *paths_global.py*

In [4]:
## Where DisGeNet credentials are stored
print('disgenet_file="%s"' % paths_global.disgenet_file)

disgenet_file="../credentials/credentials_DISGENET.txt"


##### b. Registration to the STRING database

The STRING database requires an identification of the person sending requests to the database. Write down on the first line of an empty .TXT file your e-mail address, and replace the corresponding path in *paths_global.py*

In [5]:
## Where STRING credentials are stored
print('string_file="%s"' % paths_global.string_file)

string_file="../credentials/credentials_STRING.txt"


##### c. Registration to the LINCS L1000 database access CLUE.io

Click on [this link](https://clue.io/lincs) to sign up to CLUE.io. Once you are registered, open an empty .TXT file, and write down

    on the first line: the e-mail address used for registration
    on the second line: the chosen password
    on the third line: the user key you were assigned

Save the file, and replace the corresponding path in *paths_global.py*

In [6]:
## Where LINCS credentials are stored
print('lincs_file="%s"' % paths_global.lincs_file)

lincs_file="../credentials/credentials_LINCS.txt"


##### d. Registration to the [DrugBank](https://go.drugbank.com/) database and file downloading (v. 5.1.8.)

You need first to ask for access to the DrugBank database at *info@drugbank.com* (free for 6 mo for academic purposes). Note your credentials (first line: login, second line: password) at file `drugbank_file`.

In [7]:
## Where DrugBank credentials are stored
print('drugbank_file="%s"' % paths_global.drugbank_file)
## Where DrugBank files are stored
print('drugbank_folder="%s"' % paths_global.drugbank_folder)

## File download
get_str = lambda cmd : str(sb.check_output(cmd, shell=True).decode("utf-8").split("\n")[0])
login, pwd = [get_str(cmd+" -n1 "+paths_global.drugbank_file) for cmd in ["head", "tail"]]
base_url = "https://go.drugbank.com/releases/5-1-8/downloads/"
filenames = {"COMPLETE DATABASE": ["all-full-database", "drugbank_all_full_database.xml.zip"],
            "STRUCTURES/Structure External Links": ["all-structure-links", "drugbank_all_structure_links.csv.zip"]}
for section in filenames:
    fname = paths_global.drugbank_folder+section+"/"+filenames[section][-1]
    if (not os.path.exists(fname)):
        sb.call("mkdir -p "+paths_global.drugbank_folder+section+"/", shell=True)
        sb.call("curl -Lfv -o "+fname+" -u "+login+":"+pwd+" "+base_url+filenames[section][0])

drugbank_file="../credentials/DrugBank_credentials.txt"
drugbank_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/DrugBank/"


###### d.a [DrugBank](https://go.drugbank.com/) identifiers for drug components

In [8]:
if (not os.path.exists(paths_global.data_folder+"drugbankid2drugname.pck")):
    ## Find drug names from DrugBank identifiers (using file from DrugBank after registration)
    path=paths_global.drugbank_folder+"COMPLETE DATABASE"+"/"
    drugbank_db_file = path+"full database.xml"
    if (not os.path.exists(drugbank_db_file)):
        sb.call("unzip "+path+"drugbank_all_full_database.xml.zip", shell=True)
    cmd_file="cat \'"+drugbank_db_file+"\'"
    cmd_drugbank_names="grep -e '^  <name>' | sed 's/  <name>//g' | sed 's/<\/name>//g'"
    cmd_drugbank_ids="grep -e '^  <drugbank-id primary=\"true\">' | sed 's/<\/drugbank-id>//g' | sed 's/<drugbank-id primary=\"true\">//g' | sed 's/ //g'"
    drugbank_ids=sb.check_output(cmd_file+" | "+cmd_drugbank_ids, shell=True).decode("utf-8").split("\n")
    drugbank_names=sb.check_output(cmd_file+" | "+cmd_drugbank_names, shell=True).decode("utf-8").split("\n")
    di_drugbankid2drugname = dict(zip(drugbank_ids, drugbank_names))
    ## Manually retrieved from DrugBank website
    new_names = {"DB00510": "Divalproex sodium", 
        "DB01402": "Bismuth", 
        "DBCAT004271": 'Asparaginase',
        "DB00371" : "Meprobamate",
        "DB00394": "Beclomethasone dipropionate",
        "DB00422": "Methylphenidate",
        "DB00462": "Methscopolamine bromide",
        "DB00464": "Sodium tetradecyl sulfate",
        "DB00525": "Tolnaftate",
        "DB00527": "Cinchocaine",
        "DB00563": "Methotrexate",
        "DB05381": "Histamine",
        "DB00931": "Metacycline",
        "DB00326": "Calcium glucoheptonate",
        "DB14520": "Tetraferric tricitrate decahydrate",
        "DB00717": "Norethisterone",
        "DB01258": "Aliskiren",
        "DB00006" : "Bivalirudin",
    }
    di_drugbankid2drugname.update(new_names)
    with open(paths_global.data_folder+"drugbankid2drugname.pck", "wb") as f:
        pickle.dump(di_drugbankid2drugname, f)
else:
    with open(paths_global.data_folder+"drugbankid2drugname.pck", "rb") as f:
        di_drugbankid2drugname = pickle.load(f)

##### e. [OMIM](https://www.omim.org/) identifiers for diseases

Those are identifiers from OMIM, up to the "D" letter which is needed in order to have valid column names. You need to ask for (1 year-long) registration to download files (free for academic purposes). Then a link to these files will be sent by e-mail.

In [9]:
## Where OMIM files are stored
print('omim_folder="%s"' % paths_global.omim_folder)

if (not os.path.exists(paths_global.data_folder+"omimid2diseasename.pck")):
    disease_file=paths_global.omim_folder+"mimTitles.txt"
    disease_df = pd.read_csv(disease_file, sep="\t", header=2, index_col=1)[["Preferred Title; symbol"]].dropna()
    disease_df.index = ["D"+str(int(x)) for x in disease_df.index]
    disease_df["Preferred Title; symbol"] = [x[0]+x[1:].lower() for x in disease_df["Preferred Title; symbol"]]
    di_omimid2diseasename = dict(zip(list(disease_df.index), list(disease_df["Preferred Title; symbol"])))
    with open(paths_global.data_folder+"omimid2diseasename.pck", "wb") as f:
        pickle.dump(di_omimid2diseasename, f)
else:
    with open(paths_global.data_folder+"omimid2diseasename.pck", "rb") as f:
        di_omimid2diseasename = pickle.load(f)

omim_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/OMIM/"


##### f. [SIDER](http://sideeffects.embl.de/) database of drug side effects

The access to this database is free and without registration. Files are located at this [page](http://sideeffects.embl.de/download/).

In [10]:
## Where SIDER files are stored
print('sider_folder="%s"' % paths_global.sider_folder)

base_url = "http://sideeffects.embl.de/media/download/"
filenames = ["drug_names.tsv", "meddra_all_se.tsv.gz"]

for fname in filenames:
    fm = paths_global.sider_folder+(fname if (fname[-3:]!=".gz") else fname[:-3])
    if (not os.path.exists(fm)):
        sb.call("mkdir -p "+paths_global.sider_folder, shell=True)
        sb.call("wget -N -O "+paths_global.sider_folder+fname+" "+base_url+fname, shell=True)
        if (".gz" == fname[-3:]):
            sb.call("gzip -d "+paths_global.sider_folder+fname+" && rm -f "+paths_global.sider_folder+fname, shell=True)

sider_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/SIDER/"


#### All dictionaries for identifiers

In [11]:
assert os.path.exists(paths_global.data_folder+"drugbankid2drugname.pck")
with open(paths_global.data_folder+"drugbankid2drugname.pck", "rb") as f:
    di_drugbankid2drugname = pickle.load(f)
    
assert os.path.exists(paths_global.data_folder+"omimid2diseasename.pck")
with open(paths_global.data_folder+"omimid2diseasename.pck", "rb") as f:
    di_omimid2diseasename = pickle.load(f)
    
cids_file = paths_global.data_folder+"medgenid2diseasename.pck"
if (not os.path.exists(cids_file)):
    di_medgenid2diseasename = {}
else:
    with open(cids_file, "rb") as f:
        di_medgenid2diseasename = pickle.load(f)
        
pubchem_file = paths_global.data_folder+"pubchemid2drugname.pck"
if (not os.path.exists(pubchem_file)):
    di_pubchemid2drugname = {}
else:
    with open(pubchem_file, "rb") as f:
        di_pubchemid2drugname = pickle.load(f)

## Build matrix A : $N_S \times N_D$ of drug-disease associations

### 1. [RepoDB](http://apps.chiragjpgroup.org/repoDB/) dataset

Drug identifiers are [DrugBank](https://go.drugbank.com/) identifiers. Disease identifiers are Concept ID ([CUI](https://www.ncbi.nlm.nih.gov/medgen/docs/help/)) identifiers. The full dataset should be downloaded from [this website](http://apps.chiragjpgroup.org/repoDB/) (Tab "Download", then button "Download the full repoDB Dataset"). Then enter the path to this file in *paths_global.py*.

In [12]:
print('repodb_folder="%s"' % paths_global.repodb_folder)
## Download the full database on the website
assert os.path.exists(paths_global.repodb_folder+"full.csv")

repodb_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/RepoDB/"


In [13]:
if (not os.path.exists(featureless_folder+"RepoDB_set.csv")):
    dataset = pd.read_csv(paths_global.repodb_folder+"full.csv", header=0)

    ## Select late phase trials
    late_phase_trials=(dataset["phase"]=="Phase 3")|(dataset["phase"]=="Phase 2/Phase 3")|(dataset["phase"]=="Phase 2")|pd.isnull(dataset["phase"])
    dataset = dataset.loc[late_phase_trials]

    ## Remove suspended/withdrawn trials
    dataset = dataset.query("status!='Suspended'").query("status!='Withdrawn'")
    test_outcome = lambda out, outs : all([not (o in out.lower()) for o in outs])
    test_all_outcomes = lambda out_ls, outs : list(map(lambda x : str(x)=="nan" or test_outcome(x, outs), list(out_ls)))

    ## Remove unspecified outcomes in terminated trials
    dataset = dataset.loc[(dataset["status"]!="Terminated")|(~pd.isnull(dataset["DetailedStatus"]))]
    unspecified_outcomes = ["detailed description", "study completed per investigator", "study closed by the nci", "the first phase was completed", "administrative reasons", "study terminated", "study has finished", "administratively complete", "data collection complete", "no records are available"]
    unspecified_outcomes += ["interim analysis indicated study should be terminated", "trial stopped on sept 24, 2007", "irb study closure"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], unspecified_outcomes))]

    ## Remove trials with slow/low accrual/enrollment/recruitment
    slow_outcomes = [adj+" "+noun for adj in ["slow", "diminished", "difficult", "low", "poor", "bad", "lack of", "inadequate", "recruitable", "insufficient"] for noun in ["accural", "recruitement", "inclusion", "subject", "accrual", "rectruitment", "participant", "patient", "enrollment", "enrolment", "recruitment"]]
    slow_outcomes += [a+" "+b for a in ["unable", "failure", "difficult", "inability"] for b in ["to recruit", "to enroll"]]
    slow_outcomes += [a+" "+b for a in ["enrolling", "enrollment", "recruitment", "difficulty", "difficulty of", "recruitment of"] for b in ["recrutement", "participant", "difficulties", "challenge", "in enrolling"]]
    slow_outcomes += ["recruitment", "sample size", "recruiting", "not enrolled completely", "lack of eligible patients", "enough patient", "enrollment number"]
    slow_outcomes += ["inclusion rate", "target enrollment", "feasibility", "discontinued", "enrollee", "recruit new patient"]
    slow_outcomes += ["accrual was not optimized", "low rate of accrual", "short of participants", "accrual goal for interventional part not achievable"]
    slow_outcomes += ["enrollment issues", 'difficulty to recruit patients', "enrollment very slow", "accrual was too slow", "enrollment was much slower than anticipated", "only 2 subjects enrolled", "recruitement did not meet expectations", 'closed to enrollment', "difficulty in accruing subjects", "no more eligible patients"]
    slow_outcomes += ["5 subjects could be enrolled", "subject recuitment halted", "accrual was very low", "not be able to reach stated accrual"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], slow_outcomes))]

    ## Remove trials which were ended by sponsor/lack of funding/access to drug
    sponsor_outcomes = ["medication supply issue", "due study team travel restrictions", "could not receive the support from the national medical insurance", "lack of experimental medication", "withdrawal of support from our collaborator", "financial resource limitations", "terminated study due to lack of funds", "unavailibility of methylnaltrexone ", "insurance companies to cover", "stopped drug delivery", "change in the national policy of medications", "sponsor", "limited availability of drug", "drug supply", "study medication expired", "not able to obtain the study drug", "registration of the medicine is no longer being pursued", "company decision", "competitor study", "drugs unavailable", "unable to secure drug", "access to study drug", "competition", "competing", "funding", "business", "supply study drug", "drug availability", "drug no longer available", "insurance coverage"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], sponsor_outcomes))]

    ## Remove trials which were ended for scientific reasons
    scientific_outcomes = ["changes in standard care", "hri no longer conducting research", "to be compliant with the timelines", "was determined not feasible", "changing aetiology of squamous cell carcinoma", "preliminary analysis", "change in development plan", "change in standard of care", "h1n1 pandemic is now over", "medical/ethical reasons", "original investigator left", "pi ", "this study will not be written up", "h1n1 pandemic concluded", "fda placed a clinical hold", "on hold at the request of the fda", "trial not progressing toward scientific goals", "pi's ", "some of the researchers finished their participation in the study", 'fda hold may 2007', "major revisions needed", "protocol modification", "design changes were needed", "non-compliance"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], scientific_outcomes))]

    ## Remove trials (lack of clear outcome)
    other_outcomes = ['safety concern of active control drug', 'few delirious patients were enrolled', 'trial design contingent on RFA optimization', 'efficacy interim analysis as per protocol', 'interim analysis showed that the primary endpoint would not be met', 'study was never initiated under new location/provider group', 'to complete the study in an appropriate time frame', 'malaria prev. fell in the study area, so we cannot evaluate the primary endpoint', 'sufficient number to reach the primary endpoint and as planned', 'sufficient number of subjects accrued to conduct analysis', 'aoi pharma terminated the license agreement', 'dsmb stopped study because placebo arm had more adverse events', 'drug was no longer available', 'pfizer has terminated the execution of this protocol', "clinical trial terminated due to results from recent nonclinical studies", "data from the c08 study and avant study", "technical/operational issues", "study closed and subject follow-up completed following analysis of blinded study data"]
    dataset = dataset.loc[(test_all_outcomes(dataset["DetailedStatus"], other_outcomes))]
    dataset = dataset.loc[(dataset["DetailedStatus"]!="Completed")&(dataset["DetailedStatus"]!="Terminated")]

    ## Create positive and negative outcome sets
    test_outcome = lambda out, outs : any([(o in out.lower()) for o in outs])
    test_all_outcomes = lambda out_ls, outs : list(map(lambda x : str(x)=="nan" or test_outcome(x, outs), list(out_ls)))
    detailed_status = dataset['DetailedStatus']
    dataset = dataset[['drug_name', 'drug_id', 'ind_name', 'ind_id', 'status']]
    dataset["status"] = "Negative"
    positive_outcomes = ["publishing the results", "demonstrating efficacy", 'had already been prescribed Cymbalta', 'drug now on market', 'PXD101-CLN-19']
    dataset["status"].loc[(test_all_outcomes(detailed_status, positive_outcomes))] = "Positive"
    print(dataset["status"].value_counts())
    
    dataset["rating"] = [int(x=="Positive")-int(x=="Negative") for x in dataset["status"]]
    dataset_repodb = dataset[["ind_id", "drug_id", "drug_name", "ind_name", "rating"]]
    dataset_repodb.to_csv(featureless_folder+"RepoDB_set.csv")

dataset = pd.read_csv(featureless_folder+"RepoDB_set.csv", index_col=0)
dataset_repodb = dataset[["ind_id", "drug_id", "rating"]]
utils.print_dataset(dataset_repodb, "ind_id", "drug_id", "rating")
dataset_repodb.T

Ndrugs=1531	Ndiseases=1287
6686 positive	190 negative	1963521 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9987,9988,10099,10116,10117,10164,10179,10250,10283,10396
ind_id,C0272275,C0585362,C3163899,C1319317,C0280324,C0007102,C0010674,C0079773,C0003873,C3495559,...,C0026764,C0026764,C0948780,C0021359,C0021359,C0019196,C0277554,C0346976,C0023283,C0039445
drug_id,DB00001,DB00002,DB00002,DB00002,DB00002,DB00002,DB00003,DB00004,DB00005,DB00005,...,DB00773,DB00987,DB00641,DB00783,DB06825,DB00715,DB00682,DB00441,DB00196,DB00112
rating,1,1,1,1,1,1,1,1,1,1,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


Populate drug and disease identifiers

In [14]:
for idx in dataset.index:
    drug_id = dataset.loc[idx]["drug_id"]
    drug_name = dataset.loc[idx]["drug_name"]
    ind_id = dataset.loc[idx]["ind_id"]
    ind_name = dataset.loc[idx]["ind_name"]
    di_drugbankid2drugname.setdefault(drug_id, drug_name)
    di_medgenid2diseasename.setdefault(ind_id, ind_name)
    
with open(paths_global.data_folder+"drugbankid2drugname.pck", "wb") as f:
    pickle.dump(di_omimid2diseasename, f)
with open(paths_global.data_folder+"medgenid2diseasename.pck", "wb") as f:
    pickle.dump(di_medgenid2diseasename, f)

### 2. [Gottlieb](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159979/) dataset

In [15]:
A_gottlieb = utils.load_dataset("Gottlieb", save_folder=paths_global.data_folder)["ratings_mat"]
dataset_gottlieb = utils.matrix2ratings(A_gottlieb, "ind_id", "drug_id", "rating")

print("Sparsity = "+str(utils.compute_sparsity(A_gottlieb))+"%")
utils.print_dataset(dataset_gottlieb, "ind_id", "drug_id", "rating")
dataset_gottlieb.T

Sparsity = 1.0414365682698576%
Ndrugs=593	Ndiseases=313
1933 positive	0 negative	183676 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932
ind_id,D131200,D150699,D175510,D176807,D600082,D601518,D604416,D604416,D114480,D131200,...,D110100,D157700,D161900,D608622,D126200,D102500,D166710,D606842,D144700,D605839
drug_id,DB00007,DB00007,DB00007,DB00007,DB00007,DB00007,DB00007,DB00010,DB00014,DB00014,...,DB04861,DB04861,DB04861,DB04861,DB05259,DB06285,DB06285,DB06285,DB06287,DB06287
rating,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


### 3. [CDataset](https://academic.oup.com/bioinformatics/article/34/11/1904/4820334) dataset

In [16]:
A_cdataset = utils.load_dataset("Cdataset", save_folder=paths_global.data_folder)["ratings_mat"]
dataset_cdataset = utils.matrix2ratings(A_cdataset, "ind_id", "drug_id", "rating")

print("Sparsity = "+str(utils.compute_sparsity(A_cdataset))+"%")
utils.print_dataset(dataset_cdataset, "ind_id", "drug_id", "rating")
dataset_cdataset.T

Sparsity = 0.9337419376251535%
Ndrugs=663	Ndiseases=409
2532 positive	0 negative	268635 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2522,2523,2524,2525,2526,2527,2528,2529,2530,2531
ind_id,D114480,D131200,D176807,D601518,D125700,D134430,D134500,D193400,D277480,D304900,...,D125852,D222100,D300136,D600319,D601208,D601318,D601666,D601941,D601942,D603266
drug_id,DB00014,DB00014,DB00014,DB00014,DB00035,DB00035,DB00035,DB00035,DB00035,DB00035,...,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907,DB08907
rating,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


### 4. [Covid-19](https://github.com/vikram-s-narayan/collaborative-filtering-for-drug-repurposing-COVID-V3) dataset

It seems every drug which has been tested for Covid-19 is rated "1" in this dataset, so I use literature and current recommendations to refine the ratings. This dataset also provides other indications, but since there is no source about how this dataset was built, we won't use them.

Up to date on Feb, $2^{nd}, 2021$: this [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7520874/) might be interesting but is inconclusive about a good treatment for Covid-19. Hence we use current therapeutic recommendations from the NIH (USA) and FDA (USA) and NHS (UK) and HAS (FR):
- [baricitinib and remdesivir](https://www.fda.gov/news-events/press-announcements/coronavirus-covid-19-update-fda-authorizes-drug-combination-treatment-covid-19)
- [remdesivir alone](https://www.fda.gov/emergency-preparedness-and-response/coronavirus-disease-2019-covid-19/covid-19-frequently-asked-questions), [other ref](https://www.nice.org.uk/advice/es27/chapter/Key-messages)
- [hydroxychloroquine](https://www.fda.gov/drugs/drug-safety-and-availability/fda-cautions-against-use-hydroxychloroquine-or-chloroquine-covid-19-outside-hospital-setting-or) (lack of efficacy)
- [azithromycin alone](https://www.fda.gov/news-events/press-announcements/coronavirus-covid-19-update-daily-roundup-may-11-2020)
- [tocilizumab,sarilumab](https://www.covid19treatmentguidelines.nih.gov/immune-based-therapy/immunomodulators/interleukin-6-inhibitors/) (lack of efficacy)
- [dexamethasone,bamlanivimab,casirivimab+imdevimab](https://www.covid19treatmentguidelines.nih.gov/therapeutic-management/)
- [lopinavir,ritonavir](https://www.who.int/fr/news/item/04-07-2020-who-discontinues-hydroxychloroquine-and-lopinavir-ritonavir-treatment-arms-for-covid-19) (lack of efficacy)
- [ibuprofen,acetylsalicylic acid](https://www.nice.org.uk/advice/es23/chapter/Key-messages) (no evidence in either way)
- [anakinra](https://www.nice.org.uk/advice/es26/chapter/Key-messages) (no evidence either way)
- [atovaquone] (no official info was found)
- [omeprazole] (no official info was found)
- [famotidine] (no official info was found)
- [oseltamivir](https://www.fda.gov/drugs/information-drug-class/influenza-flu-antiviral-drugs-and-related-information) (antiviral targeting influenza != family of viruses)
- [favipiravir](https://www.who.int/fr/news-room/q-a-detail/coronavirus-disease-covid-19-hiv-and-antiretrovirals) (ditto)
- [tenofovir,emtricitabine](https://www.who.int/fr/news-room/q-a-detail/coronavirus-disease-covid-19-hiv-and-antiretrovirals) (no conclusive official info was found)
- [atazanavir] (no official info was found)
- (https://www.nice.org.uk/covid-19)

In [17]:
## Download the associated file on GitHub
print('covid_folder="%s"' % paths_global.covid_folder)
if (not os.path.exists(paths_global.covid_folder)):
    covid_url="https://raw.githubusercontent.com/vikram-s-narayan/collaborative-filtering-for-drug-repurposing-COVID-V3/master/approved_COVID.csv"
    
    if (not os.path.exists(paths_global.covid_folder)):
        sb.Popen(["mkdir", "-p", paths_global.covid_folder])
        sb.Popen(["wget", "-O", paths_global.covid_folder+"approved_COVID.csv", covid_url])
covid_file = paths_global.covid_folder+"approved_COVID.csv"
assert os.path.exists(covid_file)

covid_folder="/media/kali/1b80f30d-2803-4260-a792-9ae206084252/Code/M30/data/collaborative-filtering-for-drug-repurposing-COVID-V3/"


In [18]:
if (not os.path.exists(featureless_folder+"Covid_set.csv")):
    ## According to notebook, this dataset is the same than RepoDB except
    ## for the last 15 lines
    approved_Covid = pd.read_csv(covid_file).iloc[-15:,:]
    approved_Covid.index = range(15)
    add_ratings=[
        ["Baricitinib", "Coronavirus disease 2019", 1], 
        ["Sarilumab", "Coronavirus disease 2019", -1], 
        ["Dexamethasone", "Coronavirus disease 2019", 1], 
        ["Bamlanivimab", "Coronavirus disease 2019", 1], 
        ["Casirivimab", "Coronavirus disease 2019", 1], 
        ["Imdevimab", "Coronavirus disease 2019", 1], 
        ["Atovaquone", "Coronavirus disease 2019", 0], 
        ["Omeprazole", "Coronavirus disease 2019", 0], 
        ["Famotidine", "Coronavirus disease 2019", 0],
        ["Oseltamivir", "Coronavirus disease 2019", -1], 
        ["Favipiravir", "Coronavirus disease 2019", -1], 
        ['Emtricitabine', "Coronavirus disease 2019", 0], 
        ["Tenofovir", "Coronavirus disease 2019", 0], 
        ["Atazanavir", "Coronavirus disease 2019", 0], 
        ["Hydroxychloroquine", "Coronavirus disease 2019", -1], 
        ["Tocilizumab", "Coronavirus disease 2019", -1], 
        ["Lopinavir", "Coronavirus disease 2019", -1], 
        ["Ritonavir", "Coronavirus disease 2019", -1]
    ]
    for drug, ind, rating in add_ratings:
        if (drug in approved_Covid["drug_name"]):
            covid_ind = (approved_Covid["ind_name"]=="COVID-19")
            drug_id = (approved_Covid["drug_name"]==drug)
            approved_Covid.loc[covid_ind&drug_id] = [drug, ind, rating]
        else:
            approved_Covid.loc[len(approved_Covid.index)] = [drug, ind, rating]
            
    covid_drug_ids = {
        "DB15941": "Casirivimab",
        "DB00927": "Famotidine",
        "DB15718": "Bamlanivimab",
        "DB00503": "Ritonavir",
        "DB11817": "Baricitinib",
        "DB01234": "Dexamethasone",
        "DB00207": "Azithromycin",
        "DB14126": "Tenofovir", 
        "DB00338": "Omeprazole",
        "DB01117": "Atovaquone",
        "DB00879": "Emtricitabine",
        "DB00198": "Oseltamivir", 
        "DB06273": "Tocilizumab",
        "DB12466": "Favipiravir", 
        "DB11767": "Sarilumab",
        "DB14761": "Remdesivir",
        "DB01611": "Hydroxychloroquine",
        "DB01072": "Atazanavir",
        "DB15940": "Imdevimab",
        "DB01601": "Lopinavir",
    }

    covid_disease_ids = {
        "C5203670": "Coronavirus disease 2019",
        "CN294793": "Asymptomatic COVID-19 infection",
        "CN294811": "COVID-19–associated multisystem inflammatory syndrome in children",
        "CN294802": "Critical COVID-19 infection",
        "CN294794": "Mild COVID-19 infection",
        "CN294795": "Moderate COVID-19 infection",
        "CN294804": "Presymptomatic COVID-19 infection",
        "CN294796": "Severe COVID-19 infection", 
    }

    di_drugbankid2drugname.update(covid_drug_ids)
    di_medgenid2diseasename.update(covid_disease_ids)

    with open(paths_global.data_folder+"drugbankid2drugname.pck", "wb") as f:
        pickle.dump(di_drugbankid2drugname, f)
    with open(paths_global.data_folder+"medgenid2diseasename.pck", "wb") as f:
        pickle.dump(di_medgenid2diseasename, f)
        
    approved_Covid["ind_name"] = ["C5203670"]*len(approved_Covid.index)
    di = {}
    for k in covid_drug_ids:
        di.setdefault(covid_drug_ids[k], k)
    approved_Covid["drug_name"] = [di[idx] for idx in approved_Covid["drug_name"]]
    approved_Covid.columns = ["drug_id", "ind_id", "rating"]
    dataset_covid = approved_Covid
    dataset_covid.to_csv(featureless_folder+"Covid_set.csv")
    
dataset_covid = pd.read_csv(featureless_folder+"Covid_set.csv", index_col=0)
utils.print_dataset(dataset_covid, "ind_id", "drug_id", "rating")
dataset_covid.T

Ndrugs=20	Ndiseases=1
20 positive	7 negative	6 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,23,24,25,26,27,28,29,30,31,32
drug_id,DB14761,DB01611,DB00207,DB06273,DB01601,DB00503,DB01117,DB00338,DB00927,DB00198,...,DB00927,DB00198,DB12466,DB00879,DB14126,DB01072,DB01611,DB06273,DB01601,DB00503
ind_id,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,...,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670,C5203670
rating,1,1,1,1,1,1,1,1,1,1,...,0,-1,-1,0,0,0,-1,-1,-1,-1


### 5. Epilepsy dataset

Privately shared dataset on antiepileptic and proconvulsant drugs compiled by Dr. Baptiste PORTE <baptiste.porte@inserm.fr>.

In [19]:
fname="../epilepsy_dataset/epilepsy_scores_verif.csv"
assert os.path.exists(fname)

In [20]:
if (not os.path.exists(featureless_folder+"Epilepsy_set.csv")):

    dataset_epilepsy=pd.read_csv(fname, index_col=0)

    pubchem_cid = dict(zip(dataset_epilepsy.index, dataset_epilepsy["drug_name"]))
    di_pubchemid2drugname.update(pubchem_cid)
    with open(paths_global.data_folder+"pubchemid2drugname.pck", "wb+") as f:
        pickle.dump(di_pubchemid2drugname, f)

    di = {}
    for k in di_drugbankid2drugname:
        di.setdefault(di_drugbankid2drugname[k], k)

    dataset_epilepsy["drug_name"] = [di.get(dataset_epilepsy["drug_name"].loc[idx], idx) for idx in dataset_epilepsy.index]
    dataset_epilepsy["score"] = [-1 if ('-1' in dataset_epilepsy.loc[dataset_epilepsy.index[i]]["verification"]) else (1 if ('1' in dataset_epilepsy.loc[dataset_epilepsy.index[i]]["verification"]) else x) for i, x in enumerate(dataset_epilepsy["score"])]
    dataset_epilepsy = dataset_epilepsy[["drug_name","score"]]
    dataset_epilepsy.columns = ["drug_id", "rating"]
    dataset_epilepsy["ind_id"] = ["C0014544"]*len(dataset_epilepsy.index)

    di_medgenid2diseasename.update({"C0014544": "Epilepsy"})
    with open(paths_global.data_folder+"medgenid2diseasename.pck", "wb") as f:
        pickle.dump(di_medgenid2diseasename, f)
        
    dataset_epilepsy.to_csv(featureless_folder+"Epilepsy_set.csv")
        
dataset_epilepsy = pd.read_csv(featureless_folder+"Epilepsy_set.csv", index_col=0)
utils.print_dataset(dataset_epilepsy, "ind_id", "drug_id", "rating")
dataset_epilepsy.T

Ndrugs=509	Ndiseases=1
37 positive	138 negative	334 unknown matchings


Unnamed: 0,3767,2310,104999,4375,3352,3373,5917,442872,442021,33741,...,5284627,4843,3291,34312,1986,3016,3446,5665,5284583,5734
drug_id,DB00951,DB13740,104999,4375,3352,DB01205,5917,442872,442021,DBCAT004271,...,DB00273,DB09210,DB00593,DB00776,DB00819,DB00829,DB00996,DB01080,DB01202,DB00557
rating,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,...,1,1,1,1,1,1,1,1,1,1
ind_id,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,...,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544


### 6. Merge all datasets

Inconsistent outcomes are ratings which are contradicted by another rating (in {0,1,-1}) in the union of ratings of all considered datasets. We correct inconsistent outcomes as follows:
- If there is at least one negative outcome reported, then it is a negative outcome (in order to be conservative with respect to drug recommendations).
- If there is at least one positive outcome and no negative outcome reported, then it is a positive outcome.

In [21]:
#' @param rating_dfs list of Pandas Dataframe of ratings
#' @param user_col
#' @param item_col
#' @param rating_col
#' @return merged Pandas Dataframe of ratings
def merge_ratings(rating_dfs, user_col, item_col, rating_col):
    dfs = tuple(rating_dfs)
    ratings = pd.concat(dfs, axis=0)
    ratings = ratings.loc[ratings[rating_col]!=0]
    ratings = ratings[[user_col, item_col, rating_col]]
    ratings.index = ["-".join(list(map(str,ratings.iloc[idx][[user_col,item_col]]))) for idx in range(len(ratings.index))]
    if (ratings.index.duplicated().any()):
        duplicated_idxs = np.unique(ratings.index[ratings.index.duplicated()])
        for idx in duplicated_idxs:
            user, item = idx.split("-")
            ratings_ = list(ratings.loc[idx]["rating"])
            if (len(np.unique(ratings_))==1):
                rating = ratings_[0]
            else:
                print("/!\ Inconsistent outcomes! "+str(np.unique(ratings_))+"\tuser="+str(user)+"\titem="+str(item))
                ## If at least one negative outcome reported => negative (-1)
                ## If at least one positive outcome and no negative outcome reported => positive (1)
                if ((np.array(ratings_)==-1).any()):
                    rating = -1
                elif ((np.array(ratings_)==1).any()):
                    rating = 1
                else:
                    rating = 0
            ratings = ratings.loc[ratings.index!=idx]
            ratings.loc[idx] = [user, item, rating]
    return ratings

if (not os.path.exists(featureless_folder+"all_ratings_merged.csv")):

    dfs = [dataset_gottlieb, dataset_repodb, dataset_covid, dataset_epilepsy]
    ratings_A = merge_ratings(dfs, "ind_id", "drug_id", "rating")
    A = utils.ratings2matrix(ratings_A, "ind_id", "drug_id", "rating").fillna(0.)

    ## Deal with inconsistent outcomes
    #user=C0014544	item=DB00273 https://go.drugbank.com/drugs/DB00273
    #user=C0014544	item=DB09210 https://go.drugbank.com/drugs/DB09210
    #user=C5203670	item=DB00198 https://go.drugbank.com/drugs/DB00198
    #user=C5203670	item=DB00503 https://go.drugbank.com/drugs/DB00503
    #user=C5203670	item=DB01601 https://go.drugbank.com/drugs/DB01601
    #user=C5203670	item=DB01611 https://go.drugbank.com/drugs/DB01611
    #user=C5203670	item=DB06273 https://go.drugbank.com/drugs/DB06273
    #user=C5203670	item=DB12466 https://go.drugbank.com/drugs/DB12466
    inconsistent_outs = {
        ("C0014544","DB00273"): 1,
        ("C0014544","DB09210"): 1,
        ("C5203670","DB00198"): -1,
        ("C5203670","DB00503"): -1,
        ("C5203670","DB01601"): -1,
        ("C5203670","DB01611"): -1,
        ("C5203670","DB06273"): -1,
        ("C5203670","DB12466"): -1, 
    }
    for (disease, drug) in inconsistent_outs:
        old = A.loc[drug][disease]
        A.loc[drug][disease] = inconsistent_outs[(disease,drug)]

    A.to_csv(featureless_folder+"all_ratings_merged.csv")

A = pd.read_csv(featureless_folder+"all_ratings_merged.csv", index_col=0)
ratings_A = utils.matrix2ratings(A, user_col="ind_id", item_col="drug_id", rating_col="rating")

print("Sparsity = "+str(utils.compute_sparsity(A))+"%")

utils.print_dataset(ratings_A, "ind_id", "drug_id", "rating")

Sparsity = 0.35070326199346175%
Ndrugs=1599	Ndiseases=1599
8658 positive	320 negative	2547823 unknown matchings


In [22]:
assert all([x[0] in ["C", "D"] and x[:2]!="DB" for x in ratings_A["ind_id"].unique()])
assert all([str(x)[0] not in ["C", "D"] or str(x)[:2]=="DB" for x in ratings_A["drug_id"].unique()])
assert all([x[0] in ["C", "D"] and x[:2]!="DB" for x in A.columns])
assert all([str(x)[0] not in ["C", "D"] or str(x)[:2]=="DB" for x in A.index])

### I. Convert all disease/drug ids to the same notation

Convert all OMIM/RepoDB disease id into MedGen Concept id

In [23]:
if (not os.path.exists(featureless_folder+"all_ratings_converted_diseases.csv")):

    A = pd.read_csv(featureless_folder+"all_ratings_merged.csv", index_col=0)

    disease_id_list = list(A.columns)
    disease_list = utils.disease_id2name(disease_id_list, id_folder=paths_global.data_folder)
    for ix, x in enumerate(disease_list):
        if ("Moved to " in x):
            disease_list[ix] = utils.disease_id2name(["D"+"".join(x.split("Moved to ")[-1])], id_folder=paths_global.data_folder)[0]

    def get_cid_job(did, dname):
        try:
            if (dname[0]!="C"):
                dcid = utils.get_concept_id(dname, delay=0)
            else:
                dcid = dname
        except:
            if ("Metastatic Renal Cell Cancer" in dname):
                dcid = get_cid_job(did, "Renal Cell Cancer")
            elif ("Unspecified Abortion" == dname):
                dcid = get_cid_job(did, "Spontaneous abortion")
            elif ("Poisoning by " in dname):
                dcid = get_cid_job(did, dname.split("Poisoning by ")[-1])
            elif ("Malignant lymphoma, lymphocytic, intermediate differentiation, diffuse" == dname):
                dcid = get_cid_job(did, "Malignant lymphoma, lymphocytic, intermediate differentiation")
            elif ("Septicemia candida" == dname):
                dcid = get_cid_job(did, "candida infection")
            elif ("Toxic effect of cyanide" == dname):
                dcid = get_cid_job(did, "cyanide")
            elif ("Toxoplasmosis associated with AIDS" == dname):
                dcid = get_cid_job(did, "AIDS-related Toxoplasmosis")
            elif ("Moraxella catarrhalis pneumonia" == dname):
                dcid = get_cid_job(did, "Acute Moraxella catarrhalis bronchitis")
            elif ("Bladder muscle dysfunction - overactive" == dname):
                dcid = get_cid_job(did, "Overactive bladder")
            elif ("Renal disease with edema NOS" == dname):
                dcid = get_cid_job(did, "Nephrotic syndrome, type 3")
            elif (" (disorder)" in dname):
                dcid = get_cid_job(did, dname.split(" (disorder)")[0])
            elif ("Malignant neoplasm of stomach stage IV" == dname):
                dcid = get_cid_job(did, "Malignant neoplasm of stomach")
            elif ("Escherichia coli septicemia" == dname):
                dcid = get_cid_job(did, "Escherichia coli infection")   
            elif ("URINARY TRACT INFECTION".lower() in dname.lower()):
                dcid = get_cid_job(did, "URINARY TRACT INFECTION")  
            elif ("candidal peritonitis" == dname):
                dcid = get_cid_job(did, "peritonitis") 
            elif ("pharyngitis due to Haemophilus influenzae" == dname):
                dcid = get_cid_job(did, "Haemophilus influenzae") 
            elif ("Streptococcal endocarditis" == dname):
                dcid = get_cid_job(did, "Subacute bacterial endocarditis")
            elif ("Pseudomonas aeruginosa meningitis" == dname):
                dcid = get_cid_job(did, "Bacterial meningitis")
            elif ("Staphylococcal bacteraemia" == dname):
                dcid = get_cid_job(did, "bacteraemia")
            elif ("Haemophilus parainfluenzae pneumonia" == dname):
                dcid = get_cid_job(did, "bacterial pneumonia")
            elif ("Accidental poisoning by lead and its compounds and fumes" == dname):
                dcid = get_cid_job(did, "Lead poisoning")
            elif ("Sprains and Strains" == dname):
                dcid = get_cid_job(did, "Ehlers-Danlos syndrome")
            elif ("Bacteroides empyema" == dname):
                dcid = get_cid_job(did, "Bacteroides infectious disease")
            elif ("Meningitis due to Bacteroides" == dname):
                dcid = get_cid_job(did, "Meningitis")
            elif ("Diarrhoea predominant irritable bowel syndrome" == dname):
                dcid = get_cid_job(did, "irritable bowel syndrome")   
            elif ("Endocarditis haemophilus" == dname):
                dcid = get_cid_job(did, "Bacterial endocarditis")  
            elif ("Septicemia due to anaerobes" == dname):
                dcid = get_cid_job(did, "Sepsis")  
            elif ("Septic arthritis haemophilus" == dname):
                dcid = get_cid_job(did, "Septic arthritis")
            elif ("Accidental poisoning by methyl alcohol" == dname):
                dcid = get_cid_job(did, "Alcohol dependence")
            elif ("Ethylene glycol poisoning" == dname):
                dcid = get_cid_job(did, "Ethylene glycol")
            elif ("Branch retinal vein occlusion with macular edema" == dname):
                dcid = get_cid_job(did, "Branch retinal vein occlusion with no neovascularization")
            elif ("Generalized glycogen storage disease of infants" == dname):
                dcid = get_cid_job(did, "Glycogen storage disease, type II")
            elif ("Arsenic Poisoning" == dname):
                dcid = get_cid_job(did, "Arsenic Trioxide response")
            elif ("Thallium poisoning" == dname):
                ## not present in Medgen
                dcid = None
            elif ("Lipodystrophy due to Human immunodeficiency virus infection and antiretroviral therapy" == dname):
                dcid = get_cid_job(did, "HIV Lipodystrophy")
            elif ("Heparin overdose" == dname):
                dcid = get_cid_job(did, "Heparin-induced thrombocytopenia")
            elif ("Antimetabolite overdose" == dname):
                dcid = get_cid_job(did, "Antimetabolite adverse reaction")
            elif ("Arthropod bite wound" == dname):
                dcid = get_cid_job(did, "Arbovirus infection")
            elif ("Burn injury" == dname):
                ## unclear
                dcid = None
            elif ("transvaginal ultrasound: length of cervix" == dname):
                dcid = get_cid_job(did, "Length of uterine cervix")
            elif ("colon cancer liver metastasis" == dname):
                dcid = get_cid_job(did, "Colorectal Cancer pM1 TNM Finding v6 and v7")   
            elif ("PYELONEPHRITIS E COLI" == dname):
                dcid = get_cid_job(did, "Pyelonephritis")   
            elif ("Proteus septicemia" == dname):
                dcid = get_cid_job(did, "Proteus infectious disease") 
            elif ("Septicemia due to enterococcus" == dname):
                dcid = get_cid_job(did, "Enterococcus faecalis infection") 
            elif ("Head and neck cancer metastatic" == dname):
                dcid = get_cid_job(did, "Head and neck carcinoma") 
            else:
                raise ValueError("did=%s dname=%s" % (did,dname))
        return dcid

    disease_cid_list = Parallel(n_jobs=njobs, backend='loky')(delayed(get_cid_job)(di_id, di_cid) for di_id, di_cid in enumerate(disease_list))

    ## add (cid, disease_name) to path_to_diseaseCIDS
    di = dict(zip(disease_cid_list, disease_list))
    di_medgenid2diseasename.update(di)
    with open(cids_file, "wb+") as f:
        pickle.dump(di_medgenid2diseasename, f)

    ids_none = [i for i, d in enumerate(disease_cid_list) if (str(d)!="None")]
    A = A[A.columns[ids_none]]
    A.columns = [disease_cid_list[i] for i in ids_none]

    replace_in_A_cols = {}
    for x in list(di_medgenid2diseasename.keys()):
        try:
            int(x[2:])
        except:
            if (str(x)=="None"):
                di_medgenid2diseasename.pop(x)
                continue
            if (x == "Carcinoma breast stage IV"):
                xx = "Breast carcinoma"
            elif (x == "Cystitis escherichia"):
                xx = "Bacterial cystitis"
            elif (x == "Central retinal vein occlusion with macular edema"):
                xx = "Central retinal vein occlusion"
            elif (x == "Cicatrix, Hypertrophic"):
                xx = "Cicatrix"
            elif (x == "Contact dermatitis due to Rhus diversiloba"):
                xx = "Toxicodendron dermatitis"
            else:
                xx = x
            dcid = utils.get_concept_id(xx, delay=0)
            print((dcid, x))
            di_medgenid2diseasename.update(dict(zip([dcid],[x])))
            di_medgenid2diseasename.pop(x)
            replace_in_A_cols.setdefault(x, dcid)

    with open(cids_file, "wb+") as f:
        pickle.dump(di_medgenid2diseasename, f)
    A.columns = [replace_in_A_cols[x] if (x in replace_in_A_cols) else x for x in A.columns]

    A.to_csv(featureless_folder+"all_ratings_converted_diseases.csv")

A = pd.read_csv(featureless_folder+"all_ratings_converted_diseases.csv", index_col=0)

### II. Convert all drug id to DrugBank ids, get PubChem CIDS

Convert remaining DrugBank IDs drug ids into PubChem ids whenever possible. Otherwise, use the DrugBank ID.

In [24]:
if (not os.path.exists(featureless_folder+"all_ratings.csv")):

    A = pd.read_csv(featureless_folder+"all_ratings_converted_diseases.csv", index_col=0)
    drug_id_list = list(A.index)
    drug_cid_list = [None]*A.shape[0]
    drugbank_ids = []
    drugbank_ids_id = []
    for dg_id, dg_cid in enumerate(drug_id_list):
        if (str(dg_cid)[:2]=="DB"):
            drugbank_ids.append(dg_cid)
            drugbank_ids_id.append(dg_id)
        else:
            drug_cid_list[dg_id] = dg_cid
            
    ## Convert drug id in DrugBank into PubChem id
    def get_cid_job(did,ls):
        return utils.get_pubchem_id(ls)

    chunksize=100
    pubchem_ids = [get_cid_job(di_ls,drugbank_ids[di_ls:di_ls+chunksize]) for di_ls in range(0, len(drugbank_ids), chunksize)]
    
    pubchem_ids_ = [x for ls in pubchem_ids for x in ls]
    for ii, i in enumerate(drugbank_ids_id):
        drug_cid_list[i] = pubchem_ids_[ii]
    drug_names = utils.drug_id2name(drug_id_list, id_folder=paths_global.data_folder)
    
    ## add (cid, drug_name) to dictionary
    di = dict(zip(pubchem_ids_, drug_names))
    di_pubchemid2drugname.update(di)
    with open(pubchem_file, "wb+") as f:
        pickle.dump(di_pubchemid2drugname, f)
        
    ids_none = [i for i, c in enumerate(pubchem_ids_) if (str(c)!="None")]
    A = A.loc[A.index[ids_none]]
    A.index = [pubchem_ids_[i] for i in ids_none]
    ids = [di_drugbankid2drugname.get(d, d) for d in A.index]
    inv_map = {v: k for k, v in di_pubchemid2drugname.items()}
    A.index = [inv_map.get(d, list(A.index)[ii]) for ii, d in enumerate(ids)]
    A.to_csv(featureless_folder+"all_ratings.csv")
    
A = pd.read_csv(featureless_folder+"all_ratings.csv", engine="python", index_col=0)
print("Sparsity "+str(utils.compute_sparsity(A))+"%")
ratings_A = utils.matrix2ratings(A, user_col="ind_id", item_col="drug_id", rating_col="rating")
utils.print_dataset(ratings_A, user_col="ind_id", item_col="drug_id", rating_col="rating")
ratings_A.T

Sparsity 0.35070326199346175%
Ndrugs=1599	Ndiseases=1599
8658 positive	320 negative	2547823 unknown matchings


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8968,8969,8970,8971,8972,8973,8974,8975,8976,8977
ind_id,C1851649,C0042133,C5193005,C2676676,C1704272,C4722327,C1858361,C0034013,C0014175,C1858361,...,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C0014544,C5203670
drug_id,657181,657181,657181,657181,657181,657181,657181,657181,657181,DB00010,...,2310,492405,4375,3352,3373,5917,442872,442021,4843,DB12466
rating,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0


### III. Final matrix

In [25]:
A = pd.read_csv(featureless_folder+"all_ratings.csv", engine="python", index_col=0)
A

Unnamed: 0,C1851649,C0042133,C5193005,C2676676,C1704272,C4722327,C1858361,C2676676.1,C4310232,C0029456,...,C0242770,C1880129,C0022661,C0236792,C1135191,C0149516,C1835407,C0016667,C0039445,C5203670
657181,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB00010,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5311128,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DB00017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5311065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5917,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
442872,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
442021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4843,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
