## Extracting graph edges and nodes from data

This first section uses a dataset we extracted with no node features.
We have extracted data in /data/drug_interactions.tsv with the following fields:

- drug_interaction_id: id of drug A
- name: name of drug A
- description: interaction info of drug A with drug B
- drugbank_id: id of drug B

Now we want to extract a graph with nodes as the drugs and edges between each drug_interaction_id-drugbank_id pair

In [None]:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

In [76]:
// read in data 
val lines = sc.textFile("/home/jovyan/work/data/drug_interactions.tsv")
// skip header
val header = lines.first() // extract header
val data = lines.filter(row => row != header) // filter out header

lines: org.apache.spark.rdd.RDD[String] = /home/jovyan/work/data/drug_interactions.tsv MapPartitionsRDD[283] at textFile at <console>:38
header: String = drug_interaction_id	name	description	drugbank_id
data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[284] at filter at <console>:41


In [2]:
data.take(2)

res0: Array[String] = Array(DB06605	Apixaban	Apixaban may increase the anticoagulant activities of Lepirudin.	DB00001, DB06695	Dabigatran etexilate	Dabigatran etexilate may increase the anticoagulant activities of Lepirudin.	DB00001)


In [3]:
val lines = data.map(line => line.split("\t"))
lines.take(4)

lines: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at <console>:27
res1: Array[Array[String]] = Array(Array(DB06605, Apixaban, Apixaban may increase the anticoagulant activities of Lepirudin., DB00001), Array(DB06695, Dabigatran etexilate, Dabigatran etexilate may increase the anticoagulant activities of Lepirudin., DB00001), Array(DB01254, Dasatinib, The risk or severity of bleeding and hemorrhage can be increased when Dasatinib is combined with Lepirudin., DB00001), Array(DB01609, Deferasirox, The risk or severity of gastrointestinal bleeding can be increased when Lepirudin is combined with Deferasirox., DB00001))


In [4]:
// read in all drugs (drug_interactions.tsv doesn't contain all drugs, only those with interactions) to get all drugs
val linesDrugs = sc.textFile("/home/jovyan/work/data/drug_features.csv")
// skip header
val header = lines.first() // extract header
val data = lines.filter(row => row != header) // filter out header
linesDrugs.take(4)

linesDrugs: org.apache.spark.rdd.RDD[String] = /home/jovyan/work/data/drug_features.csv MapPartitionsRDD[5] at textFile at <console>:30
header: Array[String] = Array(DB06605, Apixaban, Apixaban may increase the anticoagulant activities of Lepirudin., DB00001)
data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[6] at filter at <console>:33
res2: Array[String] = Array(drug_id	type	group	interactions, DB00001	biotech	[approved]	[DB06605, DB06695, DB01254, DB01609, DB01586, DB02123, DB02659, DB02691, DB03619, DB04348, DB05990, DB06777, DB08833, DB08834, DB08857, DB11622, DB11789, DB09075, DB09053, DB08935, DB06228, DB06206, DB09070, DB00932, DB00013, DB00163, DB09030, DB01381, DB01181, DB00468, DB00908, DB00675, DB00539, DB00806, DB00686, DB00583, DB00255, DB00269, DB00286, DB0...


In [2]:
var all_drugs = spark.read.options(Map("inferSchema"->"true","delimiter"->"\t", "header"->"true"))
  .csv("/home/jovyan/work/data/drug_features.csv")
all_drugs.show(3)

+-------+-------+----------+--------------------+
|drug_id|   type|     group|        interactions|
+-------+-------+----------+--------------------+
|DB00001|biotech|[approved]|[DB06605, DB06695...|
|DB00002|biotech|[approved]|[DB00012, DB00016...|
|DB00003|biotech|[approved]|                null|
+-------+-------+----------+--------------------+
only showing top 3 rows



all_drugs: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 2 more fields]


In [3]:
val distinct_drugs = all_drugs.select("drug_id").distinct()
distinct_drugs.count()

distinct_drugs: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [drug_id: string]
res2: Long = 13580


In [4]:
var drugs = distinct_drugs.select("drug_id").rdd
                    .map(x => x(0).toString) // prevent Array type
drugs.take(4)

drugs: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[32] at map at <console>:27
res3: Array[String] = Array(DB00194, DB00741, DB00846, DB00912)


In [5]:
drugs.count()

res4: Long = 13580


We produce a drug -> node map containing the drug id and node id (which will be used as input to our graph)

In [6]:
val drug2NodeMap = drugs.zipWithIndex()

drug2NodeMap: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[33] at zipWithIndex at <console>:26


In [7]:
drug2NodeMap.take(5)

res5: Array[(String, Long)] = Array((DB00194,0), (DB00741,1), (DB00846,2), (DB00912,3), (DB01357,4))


In [8]:
drug2NodeMap.count()

res6: Long = 13580


In [9]:
val df_drug_node_map = spark.createDataFrame(drug2NodeMap).toDF("drug_id", "node_id")

df_drug_node_map: org.apache.spark.sql.DataFrame = [drug_id: string, node_id: bigint]


In [10]:
// save map for later use
df_drug_node_map
   .repartition(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("/home/jovyan/work/data/drug2NodeMap.csv")

Now we want to map drugs in out drug interactions dataset to these node IDs

In [11]:
var interactions_df = spark.read.options(Map("inferSchema"->"true","delimiter"->"\t", "header"->"true"))
  .csv("/home/jovyan/work/data/drug_interactions.tsv")
interactions_df.show(5)

+-------------------+--------------------+--------------------+-----------+
|drug_interaction_id|                name|         description|drugbank_id|
+-------------------+--------------------+--------------------+-----------+
|            DB06605|            Apixaban|Apixaban may incr...|    DB00001|
|            DB06695|Dabigatran etexilate|Dabigatran etexil...|    DB00001|
|            DB01254|           Dasatinib|The risk or sever...|    DB00001|
|            DB01609|         Deferasirox|The risk or sever...|    DB00001|
|            DB01586|Ursodeoxycholic acid|The risk or sever...|    DB00001|
+-------------------+--------------------+--------------------+-----------+
only showing top 5 rows



interactions_df: org.apache.spark.sql.DataFrame = [drug_interaction_id: string, name: string ... 2 more fields]


In [12]:
var interactions_rdd = interactions_df.rdd.zipWithIndex()
interactions_rdd.take(1)

interactions_rdd: org.apache.spark.rdd.RDD[(org.apache.spark.sql.Row, Long)] = ZippedWithIndexRDD[58] at zipWithIndex at <console>:26
res9: Array[(org.apache.spark.sql.Row, Long)] = Array(([DB06605,Apixaban,Apixaban may increase the anticoagulant activities of Lepirudin.,DB00001],0))


In [13]:
val interactions_rdd2 = interactions_rdd.map(x => (x._1(0).toString, x._1(1).toString, x._1(2).toString, x._1(3).toString, x._2))
interactions_rdd2.take(1)

interactions_rdd2: org.apache.spark.rdd.RDD[(String, String, String, String, Long)] = MapPartitionsRDD[59] at map at <console>:26
res10: Array[(String, String, String, String, Long)] = Array((DB06605,Apixaban,Apixaban may increase the anticoagulant activities of Lepirudin.,DB00001,0))


In [14]:
interactions_df = spark.createDataFrame(interactions_rdd2).toDF("drug_interaction_id", "name", "description", "drugbank_id", "row_number")
interactions_df.show(3)

+-------------------+--------------------+--------------------+-----------+----------+
|drug_interaction_id|                name|         description|drugbank_id|row_number|
+-------------------+--------------------+--------------------+-----------+----------+
|            DB06605|            Apixaban|Apixaban may incr...|    DB00001|         0|
|            DB06695|Dabigatran etexilate|Dabigatran etexil...|    DB00001|         1|
|            DB01254|           Dasatinib|The risk or sever...|    DB00001|         2|
+-------------------+--------------------+--------------------+-----------+----------+
only showing top 3 rows



interactions_df: org.apache.spark.sql.DataFrame = [drug_interaction_id: string, name: string ... 3 more fields]


In [15]:
val drugAs = interactions_df.select("drug_interaction_id", "row_number")
drugAs.show(2)

+-------------------+----------+
|drug_interaction_id|row_number|
+-------------------+----------+
|            DB06605|         0|
|            DB06695|         1|
+-------------------+----------+
only showing top 2 rows



drugAs: org.apache.spark.sql.DataFrame = [drug_interaction_id: string, row_number: bigint]


In [16]:
drugAs.count()

res13: Long = 2668185


In [17]:
var drugAWithNodeIDs = drugAs.join(df_drug_node_map, $"drug_interaction_id" === $"drug_id", "left")
drugAWithNodeIDs.count()

drugAWithNodeIDs: org.apache.spark.sql.DataFrame = [drug_interaction_id: string, row_number: bigint ... 2 more fields]
res14: Long = 2668185


In [18]:
drugAWithNodeIDs = drugAWithNodeIDs.withColumnRenamed("drug_interaction_id","drug_A_id")
           .withColumnRenamed("node_id","drug_A_node_id").drop("drug_id")
drugAWithNodeIDs.show(3)

+---------+----------+--------------+
|drug_A_id|row_number|drug_A_node_id|
+---------+----------+--------------+
|  DB00194|    315697|             0|
|  DB00741|      1855|             1|
|  DB00741|     13266|             1|
+---------+----------+--------------+
only showing top 3 rows



drugAWithNodeIDs: org.apache.spark.sql.DataFrame = [drug_A_id: string, row_number: bigint ... 1 more field]


In [19]:
val drugBs = interactions_df.select("drugbank_id", "row_number")
var drugBWithNodeIDs = drugBs.join(df_drug_node_map, $"drugbank_id" === $"drug_id", "left")
                        .withColumnRenamed("node_id","drug_B_node_id").drop("drug_id")
drugBWithNodeIDs.count()

drugBs: org.apache.spark.sql.DataFrame = [drugbank_id: string, row_number: bigint]
drugBWithNodeIDs: org.apache.spark.sql.DataFrame = [drugbank_id: string, row_number: bigint ... 1 more field]
res16: Long = 2668185


In [20]:
drugBWithNodeIDs.show(1)

+-----------+----------+--------------+
|drugbank_id|row_number|drug_B_node_id|
+-----------+----------+--------------+
|    DB00194|     77879|             0|
+-----------+----------+--------------+
only showing top 1 row



In [21]:
// join drugAWithNodeIDs and drugBWithNodeIDs to get the edges
var edgesData = drugAWithNodeIDs.join(drugBWithNodeIDs, Seq("row_number"), "left")
edgesData.count()

edgesData: org.apache.spark.sql.DataFrame = [row_number: bigint, drug_A_id: string ... 3 more fields]
res18: Long = 2668185


In [25]:
edgesData = edgesData.drop("row_number")
edgesData.show(3)

+---------+--------------+-----------+--------------+
|drug_A_id|drug_A_node_id|drugbank_id|drug_B_node_id|
+---------+--------------+-----------+--------------+
|  DB09030|         10315|    DB00001|          9529|
|  DB00468|          4259|    DB00001|          9529|
|  DB00056|            61|    DB00001|          9529|
+---------+--------------+-----------+--------------+
only showing top 3 rows



In [26]:
// save edges data for later use
edgesData
   .repartition(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("/home/jovyan/work/data/edges.csv")

## Graph extraction with Features

To use GCNs for link prediction, we need to extract some features for each node (drug). 
For this part, we use a dataset /data/drug_features.csv with the following fields:

    - drug_id: id of drug 
    - type: name of drug A
    - group: interaction info of drug A with drug B
    - target_info: list of info extracted directly from xml, needed for extracting target gene name(s)
    - enzyme_info: list of info extracted directly from xml, needed for extracting enzyme gene name(s)
    - interactions: list of all drug ids this drug interacts with

Now we want to extract a graph with nodes as the drugs containing features, and edges between each drug_interaction_id-drugbank_id pair

In [27]:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD


In [28]:
var df = spark.read.options(Map("inferSchema"->"true","delimiter"->"\t", "header"->"true"))
  .csv("/home/jovyan/work/data/drug_features.csv")
df.show(8)

+-------+--------------+--------------------+--------------------+
|drug_id|          type|               group|        interactions|
+-------+--------------+--------------------+--------------------+
|DB00001|       biotech|          [approved]|[DB06605, DB06695...|
|DB00002|       biotech|          [approved]|[DB00012, DB00016...|
|DB00003|       biotech|          [approved]|                null|
|DB00004|       biotech|[approved, invest...|[DB00012, DB00016...|
|DB00005|       biotech|[approved, invest...|[DB01281, DB00026...|
|DB00006|small molecule|[approved, invest...|[DB06605, DB06695...|
|DB00007|small molecule|[approved, invest...|[DB09066, DB09083...|
|DB00008|       biotech|[approved, invest...|[DB06643, DB00005...|
+-------+--------------+--------------------+--------------------+
only showing top 8 rows



df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 2 more fields]


In [38]:
import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.feature.CountVectorizerModel
import org.apache.spark.sql.SparkSession

import org.apache.spark.ml.feature.OneHotEncoder
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.feature.CountVectorizerModel
import org.apache.spark.sql.SparkSession


#### 1. type feature

In [40]:
// let's see what values we have here
df.select("type").distinct().show()

+--------------+
|          type|
+--------------+
|          null|
|       biotech|
|small molecule|
+--------------+



In [30]:
// add is_biotech (0/1) and is_small_molecule columns
df = df.withColumn("is_biotech", col("type") === "biotech")
df = df.withColumn("is_small_molecule", col("type") === "small molecule")
df.show(8)

+-------+--------------+--------------------+--------------------+----------+-----------------+
|drug_id|          type|               group|        interactions|is_biotech|is_small_molecule|
+-------+--------------+--------------------+--------------------+----------+-----------------+
|DB00001|       biotech|          [approved]|[DB06605, DB06695...|      true|            false|
|DB00002|       biotech|          [approved]|[DB00012, DB00016...|      true|            false|
|DB00003|       biotech|          [approved]|                null|      true|            false|
|DB00004|       biotech|[approved, invest...|[DB00012, DB00016...|      true|            false|
|DB00005|       biotech|[approved, invest...|[DB01281, DB00026...|      true|            false|
|DB00006|small molecule|[approved, invest...|[DB06605, DB06695...|     false|             true|
|DB00007|small molecule|[approved, invest...|[DB09066, DB09083...|     false|             true|
|DB00008|       biotech|[approved, inves

df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 4 more fields]
df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 4 more fields]


#### 2. group feature
This feature is a list of values from: ['withdrawn', 'illicit', 'vet_approved', 'investigational', 'approved', 'experimental', 'nutraceutical'] (extracted using pandas)

In [31]:
df = df.withColumn("withdrawn", col("group").contains("withdrawn"))
df = df.withColumn("illicit", col("group").contains("illicit"))
df = df.withColumn("vet_approved", col("group").contains("vet_approved"))
df = df.withColumn("investigational", col("group").contains("investigational'"))
df = df.withColumn("approved", col("group").contains("approved"))
df = df.withColumn("experimental", col("group").contains("experimental"))
df = df.withColumn("nutraceutical", col("group").contains("nutraceutical"))
df.show(8)

+-------+--------------+--------------------+--------------------+----------+-----------------+---------+-------+------------+---------------+--------+------------+-------------+
|drug_id|          type|               group|        interactions|is_biotech|is_small_molecule|withdrawn|illicit|vet_approved|investigational|approved|experimental|nutraceutical|
+-------+--------------+--------------------+--------------------+----------+-----------------+---------+-------+------------+---------------+--------+------------+-------------+
|DB00001|       biotech|          [approved]|[DB06605, DB06695...|      true|            false|    false|  false|       false|          false|    true|       false|        false|
|DB00002|       biotech|          [approved]|[DB00012, DB00016...|      true|            false|    false|  false|       false|          false|    true|       false|        false|
|DB00003|       biotech|          [approved]|                null|      true|            false|    false|

df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 11 more fields]
df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 11 more fields]
df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 11 more fields]
df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 11 more fields]
df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 11 more fields]
df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 11 more fields]
df: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 11 more fields]


#### Create Nodes Dataset

In [32]:
// val df2 = df.withColumn("interactions", explode(array(col("interactions"))))
// df2.show(8)

+-------+--------------+--------------------+--------------------+----------+-----------------+---------+-------+------------+---------------+--------+------------+-------------+
|drug_id|          type|               group|        interactions|is_biotech|is_small_molecule|withdrawn|illicit|vet_approved|investigational|approved|experimental|nutraceutical|
+-------+--------------+--------------------+--------------------+----------+-----------------+---------+-------+------------+---------------+--------+------------+-------------+
|DB00001|       biotech|          [approved]|[DB06605, DB06695...|      true|            false|    false|  false|       false|          false|    true|       false|        false|
|DB00002|       biotech|          [approved]|[DB00012, DB00016...|      true|            false|    false|  false|       false|          false|    true|       false|        false|
|DB00003|       biotech|          [approved]|                null|      true|            false|    false|

df2: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 11 more fields]


In [33]:
val df_drug_node_map = spark.read.options(Map("delimiter"->",", "header"->"true"))
  .csv("/home/jovyan/work/data/drug2NodeMap.csv")
df_drug_node_map.show(7)

+-------+-------+
|drug_id|node_id|
+-------+-------+
|DB00194|      0|
|DB00741|      1|
|DB00846|      2|
|DB00912|      3|
|DB01357|      4|
|DB01460|      5|
|DB01979|      6|
+-------+-------+
only showing top 7 rows



df_drug_node_map: org.apache.spark.sql.DataFrame = [drug_id: string, node_id: string]


In [34]:
df.count()

res28: Long = 13608


In [35]:
val node_features = df.join(df_drug_node_map, Seq("drug_id"), "left")
node_features.show(5)

+-------+-------+--------------------+--------------------+----------+-----------------+---------+-------+------------+---------------+--------+------------+-------------+-------+
|drug_id|   type|               group|        interactions|is_biotech|is_small_molecule|withdrawn|illicit|vet_approved|investigational|approved|experimental|nutraceutical|node_id|
+-------+-------+--------------------+--------------------+----------+-----------------+---------+-------+------------+---------------+--------+------------+-------------+-------+
|DB00001|biotech|          [approved]|[DB06605, DB06695...|      true|            false|    false|  false|       false|          false|    true|       false|        false|   9529|
|DB00002|biotech|          [approved]|[DB00012, DB00016...|      true|            false|    false|  false|       false|          false|    true|       false|        false|   3685|
|DB00003|biotech|          [approved]|                null|      true|            false|    false|  

node_features: org.apache.spark.sql.DataFrame = [drug_id: string, type: string ... 12 more fields]


In [36]:
node_features.count()

res30: Long = 13608


In [37]:
node_features
   .repartition(1)
   .write.format("com.databricks.spark.csv")
   .option("header", "true")
   .save("/home/jovyan/work/data/node_features.csv")