# **USECASE: Building a disease-disease similairty netwrok**

* DBRetina is an efficent tool for  building a similarity network for a set of items by pairwaise calcuation of their shared features using a linear-time algorithm.
* [DisGeNET](https://www.disgenet.org/) has one of the largest collections of genes associated to human diseases.
* In this tutorial, we will use DBRetina to build a disease-disease similairty netwrok based on the shared number of genes asscoiated with them in the DisGeNET database

In [None]:
%%bash
## Download the most recent version of disease gene associations from DisGeNET
if [ ! -f all_gene_disease_associations.tsv ];then
  wget -N http://www.disgenet.org/static/disgenet_ap1/files/downloads/all_gene_disease_associations.tsv.gz
  gunzip all_gene_disease_associations.tsv.gz
else echo "all_gene_disease_associations.tsv file exists in the disgenet DB";fi

In [None]:
%%bash
head -n3 all_gene_disease_associations.tsv

In [None]:
%%bash
## transform the data table into the DBRetina format
## DBRetina expects 2 files. Both are tab-separated files with two columns. Files must have header lines
## 1) Associations file: The 1st column for "items" and the 2nd for their asscoiated "features".
## 2) Super-association file: The 1st column for "items" and the 2nd for their "aliases". You can use this column to update the item name or if you want to pool multiple items together as one super item otherwise the 2nd column should be a copy of the 1st column
## In addition, we will filter the input list to keep trusted disease-gene associations only (DisGeNET score > 0.3)
cat all_gene_disease_associations.tsv | sed -e 's/^[ \t]*//' | awk 'BEGIN{FS=OFS="\t";}{if($10>0.3)print $6,$2}' > disgenet.asc
echo "item alias" | tr ' ' '\t' > disgenet.names
tail -n+2 disgenet.asc | awk 'BEGIN{FS=OFS="\t";}{print $1,$1}' | sort | uniq >> disgenet.names

In [None]:
%%bash
## Let us explore the format of the prepared filed
echo "DisGeNET input file"
wc -l all_gene_disease_associations.tsv
echo "==================="
echo "Associations file"
wc -l disgenet.asc
head -n3 disgenet.asc
echo "==================="
echo "Super-associations file"
wc -l disgenet.names
head -n3 disgenet.names

In [None]:
%%bash
## Now we can run DBRetina
kPro_index="disgenetDBR"
DBRetina items_indexing -i disgenet.asc -n disgenet.names -p ${kPro_index}
DBRetina pairwise -i ${kPro_index}

In [None]:
%%bash
kPro_index="disgenetDBR"
# How many pairwise combinations did we do? 
wc -l ${kPro_index}_kSpider_pairwise.tsv

In [None]:
%%bash
kPro_index="disgenetDBR"
# How does the output look like?
echo "The table of pairwise combinations"
head ${kPro_index}_kSpider_pairwise.tsv

In [None]:
%%bash
kPro_index="disgenetDBR"
# The items (i.e. diseases) are encoded as numerical IDs
# We have a separate file the map each item to its ID
head ${kPro_index}.namesMap

In [None]:
%%bash
kPro_index="disgenetDBR"
# How many genes associted with each disease is another important piece of info that we have in another output file 
head ${kPro_index}_kSpider_seqToKmersNo.tsv

In [None]:
%%bash
kPro_index="disgenetDBR"
# Now let us merge the items' names, IDs and number of associated features in one output 
paste <(tail -n+2 ${kPro_index}.namesMap |cut -d" " -f1)  <(tail -n+2 ${kPro_index}.namesMap |cut -d" " -f2-) > ${kPro_index}.namesMap.tmp
echo "node_id node_name size" | tr ' ' '\t' > ${kPro_index}_nodes_size.tsv
awk 'BEGIN{FS=OFS="\t";}FNR==NR{a[$2]=$3;next;}{if(a[$1]!="")print $0,a[$1]}' ${kPro_index}_kSpider_seqToKmersNo.tsv ${kPro_index}.namesMap.tmp >> ${kPro_index}_nodes_size.tsv
rm ${kPro_index}.namesMap.tmp*
head ${kPro_index}_nodes_size.tsv

In [None]:
%%bash
kPro_index="disgenetDBR"
# Let us get one final output 
# Meanwhile, we will calc jaccard distance and containment ratio for each pair and filter out those with minimal similarities 
echo ":START_ID-Features|shared_count:int|jDist:float|smPerc:float|:END_ID-Features" > ${kPro_index}_relations.csv
awk 'BEGIN{FS="\t";S="|";}FNR==NR{a[$1]=$3;b[$1]=$2"-"$3;next;}{
   g1=a[$2]; g2=a[$3]; min=g1;min=(min < g2 ? min : g2); 
   jDist=$4*100/(g1+g2-$4); smPerc=$4*100/min; 
   if(jDist>1 || smPerc>10)printf("%s%s%s%s%.1f%s%.1f%s%s\n", b[$2],S,$4,S,jDist,S,smPerc,S,b[$3])}' ${kPro_index}_nodes_size.tsv <(tail -n+2 ${kPro_index}_kSpider_pairwise.tsv) >> ${kPro_index}_relations.csv

In [None]:
%%bash
kPro_index="disgenetDBR"
# Let us have a look
head ${kPro_index}_relations.csv