# **Learning a signed graph from GSD Dataset**

In this notebook, we will learn a signed graph from GSD dataset, taken from [1] and also studied in 
the paper. We start with importing necessary packages. 

In [8]:
import os

import pandas as pd # to load read GSD dataset
import numpy as np

import project_path # for this notebook to be able to import from pysrc folder
from pysrc.graphlearning import learn_signed_graph
from pysrc.evaluation import auc # to evaluate inference with auprc/auroc

# Project folder
parent_dir = os.path.abspath(os.path.join(os.pardir)) 

Next, we will read gene expression data and reference network.

In [2]:
# Data files
expression_file = os.path.join(parent_dir, "data/inputs/GSD/ExpressionData.csv")
ref_net_file = os.path.join(parent_dir, "data/inputs/GSD/refNetwork.csv")

# Read data files
expression_df = pd.read_csv(expression_file, index_col=0) 
ref_net_df = pd.read_csv(ref_net_file)

Now, we infer a signed graph from the expression data, which is a $p\times n$ dimensional matrix with
p is the number of genes and n is the number of cells. For this, we need to determine two hyperparameters:
$\alpha_1$ and $\alpha_2$. These parameters control the density of postive and negative part of the 
learned signed graph. Thus, instead of determining the hyperparameters, we can set the desired density
and then search for $\alpha_1$ and $\alpha_2$ values that give the desired densities. In current
implementation we use binary search to find $\alpha_1$ and $\alpha_2$ to obtain densities close to 
desired densities. Binary search can sometimes fail to find the values of $\alpha_1$ and $\alpha_2$
that give close approximation for the desired density. In such cases, one can manually choose 
hyperparameters that will give the desired densities. As a future work, we intend to find an exact 
relation between hyperparameters and densities of the positive and negative parts as done in some
recent works in graph signal processing literature. 

For this example, we will try to infer a signed graph with correlation kernel and we will set both
desired positive and negative edge densities to 0.45, which is approximately the value we found for
positive and negative edge densities using surrogate data approach described in the paper.

In [21]:
G = learn_signed_graph(expression_df.to_numpy(), pos_density=0.45, neg_density=0.45, 
                                assoc="correlation", gene_names=np.array(expression_df.index))

G is a dataframe with each row indicating an edge between two genes. Each edge is also associated 
with a weight, which is either positive or negative depending on the sign of the edge. We evaluate the
inferred the graph using signed AUPRC:

In [22]:
auprc, auroc, auprc_ratio, auroc_ratio = auc.signed(ref_net_df, G, "directed")
print(auprc)
print(auroc)

{'+': 0.37632317608807087, '-': 0.16571191138881608}
{'+': 0.7545796475586822, '-': 0.7256319811875367}


### **References**

[1] Pratapa, Aditya, et al. "Benchmarking algorithms for gene regulatory network inference from 
single-cell transcriptomic data." Nature methods 17.2 (2020): 147-154.