# Tutorial for CuNA 

CuNA has two parts to it: 
1. computing redescription groups by cumulants and 
2. performing network analysis after forming a network from the redescription groups.

In [1]:
import pandas as pd 
import numpy as np 
import os, sys, time, random, math
from geno4sd.topology.CuNA import cumulants, CuNA

### Read data
We use a sample data from TCGA Breast cancer study. We have selected a subset of mRNAs, miRNAs and proteins which are associated with breast cancer in this data. 

In [6]:
fname = '../sample_data/CuNA_TCGA_sample_data.csv'
df = pd.read_csv(fname)
print("Number of individuals: ", df.shape[0])
print("Number of features: ", df.shape[1])

df = df.sample(n=25, axis='columns', random_state=123)
ids = df.columns[0]
df.drop(df.columns[0], axis=1, inplace=True)
i_fname = os.path.basename(fname).split('.')[0]

Number of individuals:  150
Number of features:  44


### Computing Cumulants

In [7]:
beg_time = time.time()
cumulants_df = cumulants.getCumulants(df)
print("Time spent computing cumulants (mins): ", (time.time() - beg_time)/60)

Time spent computing cumulants (mins):  0.14253523747126262


#### The p-value input should be a list of p-values. 

In [8]:
#p-value threshold
p = [1e-5, 1e-6, 1e-7]

#percentage thresholds of` (0.9, 0.1)
cutofflist = np.linspace(0.9,0.1,17)

### Computing CuNA (Cumulant-based network analysis)
CuNA returns the following:
    1. A dataframe with edges in its rows and the connected vertices in columns along with the statistical significance (measured by p-value) from the Fisher Exact test. 
    2. **count** or weight of the edge. 
    3. A dataframe of varying row lengths (contains None for empty fields) with the community membership information for all the vertices. 
    4. A dataframe with node rank. A score indicating the importance of each vertex across different centrality measures. The lower the score means higher the importance. 

In [9]:
beg_time = time.time()
interactions, nodes, communities, noderank = CuNA.get_network(cumulants_df, 0, p, verbose=0)
print("Time spent computing CuNA network (mins): ", (time.time() - beg_time)/60)

Time spent computing CuNA network (mins):  0.47827444871266683


Communities in the network

In [10]:
print(communities)

                       0           1       2             3           4      5  \
Community0   hsa-mir-20a  hsa-mir-93      PR  hsa-mir-106a  hsa-mir-17  CSRP2   
Community1  hsa-mir-130b     C4orf34  SEMA3C          FUT8      MED13L     AR   
Community2         MEX3A        JNK2   PREX1        INPP4B      ZNF552   E2F1   

                      6            7             8  
Community0      SLC43A3  hsa-mir-186  hsa-mir-1301  
Community1  hsa-mir-505         ASNS         CCNA2  
Community2         None         None          None  


Top 10 ranked nodes in the network

In [11]:
noderank.sort_values(by='Score')[:10]

Unnamed: 0,Node,Score
9,PREX1,3.8
6,C4orf34,4.6
3,SLC43A3,4.8
7,JNK2,5.0
1,CCNA2,5.4
21,INPP4B,8.2
0,ZNF552,10.0
5,PR,10.4
13,hsa-mir-186,10.8
15,SEMA3C,11.0
