# PyGNA Workflow

### The workflow involves the following three steps

1. Generate GMT files from CSV files in case GMT file isn't available
2. Generate matrices
3. Perform analysis for single or multiple genesets. Get the results in the form of pdf or png


## Data Loading

### Generating GMT files from  a table

This is when you have a table data from csv or Deseq. The following utlity can be used to generate gmt files from table data.

In [None]:
$ pygna geneset-from-table <filename>.csv <setname> <filename>.gmt --name-colum <gene_names_column> --filter-column <filter-col> <'less'> --threshold <th> --descriptor <descriptor string>
$ pygna geneset-from-table <deseq>.csv diff_exp <deseq>.gmt --descriptor deseq#for table from deseq


### Merging different Genesets

It is also possible to merge different setnames in a single gmt file through the function generate-group-gmt. You can override the default parameters, to match the columns in your table.*generate-group-gmt* generates a GMT file of multiple setnames. From the table file, it groups the names in the group_col (the column you want to use to group them) and prints the genes in the name_col. Set the descriptor according to your needs. OR you could simply concatenate all the files. 

## Computing rwr and sp matrices

In [None]:
$ pygna build-distance-matrix <network> <network_sp>.hdf5
$ pygna build-rwr-diffusion <network> --output-file <network_rwr>.hdf5

## Topology Tests

In [None]:
$ pygna test-topology-module <network> <geneset> <table_results_test>_topology_module.csv --number-of-permutations 100 --cores 4
$ pygna test-topology-rwr <network> <geneset> <network_rwr>.hdf5 <table_results_test>_topology_rwr.csv --number-of-permutations 100 --cores 4
$ pygna test-topology-internal-degree <network> <geneset> <table_results_test>_topology_internal_degree.csv --number-of-permutations 100 --cores 4
$ pygna test-topology-sp <network> <geneset> <network_sp>.hdf5 <table_results_test>_topology_sp.csv --number-of-permutations 100 --cores 4
$ pygna test-topology-total-degree <network> <geneset> <table_results_test>_topology_total_degree.csv --number-of-permutations 100 --cores 4

## Association tests

If only A_geneset_file is passed the analysis is run on all the pair of sets in the file, if both A_geneset_file and B_geneset_file are passed, one can specify the setnames for both, if there is only one geneset in the file, setname_X can be omitted, if both sets are in the same file, B_geneset_file can be not specified, but setnames are needed

In [None]:
pygna test-association-rwr [-h] [--setname-a SETNAME_A] [--file-geneset-b FILE_GENESET_B] [--setname-b SETNAME_B] [--size-cut SIZE_CUT] [-k] [-c CORES] [-i]
                                [--number-of-permutations NUMBER_OF_PERMUTATIONS] [--n-bins N_BINS] [--results-figure RESULTS_FIGURE]
                                network-file file-geneset-a rwr-matrix-filename output-table

    Performs comparison of network location analysis.

    It computes a p-value for the shortest path distance
    between two genesets being smaller than expected by chance.

    If only A_geneset_file is passed the analysis is run on all the pair of sets in the file, if both
    A_geneset_file and B_geneset_file are passed, one can specify the setnames for both, if there is only one
    geneset in the file, setname_X can be omitted, if both sets are in the same file, B_geneset_file can be not
    specified, but setnames are needed.


positional arguments:
network-file          network file
file-geneset-a        GMT geneset file
rwr-matrix-filename   .hdf5 file with the RWR matrix obtained by pygna
output-table          output results table, use .csv extension

optional arguments:
-h, --help            show this help message and exit
--setname-a SETNAME_A
                        Geneset A to analyse (default: -)
--file-geneset-b FILE_GENESET_B
                        GMT geneset file (default: -)
--setname-b SETNAME_B
                        Geneset B to analyse (default: -)
--size-cut SIZE_CUT   removes all genesets with a mapped length < size_cut (default: 20)
-k, --keep            if true, keeps the geneset B unpermuted (default: False)
-c CORES, --cores CORES
                        Number of cores for the multiprocessing (default: 1)
-i, --in-memory       set if you want the large matrix to be read in memory (default: False)
--number-of-permutations NUMBER_OF_PERMUTATIONS
                        number of permutations for computing the empirical pvalue (default: 500)
--n-bins N_BINS       if >1 applies degree correction by binning the node degrees and sampling according to geneset distribution (default: 1)
--results-figure RESULTS_FIGURE
                        heatmap of results (default: -)

In [None]:
$ pygna test-association-sp <network> <geneset> <network_sp>.hdf5 <table_results_test>_association_sp.csv -B <geneset_pathways> --keep --number-of-permutations 100 --cores 4
$ pygna test-association-rwr <network> <geneset> <network_rwr>.hdf5 <table_results_test>_association_rwr.csv -B <geneset_pathways> --keep --number-of-permutations 100 --cores 4


### Visualisation

In [None]:
Usage: pygna paint-datasets-stats [-h] [-a ALTERNATIVE] table-filename output-file #GNT barplot
Usage: pygna paint-summary-gnt [-h] [-s SETNAME] [-t THRESHOLD] [-c COLUMN_FILTER] [--larger] [--less-tests LESS_TESTS] output-figure [input_tables [input_tables ...]]#GNT Summary
Usage: pygna paint-comparison-matrix [-h] [-r] [-s] [-a] table-filename output-file#heatmap
Usage: pygna paint-volcano-plot [-h] [-r] [-i ID_COL] [--threshold-x THRESHOLD_X] [--threshold-y THRESHOLD_Y] [-a] table-filename output-file#volcanoplot



### Snakemake Workflow

1) Install Snakemake
2) Make changes to the config file and rules files accordingly(changing the path/parameters etc)
3) Run the analysis

All the steps from above are boiled down to one or two steps.


In [None]:
snakemake --use-conda -n#dry run


In [None]:
snakemake --snakefile Snakefile_paper --configfile config_paper --use-conda --cores $N#to replicate the results of the paper

To obtain all the results for the single geneset (avoid the first step to have the full regeneration of all files):

In [None]:
 snakemake snakemake --snakefile Snakefile_paper single_all --configfile config_paper_single.yaml -t 
 
 snakemake --snakefile Snakefile_paper single_all --configfile config_paper_single.yaml --use-conda

To obtain the results for the multi geneset

In [None]:
 snakemake snakemake --snakefile Snakefile_paper multi_all --configfile config_paper_multi.yaml -t 
 
 snakemake --snakefile Snakefile_paper multi_all --configfile config_paper_multi.yaml

## Paper Use Case

### Using Commandline

Since the distance matrices are already built and the merged geneset(gmt) already obtained, topology and association analysis can be carried out directly.



#### **Topology Analysis**

In [None]:
#file names: biogrid_3.168_filtered.tsv merged.gmt goslim.gmt interactome_RWR.hdf5 interactome_SP.hdf5

In [None]:
cd /home/gee3/Documents/PyGNA/data_tcga_workflow/external/

In [None]:
! pygna test-topology-module biogrid_3.168_filtered.tsv merged.gmt table_topology_module3.csv --number-of-permutations 100 --cores 2





In [None]:
! pygna test-topology-rwr biogrid_3.168_filtered.tsv merged.gmt interactome_RWR.hdf5 tableresults_topology_rwr.csv --number-of-permutations 10 --cores 3

In [None]:
! pygna test-topology-internal-degree biogrid_3.168_filtered.tsv merged.gmt table_topology_internal_degree.csv --number-of-permutations 10 --cores 3

In [None]:
! pygna test-topology-sp biogrid_3.168_filtered.tsv merged.gmt interactome_SP.hdf5 table_topology_sp.csv --number-of-permutations 10 --cores 2


In [None]:
! pygna test-topology-total-degree biogrid_3.168_filtered.tsv merged.gmt table_topology_total_degree.csv --number-of-permutations 100 --cores 4

#### **Association Tests**

In a GNA two genesets are tested for their association. When testing a signle geneset against many pathways it is recommended the –keep flag is used. This way, while resampling only the geneset a will be randomly permuted and the geneset b is going to be kept as it is. This strategy is more conservative and is helpful in testing whether the tested geneset is more strongly connected to the pathway (or any other geneset of interest) than expected by chance.

In [None]:
! pygna test-association-rwr biogrid_3.168_filtered.tsv merged.gmt interactome_RWR.hdf5 table_association_rwr.csv --file-geneset-b goslim_entrez.gmt --keep --number-of-permutations 100 --cores 4

If you don't include the --results-figure flag at the comparison step, plot the matrix as follows

In [None]:
! pygna paint-comparison-matrix table_association_rwr.csv heatmap_association_rwr.png --rwr --annotate

If setname B is not passed, the analysis is run between each couple of setnames in the geneset as follows(The only difference between single geneset and multiple genests. No within comprison in multi):

In [None]:
! pygna test-association-rwr biogrid_3.168_filtered.tsv merged.gmt interactome_RWR.hdf5 table_within_comparison_rwr.csv --number-of-permutations 100 --cores 2

! pygna paint-comparison-matrix table_within_comparison_rwr.csv heatmap_within_comparison_rwr.png --rwr --single-geneset

In [None]:
! pygna test-association-sp biogrid_3.168_filtered.tsv merged.gmt interactome_SP.hdf5 table_association_SP.csv --file-geneset-b goslim_entrez.gmt --keep --number-of-permutations 2 --cores 1 
! pygna paint-comparison-matrix table_association_sp.csv heatmap_association_sp.png --rwr --annotate#default heatmap 

INFO:root:geneset_a contains 6 sets
INFO:root:geneset_b contains 139 sets
INFO:root:Results file = table_association_SP.csv
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 504 genes out of 541 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0722265 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 2379 genes out of 2659 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0167353 p-value: 0.666667
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 2031 genes out of 2275 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0571095 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 938 genes out of 995 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0402866 p-value: 0.333333
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Map

INFO:root:Observed: -0.0256151 p-value: 0.333333
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 1800 genes out of 1972 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0699152 p-value: 0.333333
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 1441 genes out of 1538 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0192628 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 479 genes out of 506 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.164695 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 4849 genes out of 5054 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0124333 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 253 genes out of 270 from genesetB
INFO:root:n_proc = 1, each computing 2 pe

INFO:root:Observed: 0.0982196 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 854 genes out of 1410 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0356779 p-value: 0.333333
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 1878 genes out of 2102 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0284033 p-value: 0.333333
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 437 genes out of 461 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.101663 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 274 genes out of 297 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.21276 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 537 genes out of 556 from genesetB
INFO:root:n_proc = 1, each computing 2 permutati

INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 60 genes out of 61 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.637771 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 257 genes out of 300 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.239915 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 88 genes out of 92 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.492958 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 1246 genes out of 1481 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0383234 p-value: 0.666667
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 1066 genes out of 1304 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0402831 p-value: 0.333333

INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 3874 genes out of 4155 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.290815 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 5462 genes out of 6926 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.349555 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 611 genes out of 626 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.24703 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 1112 genes out of 1168 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.226991 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 295 genes out of 313 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.228655 p-value: 1
INFO:root:

INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.152982 p-value: 0.666667
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 451 genes out of 516 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.12159 p-value: 0.666667
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 657 genes out of 692 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.163738 p-value: 0.333333
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 236 genes out of 246 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.227079 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 482 genes out of 507 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.261662 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 1996 genes out of 2222 from ge

INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.322158 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 59 genes out of 61 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.152703 p-value: 0.333333
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 5569 genes out of 6693 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.333177 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 1650 genes out of 1802 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.23064 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 162 genes out of 171 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.442962 p-value: 1
INFO:root:Mapped 172 genes out of 259 from genesetA
INFO:root:Mapped 579 genes out of 649 from genesetB
INFO:ro

INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0788165 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 3585 genes out of 3994 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0741008 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 636 genes out of 663 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0630482 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 273 genes out of 285 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.263781 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 924 genes out of 980 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.015859 p-value: 0.666667
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 1422 genes out of 1580 fr

INFO:root:Observed: -0.071291 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 151 genes out of 164 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.333103 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 121 genes out of 140 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.332212 p-value: 0.666667
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 2598 genes out of 3073 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0425529 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 4336 genes out of 4728 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0857787 p-value: 0.666667
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 5934 genes out of 6600 from genesetB
INFO:root:n_proc = 1, each comput

INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.279041 p-value: 0.333333
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 214 genes out of 224 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.329605 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 320 genes out of 385 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.188881 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 406 genes out of 418 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.119652 p-value: 1
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 83 genes out of 90 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.45021 p-value: 0.666667
INFO:root:Mapped 1208 genes out of 1640 from genesetA
INFO:root:Mapped 4708 genes out of 5105 from g

INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0783613 p-value: 1
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 285 genes out of 297 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.142174 p-value: 0.333333
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 4512 genes out of 5002 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.181151 p-value: 1
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 1000 genes out of 1082 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.00451476 p-value: 1
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 466 genes out of 475 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.180178 p-value: 0.666667
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 7183 genes out of

INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0731839 p-value: 0.666667
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 830 genes out of 886 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0212989 p-value: 0.333333
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 2414 genes out of 2538 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.174622 p-value: 0.666667
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 975 genes out of 1062 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.0220607 p-value: 0.333333
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 417 genes out of 492 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.309425 p-value: 1
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 315

INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.196029 p-value: 0.666667
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 919 genes out of 952 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.0469361 p-value: 1
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 3353 genes out of 3458 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.202971 p-value: 0.666667
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 236 genes out of 249 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: 0.318696 p-value: 0.333333
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 5172 genes out of 5689 from genesetB
INFO:root:n_proc = 1, each computing 2 permutations 
INFO:root:Observed: -0.220828 p-value: 0.333333
INFO:root:Mapped 2292 genes out of 3247 from genesetA
INFO:root:Mapped 3305 

INFO:root:n_proc = 1, each computing 2 permutations 


In [None]:
! pygna test-association-sp biogrid_3.168_filtered.tsv merged.gmt interactome_RWR.hdf5 table_within_comparison_sp.csv --number-of-permutations 2 --cores 2
! pygna paint-comparison-matrix table_within_comparison_sp.csv heatmap_within_comparison_rwr.png --rwr --single-geneset

#### **Diagnostic**

Distribution plot
When running a statistical test, one might want to visually assess the null distribution. By passing -d \<diagnostic_folder/> through command line, a distribution plot of the empirical null is shown for each test.

In [None]:
! pygna test-topology-total-degree biogrid_3.168_filtered.tsv merged.gmt diagnstic_total_degree.csv -d "diagnostic/" --number-of-permutations 2 --cores 2

### **Visualisation**

There are four main types of figures currently implemented in PyGNA, namely bar plots, point plots, heatmaps and volcano plots, to visualize to GNT and GNA results.

Barplots are used to plot the GNT results for a single statistic. For each geneset a red bar represents the observed statistic, whereas a blue one represents the average of the empirical null distribution. Conversely, a dot plot can be used to summarize multiple tests for the same geneset. In order to show all the results in the same figure, the observed values are transformed in absolute normalized z-scores, such that all significant tests have z-score >0 and are marked with a red dot. 

GNA results can instead be visualised on heatmaps, with the color gradients used to report the strength of association between two genesets. When an all-vs-all test is conducted, a lower triangular matrix is shown, with stars denoting significance. If, instead, a M-vs-N test was conducted, a complete heatmap would be included in the plot.

Alternatively, volcano plots can be used to visualize one-vs-many GNA results, for testing a geneset against a large number of datasets (e.g. gene ontologies). The plot shows the normalized z-score on the x-axis and the −log10 of the p-value adjusted to control the False Discovery Rate (FDR) on the y-axis. Significant results are shown with red crosses, whereas not significant associations are represented by blue dots.Can be annotated to fid out the top 5 terms.

In [None]:
! pygna paint-datasets-stats table_topology_module.csv gnt_tm.png #GNT barplot
! pygna paint-summary-gnt dotplt.png #GNT Summary
! pygna paint-comparison-matrix table_association_sp.csv withncomp_sp.pdf #heatmap
! pygna paint-volcano-plot table_association_sp.csv volcno_sp.png #volcanoplot

### **Benchmarking**

#### GNT and GNA benchmarking using SBM 



In [10]:
! pygna generate-gnt-sbm "benchmarking/gnt_sbm.tsv" 'benchmarking/gnt_sbm.gmt'
! pygna generate-gna-sbm "benchmarking/gna_sbm.tsv" 'benchmarking/gna_sbm.gmt'


[50, 50, 50, 50, 50, 50, 700]
INFO:root:Network written on benchmarking/gnt_sbm.tsv
[50, 50, 50, 50, 50, 50, 50, 50, 600]
INFO:root:Network written on benchmarking/gna_sbm.tsv
Generatedbenchmarking/gna_sbm.tsv


#### GNT and GNA benchmarking using SBM 

In [11]:
! pygna generate-hdn-network benchmarking/ hdn_network

INFO:root:Reject=True
INFO:root:Reject=True
INFO:root:Nodes: 1000, in LCC: 999
INFO:root:Reject=True
INFO:root:Nodes: 1000, in LCC: 999
INFO:root:Reject=True
INFO:root:Nodes: 1000, in LCC: 999
INFO:root:Reject=False
INFO:root:Nodes: 1000, in LCC: 1000
INFO:root:Network written on benchmarking/hdn_network_s_0_network.tsv
INFO:root:Reject=True
INFO:root:Reject=False
INFO:root:Nodes: 1000, in LCC: 1000
INFO:root:Network written on benchmarking/hdn_network_s_1_network.tsv
INFO:root:Reject=True
INFO:root:Reject=False
INFO:root:Nodes: 1000, in LCC: 1000
INFO:root:Network written on benchmarking/hdn_network_s_2_network.tsv
INFO:root:Reject=True
INFO:root:Reject=True
INFO:root:Nodes: 1000, in LCC: 999
INFO:root:Reject=False
INFO:root:Nodes: 1000, in LCC: 1000
INFO:root:Network written on benchmarking/hdn_network_s_3_network.tsv
INFO:root:Reject=True
INFO:root:Reject=False
INFO:root:Nodes: 1000, in LCC: 1000
INFO:root:Network written on benchmarking/hdn_network_s_4_network.tsv


#### Given the generated network and node list of HDNs, novel genesets made of mixtures of the two can be generate.

The original network with a number of HDNs, then the partial, extended, and branching genesets can be generated. 


**Adding Extended Genesets**

 Creates new genesets from the vip list, number of genesets and portion of genes
    can be specified by input. The final new geneset is going to be formed by:
    percentage ev * HDN_total + ratio* percentage ev*vips total.

In [None]:
!pygna hdn-add-extended input-geneset-file#Genesets are input to identify

**Adding Partial Genesets**

Creates new genesets from the vip list, number of genesets and portion of
    genes can be specified by input.

In [None]:
!pygna hdn-add-partial input-geneset-file

**Adding Branching Genesets**

  Creates new genesets from the vip list, new genesets are created adding 1 step
    nodes to vips. The new genes are created as branches.

In [None]:
!pygna hdn-add-branching input-geneset-file