# Pathway analysis with scBonita


## To perform pathway analysis, scBONITA uses the rules generated in Step 2. 

In addition, scBONITA requires:

* a **metadata file** specifiying the treatments/experimental variables for each cell and 
* a **contrast file** specifying the pairs of treatments to be compared.


## The pathway analysis script has the following arguments:

* **dataFile** 
    
    Specify the name of the file containing processed scRNA-seq data


* **conditions**
    
    Specify the name of the file containing cell-wise condition labels, ie, metadata for each cell in the training dataset. The columns are condition variables and the rows are cells. The first column must be cell names that correspond to the columns in the training data file. The column names must contain the variables specified in the contrast file (see contrast --help for more information).


* **contrast**
    
    A text file where each line contains the two conditions (corresponding to column labels in the conditions file) are to be compared during pathway analysis.


* **conditions_separator**
    
    Separator for the conditions file. Must be one of , (comma), \s (space) or \t (tab).


* **contrast_separator**
    
    Separator for the contrast file. Must be one of , (comma), \s (space) or \t (tab)



## Example usage with the provided example files in the `data` folder:

> `python3.6 pathwayAnalysis.py --dataFile "data/trainingData.csv" --conditions "data/conditions.txt" --contrast "data/contrast.txt" --conditions_separator "," --contrast_separator "\t"`

In [None]:
## Output files from scBONITA Pathway Analysis

1. A comma-separated (CSV) file named as

> **pvalues + contrast[0] + _vs_ + contrast[1] + .csv**

For example, if the conditions to be compared (and specified in the contrasts file) are 'control' and 'treatment', the output file of scBONITA pathway analysis will be:

> **pvalues_control_vs_treatment.csv**

2. For each network, a file ending in **_importanceScores.csv**

This file contains a table with the following columns:

* **Node**: Gene name
* **Importance Score**: Importance score for the node in the network, calculated using the provided training dataset
* **ObsERS**: Observed size of the equivalent rule set or ERS. This is the number of possible equally valid Boolean rules for this node.
* **MaxERS**: Maximum possible size of the ERS for that node. This is (2^n) - 1, where n is the number of incoming edges for the node in the network.

3. For each network, a GRAPHML file ending in "_IS"

## View and analyze the output of scBONITA

#### Load required packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx

#### Example pvalues file

In [None]:
pvalues = pd.read_csv()
pvalues.head()

#### Make bubbleplots using generated pvalues

In [None]:
makeBubblePlots(pvalues=pd.DataFrame(),
                    adjPValueThreshold=0.05,
                    wrap=25,
                    height=8,
                    width=10,
                    palette="colorblind",
                    saveAsPDF=True,
                    outputFile="example_PA_bubbleplot.pdf")

#### Example importance scores file

In [None]:
importanceScores = pd.read_csv()
importanceScores.head()

#### Plot the equivalent rule set sizes for this network

In [None]:
importanceScores.loc[importanceScores["MaxERS"] == 127].hist(column='ObsERS')
plt.xlabel('Observed ERS')
plt.ylabel('Frequency')
plt.title('ERS of nodes with in-degree >= 3)')
plt.show()
plt.clf()

In [None]:
importanceScores.loc[importanceScores["MaxERS"] == 7].hist(column='ObsERS')
plt.xlabel('Observed ERS')
plt.ylabel('Frequency')
plt.title('ERS of nodes with in-degree = 2)')
plt.show()
plt.clf()

#### Visualize this network in external software such as CytoScape or Gephi

In [None]:
graph = nx.read_graphml()

In [None]:
pd.DataFrame.from_dict(dict(graph.nodes(data=True)), orient='index')

In [None]:
pd.DataFrame.from_dict(dict(graph.edges(data=True)), orient='index')