Skip to content

Commit

Permalink
manual content
Browse files Browse the repository at this point in the history
  • Loading branch information
shraddhapai committed Jan 27, 2020
1 parent e4f049e commit d7a9bff
Show file tree
Hide file tree
Showing 9 changed files with 558 additions and 0 deletions.
20 changes: 20 additions & 0 deletions docs/Clique_Filtering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Clique filtering

Clique filtering is a technique used to remove "random-like" networks when working with binary similarity and very sparse data. Data types include copy number variations, which are few per patient.

*Motivation section TBA*

## Output format of `cliqueFilterNets.R`
A data.frame with per-network stats on clique-filtering:

* NETWORK: network name
* orig_pp: Num (+,+) interactions.
* orig_rest: Num (+,-) and (-,-) interactions
* ENR: Enrichment or bias of (+,+) interactions relative to other interactions. Specifically defined as `(orig_pp-orig_rest)/(orig_pp+orig_rest)`. Ranges between -1 (all non-(+,+)) to +1 (all (+,+)).
* TOTAL_INT: orig_pp+orig_rest (log-10 transformed)
* numPerm: num permutations done in clique filtering
* shuf_mu: mean ENR of permuted nets (i.e. mean null ENR)
* shuf_sigma: standard deviation of ENR of permuted nets (i.e. s.d. of null ENR)
* Z: Z-score of ENR in real network, relative to null distribution
* pctl: Percentile of ENR in real network, relative to null distribution. Also the p-value
* Q: Benjamini-Hochberg corrected pvalue.
164 changes: 164 additions & 0 deletions docs/Create_PSN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Create Patient Similarity Networks (or Feature Design)

<a id="overview"></a>
## Overview
Defining meaningful patient similarity networks (or features) is the key to building a high-performing classifier. It is also crucial to using netDx for clinical discovery. When designing patient similarity networks, two considerations are the level at which individual variables should be grouped, and the measure used to define patient similarity.

## Grouping Variables: What makes an input network?
In a given datatype, not all measured variables are equally predictive. Moreover some groupings of variables may be more informative about mechanism, perhaps because they use prior knowledge. Such variables are more interpretable, as they reflect some process of clinical or biological relevance. For each datatype, the user needs to decide which level of variable-grouping to apply.

This table provides some examples for different types of data. Consider the lung cancer example from the [Introduction](Introduction), and assume we measure: 34 clinical variables; 20,000 genes; 16 metabolites; and genetic mutations from whole-genome sequencing.

<table cellspacing="0" border=1>
<tr>
<th>Type of data</th>
<th>1 variable per network</th>
<th>All variables in one network</th>
<th>Subset of variables per network</th>
</tr>
<tr>
<td style="spec">Clinical data</td>
<td style="">Age or sex or smoking frequency <i>(34 networks)</i></td>
<td style="">Clinical data <i>(1 network)</i></td>
<td style="">a set of 5 variables measuring lung condition <i>(2-27 networks)</td>
</tr>
<tr>
<td style="spec">Gene expression</td>
<td style="">Top 10 known high-risk genes, one gene per network<i>(10 networks)</i></td>
<td style="">All gene expression data <i>(1 network)</i></td>
<td style="">Genes grouped by pathways <i>(~2,000 networks)</td>
</tr>
<tr>
<td style="spec">Metabolic data</td>
<td style="">3 metabolites identified through other statistical analyses<i>(3 networks)</i></td>
<td style="">All metabolite data <i>(1 network)</i></td>
<td style="">Metabolites grouped by biological process they affect <i>(3-4 networks)</td>
</tr>
<tr>
<td style="spec">Genetic data</td>
<td style="">Top genes from GWAS studies <i>(~10 networks)</i></td>
<td style="">All genetic data <i>(1 network)</i></td>
<td style="">Genes grouped by pathways <i>(~2,000 networks)</td>
</tr>
</table>

## Choosing a similarity metric
Patient similarity can be measured in different ways for different types of input data. Here is a set of recommended metrics to start with, depending on what the type of data is and how many variables were grouped to create a given net.

![sim_metrics.png](./_static/images/Create_PSN/sim_metrics.png)

<a id="summary"> </a>
## Summary table
<table cellspacing="0" border=1>
<tr>
<th>Type of data</th>
<th>Example</th>
<th>Similarity measure</th>
<th>Example call</th>
</tr>
<tr> <th class="spec">Continous, over 5 vars </th>
<td class="">Gene expression</td>
<td class="">Pearson correlation</th>
<td class="code">makePSN_NamedMatrix(dat,dat_names,
groupList,outDir,<b>writeProfiles=TRUE</b>)
</tr>
<tr> <th class="spec">Continous, 2-5 vars </th>
<td class="">Gene expression</td>
<td class="">Average normalized difference (custom)</th>
<td class="code">makePSN_NamedMatrix(dat,dat_names,myGroup,
outDir,simMetric="custom",customFunc=normDiff2,
sparsify=TRUE)
</tr>
<tr> <th class="spec">Discrete, mutation data</th>
<td class="">Gene mutations</td>
<td class="">Co-occurrence in same unit (e.g. gene or pathway)</th>
<td class="code">**makePSN_RangeSets**(mutation_GR, pathway_GRList,outDir)
</tr>
</table>

### Defining custom similarity functions
netDx is agnostic to the choice of similarity metric. Set the `simMetric` argument of `makePSN_NamedMatrix` to `custom` to provide a user-defined similarity metric. Be aware that the choice of similarity metric could increase false positives or false negatives during feature selection. Build controls to guard against these.

<a id="pearson"></a>
## Expression data
This situation applies to a table of >5 values with continuous-valued measures. An example is a table of gene expression data, with ~20,000 measures per patient. Another is proteomic data with ~20 measures per patient.

Suggested metric: *Pearson correlation*.
This is the default similarity metric for `makePSN_NamedMatrix()` so no special specification is required.

**Note:** Be sure to set `writeProfiles=TRUE`.

<a id="avg_normdiff"></a>
## Fewer than 5 datapoints
Pearson correlation is not a stable measure of similarity when there are fewer than 5 variables per patient. An alternate similarity measure in such a situation is the **average normalized difference for each variable**.

![avg_normDiff](./_static/images/Create_PSN/avg_normDiff.png)


```{r}
#' Similarity by average of normalized difference
#'
#' @details Similarity measure for network with 2-5 variables.
#' Defined as the average of normalized difference for each of the
#' member variables
#' @param x (matrix) rows are patients, columns are values for component
#' variables
normDiff_avg <- function(x) {
# normalized difference for a single variable
normDiff <- function(x) {
nm <- colnames(x)
x <- as.numeric(x)
n <- length(x)
rngX <- max(x,na.rm=T)-min(x,na.rm=T)
out <- matrix(NA,nrow=n,ncol=n);
# weight between i and j is
# wt(i,j) = 1 - (abs(x[i]-x[j])/(max(x)-min(x)))
for (j in 1:n) out[,j] <- 1-(abs((x-x[j])/rngX))
rownames(out) <- nm; colnames(out)<- nm
out
}
sim <- matrix(0,nrow=ncol(x),ncol=ncol(x))
for (k in 1:nrow(x)) {
tmp <- normDiff(x[k,,drop=FALSE])
sim <- sim + tmp
rownames(sim) <- rownames(tmp)
colnames(sim) <- colnames(tmp)
}
sim <- sim/nrow(x)
sim
}
```

Use in `netDx`: Given
* `dat`matrix with 5 variables (5xN matrix, where N is number of patients)
* `dat_names` vector with variable names
* `myGroup` list with the group name as key and members as variables,

then the call to create networks would be:
```
makePSN_NamedMatrix(dat,dat_names,myGroup,
outDir,simMetric="custom",customFunc=normDiff_avg,
sparsify=TRUE)
```
**Note:**
* `simMetric="custom"`
* `customFunc` points to the custom function definition
* `writeProfiles=FALSE`
* `sparsify=TRUE` *keep only the strongest edges for efficient memory use*

<a id="binary_nets"></a>
## Range-based data (genetic mutations)
Creating patient data from genomic events such as genetic mutations or DNA copy number polymorphisms, requires a different design for creating PSN.

*This section coming soon - 170927*


<a id="howto_emap"></a>
## Setting up to get an enrichment map




3 changes: 3 additions & 0 deletions docs/Installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Installation

Detailed install instructions are provided on the [netDx github repo page](https://github.com/BaderLab/netDx/#install-netdx).
7 changes: 7 additions & 0 deletions docs/Interpreting_Output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Interpreting Output

This page will tell you how to:
* look at enrichment map data.
* How to set cutoffs for consistently high-scoring networks.
* how to read the integrated psn. computing dijkstra for evaluating group separation.

29 changes: 29 additions & 0 deletions docs/Introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Introduction

[**netDx**](https://github.com/BaderLab/netDx) is a **patient classifier** algorithm that can integrate several types of patient data into a single model. It specializes in the use of genomic data, which is distinct in the number and correlation structure of measures (e.g. 20,000 genes)It does this by converting each type of data into a view of patient similarity; i.e. by converting the data into a graph in which more similar patients are tightly linked, while less similar patients are not so tightly linked.

## Motivation

In this example, we try to predict which patients are at high-risk for lung cancer. We have four types of data: relevant clinical variables, including smoking frequency, gene expression data, genetic mutations, and metabolomic data. netDx converts the data into 4 views of patient similarity (edge strength
![psn_intro.png](./_static/images/Introduction/psn_intro.png)

In the graphs above, the nodes are patients and the edges are weighted by similarity for that particular datatype. It is evident that the high-risk patients form a strongly interconnected cluster based on smoking frequency (red network) but that the clustering is less evident for gene expression data (green network).

## How netDx works
The conceptual workflow for netDx is shown below. netDx starts with patient data as above. It allows users to define similarity for each of the input datatypes and creates the resulting patient similarity networks. It then uses machine learning to identify which of the input features were predictive for each class. Finally, it uses the predictive features to classify new patients of unknown type.

![workflow.png](./_static/images/Introduction/workflow.png)

An important aspect of the predictor is the score associated with each input feature. This score indicates the frequency with which cross-validation identified a particular network as predictive for a patient label, and is a measure of predictive power. A threshold can be applied to this score, making passing networks "feature-selected".

## Output
netDx broadly has two purposes. First, it serves as a classifier that can integrate heterogeneous datatypes. Second, it serves as a tool for clinical discovery and research, as identified features may provide mechanistic insight into the condition under study or identify new biomarkers.

netDx therefore provides several types of output that allow the user to examine the nature of the predictor:
* **Predicted labels** for test patients. If [nested cross-validation](Predictor_Designs.md#nestedcv) is used, labels for all iterations are provided, along with individual-level classification accuracy.
* **Summary network scores:** Network-level scores for all cross-validation folds. Applying a cutoff for these results in "feature-selected" networks.
* Detailed output: All **[intermediate results](Output_Files.md)**, showing network rankings across cross-validation
* An **overall patient similarity network** created by integrating feature-selected networks
* Where applicable, a network visualization of selected features (also called an EnrichmentMap) is generated. This view shows the major themes present in feature-selected variables.

![outputs.png](./_static/images/Introduction/outputs.png)
20 changes: 20 additions & 0 deletions docs/Key_Concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Concepts used in netDx

*In progress: 170905*

<a name="psn"></a>
### Patient similarity network
<a name="emap"></a>
### Enrichment map
<a name="feat"></a>
### Feature
<a name="featsel"></a>
### Feature selection
<a name="gm"></a>
### GeneMANIA
An algorithm that integrates similarity networks and ranks patients by similarity to a query (e.g. "rank patients by similarity to those with high-risk for lung cancer.")
**Cite original papers**
<a name="nestedcv"></a>
### Nested cross-validation


0 comments on commit d7a9bff

Please sign in to comment.