<a href="https://colab.research.google.com/github/Bix4UMD/BIOI611_lab/blob/main/docs/BIOI611_DAVID.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DAVID Tool Suite Overview

## Introduction

The original version of DAVID introduced a tool suite designed primarily for batch gene annotation and GO term enrichment analysis, enabling researchers to identify the most relevant biological processes associated with a given gene list. While the core enrichment algorithm has remained consistent across all versions, the annotation coverage has significantly expanded. Initially limited to GO terms, DAVID now supports a wide range of annotation categories, including:

* Gene Ontology (GO) terms

* Protein–protein interactions

* Protein functional domains

* Disease associations

* Biological pathways (e.g., KEGG, BioCarta)

* Sequence features

* Functional summaries

* Tissue expression

* Literature references, and more

This expanded coverage allows researchers to explore their gene lists from multiple biological perspectives — all within a single platform. Results can be viewed as individual annotation chart reports or as combined summary reports, offering flexibility depending on the analysis goals.

A notable feature of DAVID is its ability to accept custom gene backgrounds, which is rarely available in other web-based enrichment tools. This option enables more tailored and accurate analyses, particularly when comparing against a relevant experimental or platform-specific background rather than the whole genome.

## A Typical Analysis Flow

Load Gene List → View Summary Page → Explore details through Chart Report, Table Report, Clustering Report, etc. → Export and Save Results.

##  Fisher’s Exact Test in DAVID

When observations from two independent groups fall into two mutually exclusive categories, Fisher’s Exact Test can be used to determine whether the proportions differ between the groups. In DAVID, this test is applied to assess gene enrichment in annotation terms.

The p-value is computed by summing the probabilities of all contingency tables that are as extreme as or more extreme than the observed table:
p = ∑A p, where A represents the set of relevant tables.

For a 2 × 2 contingency table, the one-sided p-value is based on the frequency in the (1,1) cell (first row, first column), denoted as n₁₁. Under the right-sided alternative hypothesis, set A includes all tables where the (1,1) cell frequency is greater than or equal to n₁₁.

A small right-sided p-value indicates that the observed frequency in the (1,1) cell is larger than would be expected under the null hypothesis of independence, suggesting an association between the row and column variables.

## Hypothetical Example

Consider the Human genome as the background, containing 30,000 genes — this is the Population Total (PT). Among these genes, 40 are known to participate in the p53 signaling pathway, referred to as the Population Hits (PH).

Now assume that, in your experimental gene list of 300 genes — the List Total (LT) — three genes (List Hits, LH) are found to be associated with the p53 signaling pathway.

The question is:

➡️ Is the proportion 3/300 in our list significantly higher than the background proportion of 40/30,000?
In other words, is this enrichment more than what would be expected by random chance?

Fisher’s Exact Test is used in DAVID to statistically evaluate whether the observed enrichment of p53-related genes in the list is significant compared with the genomic background.

<img width="406" height="164" alt="Image" src="https://github.com/user-attachments/assets/9eeefb7e-5246-4f75-a96c-e50683a7e728" />

In [None]:
LH <- 3      # List Hits
LT <- 300    # List Total
PH <- 40     # Population Hits
PT <- 30000  # Population Total
# Contingency table for Fisher's test
table_fisher <- matrix(c(LH,
                         LT - LH,
                         PH - LH,
                         PT - LT - (PH - LH)),
                       nrow = 2,
                       byrow = TRUE)

# Fisher's Exact Test (right-sided)
fisher.test(table_fisher, alternative = "greater")




	Fisher's Exact Test for Count Data

data:  table_fisher
p-value = 0.007443
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 2.105361      Inf
sample estimates:
odds ratio 
  8.096268 


## What About the EASE Score? (Modified Fisher’s Test)

The EASE Score is a more conservative variant of Fisher’s Exact Test used by DAVID.

It works by subtracting one gene from the List Hits (LH) before computing the p-value.

Why subtract one?

It penalizes weak evidence — especially when enrichment is supported by only one or a few genes.

This helps to avoid false positives and ensures that the association is strong and reliable.

<img width="406" height="164" alt="Image" src="https://github.com/user-attachments/assets/891a4490-b2a4-4fc3-875c-2ceeee6eb2a8" />


In [None]:
# Adjusted List Hits for EASE Score
LH_ease <- LH - 1

# Contingency table for EASE Score
table_ease <- matrix(c(LH_ease,
                       LT - LH_ease,
                       PH - LH_ease,
                       PT - LT - (PH - LH_ease)),
                     nrow = 2,
                     byrow = TRUE)

# EASE Score = Modified Fisher’s Exact Test
fisher.test(table_ease, alternative = "greater")



	Fisher's Exact Test for Count Data

data:  table_ease
p-value = 0.06063
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 0.8956195       Inf
sample estimates:
odds ratio 
  5.238625 


## Functional Annotation Summary

<img width="730" height="360" alt="Image" src="https://github.com/user-attachments/assets/2067a7d9-ca83-4016-84de-77d28c34e2be" />

## Functional Annotation Chart Report

<img width="730" height="508" alt="Image" src="https://github.com/user-attachments/assets/cd82e373-0d43-4a6b-9d0a-1499d931c6fd" />



## Functional Annotation Clustering


Because many annotation terms are biologically related, the Functional Annotation Chart often shows multiple similar annotations repeatedly. This redundancy can make it difficult to focus on the key biological themes.

To address this issue, DAVID provides the Functional Annotation Clustering feature. Instead of listing terms individually, this tool groups similar annotations together, allowing researchers to interpret results more clearly and at a higher biological level than with a traditional chart report.

The clustering method is based on the principle that similar annotation terms tend to share similar gene members. DAVID uses:

Kappa statistics to quantify the similarity between annotation terms by measuring the degree of shared genes.

Fuzzy heuristic clustering (previously used in the Gene Functional Classification Tool) to group annotation terms according to their Kappa values.

In essence, the more genes two annotations share, the more likely they are placed in the same cluster.

Interpretation of Results

* The p-values for each annotation term within a cluster are identical to those shown in the regular Functional Annotation Chart (e.g., Fisher’s Exact Test / EASE Score).

* Each annotation cluster is assigned a Group Enrichment Score, calculated as the geometric mean of the member p-values (in –log₁₀ scale).

* Clusters with higher Enrichment Scores indicate that their member terms consistently have lower p-values, reflecting stronger biological relevance.

<img width="730" height="716" alt="Image" src="https://github.com/user-attachments/assets/b125529b-49e7-42a8-a425-bdc20dd5f6c2" />

## Refernce

https://davidbioinformatics.nih.gov/helps/functional_annotation.html

