# W4 practical

## 📝 Learning goals of practical

- You can describe how GO enrichment is applied in combination with clustering

- You can list how GO enrichment can be used to propose new experiments

## Setup

In [None]:
%pip install pyvis goatools

In [None]:
import sys
if "google.colab" in sys.modules:
    %pip install git+https://github.com/CropXR/EduXR.git
else:
    %load_ext autoreload
    %autoreload 2
from dsplantbreeding.clustering import read_and_cluster_expression_data, show_genes_per_cluster
from goatools.obo_parser import GODag
from goatools.anno.gaf_reader import GafReader
from dsplantbreeding.go_enrichment import show_annotations_of_gene, perform_go_enrichment

In [None]:
# Expression data we looked at in week 2
!wget https://raw.githubusercontent.com/CropXR/EduXR/refs/heads/main/data/biotic_transcriptomics.txt
# Gene Ontology annotations
!wget https://raw.githubusercontent.com/CropXR/EduXR/refs/heads/main/data/go-basic.obo
!wget https://raw.githubusercontent.com/CropXR/EduXR/refs/heads/main/data/tair.gaf

### ❓Questions

- What does this dataset represent? (It's the same one we looked at in week 2.)
- Let's look at the clustering again. What number of clusters would you say is appropriate? You can change both the number of clusters and the linkage that is used. (E.g. `'complete'`, `'average'`, or `'single'`). 
- Which of these clusters shows an expression pattern that looks promising for studying resistance to this biotic stress?

In [None]:
clustered_df = read_and_cluster_expression_data('biotic_transcriptomics.txt', n_clusters=1, linkage_method='complete')

Now we can see the genes that are in each module:

In [None]:
show_genes_per_cluster(clustered_df)

But how do we find out more about these genes than only their names?

Annotation data is available for Arabidopsis genes. Let's load this data.

In [None]:
godag = GODag("go-basic.obo")  
tair_gaf = GafReader("tair.gaf")

Pick one of the genes from the list above, and let's see what's known about it. Since GO is an 'ontology', we can plot the hierarchy of this gene's annotations as well.

In [None]:
show_annotations_of_gene("GENE_ID_HERE", tair_gaf, godag)

### ❓Questions
- What functional annotations does your gene have? Is each GO term equally specific?
- What is a disadvantage of looking only at annotations of a single gene? Do you think all the GO terms you find for your gene make sense?


Now let's do an enrichment study on the cluster you are interested in.

In [None]:
perform_go_enrichment(clustered_df, cluster_id=WHICH_CLUSTER, tair_gaf=tair_gaf, godag=godag)

GO terms colored by P-value:
- pval < 0.005 (light red)
- pval < 0.01 (light orange)
- pval < 0.05 (yellow)
- pval > 0.05 (grey) Study terms that are not statistically significant

### ❓Questions
- Do you think the enriched GO terms of the cluster make sense, looking at their expression profile in the expression matrix? 
- What is the goal of GO enrichment analysis?
- How would you use these enriched GO terms to inform subsequent experiments. E.g. what would you like to measure?
- Can you use this methodology to explain what seems to cause the gene expression changes in the plant at 1 hour after infection?

If you have extra time you can change the clustering and/or do GO enrichment studies for other clusters to see if you can get even more insights from this data.