<a href="https://colab.research.google.com/github/LACDR-CDS/SCDR_RNAseq/blob/main/Session3and4/group7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 3 & 4
In the last session, you learned the basic processing steps for RNA sequencing data. Now you will do the next steps in the analysis of transcriptomics data to investigate the transcriptional impact of a GATA4 mutation on human cardiomyocyte differentiation.

## Background
Mutations in **GATA4**, a key transcription factor in heart development, are linked to **congenital heart defects** and **cardiomyopathy**.  
To investigate the molecular basis, you will compare **isogenic wildtype** and **GATA4-G296S mutant** cells at different stages of cardiomyocyte differentiation using RNA-seq.  

## Objectives

1. Explore whether GATA4 mutation alters global gene expression patterns (PCA).  
2. Identify differentially expressed genes (DEGs) between mutant and WT at each stage.  
3. Visualize DEGs with volcano plots and heatmaps.  
4. Perform GO enrichment to determine affected biological pathways.  
5. Interpret whether the mutation leads to loss of cardiomyocyte identity and/or gain of non-cardiac programs.  

## Setup

Run the following cells to set up the necessary packages and download the data. If you wish to use a package which is not in the list below, you will need to install and import it yourself.

In [None]:
#Install packages which are not in the default environment
%pip install scanpy
%pip install pydeseq2
%pip install gseapy

In [None]:
#Import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scanpy as sc
import anndata as ad
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
import pickle
import gseapy
import os
import warnings
warnings.filterwarnings("ignore")

In [None]:
group_number = 7

In [None]:
#Make data directory if it does not exist
os.makedirs("data", exist_ok=True)
os.makedirs("plots", exist_ok=True)

#Download datasets in the data folder
!wget https://raw.githubusercontent.com/LACDR-CDS/SCDR_Bioinformatics_Practical/refs/heads/main/Session3and4/data/group{group_number}_counts.txt -O data/group{group_number}_counts.txt
!wget https://raw.githubusercontent.com/LACDR-CDS/SCDR_Bioinformatics_Practical/refs/heads/main/Session3and4/data/group{group_number}_metadata.csv -O data/group{group_number}_metadata.csv

## Data import
Read the count matrix.
- How many samples and genes do you have?

<details>
<summary>ðŸ’¡ Show solution</summary>
    
There are 4 samples (2 wildtype, 2 mutant) and 42833 genes.

Read the corresponding metadata table.
- Which differentiation day do the samples belong to?

<details>
<summary>ðŸ’¡ Show solution</summary>
    
The samples belong to differentiation day 7.

## Filtering
Filter the data to remove genes with less than 10 reads over all samples.
- How many genes are left in the count table after filtering?

<details>
<summary>ðŸ’¡ Show solution</summary>
    
There are 21074 genes left after filtering.

## Normalization: counts per million
Perform normalization to account for differences in sequencing depth among samples.

## Principle component analysis (PCA)

Now explore your data with PCA like you did in the previous session. Plot the first two principal components, use the seaborn package.
- Does the mutation have any effect on gene expression or do all samples cluster together?
- Do you think the samples of the isogenic wildtype are perfect replicates? What about the missense mutation?

<details>
<summary>ðŸ’¡ Show solution</summary>
    
The mutation has an effect on gene expression, as the PCA plot shows that the wildtype and mutant samples cluster separately. However, there is also quite a transcriptional difference between the two wildtype samples, which makes them less reliable as replicates. The missense mutation samples cluster a lot closer together, which makes them better replicates.

## Differential gene expression

Now you can perform differential gene expression analysis using the DEseq2 method. This method allows to find genes with significant expression differences (up- or downregulated) between two conditions.
- Your goal is to investigate the effect of the mutation on cardiomyocyte differentiation. Which samples do you need to compare?
- Build your DEseq object(s) and run the analysis. *Make sure you use the raw filtered counts for this (not the counts per million normalized counts used for PCA), since DEseq does it's own internal normalization method.*
- Look at the results of you analysis. What information do the rows and columns of the DEseq result mean?

<details>
<summary>ðŸ’¡ Show solution</summary>
    
We need to compare the missense mutation with the isogenic wildtype. The rows are the gene names and the columns show the statistical differential expression results.
- baseMean: mean normalized expression across all samples.
- log2FoldChange: logâ‚‚ of the estimated fold change between conditions.
- lfcSE: standard error of the log2FoldChange estimate.
- stat: Wald test statistic (log2FoldChange / lfcSE).
- pvalue: p-value for Hâ‚€: log2FoldChange = 0.
- padj: p-value adjusted for multiple testing (FDR, e.g. Benjaminiâ€“Hochberg).

- Now get the significantly upregulated and downregulated genes from this dataframe with a log2 fold change bigger than 2. Are there more genes that are upregulated or downregulated? _Tip: to get an overview in one figure, you can plot the upregulated and downregulated genes in a volcano plot._

<details>
<summary>ðŸ’¡ Show solution</summary>

There are more genes upregulated (166) than downregulated (89).

- Take the dataframe of your DESeq2 results (res_df) and sort the values by the significant p-adjusted value (padj). Then take the top 10 genes names from the sorted dataframe (_hint: top10 = res_df.head(10).index_). What are the top 3 gene names?

<details>
<summary>ðŸ’¡ Show solution</summary>
    
The top 3 differentially expressed genes are CD34, TBX5 and RSP4Y1.

- Now plot the top 10 differentially expressed genes. Which gene is downregulated the most due to the GATA4 mutation? Can you find any support in literature how this gene is connected to GATA4?
<details>
<summary>ðŸ’¡ Show solution</summary>

TBX5 is a transcription factor that causes septal defects when mutated. GATA4 normally co-occupies cardiac enhancers with TBX5, another cardiac transcription factor. The G296S mutation disrupts TBX5 recruitment, especially at super-enhancers.
- https://doi.org/10.1016/j.cell.2016.11.033

# Gene Ontology enrichment

Now, you can start interpreting the lists of genes you get from the DEseq analysis and learn if there are any biological pathways that are effected by the GATA4 mutation.
- Use the cheat sheet to perform a GO enrichment analysis on the upregulated and downregulated genes. Which biological processes are downregulated in mutant vs WT at this day?
- Can you find scientific evidence in literature that these processes are regulated by GATA4?

<details>
<summary>ðŸ’¡ Show solution</summary>

Biological processes that are downregulated are mostly related to cardiac conduction, in which TBX5 plays an essential role. The G296S mutation disrupts TBX5 recruitment, leading to downregulation of cardiac programs (e.g., contraction, calcium handling, septation) and aberrant activation of endothelial/endocardial pathways.

- https://doi.org/10.1038/nature01827
- https://doi.org/10.1016/bs.ctdb.2016.08.008
- https://doi.org/10.1016/j.cell.2016.11.033