# Pandas RNA-seq exercise

## Overview

## Learning Objectives
In this exercise, you will use a real RNA-seq dataset to apply your Python coding skills to:
- Loading and inspecting data.
- Data cleaning (handling duplicates/NaN values).
- Subsetting and reshaping.
- Visualizing gene expression patterns (violin plots, PCA scatterplots).

**While you may not know what to do to write the code for each step, you should be able to follow what methods are being used!**

## Prerequisites
- Pandas & Numpy modules
- Some [background on RNA-seq](https://pmc.ncbi.nlm.nih.gov/articles/PMC6096346/) would be very help

## Getting Started
- <mark>This exercise needs to be performed in a cloud environment. The files are too big for PCs</mark>
- Run the next code box to install required tools

In [None]:
%pip install jupyterquiz
import pandas as pd
import numpy as np
import os
from jupyterquiz import display_quiz

## Pandas Exercise: Analyzing RNA-seq Data
Scenario: You have obtained RNA-seq data from NCBI,and you need to analyze it using Pandas. The dataset contains information about gene expression levels across different samples. Your tasks involve data cleaning, normalization, and exploratory analysis.

*You should be able to use other datasets, including your own!*

**Recommended Dataset:**
GEO Series Accession: GSE198050
Title: Transcriptome (RNA-Seq) of HEK293T and HEK293TΔALKBH5ΔFTO cells
Organism: Homo sapiens

<p style="background:blue;color:white;font-family:arial"> This dataset investigates the impact of knocking out two key RNA demethylases (ALKBH5 and FTO) on the transcriptome. These enzymes remove the m6A modification, a widely studied RNA methylation mark that regulates RNA stability, splicing, and translation. By comparing normal HEK293T cells with double-knockout (ΔALKBH5ΔFTO) cells, you can explore how epitranscriptomic regulation affects gene expression, connecting Pandas data analysis with biological insights </p>

The steps of RNA-seq processing are more involved, and there are other tutorials which can help you to use them. Here, remember, we are mainly trying to practice python (Pandas) tools in **steps 1-4.**

Summary of Steps and some tools you'll practice

1. Load RNA-seq Data: Import raw gene counts and metadata.      *pd.read_csv()*
2. Clean Data: Remove duplicate rows and handle NaN values.     *df.drop_duplicates(), df.dropna()*
3. Normalize: Convert raw counts to Counts Per Million (CPM).   *iterating, merge, df.sum() and df.div() math functions*
4. Filter Genes: Remove lowly expressed genes with CPM < 1.     *df.mask()*
5. Encode Groups: Map sample groups and perform dummy coding.   *df.groupby()*
6. PCA Analysis: Perform PCA and visualize sample clustering.   *df.T(), PCA, making dataframes, plotting*
7. Differential Expression: Test for differential expression using linear regression or GLMs.
8. Adjust for Multiple Testing: Apply FDR correction to identify significant genes.

Some later steps will use techniques we have not yet covered in the tutorials.

If you get stuck, all the required coding can be found at the end. BUT, you should focus on **trying it yourself.**


## Step 1: Get a data set
**Task:** 
Download the file into a pandas dataframe. Inspect the first few rows to understand the structure and identify column names corresponding to gene identifiers and sample counts.

The dataset includes a TSV file containing gene counts which is compressed (*.gz) but Pandas can handle that (see code starter in the next box)

File Name: [GSE198050_HEK293T_dALKBH5dFTO_genecounts.tsv.gz](https://ftp.ncbi.nlm.nih.gov/geo/series/GSE198nnn/GSE198050/suppl/GSE198050_HEK293T_dALKBH5dFTO_genecounts.tsv.gz)

The file is downloaded into the Datasets folder as: GSE198050_HEK293T_dALKBH5dFTO_genecounts.tsv.gz  or you can use the link above in the Python code.

In [None]:
# Read a .gz file directly into a DataFrame
df = pd.read_csv('path/to/your/file.tsv.gz', compression='gzip',sep='\t')  #replace the file name!
print(df.head())

In [None]:
# Read a .gz file directly into a DataFrame
df = pd.read_csv('.\Datasets\GSE198050_HEK293T_dALKBH5dFTO_genecounts.tsv.gz', compression='gzip', sep='\t')

print(df.head())

## Step 2: Data Cleaning

1. Remove duplicate rows from the DataFrame.
2. Remove rows with NaN values in any column.


In [None]:
# Remove duplicate rows



# Remove rows with NaN values


print(f"Cleaned data shape: {df.shape}")    #did the total number of genes change?



In [None]:
# Remove duplicate rows

df = df.drop_duplicates()


# Remove rows with NaN values
df = df.dropna()


print(f"Cleaned data shape: {df.shape}")    #did the total number of genes change?



## Step 3. Normalize the Data

A standard normalization technique for RNA-seq is to use counts-per-million:

$CPM= \dfrac{Raw Counts}{Total Counts for the Sample} × 10^6 $

Since the columns are samples and the rows are genes, this normalization can be done in two simple steps. 
1. Create a variable to hold the sum of each column (df.sum())
2. Divide the data (axis=1) by these count values & multiply by 1 million

The only challenge is that the first column (or 2?) is gene names. Remove the gene name columns, merging them back onto the dataframe after doing the math. 

In [None]:
# Exclude the 'Gene' column & copy the rest of the information into a dataframe for normalization

# Calculate the total counts per sample (column sums)


# Normalize to CPM in a new dataframe by using the df.div() function

# Add the 'Gene' column back on the left by merging  
df_cpm=
print("CPM Normalized Data:")
print(df_cpm)

In [None]:
# Exclude the 'Gene' column for normalization
count_data = df.iloc[:, 1:]

# Calculate the total counts per sample (column sums)
total_counts = count_data.sum(axis=0)

# Normalize to CPM
cpm_data = count_data.div(total_counts, axis=1) * 1e6

# Add the 'Gene' column back
df_cpm = pd.concat([df[['Gene']], cpm_data], axis=1)

print("CPM Normalized Data:")
print(df_cpm)

## Step 4: Filter out genes with low expression
Remove genes with an average expression (across samples) below a threshold of 0.5CPM

In the pandas tutorial, we accomplished this by replacing values with NaN, then dropping those rows (genes)


In [None]:
# Remove rows with expression values <0.5 CPM



In [None]:
# Remove rows with sum of the row <0.5 CPM



## Advanced steps
The rest of the steps are provided below. You will need to adjust your variable names (or the ones below) to enable you to demonstrate the remaining steps!

### Principal component analysis

PCA requires genes as columns and samples as rows, so we'll transpose the data before performing the PCA

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

# Transpose the CPM data (samples as rows, genes as columns)
pca_data = df_cpm.set_index('Gene').T

# Perform PCA
pca = PCA(n_components=2)  # Reduce to 2 principal components
pca_result = pca.fit_transform(pca_data)

# Create a PCA DataFrame
pca_df = pd.DataFrame(pca_result, columns=['PC1', 'PC2'], index=pca_data.index)

# Add experimental groups to PCA DataFrame
pca_df['Group'] = metadata.set_index('Sample').loc[pca_df.index, 'Group']

print("\nPCA Results:")
print(pca_df.head())


Now, we can use a scatterplot to visualize the first two principal components, with points colored by experimental group.

In [None]:
# Plot PCA results
plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue='Group', data=pca_df, palette='Set2')
plt.title('PCA: RNA-seq (CPM Normalized)')
plt.xlabel(f'Principal Component 1 ({pca.explained_variance_ratio_[0]*100:.2f}% variance)')
plt.ylabel(f'Principal Component 2 ({pca.explained_variance_ratio_[1]*100:.2f}% variance)')
plt.legend(title='Group')
plt.show()


### Step 5: Add Experimental Group Information
To analyze differential expression, you'll need to encode the experimental groups. For example:

The dataset contains two groups: "Control" and "Knockout".
Use/create a metadata file (metadata.tsv) containing sample IDs and their corresponding groups.

In [None]:
# Load metadata with sample groups
metadata = pd.read_csv('metadata.tsv', sep='\t')
print("\nMetadata:")
print(metadata)

# Map sample groups to columns in the data
sample_to_group = dict(zip(metadata['Sample'], metadata['Group']))

# Add the group information to the CPM DataFrame
df_long = df_cpm.melt(id_vars='Gene', var_name='Sample', value_name='CPM')
df_long['Group'] = df_long['Sample'].map(sample_to_group)

print("\nLong Format Data with Group Information:")
print(df_long.head())

### Step 6: Dummy Coding for Statistical Analysis
Dummy code the Group variable for regression-based differential expression analysis.

In [None]:
# Dummy coding (0 for Control, 1 for Knockout)
df_long['Group_Code'] = df_long['Group'].map({'Control': 0, 'Knockout': 1})

print("\nDummy-Coded Data:")
print(df_long.head())


### Step 7: Perform differential expression analysis on all genes

We will use **statsmodel.api**, a Python library for statistical modeling and hypothesis testing. It provides tools to perform a wide range of statistical analyses, including linear models and t-tests. 

The steps here are:
1. Iterate Over Genes: For each gene, perform linear regression using statsmodels to compare expression between groups.
2. Store Results: Collect the p-values, coefficients, and other statistics for all genes in a new DataFrame.
3. Adjust for Multiple Testing: Apply False Discovery Rate (FDR) correction to the p-values to identify significant genes.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.multitest import multipletests

# Example data: Long format with columns ['Gene', 'Sample', 'CPM', 'Group', 'Group_Code']
# df_long is already created in previous steps

# Initialize lists to store results
results = []

# Iterate over each unique gene
for gene in df_long['Gene'].unique():
    # Subset data for the current gene
    gene_data = df_long[df_long['Gene'] == gene]

    # Define X (independent variable: Group_Code) and y (dependent variable: CPM)
    X = sm.add_constant(gene_data['Group_Code'])  # Add intercept
    y = gene_data['CPM']

    # Fit the linear model
    model = sm.OLS(y, X).fit()

    # Extract statistics
    pval = model.pvalues['Group_Code']
    coef = model.params['Group_Code']
    results.append({'Gene': gene, 'pval': pval, 'coef': coef})

# Convert results to a DataFrame
results_df = pd.DataFrame(results)

# Adjust p-values for multiple testing (FDR correction)
results_df['adjusted_pval'] = multipletests(results_df['pval'], method='fdr_bh')[1]

# Add a significance column
results_df['significant'] = results_df['adjusted_pval'] < 0.05

print("\nDifferential Expression Results (Top Genes):")
print(results_df.head(n=50))


The resulting results_df DataFrame will have the following columns:

- Gene: The gene name.
- pval: The raw p-value from the linear regression test.
- coef: The effect size (difference in CPM between groups).
- adjusted_pval: The FDR-corrected p-value.
- significant: A boolean indicating whether the gene is significantly differentially expressed (adjusted p-value < 0.05).

### Final steps: Visualization
It is common to show differential expression with a Volcano plot, with insigificant genes plotted in gray and significant in red. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Volcano plot: Effect size (coef) vs -log10(adjusted p-value)
plt.figure(figsize=(8, 6))
results_df['log_pval'] = -np.log10(results_df['adjusted_pval'])

# Plot significant and non-significant genes
plt.scatter(results_df['coef'], results_df['log_pval'], c='gray', label='Not Significant')
plt.scatter(
    results_df[results_df['significant']]['coef'],
    results_df[results_df['significant']]['log_pval'],
    c='red', label='Significant'
)

plt.axhline(-np.log10(0.05), color='blue', linestyle='--', label='p=0.05 (adjusted)')
plt.title('Volcano Plot of Differential Expression')
plt.xlabel('Effect Size (Coef)')
plt.ylabel('-log10(Adjusted p-value)')
plt.legend()
plt.show()


## Conclusion

This exercise demonstrated some of the power of Python libraries (NumPy, Pandas, matplotlib, & statsmodel) to perform complex bioinformatics tasks and protein visualizations. After this guided exercise, it is time to tackle the [submodule 2 bioinformatics project](./Submodule_2_Tutorial_6_Project.ipynb) using the many skills you've learned in this module (or jump to the [solved version](./Submodule_2_Tutorial_7_ProjectSolutions.ipynb) of the project)!

### Clean up

Remember to stop your Jupyter Notebook compute instance to avoid unnecessary charges.