# RNA-seq Practice Notebook
_This notebook is going to be documentation of what I learned in my self study of rna-seq data analysis using PyDESeq2_

## Part 1: Original Protocol Exploration
_these cells follow protocol for comprehension_



### 1.1 Overview

* WHAT does DESeq2 do in simple words?
  * it is a tool used to analyze RNA sequencing data which help you figure out which genes show significant changes in expression levels between 2 or more groups of samples.
* HOW does it do that?
  * It compares RNA counts (gene transcripts) are present in each sample
    * Counts tell you how "active" each gene is.
  *normalizes the data and runs statistical tests to find genes that are consistently more or less active in one group compared to another.
* WHY does differential expression analysis matter?
  * Tells you how gene expression changes in response to different conditions (like drug treatments, disease states, stress, or development)  
    * Can reveal important biological mechanisms, potential drug targets, or biomarkers.


### 1.2 Required Import Packages

In [None]:
# Install PyDESeq2 in Colab (only needed once per session)
!pip install pydeseq2

#import required packages
import os
import pickle as pkl

import numpy as np

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data

#prepare for saving files
SAVE = False  # whether to save the outputs of this notebook

if SAVE:
    # Replace this with the path to directory where you would like results to be
    # saved
    OUTPUT_PATH = "../output_files/synthetic_example"
    os.makedirs(OUTPUT_PATH, exist_ok=True)  # Create path if it doesn't exist

### 1.3 Data Loading

* PyDESeq2 requires **two types of inputs** to perform differential expression analysis (DEA):
  * count matrix
    * shape: '# of samples' x '# of genes'
    * containing read counts (non-negative integers)
  * Metadata
    * shape: '# of samples' x '# of variables'
    * containing sample annotations that will be used to split data in cohorts
  * **BOTH should be provided as pandas dataframes**

In [None]:
#loading an example dataset

counts_df = load_example_data(
    modality="raw_counts",
    dataset="synthetic",
    debug=False,
)

metadata = load_example_data(
    modality="metadata",
    dataset="synthetic",
    debug=False,
)

### 1.4 Data Preprocessing

#### Cleaning - Removing missing or corrupted values

In [None]:
samples_to_keep = ~metadata.condition.isna()
counts_df = counts_df.loc[samples_to_keep]
metadata = metadata.loc[samples_to_keep]

#### Filtering - Removing low-count genes or outlier samples

In [None]:
genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

#### Normalization - Adjusting for sequencing depth

#### Type conversion

#### Metadata merging - Linkikng sample annotations with count data

#### Log transformation - Stabilizing variance for visualization or testing

### 1.5 Alignment


### 1.6 Quantification

### 1.7 Differential Expression

## Part 2: My Brain-Friendly Rewrites and Tinkering

### 2.1 Starter Notes

* Using PyDESeq2 due to my comfort with Python. I will be using the comfort as a crutch _only as I learn the logic_. The goal is to understand how DESeq2 works.
* After understanding the logic and how it works in Python, I will take the time to learn R in order to apply to labs and to follow established pipelines.

### 2.2 Refactored Preprocessing

### 2.3 Notes on Troubleshooting

### 2.4 Annnotated Visualization Tweaks

# Resources


## Links
* Step-by-step PyDESeq2 workflow - https://pydeseq2.readthedocs.io/en/stable/auto_examples/plot_step_by_step.html
* A simple PyDESeq2 workflow - https://pydeseq2.readthedocs.io/en/latest/auto_examples/plot_minimal_pydeseq2_pipeline.html


## References
* Differential Expression with DEseq2 - https://genviz.org/module-04-expression/0004/02/01/DifferentialExpression/#:~:text=Differential%20expression%20analysis%20is%20used,untreated%20samples).

## Glossary of Terms