# Module 6: Analyzing RNA Sequencing Data


> **⚠️ CAUTION:** This is a very computationaly demanding module and is primarily to serve as a reference.  The processed files will be made availeble in Module 7.

## Experimental Design: SARS-CoV-2 Infection in Cell Lines

This data comes from a pivotal study ([Blanco-Melo et al., Cell 2020](https://doi.org/10.1016/j.cell.2020.04.026)) examining SARS-CoV-2 infection responses in cell culture models. This seminal paper, "Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19", was one of the first to characterize the unique host response to SARS-CoV-2 and has been cited over 2,500 times. The experiment uses A549 cells, a human lung cancer cell line commonly used in respiratory virus research.

### Sample Groups
The experiment includes two main series of comparisons:

#### Series 5: Standard A549 Cells
* **Treatment Group**: SARS-CoV-2 infected (n=3)
 - GSM4462339, GSM4462340, GSM4462341
* **Control Group**: Mock infected (n=3)
 - GSM4462336, GSM4462337, GSM4462338

#### Series 16: Modified A549-ACE2 Cells
These cells were engineered to express the ACE2 receptor, which SARS-CoV-2 uses to enter cells.
* **Treatment Group**: SARS-CoV-2 infected (n=3)
 - GSM4486160, GSM4486161, GSM4486162
* **Control Group**: Mock infected (n=3)
 - GSM4486157, GSM4486158, GSM4486159

### Experimental Significance
* A549 cells normally express very low levels of ACE2, the receptor SARS-CoV-2 uses to enter cells
* By comparing standard A549 cells with ACE2-expressing A549 cells, researchers can:
 - Understand how the virus affects cells when it can and cannot efficiently enter
 - Identify cellular responses specific to viral entry versus exposure
 - Study how ACE2 expression levels impact infection outcomes

### Data Structure and Storage
* Each sample has a unique GEO accession (GSM) number for metadata tracking
* Each GSM is associated with one or more SRA run accessions (SRR numbers)
* SRR files are stored in a specialized `.sra` format:
 - Binary format that efficiently stores sequencing reads
 - Contains both sequence data and quality scores
 - Requires special tools (SRA Toolkit) to convert to standard FASTQ format
 - More space-efficient than raw FASTQ files
 - Can represent both single-end and paired-end sequencing data
 - Acts as an archive format for sequence repositories

### Data Processing Flow
1. GSM accessions identify the biological samples
2. SRR numbers link to the actual sequencing data
3. `.sra` files are downloaded using `prefetch`
4. `.sra` files are converted to FASTQ using `fasterq-dump`
5. FASTQ files are then processed for quality control and analysis


This notebook demonstrates a cleaner, more modular approach to:
1. Installing required bioinformatics tools
2. Fetching SRA reads
3. Running quality control with `fastp` + `MultiQC`
4. Building a Salmon index
5. Running Salmon quantification

---

## 1. Install Tools 

## RNA-seq Pipeline Package Installation

The code cell below installs essential bioinformatics tools for RNA-seq analysis through conda.

### Package Details and Purposes

#### Data Access and Download
* [**entrez-direct**](https://www.ncbi.nlm.nih.gov/books/NBK179288/) - Command-line tools for querying and downloading from NCBI databases. Essential for programmatically accessing sequence data from repositories like SRA and GenBank. It allows automated batch downloads and queries of NCBI's vast biological databases without manual web interface interaction.
* [**sra-tools**](https://github.com/ncbi/sra-tools) - Official NCBI toolkit for working with Sequence Read Archive data. Required for downloading raw sequencing files. It handles the conversion of SRA's specialized format into standard FASTQ files that downstream tools can process.
* [**parallel-fastq-dump**](https://github.com/rvalieris/parallel-fastq-dump) - Parallelized version of fastq-dump that significantly speeds up SRA data downloads. While the standard fastq-dump processes files sequentially, this tool utilizes multiple CPU cores to accelerate the download and conversion process.

#### Quality Control and Preprocessing
* [**fastp**](https://github.com/OpenGene/fastp) - Modern all-in-one FASTQ preprocessor that handles quality control, adapter trimming, and filtering with exceptional speed. Unlike older tools that perform single tasks, fastp integrates multiple preprocessing steps including base quality control, adapter removal, quality filtering, and per-base quality pruning in a single efficient pass through the data.
* [**FastQC**](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) - Industry standard tool for generating detailed quality control reports of sequencing data. It provides comprehensive visualizations and statistics about sequence quality scores, GC content, adapter contamination, overrepresented sequences, and other potential quality issues that might affect downstream analysis.
* [**MultiQC**](https://multiqc.info/) - Aggregates bioinformatics analysis results across multiple samples into a single comprehensive report. This is particularly valuable for large-scale projects, as it combines reports from various tools (FastQC, Salmon, etc.) across many samples into an interactive HTML report, making it easier to spot batch effects or systematic issues.

#### Quantification
* [**Salmon**](https://combine-lab.github.io/salmon/) - A cutting-edge tool for transcript quantification that uses lightweight mapping (quasi-mapping) instead of traditional alignment. Unlike traditional aligners (like STAR or HISAT2) that generate BAM files with exact read positions, Salmon uses probabilistic models and lightweight mapping to estimate transcript abundances directly. This approach is substantially faster and more memory-efficient while maintaining high accuracy. It's particularly good at handling technical issues like multi-mapping reads and GC bias.

#### Utility
* [**pigz**](https://zlib.net/pigz/) - Parallel implementation of gzip for efficient compression/decompression of large sequencing files. While standard gzip uses a single CPU core, pigz parallizes the compression/decompression process across multiple cores, significantly speeding up file operations on modern multi-core systems.

The installation uses specific conda channels:
* `conda-forge`: General scientific computing packages, known for its comprehensive collection of well-maintained scientific software
* `bioconda`: Specialized bioinformatics software channel, maintained by the bioinformatics community with standardized builds and dependencies

The `-y` flag automates the installation by accepting all prompts, allowing for unattended installation of all packages.




In [None]:
!conda install -c conda-forge -c bioconda \
  entrez-direct fastp fastqc multiqc salmon parallel-fastq-dump sra-tools pigz -y


## Helper Functions for RNA-seq Data Processing

This section contains essential Python functions that help automate and streamline the RNA-seq analysis pipeline. While the implementation details are complex, each function serves a specific purpose in the workflow:

### System Setup Functions
* `get_cpu_count()` - Intelligently determines how many CPU cores to use for processing, leaving one core free for system operations to prevent overloading.
* `setup_directories()` - Creates and organizes the necessary folder structure for storing downloaded and processed sequencing data.

### Data Download and Verification
* `verify_paired_end()` - Checks whether sequencing data is paired-end (two reads per fragment) or single-end by querying NCBI's metadata.
* `check_sra_exists()` - Verifies if sequencing data has already been downloaded in SRA format to prevent redundant downloads.
* `verify_fastq_exists()` - Checks if sequencing data has already been converted to FASTQ format to avoid duplicate processing.
* `get_srr_dict()` - Retrieves SRA run accessions (SRR numbers) for each sample without downloading data. Creates a mapping between sample groups and their sequencing data.
* `get_and_fetch_srr()` - Comprehensive function that:
 - Downloads raw sequencing data from NCBI
 - Converts it from SRA to FASTQ format 
 - Compresses files to save space
 - Handles both single and paired-end data automatically

### Data Processing and Quality Control
* `run_fastp_with_concat()` - Processes raw sequencing data by:
 - Combining multiple sequencing runs for the same sample
 - Trimming low-quality bases
 - Removing adapter sequences
 - Generating quality reports
 - Creating cleaned, high-quality reads for downstream analysis

### Quantification
* `run_salmon_quant()` - Performs transcript quantification using Salmon by:
 - Processing cleaned reads from multiple runs per sample
 - Mapping reads to a reference transcriptome
 - Generating gene/transcript expression counts
 - Creating output compatible with downstream differential expression analysis

Each function includes error checking and progress reporting to ensure reliable data processing. The functions work together to create a seamless pipeline from raw data download through expression quantification.

In [None]:
import os
import subprocess
import shutil
from pathlib import Path
import multiprocessing
import pandas as pd

########################
# Helper Functions
########################

def get_cpu_count():
    """Get the number of available CPU cores (minus 1)."""
    try:
        return max(1, multiprocessing.cpu_count() - 1)
    except:
        return 1

def setup_directories(base_dir=".", sra_dir="sra", fastq_dir="fastq"):
    """Create and return paths for SRA and fastq storage."""
    base_path = Path(base_dir).resolve()
    sra_path = base_path / sra_dir
    fastq_path = base_path / fastq_dir
    temp_path = fastq_path / "temp"
    for path in [sra_path, fastq_path, temp_path]:
        path.mkdir(parents=True, exist_ok=True)
    return base_path, sra_path, fastq_path, temp_path

def verify_paired_end(srr):
    """Check if SRR is paired-end by querying SRA metadata."""
    # SRA: runinfo => layout
    cmd1 = f'esearch -db sra -query "{srr}" | efetch -format runinfo | cut -d "," -f 16'
    layout = subprocess.run(cmd1, shell=True, capture_output=True, text=True).stdout.strip()

    # Alternatively: docsum => read_spec
    cmd2 = (
        f'esearch -db sra -query "{srr}" | efetch -format docsum ' 
        f'| xtract -pattern DocumentSummary -element spots.read_spec'
    )
    read_spec = subprocess.run(cmd2, shell=True, capture_output=True, text=True).stdout.strip()

    is_paired = (layout.upper() == "PAIRED") or ("2 reads per spot" in read_spec)
    return is_paired

def check_sra_exists(srr, sra_dir):
    """Check if .sra file already exists locally."""
    return (Path(sra_dir) / f"{srr}.sra").exists()

def verify_fastq_exists(srr, output_dir, is_paired):
    """Check if corresponding FASTQ already exists."""
    output_dir = Path(output_dir)
    if is_paired:
        expected_files = [f"{srr}_1.fastq.gz", f"{srr}_2.fastq.gz"]
    else:
        expected_files = [f"{srr}.fastq.gz", f"{srr}_1.fastq.gz"]
    return any((output_dir / f).exists() for f in expected_files)

def get_srr_dict(gsm_list):
    """
    Get SRR numbers for each GSM, without downloading.
    Returns a dict: { group_name: [SRR1, SRR2, ...], ... }.
    """
    results = {}
    for gsm, group in gsm_list:
        cmd1 = (
            f'esearch -db gds -query "{gsm}" | efetch -format docsum | '
            f'xtract -pattern ExtRelation -element TargetObject | grep ^SRX'
        )
        srx = subprocess.run(cmd1, shell=True, capture_output=True, text=True).stdout.strip()

        cmd2 = (
            f'esearch -db sra -query "{srx}" | efetch -format runinfo | '
            f'cut -d "," -f 1 | grep ^SRR'
        )
        srr_list = subprocess.run(cmd2, shell=True, capture_output=True, text=True).stdout.split()
        results[group] = srr_list
    return results

def get_and_fetch_srr(gsm_list, base_dir=".", sra_dir="sra", output_dir="fastq"):
    """
    Fetch SRRs for each GSM => prefetch => convert to FASTQ => compress.
    Returns a dict: { group_name: [SRR1, SRR2, ...], ...}.
    """
    base_path, sra_path, fastq_path, temp_path = setup_directories(base_dir, sra_dir, output_dir)
    results = {}
    cpu_count = get_cpu_count()

    for gsm, group in gsm_list:
        # Get SRX
        cmd_srx = (
            f'esearch -db gds -query "{gsm}" | efetch -format docsum | '
            f'xtract -pattern ExtRelation -element TargetObject | grep ^SRX'
        )
        srx = subprocess.run(cmd_srx, shell=True, capture_output=True, text=True).stdout.strip()

        # Get SRRs
        cmd_srr = (
            f'esearch -db sra -query "{srx}" | efetch -format runinfo | '
            f'cut -d "," -f 1 | grep ^SRR'
        )
        srr_list = subprocess.run(cmd_srr, shell=True, capture_output=True, text=True).stdout.split()
        results[group] = srr_list

        for srr in srr_list:
            is_paired = verify_paired_end(srr)
            if verify_fastq_exists(srr, fastq_path, is_paired):
                continue
            
            # Prefetch if missing
            if not check_sra_exists(srr, sra_path):
                subprocess.run(f'prefetch --progress {srr} --output-directory {sra_path}', shell=True)
            
            # Fix: Account for the subdirectory structure where SRA files are saved
            sra_file_path = sra_path / srr / f"{srr}.sra"
            
            # Convert to fastq using the correct path
            cmd_fastq = (
                f'fasterq-dump --split-files "{sra_file_path}" -O {fastq_path} '
                f'-p --threads {cpu_count} --temp {temp_path}'
            )
            subprocess.run(cmd_fastq, shell=True)

            # Compress
            potential_files = list(fastq_path.glob(f"{srr}*.fastq"))
            for fq in potential_files:
                subprocess.run(f'pigz -p {cpu_count} {fq}', shell=True)

    # Cleanup
    shutil.rmtree(temp_path)
    return results



def run_fastp_with_concat(
    srr_dict,
    fastq_dir="fastq",
    trimmed_dir="trimmed_reads",
    report_dir="fastp_reports",
    phred_cutoff=15,
    min_length=36,
    cpu_count=4
):
    """
    Concatenate all FASTQs for each experiment (group) and run fastp trimming.
    
    Args:
        srr_dict (dict): Dictionary mapping { experiment_name: [SRR1, SRR2, ...], ... }
        fastq_dir (str): Directory containing raw FASTQ files (*.fastq.gz)
        trimmed_dir (str): Output directory for trimmed FASTQs
        report_dir (str): Output directory for fastp JSON/HTML reports
        phred_cutoff (int): Phred quality cutoff for trimming
        min_length (int): Minimum length required after trimming
        cpu_count (int): Number of threads to use for fastp
    """
    fastq_dir = Path(fastq_dir)
    trimmed_dir = Path(trimmed_dir)
    report_dir = Path(report_dir)
    
    trimmed_dir.mkdir(parents=True, exist_ok=True)
    report_dir.mkdir(parents=True, exist_ok=True)
    
    for experiment_name, srr_list in srr_dict.items():
        print(f"\nProcessing experiment: {experiment_name}")
        
        # Collect all FASTQ files for this experiment.
        # For single-end data, you might find each SRR as {SRR}_1.fastq.gz or {SRR}.fastq.gz
        # Adjust logic as needed for your naming convention.
        fastq_paths = []
        for srr in srr_list:
            # Potential naming patterns (adjust to your needs):
            possible_paths = [
                fastq_dir / f"{srr}.fastq.gz",
                fastq_dir / f"{srr}_1.fastq.gz"
            ]
            # Collect whichever actually exists
            existing = [p for p in possible_paths if p.exists()]
            if not existing:
                print(f"  Warning: No FASTQ file found for {srr}")
            else:
                fastq_paths.extend(existing)
        
        if not fastq_paths:
            print(f"  No FASTQs to process for {experiment_name}. Skipping.")
            continue
        
        # Concatenate them into a single file: e.g., "my_experiment_untrimmed.fastq.gz"
        # We'll zcat each input into one combined FASTQ so that fastp only sees one file.
        combined_fastq = trimmed_dir / f"{experiment_name}_untrimmed.fastq.gz"
        
        ncores = get_cpu_count()  # Reuse our CPU counting function

        if len(fastq_paths) == 1:
            # If there's only one file, just copy it (instead of doing zcat).
            shutil.copy(fastq_paths[0], combined_fastq)
            print(f"  Copied single FASTQ => {combined_fastq.name}")
        else:
            # Multiple FASTQs => merge them using pigz for parallel compression
            cmd_concat = (
                f"zcat {' '.join(str(p) for p in fastq_paths)} | "
                f"pigz -p {ncores} > {combined_fastq}"
            )
            print(f"  Merging {len(fastq_paths)} FASTQs into => {combined_fastq.name}")
            subprocess.run(cmd_concat, shell=True, check=True)
        
        # Now run fastp on the combined FASTQ
        trimmed_fastq = trimmed_dir / f"{experiment_name}.trimmed.fastq.gz"
        json_report = report_dir / f"{experiment_name}.json"
        html_report = report_dir / f"{experiment_name}.html"
        
        cmd_fastp = [
            "fastp",
            "--in1", str(combined_fastq),
            "--out1", str(trimmed_fastq),
            "--json", str(json_report),
            "--html", str(html_report),
            "--thread", str(cpu_count),
            "--qualified_quality_phred", str(phred_cutoff),
            "--length_required", str(min_length),
            "--cut_right",
            "--cut_right_window_size", "4",
            "--cut_right_mean_quality", str(phred_cutoff)
        ]
        
        print(f"  Running fastp => {experiment_name}.trimmed.fastq.gz")
        subprocess.run(cmd_fastp, check=True)
        
        # (Optional) remove the untrimmed file if you don't need it anymore
        if combined_fastq.exists():
            combined_fastq.unlink()
    
    print("\nDone! Now you can run MultiQC on the fastp_reports/ directory.")


def run_salmon_quant(srr_dict, salmon_index="salmon/salmon_index", output_dir="salmon_quant"):
    """
    Run Salmon quantification on all samples, combining multiple fastqs per group.
    Expects trimmed FASTQs in `trimmed_reads/`.
    """
    ncores = get_cpu_count()
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for exp_name, srr_list in srr_dict.items():
        fastq_files = []
        for srr in srr_list:
            # Adjust naming as needed
            fastq_path = f"trimmed_reads/{srr}_1.trimmed.fastq.gz"
            if os.path.exists(fastq_path):
                fastq_files.append(fastq_path)
            else:
                print(f"Warning: {fastq_path} not found.")

        if fastq_files:
            cmd = [
                "salmon", "quant",
                "-i", salmon_index,
                "-l", "SR",  # single-end
                "-p", str(ncores),
                "--validateMappings",
                "-o", f"{output_dir}/{exp_name}"
            ]
            # Add each FASTQ with -r
            for fastq in fastq_files:
                cmd.extend(["-r", fastq])

            result = subprocess.run(cmd, capture_output=True, text=True)
            if result.returncode != 0:
                print(f"Error processing {exp_name}")
                print(result.stderr)
            else:
                print(f"Salmon quant finished for {exp_name}!")
        else:
            print(f"No FASTQ files found for {exp_name}")
    print("\nAll Salmon quantifications are done!")

def run_salmon_quant_new(srr_dict, salmon_index="salmon/salmon_index", output_dir="salmon_quant"):
    """
    Run Salmon quantification on all samples.
    Expects trimmed FASTQs in `trimmed_reads/` named by experiment.
    """
    ncores = get_cpu_count()
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for exp_name in srr_dict.keys():
        # Look for the concatenated, trimmed fastq created by run_fastp_with_concat
        fastq_path = f"trimmed_reads/{exp_name}.trimmed.fastq.gz"
        
        if os.path.exists(fastq_path):
            cmd = [
                "salmon", "quant",
                "-i", salmon_index,
                "-l", "SR",  # single-end
                "-p", str(ncores),
                "--validateMappings",
                "-o", f"{output_dir}/{exp_name}",
                "-r", fastq_path  # Single concatenated file per experiment
            ]

            result = subprocess.run(cmd, capture_output=True, text=True)
            if result.returncode != 0:
                print(f"Error processing {exp_name}")
                print(result.stderr)
            else:
                print(f"Salmon quant finished for {exp_name}!")
        else:
            print(f"Warning: No trimmed FASTQ found for {exp_name} at {fastq_path}")
    
    print("\nAll Salmon quantifications are done!")


## 3. Example: Fetch SRA Reads
Below is an example of how you might call the functions to fetch reads by GSM accession. Adjust `gsm_data` to your dataset.  The fetched SRA files will be saved and in stored the sra folder and converted to fastq files in the fastq directory.

In [None]:
gsm_data = [
    ("GSM4462339", "Series5_A549_SARS-CoV-2_1"),
    ("GSM4462340", "Series5_A549_SARS-CoV-2_2"),
    ("GSM4462341", "Series5_A549_SARS-CoV-2_3"),
    ("GSM4462336", "Series5_A549_Mock_1"),
    ("GSM4462337", "Series5_A549_Mock_2"),
    ("GSM4462338", "Series5_A549_Mock_3"),
    ("GSM4486162", "Series16_A549-ACE2_SARS-CoV-2_3"),
    ("GSM4486161", "Series16_A549-ACE2_SARS-CoV-2_2"),
    ("GSM4486160", "Series16_A549-ACE2_SARS-CoV-2_1"),
    ("GSM4486159", "Series16_A549-ACE2_Mock_3"),
    ("GSM4486158", "Series16_A549-ACE2_Mock_2"),
    ("GSM4486157", "Series16_A549-ACE2_Mock_1")
]



# Get the SRR numbers and fetch the files
srr_dict = get_and_fetch_srr(gsm_data, output_dir="fastq")

## Understanding FASTQ Files in RNA-seq

FASTQ files are the standard format for storing raw sequencing data in RNA-seq experiments. Each sequencing read in a FASTQ file consists of four lines:

### FASTQ File Structure
1. **Sequence Identifier** (starts with '@')
  - Contains instrument ID, run information, and coordinates
  - Example: `@SRR10971381.1 HWUSI-EAS100R:6:73:941:1973/1`

2. **Raw Sequence**
  - The actual DNA/RNA sequence using A, C, G, T (or U)
  - Example: `ATCGTAGCTACGATCGACTGACTGACTACG`

3. **Description Line** (starts with '+')
  - Often just a '+' or repeats the identifier
  - Separates sequence from quality scores

4. **Quality Scores** (Phred scores)
  - ASCII-encoded quality values for each base
  - Higher scores indicate more reliable base calls
  - Example: `@@CCFFFFFHHHHHJJJJJJJJJJJJJJJJ`

### Example FASTQ Entry

```
@SRR10971381.1 HWUSI-EAS100R:6:73:941:1973/1
ATCGTAGCTACGATCGACTGACTGACTACG
+
@@CCFFFFFHHHHHJJJJJJJJJJJJJJJJ
```

## FASTQ Quality Scores: ASCII Encoding

Quality scores in FASTQ files are encoded using ASCII characters to save space and improve readability. Each base's quality score is represented by a single ASCII character.

### How It Works
* Quality scores (Phred scores) range from 0 to 40+
* These numbers are converted to ASCII characters by adding an offset
* Two common encoding schemes:
  - Phred+33: Offset of 33 (modern standard)
  - Phred+64: Offset of 64 (older Illumina format)

### Phred+33 Quality Score Examples

| Quality Score (Q) | ASCII Value | ASCII Character | Error Probability |
|------------------|-------------|-----------------|------------------|
| 40               | 73          | I              | 0.0001%         |
| 30               | 63          | ?              | 0.1%            |
| 20               | 53          | 5              | 1%              |
| 10               | 43          | +              | 10%             |

### Common Quality Ranges

| ASCII Character | Quality Range        | Meaning                          |
|----------------|---------------------|----------------------------------|
| !              | Worst (Q=0)         | Definite read error              |
| # to '         | Poor (Q<20)         | Low quality, consider trimming   |
| 5 to ?         | Good (Q20-30)       | Good quality for most applications|
| @ to I         | Excellent (Q30-40)  | Very high quality reads          |

So when you see a quality line like:

```
@@CCFFFFFHHHHHJJJJJJJJJJJJJJJJ

```

Each character represents the quality of the corresponding base in the sequence, with higher ASCII characters indicating better quality.

### File Processing
* FASTQ files are typically compressed (`.fastq.gz`) due to their large size
* A typical RNA-seq experiment can generate millions of reads
* Quality control tools like FastQC and fastp analyze these files to:
  - Identify poor quality regions
  - Detect adapter contamination
  - Find potential sequencing problems
  - Guide preprocessing decisions

## Building a Transcriptome Index with Salmon

[Salmon RNA-seq](https://github.com/COMBINE-lab/salmon) - 🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

Salmon uses a novel indexing and quantification strategy that makes it both fast and accurate for RNA-seq analysis. Unlike traditional aligners that create BAM files with exact read positions, Salmon builds an index optimized for quasi-mapping and transcript quantification.

### The Reference Transcriptome
* We need a transcriptome FASTA file (not a genome)
 - Usually downloaded from Ensembl, GENCODE, or RefSeq
 - Contains sequences for all known transcripts
 - Each sequence represents a mature, spliced transcript
 - Includes both protein-coding and non-coding RNAs

### Salmon Index Features
* **Quasi-mapping** approach:
 - Creates a minimal perfect hash of k-mers from transcripts
 - Enables ultra-fast mapping without full alignment
 - Reduces computational overhead significantly
* **Dual-phase indexing**:
 - First phase builds a sequence dictionary
 - Second phase creates efficient search structures
* **Built-in sequence deduplication**:
 - Identifies and handles repeated sequences
 - Important for gene families and paralogs



## Downloading Reference Data from GENCODE

This code block downloads essential reference files for human genome analysis from GENCODE, a high-quality gene annotation database.

### Command Breakdown

1. **Create Directory**
* `mkdir -p genome`
* Creates a new directory called `genome`
* `-p` flag creates parent directories if needed and prevents errors if directory exists

2. **Download Reference Genome**
* Downloads the human reference genome sequence (GRCh38)
* `primary_assembly` means it includes:
  - Standard chromosomes (1-22, X, Y, M)
  - No alternate haplotypes or patches
* `.fa.gz` indicates a compressed FASTA file

3. **Download Transcriptome**
* Contains sequences of mature transcripts
* Includes all known transcript variants
* Used for RNA-seq quantification with Salmon
* Version 47 is one of the latest GENCODE releases

4. **Download Gene Annotations**
* GTF file containing gene structure information
* `basic` set includes:
  - Verified, well-supported gene models
  - Simplified version of full annotation
  - Recommended for most analyses
* Contains information about:
  - Gene locations
  - Exon boundaries
  - Transcript structures
  - Gene names and IDs

### File Usage
* Genome (.fa.gz) → Reference for variant calling
* Transcriptome (.fa.gz) → Input for Salmon index
* Annotations (.gtf.gz) → Gene/transcript information for downstream analysis

In [None]:
!mkdir -p genome
!wget -P genome ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz
!wget -P genome ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.transcripts.fa.gz
!wget -P genome ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.basic.annotation.gtf.gz

## Building a Salmon Index with Decoy Sequences

The following commands build a Salmon index that includes decoy sequences to improve mapping accuracy.

### Process Overview

1. **Create Salmon Directory**
* Makes a directory to store index and related files

2. **Extract Decoy Sequences**
* Extracts chromosome names from the genome FASTA
* These will serve as "decoy" sequences to prevent false transcript mappings
* Command breakdown:
 - `gunzip -c` decompresses and outputs to stdout
 - `grep "^>"` finds all FASTA headers
 - `cut -d " " -f 1` extracts just the sequence names
 - Result saved in `decoys.txt`

3. **Clean Decoy List**
* Removes the '>' characters from FASTA headers
* Creates a clean list of chromosome names
* Keeps backup with `.bak` extension

4. **Create Gentrome**
* Concatenates transcriptome and genome files
* Called "gentrome" (genome + transcriptome)
* Helps Salmon distinguish between:
 - True transcript mappings
 - Genomic DNA contamination
 - Incomplete splicing

5. **Build Salmon Index**
* `-t gentrome.fa.gz`: Combined transcriptome and genome
* `-d decoys.txt`: List of genomic decoy sequences
* `-p 60`: Use 60 threads (adjust based on your system)
* `-i salmon_index`: Output directory
* `--gencode`: Use GENCODE-specific parsing

### Why Use Decoys?
* Improves accuracy by reducing false mappings
* Helps handle:
 - Genomic DNA contamination
 - Unspliced RNA
 - Incomplete transcript annotations
* Particularly important for:
 - Novel transcript discovery
 - Alternative splicing analysis
 - Low-quality RNA samples

In [None]:
# Example commands:
!mkdir -p salmon
!gunzip -c genome/GRCh38.primary_assembly.genome.fa.gz |grep "^>"  | cut -d " " -f 1 > salmon/decoys.txt
!sed -i.bak -e 's/>//g' salmon/decoys.txt
!cat genome/gencode.v47.transcripts.fa.gz genome/GRCh38.primary_assembly.genome.fa.gz > salmon/gentrome.fa.gz
!cd salmon && salmon index -t gentrome.fa.gz -d decoys.txt -p 60 -i salmon_index --gencode

print("Salmon index building commands go here. Uncomment and adjust as needed.")

## Quality Control with fastp and MultiQC

Before quantification, raw sequencing reads need quality control and preprocessing. We use two powerful tools: fastp for read processing and MultiQC for quality reporting.

### fastp: Modern FASTQ Preprocessing
`fastp` performs multiple essential steps in a single, efficient pass through the data:

1. **Quality Control Metrics**
  * Base quality distribution
  * Sequence content analysis
  * GC content distribution
  * Sequence length distribution
  * Duplication rate analysis

2. **Read Trimming Actions**
  * Removes low-quality bases (below Phred score threshold)
  * Trims adapter sequences automatically
  * Removes N bases
  * Filters reads by length and complexity
  * Performs sliding window quality trimming

3. **Key Parameters We Use**
  * `--qualified_quality_phred 15`: Minimum base quality score
  * `--length_required 36`: Minimum read length after trimming
  * `--cut_right`: Enable sliding window trimming
  * `--cut_right_window_size 4`: Window size for quality checking
  * `--thread`: Utilize multiple CPU cores

### MultiQC: Aggregating Quality Reports

MultiQC collects reports from multiple samples and tools to create a comprehensive quality overview:

1. **Data Collection**
  * Scans directories recursively
  * Finds reports from various bioinformatics tools
  * Recognizes fastp JSON reports automatically

2. **Generated Reports Include**
  * Interactive HTML dashboard
  * Sample comparison plots
  * Quality metrics across all samples
  * Flags potential quality issues

3. **Key Visualizations**
  * Per-base sequence quality
  * Sequence length distributions
  * Adapter content
  * Duplication rates
  * GC content
  * Quality score distributions

### Quality Assessment Flow
1. Raw reads → fastp → Trimmed reads
2. fastp generates per-sample reports
3. MultiQC combines all reports
4. Review MultiQC dashboard:
  * Look for systematic quality issues
  * Compare samples for batch effects
  * Identify potential outliers
  * Verify trimming effectiveness

The combination of fastp's efficient processing and MultiQC's comprehensive reporting provides a robust quality control pipeline for RNA-seq data.

In [None]:
# Suppose you already have a dictionary: srr_dict = { "exp1": ["SRR123", "SRR124"], "exp2": ["SRR999"] }
# And you have raw FASTQ files in 'fastq/' named SRR123.fastq.gz, SRR124.fastq.gz, SRR999.fastq.gz, etc.

# Then just do:
run_fastp_with_concat(
    srr_dict,
    fastq_dir="fastq",
    trimmed_dir="trimmed_reads",
    report_dir="fastp_reports",
    phred_cutoff=15,
    min_length=36,
    cpu_count=16
)




In [None]:
!multiqc fastp_reports --filename multiqc_fastp_report --title "Fastp QC Report" -o fastp_reports


## Salmon Quantification and Expression Estimation

Salmon performs transcript quantification through a two-phase process that produces accurate abundance estimates without traditional alignment.

### Understanding Salmon Quantification

1. **Mapping Phase**
* Uses quasi-mapping instead of traditional alignment
* Maps reads to transcripts using the pre-built index
* Much faster than traditional aligners
* No BAM files are generated

2. **Quantification Phase**
* Employs an expectation-maximization (EM) algorithm
* Resolves multi-mapping reads probabilistically
* Accounts for positional and sequence-specific biases
* Estimates transcript abundances in TPM and counts

### Key Parameters Used
* `-i salmon_index`: Path to pre-built Salmon index
* `-l SR`: Library type (SR = single-read)
* `-p`: Number of threads for parallel processing
* `--validateMappings`: Perform additional mapping validation
* `-r`: Input trimmed FASTQ file(s)



In [None]:
# Example usage:
# Suppose we have a dictionary of groups -> SRR IDs from get_and_fetch_srr:
# srr_dict = {
#   "Series5_A549_SARS-CoV-2_1": ["SRR11517677"],
#   "Series5_A549_SARS-CoV-2_2": ["SRR11517678"],
#   "Series5_A549_SARS-CoV-2_3": ["SRR11517679"],
#   "Series5_A549_Mock_1"     : ["SRR11517674"],
#   "Series5_A549_Mock_2"     : ["SRR11517675"],
#   "Series5_A549_Mock_3"     : ["SRR11517676"]
# }
#
run_salmon_quant_new(srr_dict, salmon_index="salmon/salmon_index", output_dir="salmon_quant")

print("Example Salmon quant usage. Uncomment & adjust the dictionary + paths.")

### Output Files (.sf format)
Each quantification produces a `quant.sf` file containing:
* `Name`: Transcript identifier
* `Length`: Transcript length
* `EffectiveLength`: Length adjusted for fragment size distribution
* `TPM`: Transcripts Per Million
* `NumReads`: Estimated number of reads mapping to transcript

### Sample Output Structure

In [None]:
!head salmon_quant/Series5_A549_Mock_1/quant.sf

In [None]:
import pandas as pd

# Convert dictionary to dataframe
df = pd.DataFrame({
    'sample_id': list(srr_dict.keys()),
    'srr_numbers': list(srr_dict.values())
})

# Extract information from sample_id
df['condition'] = df['sample_id'].str.extract('_(Mock|SARS-CoV-2)_')
df['cell_type'] = df['sample_id'].str.extract('_(A549-ACE2|A549)_')

# Sort to match your desired order
df = df.sort_values(['cell_type', 'condition', 'sample_id'])

# Select and order columns
df_final = df[['sample_id', 'condition', 'cell_type']]

# Save to CSV
df_final.to_csv('sample_info.csv', index=False)

# Print preview
print(df_final)

In [None]:
srr_dict