# Week 5: Simple single-cell RNA-seq analysis

1. fetch some single-cell data and create a cell-gene expression matrix and the associated AnnData object;
2. cluster the data to understand the cell structure; and
3. annotate the data with biologically relevant information.

All steps are based on Single-cell Best Practices Book (https://www.sc-best-practices.org/). Feel free to use it! You will still need to automate the process and ensure that it can be run.

## Step 0: Setup

### Install Dependencies

In [1]:
from pathlib import Path
import urllib.request
import subprocess
import gzip
import shutil
import subprocess
import pyroe
import numpy as np
import tarfile


  import pkg_resources


### Set up working directories

In [2]:
notebook_dir = Path.cwd()
data_dir = notebook_dir / "data"
data_dir.mkdir(parents=True, exist_ok=True)
print(f"Working directory: {notebook_dir}")
print(f"Data directory: {data_dir}")

Working directory: /mnt/c/Users/Nick/documents/github/fall25-csc-bioinf/week6/code
Data directory: /mnt/c/Users/Nick/documents/github/fall25-csc-bioinf/week6/code/data


## Step 1:

Get the sample data here: https://app.box.com/s/lx2xownlrhz3us8496tyu9c4dgade814. This data contains the single-cell FASTQ files, as well as the reference genome (chr5) and the GTF file with transcript information.

The list of whitelist barcodes is available [here](https://github.com/f0t1h/3M-february-2018/raw/refs/heads/master/3M-february-2018.txt.gz).

In [3]:
whitelist_url = "https://raw.githubusercontent.com/f0t1h/3M-february-2018/master/3M-february-2018.txt.gz"
whitelist_gz = data_dir / "3M-february-2018.txt.gz"
whitelist_path = data_dir / "whitelist.txt"

if not whitelist_path.exists():
    print("Downloading 10x whitelist...")
    urllib.request.urlretrieve(whitelist_url, whitelist_gz)
    print("Extracting whitelist...")
    with gzip.open(whitelist_gz, 'rb') as f_in:
        with open(whitelist_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
    whitelist_gz.unlink()
    print(f"Whitelist saved to: {whitelist_path}")
else:
    print(f"Whitelist already exists: {whitelist_path}")

Whitelist already exists: /mnt/c/Users/Nick/documents/github/fall25-csc-bioinf/week6/code/data/whitelist.txt


In [4]:
data_dir = Path("data")
data_dir.mkdir(parents=True, exist_ok=True)
toy_dataset_gz = data_dir / "toy_read_ref_set.tar.gz"

print("Extracting dataset")

with tarfile.open(toy_dataset_gz, 'r:gz') as tar:
    tar.extractall(path=data_dir)
    
toy_ref_read = data_dir / "toy_ref_read"
fastq_dir = toy_ref_read / "toy_read_fastq"
ref_dir = toy_ref_read / "toy_human_ref"
r1_fastq = fastq_dir / "selected_R1_reads.fastq"
r2_fastq = fastq_dir / "selected_R2_reads.fastq"
genome_fa = ref_dir / "fasta" / "genome.fa"
genes_gtf = ref_dir / "genes" / "genes.gtf"

files_ok = True

for filepath, description in [
    (r1_fastq, "R1 reads"),
    (r2_fastq, "R2 reads"),
    (genome_fa, "Genome FASTA"),
    (genes_gtf, "GTF annotation")
]:
    exists = filepath.exists()
    print(f"{description}: {filepath}")
    if not exists:
        files_ok = False

if files_ok:
    print("Files successfully extracted")
else:
    print("Some files are missing")

Extracting dataset
R1 reads: data/toy_ref_read/toy_read_fastq/selected_R1_reads.fastq
R2 reads: data/toy_ref_read/toy_read_fastq/selected_R2_reads.fastq
Genome FASTA: data/toy_ref_read/toy_human_ref/fasta/genome.fa
GTF annotation: data/toy_ref_read/toy_human_ref/genes/genes.gtf
Files successfully extracted


## Step 2:

Use Alevin-fry to align and quantify this data to the reference genome. See [here](https://www.sc-best-practices.org/introduction/raw_data_processing.html#a-real-world-example) for the installation steps.

## Step 3: 

Perform cell clustering (via Leiden modularity algorithm). Output the clustering plot.

## Step 4:

Perform automatic cell annotation via CellTypist. Annotate the plot with the cell types.

## Time Estimate: 8 Hours