# Day 1: Introduction & Quality Control (Total 4 hours)

## (30 min) Welcome & Course Overview

Introduction to the course structure, goals, and evaluation criteria.

Speaker: Joan Camuñas-Soler.

***

## (1h) Lecture: Introduction to Single-Cell Genomics

Topics:

Applications in precision medicine

Experimental workflows

Overview of common datasets

Challenges in single-cell analysis

---

# Workshop 1: Quality Control (2.5 h)

## 1. Data Import and Initial Exploration

In [None]:
import scanpy as sc
import numpy as np

# Load the dataset
adata = sc.read_h5ad('data.h5ad')

# Overview of the data
print(adata)
adata.var_names_make_unique()
adata.obs_names_make_unique()

# Summary statistics
adata.obs['n_counts'] = adata.X.sum(axis=1).A1
adata.obs['n_genes_by_counts'] = (adata.X > 0).sum(axis=1).A1
sc.pp.calculate_qc_metrics(adata, inplace=True)

### Discussion:

What do the summary statistics tell us about the data?

Are there cells with extremely high or low counts?

## 2. Identifying and Annotating Mitochondrial, Ribosomal, and Hemoglobin Genes

In [None]:
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['ribo'] = adata.var_names.str.startswith(('RPS', 'RPL'))
adata.var['hb'] = adata.var_names.str.contains('^HB[^(P|E|S)]')

# Calculate the percentage of mitochondrial counts
adata.obs['pct_counts_mt'] = np.sum(
    adata[:, adata.var['mt']].X, axis=1).A1 / np.sum(adata.X, axis=1).A1 * 100
adata.obs['pct_counts_ribo'] = np.sum(
    adata[:, adata.var['ribo']].X, axis=1).A1 / np.sum(adata.X, axis=1).A1 * 100
adata.obs['pct_counts_hb'] = np.sum(
    adata[:, adata.var['hb']].X, axis=1).A1 / np.sum(adata.X, axis=1).A1 * 100

### Discussion:

What is the expected proportion of mitochondrial and ribosomal content in different cell types?

## 3. Visualization of QC Metrics

In [None]:
sc.pl.violin(adata, ['n_genes_by_counts', 'n_counts', 'pct_counts_mt', 'pct_counts_ribo', 'pct_counts_hb'],
             jitter=0.4, groupby='sample', rotation=45)
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', color='pct_counts_mt')

### Discussion:

How are the different quality control measures correlated?

Why do some cells have high mitochondrial or ribosomal content?

## 4. Filtering Low-Quality Cells and Genes

In [None]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
print(f"Remaining cells: {adata.n_obs}, Remaining genes: {adata.n_vars}")

### Discussion:

Why are minimum thresholds for genes and cells important?

How does filtering affect downstream analyses?

## 5. Identifying Highly Expressed Genes

In [None]:
sc.pl.highest_expr_genes(adata, n_top=20)

### Discussion:

How can overexpressed genes like MALAT1 affect data interpretation?

## 6. Filtering Based on Mitochondrial and Ribosomal Content

In [None]:
adata = adata[adata.obs['pct_counts_mt'] < 20, :]
adata = adata[adata.obs['pct_counts_ribo'] > 5, :]
print(f"Remaining cells after mito/ribo filtering: {adata.n_obs}")

### Discussion:

How to decide appropriate cutoffs for mitochondrial and ribosomal content?

What are the trade-offs of aggressive filtering?

## 7. Normalization of Data

In [None]:
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

### Discussion:

Why is normalization critical for downstream analysis?

What methods exist besides log normalization?

## 8. Save the Preprocessed Data

In [None]:
adata.write('processed_data.h5ad')

***
## Key Takeaways:

Understanding the importance of quality control in single-cell data

Practical steps for identifying and removing low-quality cells

Interpretation of quality control metrics and visualizations

### Further Reading:

Amezquita et al., "Orchestrating single-cell analysis with Bioconductor" (2019)

Luecken & Theis, "Current best practices in single-cell RNA-seq analysis" (2019)

