An Atlas of Cells in the Human Tonsil

Palatine tonsils are secondary lymphoid organs (SLOs) representing the first line of immunological defense against inhaled or ingested pathogens. We generated an atlas of the human tonsil composed of >556,000 cells profiled across five different data modalities, including single-cell transcriptome, epigenome, proteome, and immune repertoire sequencing, as well as spatial transcriptomics. This census identified 121 cell types and states, defined developmental trajectories, and enabled an understanding of the functional units of the tonsil. Exemplarily, we stratified myeloid slan-like subtypes, established a BCL6 enhancer as locally active in follicle-associated T and B cells, and identified SIX5 as putative transcriptional regulator of plasma cell maturation. Analyses of a validation cohort confirmed the presence, annotation, and markers of tonsillar cell types and provided evidence of age-related compositional shifts. We demonstrate the value of this resource by annotating cells from B cell-derived mantle cell lymphomas, linking transcriptional heterogeneity to normal B cell differentiation states of the human tonsil.

This repository contains all the scripts, notebooks and reports to reproduce all analysis from our manuscript entitled An Atlas of Cells in the Human Tonsil, published in Immunity in 2024. Here, we describe how to access the data, document the most important packages and versions used, and explain how to navigate the directories and files in this repository.

Data

The data has been deposited in five levels of organization, from raw to processed data:

Level 1: raw data. All fastq files for all data modalities have been deposited at ArrayExpress under accession id E-MTAB-13687.
Level 2: matrices. All data modalities correspond to different technologies from 10X Genomics. As such, they were mapped with different flavors of CellRanger (CR). The most important files in the ‘‘outs’’ folder of every CR run (including all matrices) have been deposited in Zenodo.
Level 3: Seurat Objects. All data was analyzed within the Seurat ecosystem. We have archived in Zenodo all Seurat Objects that contain the raw and processed counts, dimensionality reductions (PCA, Harmony, UMAP), and metadata needed to reproduce all figures from this manuscript.
Level 4: to allow for programmatic and modular access to the whole tonsil atlas dataset, we developed HCATonsilData, available on BioConductor. HCATonsilData provides a vignette which documents how to navigate and understand the data. It also provides access to the glossary to traceback all annotations in the atlas. In addition, we will periodically update the annotations as we refine it with suggestions from the community.
Level 5: interactive mode. Our tonsil atlas has been included as a reference in Azimuth, which allows interactive exploration of cell type markers on the web.

We refer to the READMEs in the Zenodo repositories for an explanation of how to access the matrices and Seurat objects. Users can follow these steps to download the fastq files:

Download a tsv file that lists all the fastq files available in the repository and filter them to the modality of choice. We will exemplify it here with scRNA-seq. Note: the URL was copied from the section "Download report" from secondary section id ERP156759 of ArrayExpress.

PATH_TSV="https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB75429&result=read_run&fields=study_accession,sample_accession,experiment_accession,run_accession,tax_id,scientific_name,fastq_ftp,submitted_ftp,sra_ftp,bam_ftp&format=tsv&download=true&limit=0"
wget -O fastq_list_tonsil_atlas.tsv "$PATH_TSV"
grep "scRNA-seq" fastq_list_tonsil_atlas.tsv > fastq_list_tonsil_atlas_scRNA.tsv

Create metadata using the names of the fastq files. Inspired by The Cancer Genome Atlas (TCGA), we have encoded all the metadata for a single fastq file in the name itself:

[TECHNOLOGY].[DONOR_ID].[SUBPROJECT].[GEM_ID].[LIBRARY_ID].[LIBRARY_TYPE].[PAIR_ID].[READ].fastq.gz

Here's a more detailed description of each field:

TECHNOLOGY: scRNA-seq, scATAC-seq, Multiome, CITE-seq+scVDJ-seq, and spatial transcriptomics (Visium). We also include the fastq files associated with the multiome experiments performed on two mantle cell lymphoma patients (MCL).
DONOR_ID: identifier for each of the 17 patients included in the cohort. We provide the donor-level metadata in the file "tonsil_atlas_donor_metadata.csv", including the hospital, sex, age, age group, cause for tonsillectomy and cohort type for every donor.
SUBPROJECT: each subproject corresponds to one run of the 10x Genomics Chromium™ Chip.
GEM_ID: each run of the 10x Genomics Chromium™ Chip consists of up to 8 "GEM wells" (see 10X documentation): a set of partitioned cells (Gel Beads-in-emulsion) from a single 10x Genomics Chromium™ Chip channel. We give a unique identifier to each of these channels.
LIBRARY_ID: one or more sequencing libraries can be derived from a GEM well. For instance, multiome yields two libraries (ATAC and RNA) and CITE-seq+scVDJ yields 4 libraries (RNA, ADT, BCR, TCR).
LIBRARY_TYPE: the type of library for each library_id. Note that we used cell hashing () for a subset of the scRNA-seq libraries, and thus the library_type can be "not_hashed", "hashed_cdna" (RNA expression) or "hashed_hto" (the hashtag oligonucleotides).
PAIR_ID: to increase sequencing depth, each library was sequenced in more than one lane. This id identifies the pairs of fastq files (R1, R2) for a given subproject. It is used in the scripts below to assign the lane (e.g. L001).
READ: for scATAC-seq we have three reads (R1, R2 or R3), see cellranger-atac's documentation. While we find these names to be the most useful, they need to be changed to follow cellranger's conventions.

cut -f8 fastq_list_tonsil_atlas_scRNA.tsv | cut -d'/' -f6 | cut -d';' -f1 > fastq_names_scRNA.txt
echo "technology,donor_id,subproject,gem_id,library_id,library_type,lane,read" > tonsil_atlas_fastq_metadata.csv
sed 's/\.fastq\.gz//g' fastq_names_scRNA.txt | sed 's/\./,/g' >> tonsil_atlas_fastq_metadata.csv

Each line of "fastq_list_tonsil_atlas_scRNA.tsv" contains both read 1 (R1) and read 2 (R2). With the code above we are subsetting R1.

IMPORTANT NOTE: all fastq files derived from a single GEM well need to be mapped together because they correspond to the same set of cells and hence are given the same set of cell barcodes.

We will exemplify the following steps with both kind of libraries (hashed and not_hashed), whichg will ensure that we understand the intricacies of both types. We will focus on two GEM_IDs (one of each type), but users should be able to parallelize the same procedure across all gem_ids using different jobs in a high performance computer (HPC, aka cluster).

Download cellranger. As of 4/2/2024, the most recent version is v8.0.1:

# Download cellranger 8.0.1
curl -o cellranger-8.0.1.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-8.0.1.tar.gz?Expires=1717561298&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=iZkk2k4-4RDbLSuDzbjBzW4C71N2TvyPy0NTYq1l9T2SxqMNeGtLiNuVE9O12OY72G~DStTUjQwB4oa1sawTbO0eOoiXK0SpM8lNuHJYdhhsP~sgmLfSVDJHfMNm0gyWBTx0sjv~A22sHtZ1Tdpp369z9vnbzZbvpLmrP7XYPvfyGMa98DWsRXSABVHAFMrJiuuyWU2JmooG-fbNA4Gs6BhLy9YbFqm-bhcR6znfKtQm76PqmB0M1Y22BioPgh1UHCp85Q1uDN5Pel73zhgDuac0Xz7Ri~AkLD1Tz3Z0ZX37gIFajH5-nCyLpak83UF9ViMfnb3~mo8lDQ4aMkLlbw__"
tar -xzf cellranger-8.0.1.tar.gz


# Download human reference
curl -O "https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz"
tar -xzf refdata-gex-GRCh38-2024-A.tar.gz

Not hashed

3.1. Download fastqs for gem_id y7qn780g_p6jkgk63:

gem_id="y7qn780g_p6jkgk63"
fastqs_urls=$(grep $gem_id fastq_list_tonsil_atlas_scRNA.tsv | cut -f8)
mkdir -p $gem_id/fastqs
for url in $fastqs_urls;
do
    url_fastq1=$(echo $url | cut -d';' -f1)
    echo $url_fastq1
    wget -P $gem_id/fastqs $url_fastq1
    url_fastq2=$(echo $url | cut -d';' -f2)
    echo $url_fastq2
    wget -P $gem_id/fastqs $url_fastq2
done

3.2. Rename fastq files following cellranger conventions.

# Create an associative array to hold the mapping from pair_id to lane_id
declare -A lane_dict
counter=1
pair_ids=$(ls $gem_id/fastqs | cut -d'.' -f7 | sort | uniq)
for pair_id in $pair_ids; do
    echo $pair_id
    lane_dict[$pair_id]=$counter
    ((counter++))
done


# Change fastq names to cellranger-friendly
for fastq in $(ls $gem_id/fastqs); do
    read=$(echo $fastq | cut -d'.' -f8)
    pair_id=$(echo $fastq | cut -d'.' -f7)
    lane=${lane_dict[$pair_id]}
    new_name="${gem_id}_S1_L00${lane}_${read}_001.fastq.gz"
    echo "Creating symbolic link for $fastq as $new_name"
    ln -s "$fastq" "$gem_id/fastqs/$new_name"
done

3.3. Run cellranger count:

cellranger-8.0.1/cellranger count --id="run_count_${gem_id}" \
   --fastqs=$gem_id/fastqs \
   --sample=$gem_id \
   --transcriptome=refdata-gex-GRCh38-2024-A \
   --include-introns false \
   --create-bam false

Hashed

Step 3.1 is common for cell-hashed and not hashed samples, but steps 3.2 and 3.3 is different. For step 3.2 we need to declare which fastqs are cDNA and which are hashtag oligonucleotides (HTO). For step 3.3, we need to specify the arguments --feature-ref and --libraries arguments (see cellranger count --help for more info). In addition, we need to download the feature-barcoding metadata to specify the --feature-ref argument. Let us exemplify it with gem_id dvdzq8et_eot75su8

gem_id="dvdzq8et_eot75su8"
working_dir=$(pwd)

# Download feature barcoding
URL_HASHING_METADATA=https://www.ebi.ac.uk/biostudies/files/E-MTAB-13687/cell_hashing_metadata.csv
wget -O cell_hashing_metadata.csv "$URL_HASHING_METADATA"


# Rename fastqs cDNA
mkdir -p $gem_id/fastqs/cDNA
fastqs_dna=$(ls -ath ${gem_id}/fastqs | grep ".fastq.gz" | grep cdna)
declare -A lane_dict_cdna
counter=1
pair_ids_cdna=$(ls $gem_id/fastqs | grep ".fastq.gz" | grep cdna | cut -d'.' -f7 | sort | uniq)
for pair_id in $pair_ids_cdna; do
    echo $pair_id
    lane_dict_cdna[$pair_id]=$counter
    ((counter++))
done
for fastq in $fastqs_dna; do
    read=$(echo $fastq | cut -d'.' -f8)
    pair_id=$(echo $fastq | cut -d'.' -f7)
    lane=${lane_dict_cdna[$pair_id]}
    new_name="${gem_id}_S1_L00${lane}_${read}_001.fastq.gz"
    echo "Creating symbolic link for $fastq as $new_name"
    ln -s "$working_dir/$gem_id/fastqs/$fastq" "$working_dir/$gem_id/fastqs/cDNA/$new_name"
done


# Rename fastqs HTO
mkdir -p $gem_id/fastqs/hto
fastqs_hto=$(ls -ath ${gem_id}/fastqs | grep ".fastq.gz" | grep hto)
declare -A lane_dict_hto
counter=1
pair_ids_hto=$(ls $gem_id/fastqs | grep ".fastq.gz" | grep hto | cut -d'.' -f7 | sort | uniq)
for pair_id in $pair_ids_hto; do
    echo $pair_id
    lane_dict_hto[$pair_id]=$counter
    ((counter++))
done
for fastq in $fastqs_hto; do
    read=$(echo $fastq | cut -d'.' -f8)
    pair_id=$(echo $fastq | cut -d'.' -f7)
    lane=${lane_dict_hto[$pair_id]}
    new_name="${gem_id}_S1_L00${lane}_${read}_001.fastq.gz"
    echo "Creating symbolic link for $fastq as $new_name"
    ln -s "$working_dir/$gem_id/fastqs/$fastq" "$working_dir/$gem_id/fastqs/hto/$new_name"
done


# Rename fastqs and create libraries.csv file
echo "fastqs,sample,library_type" > $working_dir/$gem_id/libraries.csv
echo "${working_dir}/${gem_id}/fastqs/cDNA/,${gem_id},Gene Expression" >> $working_dir/$gem_id/libraries.csv
echo "${working_dir}/${gem_id}/fastqs/hto/,${gem_id},Antibody Capture" >> $working_dir/$gem_id/libraries.csv


# Create feature_reference.csv
head -n1 cell_hashing_metadata.csv | cut -d, -f3-8 > $working_dir/$gem_id/feature_reference.csv
grep $gem_id cell_hashing_metadata.csv | cut -d, -f3-8 >> $working_dir/$gem_id/feature_reference.csv


# Run cellranger
cellranger-8.0.1/cellranger count --id="run_count_${gem_id}" \
   --libraries=$working_dir/$gem_id/libraries.csv \
   --feature-ref=$working_dir/$gem_id/feature_reference.csv \
   --transcriptome=refdata-gex-GRCh38-2024-A \
   --include-introns false \
   --create-bam false \
   --output-dir=$gem_id

Package versions

You can check the versions of other packages at the "Session Information" section of each html report. To visualize one of the html reports online, you can copy&paste the URL of the report directly into the HTML GitHub viewer.

File system and name scheme

Although each technology requires specific analysis, they also share a similar pre-processing pipeline. We have strived to harmonize these pipelines into similar naming schemes so that it is easy for users to navigate this repo. Likewise, we have tried to code in a shared style. These are the most important steps:

1-cellranger_mapping: scripts used to run cellranger in our cluster. It also contains QC metrics for different sequencing runs.
2-QC: quaity control for the sequencing and mapping of raw data, filtering of poor-quality cells and genes, normalization, doublet detection, and batch effect correction.
3-clustering: we followed a top-down clustering approach (see methods of our manuscript). Thus, the clustering is organized by levels, in which we move from general cellular compartments to granular cell types and states in a hierarchical and recursive fashion. In this notebooks we have also included the annotation, which was established in collaboration with the annotation team.
4-revision: we include one folder with the analysis performed to answer each of the major reviews during the revision of the manuscript. Other more focused analysis include:
Gene regulatory inference using pySCENIC
Gene Set Enrichment analysis for different slancyte subsets
Cell-cell interactions with cellPhoneDB
Label transfer from RNA to ATAC
Motif analysis with chromVAR
Clonotype analysis with scirpy
Spot deconvolution with SPOTlight

In addition, the "figures_and_scripts" folder contains the scripts used to generate most of the figures in the manuscript. Finally, the "bin" folder contains functions and utilities used throughout many scripts.

Getting the code

You can download a copy of all the files in this repository by cloning the git repository:

git clone https://github.com/Single-Cell-Genomics-Group-CNAG-CRG/TonsilAtlas.git

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Atlas of Cells in the Human Tonsil

Data

Not hashed

Hashed

Package versions

File system and name scheme

Getting the code

Relevant literature

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
ArrayExpress		ArrayExpress
CITE-seq		CITE-seq
MCL		MCL
Zenodo		Zenodo
data		data
multiome		multiome
scATAC-seq		scATAC-seq
scRNA-seq		scRNA-seq
spatial_transcriptomics		spatial_transcriptomics
LICENSE		LICENSE
README.md		README.md
current.Rproj		current.Rproj

License

Single-Cell-Genomics-Group-CNAG-CRG/TonsilAtlas

Folders and files

Latest commit

History

Repository files navigation

An Atlas of Cells in the Human Tonsil

Data

Not hashed

Hashed

Package versions

File system and name scheme

Getting the code

Relevant literature

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages