Skip to content

Scripts and notebooks to reproduce all analysis from Massoni-Badosa et al. (2022)

License

Notifications You must be signed in to change notification settings

Single-Cell-Genomics-Group-CNAG-CRG/TonsilAtlas

Repository files navigation

An Atlas of Cells in the Human Tonsil

Palatine tonsils are secondary lymphoid organs (SLOs) representing the first line of immunological defense against inhaled or ingested pathogens. We generated an atlas of the human tonsil composed of >556,000 cells profiled across five different data modalities, including single-cell transcriptome, epigenome, proteome, and immune repertoire sequencing, as well as spatial transcriptomics. This census identified 121 cell types and states, defined developmental trajectories, and enabled an understanding of the functional units of the tonsil. Exemplarily, we stratified myeloid slan-like subtypes, established a BCL6 enhancer as locally active in follicle-associated T and B cells, and identified SIX5 as putative transcriptional regulator of plasma cell maturation. Analyses of a validation cohort confirmed the presence, annotation, and markers of tonsillar cell types and provided evidence of age-related compositional shifts. We demonstrate the value of this resource by annotating cells from B cell-derived mantle cell lymphomas, linking transcriptional heterogeneity to normal B cell differentiation states of the human tonsil.

This repository contains all the scripts, notebooks and reports to reproduce all analysis from our manuscript entitled An Atlas of Cells in the Human Tonsil, published in Immunity in 2024. Here, we describe how to access the data, document the most important packages and versions used, and explain how to navigate the directories and files in this repository.

Data

The data has been deposited in five levels of organization, from raw to processed data:

  • Level 1: raw data. All fastq files for all data modalities have been deposited at ArrayExpress under accession id E-MTAB-13687.
  • Level 2: matrices. All data modalities correspond to different technologies from 10X Genomics. As such, they were mapped with different flavors of CellRanger (CR). The most important files in the ‘‘outs’’ folder of every CR run (including all matrices) have been deposited in Zenodo.
  • Level 3: Seurat Objects. All data was analyzed within the Seurat ecosystem. We have archived in Zenodo all Seurat Objects that contain the raw and processed counts, dimensionality reductions (PCA, Harmony, UMAP), and metadata needed to reproduce all figures from this manuscript.
  • Level 4: to allow for programmatic and modular access to the whole tonsil atlas dataset, we developed HCATonsilData, available on BioConductor. HCATonsilData provides a vignette which documents how to navigate and understand the data. It also provides access to the glossary to traceback all annotations in the atlas. In addition, we will periodically update the annotations as we refine it with suggestions from the community.
  • Level 5: interactive mode. Our tonsil atlas has been included as a reference in Azimuth, which allows interactive exploration of cell type markers on the web.

We refer to the READMEs in the Zenodo repositories for an explanation of how to access the matrices and Seurat objects. Users can follow these steps to download the fastq files:

  1. Download a tsv file that lists all the fastq files available in the repository and filter them to the modality of choice. We will exemplify it here with scRNA-seq. Note: the URL was copied from the section "Download report" from secondary section id ERP156759 of ArrayExpress.
PATH_TSV="https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB75429&result=read_run&fields=study_accession,sample_accession,experiment_accession,run_accession,tax_id,scientific_name,fastq_ftp,submitted_ftp,sra_ftp,bam_ftp&format=tsv&download=true&limit=0"
wget -O fastq_list_tonsil_atlas.tsv "$PATH_TSV"
grep "scRNA-seq" fastq_list_tonsil_atlas.tsv > fastq_list_tonsil_atlas_scRNA.tsv
  1. Create metadata using the names of the fastq files. Inspired by The Cancer Genome Atlas (TCGA), we have encoded all the metadata for a single fastq file in the name itself:

[TECHNOLOGY].[DONOR_ID].[SUBPROJECT].[GEM_ID].[LIBRARY_ID].[LIBRARY_TYPE].[PAIR_ID].[READ].fastq.gz

Here's a more detailed description of each field:

  • TECHNOLOGY: scRNA-seq, scATAC-seq, Multiome, CITE-seq+scVDJ-seq, and spatial transcriptomics (Visium). We also include the fastq files associated with the multiome experiments performed on two mantle cell lymphoma patients (MCL).
  • DONOR_ID: identifier for each of the 17 patients included in the cohort. We provide the donor-level metadata in the file "tonsil_atlas_donor_metadata.csv", including the hospital, sex, age, age group, cause for tonsillectomy and cohort type for every donor.
  • SUBPROJECT: each subproject corresponds to one run of the 10x Genomics Chromium™ Chip.
  • GEM_ID: each run of the 10x Genomics Chromium™ Chip consists of up to 8 "GEM wells" (see 10X documentation): a set of partitioned cells (Gel Beads-in-emulsion) from a single 10x Genomics Chromium™ Chip channel. We give a unique identifier to each of these channels.
  • LIBRARY_ID: one or more sequencing libraries can be derived from a GEM well. For instance, multiome yields two libraries (ATAC and RNA) and CITE-seq+scVDJ yields 4 libraries (RNA, ADT, BCR, TCR).
  • LIBRARY_TYPE: the type of library for each library_id. Note that we used cell hashing () for a subset of the scRNA-seq libraries, and thus the library_type can be "not_hashed", "hashed_cdna" (RNA expression) or "hashed_hto" (the hashtag oligonucleotides).
  • PAIR_ID: to increase sequencing depth, each library was sequenced in more than one lane. This id identifies the pairs of fastq files (R1, R2) for a given subproject. It is used in the scripts below to assign the lane (e.g. L001).
  • READ: for scATAC-seq we have three reads (R1, R2 or R3), see cellranger-atac's documentation. While we find these names to be the most useful, they need to be changed to follow cellranger's conventions.
cut -f8 fastq_list_tonsil_atlas_scRNA.tsv | cut -d'/' -f6 | cut -d';' -f1 > fastq_names_scRNA.txt
echo "technology,donor_id,subproject,gem_id,library_id,library_type,lane,read" > tonsil_atlas_fastq_metadata.csv
sed 's/\.fastq\.gz//g' fastq_names_scRNA.txt | sed 's/\./,/g' >> tonsil_atlas_fastq_metadata.csv

Each line of "fastq_list_tonsil_atlas_scRNA.tsv" contains both read 1 (R1) and read 2 (R2). With the code above we are subsetting R1.

IMPORTANT NOTE: all fastq files derived from a single GEM well need to be mapped together because they correspond to the same set of cells and hence are given the same set of cell barcodes.

We will exemplify the following steps with both kind of libraries (hashed and not_hashed), whichg will ensure that we understand the intricacies of both types. We will focus on two GEM_IDs (one of each type), but users should be able to parallelize the same procedure across all gem_ids using different jobs in a high performance computer (HPC, aka cluster).

  1. Download cellranger. As of 4/2/2024, the most recent version is v8.0.1:
# Download cellranger 8.0.1
curl -o cellranger-8.0.1.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-8.0.1.tar.gz?Expires=1717561298&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=iZkk2k4-4RDbLSuDzbjBzW4C71N2TvyPy0NTYq1l9T2SxqMNeGtLiNuVE9O12OY72G~DStTUjQwB4oa1sawTbO0eOoiXK0SpM8lNuHJYdhhsP~sgmLfSVDJHfMNm0gyWBTx0sjv~A22sHtZ1Tdpp369z9vnbzZbvpLmrP7XYPvfyGMa98DWsRXSABVHAFMrJiuuyWU2JmooG-fbNA4Gs6BhLy9YbFqm-bhcR6znfKtQm76PqmB0M1Y22BioPgh1UHCp85Q1uDN5Pel73zhgDuac0Xz7Ri~AkLD1Tz3Z0ZX37gIFajH5-nCyLpak83UF9ViMfnb3~mo8lDQ4aMkLlbw__"
tar -xzf cellranger-8.0.1.tar.gz


# Download human reference
curl -O "https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz"
tar -xzf refdata-gex-GRCh38-2024-A.tar.gz

Not hashed

3.1. Download fastqs for gem_id y7qn780g_p6jkgk63:

gem_id="y7qn780g_p6jkgk63"
fastqs_urls=$(grep $gem_id fastq_list_tonsil_atlas_scRNA.tsv | cut -f8)
mkdir -p $gem_id/fastqs
for url in $fastqs_urls;
do
    url_fastq1=$(echo $url | cut -d';' -f1)
    echo $url_fastq1
    wget -P $gem_id/fastqs $url_fastq1
    url_fastq2=$(echo $url | cut -d';' -f2)
    echo $url_fastq2
    wget -P $gem_id/fastqs $url_fastq2
done

3.2. Rename fastq files following cellranger conventions.

# Create an associative array to hold the mapping from pair_id to lane_id
declare -A lane_dict
counter=1
pair_ids=$(ls $gem_id/fastqs | cut -d'.' -f7 | sort | uniq)
for pair_id in $pair_ids; do
    echo $pair_id
    lane_dict[$pair_id]=$counter
    ((counter++))
done


# Change fastq names to cellranger-friendly
for fastq in $(ls $gem_id/fastqs); do
    read=$(echo $fastq | cut -d'.' -f8)
    pair_id=$(echo $fastq | cut -d'.' -f7)
    lane=${lane_dict[$pair_id]}
    new_name="${gem_id}_S1_L00${lane}_${read}_001.fastq.gz"
    echo "Creating symbolic link for $fastq as $new_name"
    ln -s "$fastq" "$gem_id/fastqs/$new_name"
done

3.3. Run cellranger count:

cellranger-8.0.1/cellranger count --id="run_count_${gem_id}" \
   --fastqs=$gem_id/fastqs \
   --sample=$gem_id \
   --transcriptome=refdata-gex-GRCh38-2024-A \
   --include-introns false \
   --create-bam false

Hashed

Step 3.1 is common for cell-hashed and not hashed samples, but steps 3.2 and 3.3 is different. For step 3.2 we need to declare which fastqs are cDNA and which are hashtag oligonucleotides (HTO). For step 3.3, we need to specify the arguments --feature-ref and --libraries arguments (see cellranger count --help for more info). In addition, we need to download the feature-barcoding metadata to specify the --feature-ref argument. Let us exemplify it with gem_id dvdzq8et_eot75su8

gem_id="dvdzq8et_eot75su8"
working_dir=$(pwd)

# Download feature barcoding
URL_HASHING_METADATA=https://www.ebi.ac.uk/biostudies/files/E-MTAB-13687/cell_hashing_metadata.csv
wget -O cell_hashing_metadata.csv "$URL_HASHING_METADATA"


# Rename fastqs cDNA
mkdir -p $gem_id/fastqs/cDNA
fastqs_dna=$(ls -ath ${gem_id}/fastqs | grep ".fastq.gz" | grep cdna)
declare -A lane_dict_cdna
counter=1
pair_ids_cdna=$(ls $gem_id/fastqs | grep ".fastq.gz" | grep cdna | cut -d'.' -f7 | sort | uniq)
for pair_id in $pair_ids_cdna; do
    echo $pair_id
    lane_dict_cdna[$pair_id]=$counter
    ((counter++))
done
for fastq in $fastqs_dna; do
    read=$(echo $fastq | cut -d'.' -f8)
    pair_id=$(echo $fastq | cut -d'.' -f7)
    lane=${lane_dict_cdna[$pair_id]}
    new_name="${gem_id}_S1_L00${lane}_${read}_001.fastq.gz"
    echo "Creating symbolic link for $fastq as $new_name"
    ln -s "$working_dir/$gem_id/fastqs/$fastq" "$working_dir/$gem_id/fastqs/cDNA/$new_name"
done


# Rename fastqs HTO
mkdir -p $gem_id/fastqs/hto
fastqs_hto=$(ls -ath ${gem_id}/fastqs | grep ".fastq.gz" | grep hto)
declare -A lane_dict_hto
counter=1
pair_ids_hto=$(ls $gem_id/fastqs | grep ".fastq.gz" | grep hto | cut -d'.' -f7 | sort | uniq)
for pair_id in $pair_ids_hto; do
    echo $pair_id
    lane_dict_hto[$pair_id]=$counter
    ((counter++))
done
for fastq in $fastqs_hto; do
    read=$(echo $fastq | cut -d'.' -f8)
    pair_id=$(echo $fastq | cut -d'.' -f7)
    lane=${lane_dict_hto[$pair_id]}
    new_name="${gem_id}_S1_L00${lane}_${read}_001.fastq.gz"
    echo "Creating symbolic link for $fastq as $new_name"
    ln -s "$working_dir/$gem_id/fastqs/$fastq" "$working_dir/$gem_id/fastqs/hto/$new_name"
done


# Rename fastqs and create libraries.csv file
echo "fastqs,sample,library_type" > $working_dir/$gem_id/libraries.csv
echo "${working_dir}/${gem_id}/fastqs/cDNA/,${gem_id},Gene Expression" >> $working_dir/$gem_id/libraries.csv
echo "${working_dir}/${gem_id}/fastqs/hto/,${gem_id},Antibody Capture" >> $working_dir/$gem_id/libraries.csv


# Create feature_reference.csv
head -n1 cell_hashing_metadata.csv | cut -d, -f3-8 > $working_dir/$gem_id/feature_reference.csv
grep $gem_id cell_hashing_metadata.csv | cut -d, -f3-8 >> $working_dir/$gem_id/feature_reference.csv


# Run cellranger
cellranger-8.0.1/cellranger count --id="run_count_${gem_id}" \
   --libraries=$working_dir/$gem_id/libraries.csv \
   --feature-ref=$working_dir/$gem_id/feature_reference.csv \
   --transcriptome=refdata-gex-GRCh38-2024-A \
   --include-introns false \
   --create-bam false \
   --output-dir=$gem_id

Package versions

You can check the versions of other packages at the "Session Information" section of each html report. To visualize one of the html reports online, you can copy&paste the URL of the report directly into the HTML GitHub viewer.

File system and name scheme

Although each technology requires specific analysis, they also share a similar pre-processing pipeline. We have strived to harmonize these pipelines into similar naming schemes so that it is easy for users to navigate this repo. Likewise, we have tried to code in a shared style. These are the most important steps:

In addition, the "figures_and_scripts" folder contains the scripts used to generate most of the figures in the manuscript. Finally, the "bin" folder contains functions and utilities used throughout many scripts.

Getting the code

You can download a copy of all the files in this repository by cloning the git repository:

git clone https://github.com/Single-Cell-Genomics-Group-CNAG-CRG/TonsilAtlas.git

Relevant literature

About

Scripts and notebooks to reproduce all analysis from Massoni-Badosa et al. (2022)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages