Skip to content

StatPhysBio/covid-bcr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

This repository is used to conduct the analysis for "Dynamics of B-cell repertoires and emergence of cross-reactive responses in COVID-19 patients with different disease severity." It investigates receptor compositions depending on disease severity, expansion of BCR clonal lineages over time, sharing of BCRs among individuals, and the emergence of cross-reactivity from a SARS-CoV-2 response to a SARS response.

Dependencies

Most dependences are listed in env.yml. You can set up a conda environment with

$ conda env create -f env.yml

and activate the new environment (including all dependencies) with

$ conda activate covid-bcr

Other necessary software includes:

Pipeline and preparing data for analyses

Raw sequence processing

scripts/presto_pipeline.sh processes the data with pRESTO. Pairs of raw sequences are assembled and filtered with a QScore of 30, V primers are masked and the C primer is cut, and finally sequences are deduplicated. For example,

bash presto_pipeline.sh -s SAMPLENAME-REPLICATE -afmc

where -s SAMPLENAME-REPLICATE specifies the sample, e.g. -s S1-0, and -afmc will (a)ssemble the paired-end reads, (f)ilter sequences for quality, (m)ask primers, and (c)ollapse sequences.

Assemble FASTA file for annotating

Use scripts/assemble_fastas_for_abstar.py to annotate replicate, timepoint, severity, and patient information in headers and combine sample FASTAs into one larger FASTA for each patient. This script takes in the directories where the data is stored and a directory to which the combined FASTAs will be saved.

python assemble_fastas_for_abstar.py --dirs /PATH/TO/DIR1 /PATH/TO/DIR2 /PATH/TO/DIR3 \
                                     --save_dir /PATH/TO/SAVE/DIR \
                                     --bcellinfo /PATH/TO/INFO.csv

The output FASTAs will be called PATIENTID.fasta, where PATIENTID is an integer. B cell info files are found in the csvs directory: csvs/plasma_b_cell_info.csv and csvs/bulk_b_cell_info.csv. These files are used to mark from which patient the sample came. If fastas are from plasma repertoires, toggle --plasma.

Annotating sequences

abstar should have been installed when creating the conda environment. To annotate sequences, simply use

abstar -i /PATH/TO/INPUT_FILE.fasta -o /PATH/TO/OUTPUT_DIRECTORY -t /PATH/TO/TEMP_DIRECTORY

for each file or

abstar -i /PATH/TO/INPUT_DIRECTORY -o /PATH/TO/OUTPUT_DIRECTORY -t /PATH/TO/TEMP_DIRECTORY

to run abstar iteratively over all files in a directory.

Error correction, data filtering, creating lineages, and creating input for SONIA

Use scripts/abstar_pipeline.py to error correct and filter sequences

python abstar_pipeline.py --annotations PATH/TO/ABSTAR_OUTPUT.json PATH/TO/SAVE/FILTERED_ANNOTATIONS.json

create lineages

python abstar_pipeline.py --lineages PATH/TO/FILTERED_ANNOTATIONS.json PATH/TO/SAVE/LINEAGES.json

and create input for SONIA

python abstar_pipeline.py --sonia PATH/TO/LINEAGES.json PATH/TO/SAVE/SONIA_INPUT.csv

See the code or Methods in Montague et al., Dynamics of B-cell repertoires and emergence of cross-reactive responses in COVID-19 patients with different disease severity for more details.

Create input for IGoR

Use scripts/assemble_fasta_for_igor.py to create the FASTA file used as input for IGoR.

python assemble_fasta_for_igor.py --infiles PATH/TO/LINEAGES/* --outfile PATH/TO/SAVE/IGOR_INPUT.fasta

Prepare input for expansion analysis

Use scripts/wrangle_lineages.py to create the .csv file used as input for the expansion analysis.

python wrangle_lineages.py --lineages PATH/TO/LINEAGE_FILE.json --outfile PATH/TO/SAVE/LINEAGE_COUNTS.csv

Obtain HCDR3 anchors and genomic references

All genomic references and HCDR3 anchors for IGoR and SONIA are in igor_input/ and sonia_input/. If you would like to prepare your own HCDR3 anchors and genomic references, use scripts/cdr3_anchors_and_references.py.

python cdr3_anchors_and_references.py --indir igor_input

The output will also be in igor_input.

Analysis

IGoR model

Genomic references and HCDR3 anchors for IGoR are available in igor_input/. Use scripts/run_igor.sh to make the IGoR model.

bash run_igor.sh PATH/TO/WORKING_DIRECTORY PATH/TO/IGOR_INPUT.fasta BATCHNAME

SONIA models

IGoR model output and HCDR3 anchors for SONIA are available in sonia_input/. To assemble all the individual patient data into cohorts quickly, cat them together accordingly (e.g. cat 1_sonia_input.csv 2_sonia_input.csv 3_sonia_input.csv > healthy_sonia_input.csv) and then delete the unnecessary header lines that are not the first line inside the resultant csv.

To create a SONIA model, execute

sonia-infer --set_custom_model_VDJ PATH/TO/SONIA_INPUT.csv --sonia_model leftright --epochs 150 \
            --independent genes --seq_index 0 --v_mask_index 1 --j_mask_index 2 \
            --infile PATH/TO/SONIA_INPUT.csv -o PATH/TO/SAVE/SONIA_MODEL --lines_to_skip 1 \
            --n_gen_seqs 500000

If you want to add some determinism and replicability when creating a SONIA model, you can additionally use the --seed option in sonia-infer which will lead to SONIA models with nearly identical selection factors.

To generate sequences, execute

sonia-generate --set_custom_model_VDJ PATH/TO/SONIA_MODEL --sonia_model leftright --ppost \
               --N NUM_SEQS_TO_GENERATE --delimiter_out , --outfile PATH/TO/SAVE/GENERATED_SEQS.csv

To evaluate sequences from data, execute

sonia-evaluate --set_custom_model_VDJ PATH/TO/SONIA_MODEL --sonia_model leftright --ppost \
               --infile PATH/TO/SONIA_INPUT.csv --lines_to_skip 1 --delimiter_out , \
               --outfile PATH/TO/SAVE/DATA_EVALUATIONS.csv

To evaluate generated sequences, execute

sonia-evaluate --set_custom_model_VDJ PATH/TO/SONIA_MODEL --sonia_model leftright --ppost \
               --infile PATH/TO/INPUT.csv --delimiter_out , --outfile PATH/TO/SAVE/GEN_EVALUATIONS.csv

Differential sequence features

Use scripts/abstar_stats.py to obtain statistics of gene usage, HCDR3 length, and insertion and deletion profiles of nonsingletons and progenitors in productive lineages.

python abstar_stats.py --infile PATH/TO/LINEAGES.json --outfile PATH/TO/SAVE/STATISTICS.json

For plotting these statistics and performing ANOVA (with boxplot visualizations), see notebooks/sequence_features_plotting.ipynb. Methods to aid in plotting and performing statistical tests are in scripts/plotting_helper.py.

Expansion analysis, sharing analysis, and overlap with known Abs

See notebooks/R_expansion_analysis.ipynb for the false positive rate analysis. See notebooks/sharing_null_hypothesis_bounds.nb for obtaining bounds for sharing analysis. Otherwise, all analyses are conducted in notebooks/covid_dynamics.ipynb

Extra healthy patients

Because we have only three healthy individuals, we supplement our healthy dataset using the Great Repertoire Project. Obtain the processed .json files from their associated locations using wget, e.g.

wget http://burtonlab.s3.amazonaws.com/sequencing-data/hiseq_2016-supplement/316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_json_071817.tar.gz

These are big files, so we resort to shell scripting to wrangle the files a bit. We use only the first three biological replicates. Uncompressing a tarball file can be done using only one core; however, you can use multiple cores to uncompress multiple tarball files. For example,

for i in {1..3}; do srun -n 1 -c 1 tar -xf 316188_HNCHNBCXY_consensus_UID18-cdr3nt-90_json_071817.tar.gz consensus-cdr3nt-90_json/${i}_consensus.json & done

Because our analysis uses only IgG B cells, we select from the uncompressed tarballs only those B cells.

grep -h "IgG" 1_consensus.json 2_consensus.json 3_consensus.json > 316188_IgG.json

Further, we separate out the productive sequences and out-of-frame sequences.

grep '"prod": "yes", "junction_in_frame": "yes"' 316188_IgG.json > productive_316188_IgG.json
grep '"junction_in_frame": "no"' 316188_IgG.json > unproductive_316188_IgG.json

Finally, we use load the files and remove any sequences with N's or sequences missing D genes, create lineages, and obtain their statistics and wrangle the lineages for further processing.

References

  1. Montague et al., Dynamics of B-cell repertoires and emergence of cross-reactive responses in COVID-19 patients with different disease severity, (2020)
  2. Vander Heiden, J.A., Yaari, G., Uduman, M., Stern, J.N., O'Connor, K.C., Hafler, D.A., Vigneault, F., and Kleinstein, S.H. (2014). pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 30, 1930-1932.
  3. Briney, B., and Burton, D.R. (2018). Massively scalable genetic analysis of antibody repertoires. bioRxiv 10.1101/447813, 447813.
  4. Marcou, Q., Mora, T., and Walczak, A.M. (2018). High-throughput immune repertoire analysis with IGoR. Nat Commun 9, 561.
  5. Sethna, Z., Isacchini, G., Dupic, T., Mora, T., Walczak, A.M., and Elhanati, Y. (2020). Population variability in the generation and thymic selection of T-cell repertoires. bioRxiv 10.1101/2020.01.08.899682, 2020.2001.2008.899682.

Contact

Any issues or questions should be addressed to me.

License

Free use of this analysis is granted under the terms of the GNU General Public License version 3 (GPLv3).

About

Analysis of BCR repertoires from covid-19 infections

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages