EpitopeScan

EpitopeScan is a Python3 toolset developed to facilitate the tracking of mutation patterns in the known SARS-CoV-2 antigenic peptides. This information can be used in immunity and vaccine research.

EpitopeScan command-line tool takes peptide epitopes of interest and SARS-CoV-2 genome alignment to count mutations in the supplied epitopes. Next, EpitopeScan graphical user interface app allows to analyse and stat the mutation data for each epitope of interest.

Contents:

Repository contents
Download and set-up
EpitopeScan usage
Advice on constructing Multiple Sequence Alignment
Questions?

1. Repository contents

EpitopeScan package folder contains:

EpitopeScan.py is a command-line tool for mutation report from SARS-CoV-2 multiple genome alignment and for calculation of general mutation statistics.
EpitopeScanGUI.py is a Graphical User Interface (GUI) application developed using Streamlit. It uses system's default web-brwoser to provide interactive interface for the analysis of mutation data generated by EpitopeScan.py.
utils folder contains supporting Python3 code with functions and classes for command-line and GUI tools.
reference_sequences folder contains reference SARS-CoV-2 data listed from GISAID. This consists of the reference genome EPI_ISL_402124.fasta, a table with Open Reading Frames information ORFs_reference.txt (ORF name, genome start and end coordinates, name of translated protein, length of translated protein), and reference protein sequences protein_sequences_reference.fasta.

The repository also contains:

test folder with example data for analysis example_data, subfolders with test data T#_... and the script run_EpitopeScan_test.py for command-line tool testing.
requirements.txt and setup.py files relevant for installation (see Section 2).

2. Download and set-up

Create a new Python3 environment using venv or conda:

# create and activate an environment using venv
python3 -m venv EpitopeScan_env
source EpitopeScan_env/bin/activate

# alternatively, create and activate an environment using conda
conda create --name EpitopeScan_env
conda activate EpitopeScan_env
conda install python pip

More tutorials on pip and virtual environments and conda.

Make sure that the new environment is activated for the next steps. Clone this repository to your local system:

git clone git@github.com:Aleksandr-biochem/EpitopeScan.git

# move to repository folder
cd EpitopeScan

Install EpitopeScan as a package using pip:

pip install .

Now you should be able to call EpitopeScan and EpitopeScanGUI:

$ EpitopeScan -h

usage: EpitopeScan [-h] {scan,stat} ...

EpitopeScan. Scan and analyse SARS-CoV-2 genome Multiple Sequence Alignment for peptide mutations

options:
  -h, --help   show this help message and exit

Mode:
  {scan,stat}
    scan       Scan genome MSA file for peptide mutations
    stat       Read and stat preexisting output

# the following will start the app in your default web-browser
$ EpitopeScanGUI

Note: file requirements.txt lists the exact versions of the EpitopeScan dependencies used during development. If you experience problems with the last step of installation, you may benefit from reproducing package versions from requirements.txt as follows:

# EpitopeScan was developed with Python 3.10.9
# then download required packages using pip
pip install -r requirements.txt

# or conda
conda install -c bioconda --file requirements.txt

# after that run
pip install .

Before using the package, you may run the test script as described in section 3.4 to check the performance of EpitopeScan command-line tool.

3. EpitopeScan usage

3.1 Input data

EpitopeScan requires the following inputs:

a) Peptide(s) for analysis can be provided as a sequence OR as parent protein name + start and end residue indices. Multiple sequences can be provided as a file (see usage examples below for more).

b) Genome Multiple Sequence Alignment (MSA) in FASTA format. EpitopeScan was originally developed to analyse genome alignments from COG-UK. The tool itself does not perform genome alignments. If you want to prepare your set of genomes for analysis, you should use EPI_ISL_402124.fasta sequence as the reference with any aligner of your choice. You can find further advice and help in Section 4.

c) Metadata on samples dates and lineages in CSV format. Analysis can be performed without this input. However, it will only be possible to count mutations without any insights from sampling date and lineage. EpitopeScan was originally configured to deal with metadata from COG-UK. If you wish to use your custom table, make sure to prepare CSV table with the following columns: sequence_name, sample_date, epi_week, usher_lineage (use test/example_data/example_metadata.csv as a guide).

3.2 Running command line tool

EpitopeScan operates in two modes:

scan to perform MSA file analysis and generate mutation data
stat to generate brief mutation statistics from preexisting mutation data

3.2.1 Scan mode

Scan mode accepts epitope(s), genome MSA and metadata files to generate mutation data. Terminal log messages track run configurations and process (for example, how epitope(s) are mapped onto reference genome and how many genomes are processed at the moment).

In the end of the run, mutation summary is printed to terminal stdout. It includes the number of mutations for each analysed epitope and the list of discovered mutations with their counts and BLOSUM scores.

Access help section to navigate flags:

$ EpitopeScan scan -h

usage: EpitopeScan scan [-h] [-e EPITOPE] [-f FILE] --msa MSA [--metadata METADATA] [-o OUT] [-t TAG] [-q QUALITY_FILTER]
                        [-n AMBIGUITY_THRESHOLD] [-b BLOSUM] [-s {0,1}] [-a {0,1}] [--stat_with_metadata]

options:
  -h, --help            show this help message and exit
  -e EPITOPE, --epitope EPITOPE
                        Peptide epitope. Name and sequence (comma-separated S1,VGYWA) OR name, parent protein name, first and last residue
                        indeces in parent protein (indexing starts with 1, for example S1,S,130,145)
  -f FILE, --file FILE  Alternatively, path to file with multiple input peptide sequences in FASTA format or coordinate inputs
                        '>S1,S,130,145'
  --msa MSA             Path to input MSA fasta file
  --metadata METADATA   Path to metadata csv file to merge with mutaion data
  -o OUT, --out OUT     Output directory name
  -t TAG, --tag TAG     Sample name pattern to filter
  -q QUALITY_FILTER, --quality_filter QUALITY_FILTER
                        Max threshold for N bases proportion in genome. Recommended 0.05
  -n AMBIGUITY_THRESHOLD, --ambiguity_threshold AMBIGUITY_THRESHOLD
                        Max proportion of ambiguous residues in peptide sequence regarded as sufficient coverage. Defalut 1/3
  -b BLOSUM, --blosum BLOSUM
                        BLOSUM matrix version for mutation scoring. Default 90
  -s {0,1}, --sort {0,1}
                        Sort mutations summary by count(0) or score(1). Default 0
  -a {0,1}, --stat {0,1}
                        Stat individual mutations(0) or combinations(1). Default 0
  --stat_with_metadata  Only stat samples with metadata

Required arguments:

-e (--epitope) Single input peptide epitope. Can be specified as name and sequence, comma-separated ("S1,VLLPL"). Or as a peptide name, name of the parent protein with the indices of first and last residue ("S1,S,6,10" is equal to the previous input, protein indexing starts with 1). SARS-CoV-2 protein names can be looked up in reference_sequences/protein_sequences_reference.fasta.
-f (--file) A path to file with multiple input peptides. Peptide sequences should be provided in FASTA format. File can also include alternative inputs via residue indices. For example:

>S1
VLLPL
>S2,NSP12,7,20
>S3
DYKHYTPSFK

--msa Path to input MSA fasta file

Optional arguments:

--metadata Metadata .csv file to merge with mutaion data (absence of metadata limits insights from mutation data)
-o (--out) Output directory name. Optional, default directory name with timestamp is generated automatically
-t (--tag) Sample name pattern to filter. This input string will be compiled as Python regex
-q (--quality_filter) Upper threshold of max N bases proportion in genome. This proportion is calculated as: count('N' bases in genome without '-' symbols)/length(genome without '-' symbols). Recommended value 0.05
-n (--ambiguity_threshold) Maximum proportion of ambiguous residues in sample peptide sequence, which is regarded as sufficient coverage. Default 1/3. If (count(ambiguous bases in peptide) / length(peptide)) > threshold, then sample reported as insufficient coverage.
-b (--blosum) BLOSUM matrix version for mutation scoring. Default 90. See blosum python package documentation for available options.
-s (--sort), options: {0,1}, Sort mutations summary table by count(0) or score(1). Default 0
-a (--stat), options: {0,1}, Stat individual mutations (0) or combinations(1). Default 0
--stat_with_metadata Only stat samples with metadata in final summary

Basic scan run to generate mutation data for 1 peptide and save output to an automatically named folder:

EpitopeScan scan -e S1,LTGIAVEQDK --msa test/example_data/example_genomes.fasta

Perform analysis with peptide file input and combine mutation data with samples metadata, save to a new folder with custom name:

EpitopeScan scan -f test/example_data/example_epitope.fasta --msa test/example_data/example_genomes.fasta --metadata test/example_data/example_metadata.csv -o My_EpitopeScan_Output

Add genome quality threshold of 0.07 (7%) and mark any sample with ambiguous bases in epitope's region as insufficient coverage sample:

EpitopeScan scan -f test/example_data/example_epitope.fasta --msa test/example_data/example_genomes.fasta --metadata test/example_data/example_metadata.csv -q 0.07 -n 0.0

Only keep samples with "England" or "Scotland" mentioned in sequence name:

EpitopeScan scan -f test/example_data/example_epitope.fasta --msa test/example_data/example_genomes.fasta --metadata test/example_data/example_metadata.csv -t "England|Scotland"

3.2.2 Stat mode

The stat mode accepts preexisting scan output and prints summary report to terminal stdout with specified options. There is a help section to navigate flags:

$ EpitopeScan stat -h

usage: EpitopeScan stat [-h] -i INPUT [-b BLOSUM] [-s {0,1}] [-a {0,1}] [--stat_with_metadata] [--start_date START_DATE]
                        [--end_date END_DATE]

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Direcory with scan output
  -b BLOSUM, --blosum BLOSUM
                        BLOSUM matrix version for mutation scoring. Default 90
  -s {0,1}, --sort {0,1}
                        Sort mutations summary by count(0) or score(1). Default 0
  -a {0,1}, --stat {0,1}
                        Stat individual mutations(0) or combinations(1). Default 0
  --stat_with_metadata  Only stat samples with metadata
  --start_date START_DATE
                        Stat samples after this date, dd/mm/yyyy
  --end_date END_DATE   Stat samples before this date, dd/mm/yyyy

Argument description:

-i (--input) Path to directory with EpitopeScan scan output
-b (--blosum) BLOSUM version for mutation scoring. Default 90
-s (--sort), options: {0,1}, Sort mutations summary by count(0) or score(1). Default 0
-a (--stat), options: {0,1}, Stat individual mutations (0) or combinations(1). Default 0
--stat_with_metadata Only stat samples with metadata
--start_date Subset after this date, dd/mm/yyyy
--end_date Subset before this date, dd/mm/yyyy

You can generate an output from example_data to try out the stat mode. Basic run can be launched as follows:

EpitopeScan stat -i path/to/EpitopeScan_output_dir

To stat all occurring combinations instead of individual mutations in a desired date range, and sort summary table by score:

EpitopeScan stat -i path/to/EpitopeScan_output_dir -s 1 -a 1 --start_date 15/01/2020 --end_date 17/04/2020

3.2.3 Mutation data output

For each peptide, 3 tables are generated and named according to the following templates:

EpitopeName_mutation_data.tsv contains columns: sequence_name, NA_mutations (list of nucleic acid substitutions, comma-separated), AA_mutations (list of amino acid substitutions, comma-separated), one-hot-encoding columns detailing presence of discovered AA mutations in each sample, sample_date, epi_week (epidemic week), usher_lineage (viral lineage), has_metadata (0 if False and 1 if True). If no metadata was provided, the last columns will be filled with NaNs.
EpitopeName_AA_mutation_matrix.csv count matrix. Index corresponds to AA residue in reference epitope sequence. Columns contain counts of each possible residue (including translation stop) and deletions (Δ) at corresponding peptide position
EpitopeName_NA_mutation_matrix.csv count matrix for coding nucleic acid sequece is orginised in same manner to AA matrix

When multiple peptides are provided, the output folder will contain separate subfolders with tables for each peptide.

Note: if renamed, the output files will not be suitable for analysis with EputopeScan stat or EpitopeScanGUI.

3.3 Running graphical user-interface tool

EpitopeScan GUI runs locally using the default web-browser as an interface. To launch simply type EpitopeScanGUI and wait for the browser window to pop-up:

EpitopeScanGUI

Upload mutation data files (one peptide at a time) and explore the interactive summary.

3.4 Running test scripts

Folder test contains run_EpitopeScan_test.py script and test data subfolders. This script will perform the analysis for test epitopes and compare EpitopeScan output to the reference mutation data. Run the test script after installing EpitopeScan to verify the correct performance of the tool:

./test/run_EpitopeScan_test.py

4. Advice on constructing Multiple Sequence Alignment

EpitopeScan operates on aligned genomes, meaning that you need to align your genomes with some other tools before analysing them with EpitopeScan. Here, we provide some advice and options on how to approach this task. First, you should use EPI_ISL_402124.fasta from EpitopeScan/reference_sequences as a reference for alignment.

As for the choice of software for Multiple Sequence Alignment:

Option 1: Graphical User Interface software

In case you are most comfortable with Graphical User Interface software options:

There are some online-interfaces providing access to sequence alignment tools (for example, this or this). However, online tools may have limited capacity when it comes to large amount of data.
Alternatively, download and use free Unipro UGENE software, which integrates multiple alignment algorithms to choose from. It will also be more convenient, if you wish to extend an alignment, as you can open a preexisting alignment file and append new sequences to it. Although, if you are dealing with millions of sequences, a graphical application could crash, so in this case you could explore terminal option described below.

Option 2: Use terminal tools

There are many different alignment tools with their strengths and drawbacks. Some popular options are listed here. We give an example using MAFFT aligner, which is a fast and well-established option:

# probably the easiest way to install MAFFT is with bioconda

# create a separate environment for the tool
conda create --name mafft_env
conda activate mafft_env

# install
conda install bioconda::mafft

# see MAFFT usage guide
mafft -h

# align genomes onto reference
mafft --auto --keeplength --addfragments unaligned_genomes.fasta path/to/EPI_ISL_402124.fasta > sequences_aln.fasta

# append new sequences to alignment file
mafft --auto --keeplength --addfragments more_unaligned_genomes.fasta sequences_aln.fasta > new_sequences_aln.fasta

We also provide a script, which will help you to update your alignment with new sequences, especially if it's a frequent operation and you have a lot of files with genomes.

Imagine that you add files with new sequences in FASTA format to the folder genome_sequences. You can append the sequences from the folder to an alignment file genome_alignment.fasta using the script update_msa.py. The script will automatically check what sequences in each file in the folder are already in the alignment and will append the missing entries.

./update_msa.py -h

usage: update_msa.py [-h] -s SEQUENCES_DIR -a ALIGNMENT_FILE

Update MSA with MAFFT aligner

options:
  -h, --help            show this help message and exit
  -s SEQUENCES_DIR, --sequences_dir SEQUENCES_DIR
                        Path to directory with genome sequences to align
  -a ALIGNMENT_FILE, --alignment_file ALIGNMENT_FILE
                        Alignment file to append sequences to

./update_msa.py -s path/to/genome_sequences -a path/to/genome_alignment.fasta

Note: if you are constructing MSA for the first time, just use the mafft command from above to align sequences to a reference genome.

You can tune MAFFT algorithm options or choose another aligner, which is convenient for you. Note, that for a large number of sequences alignment could be a lengthy process taking hours and days of computation.

5. Questions?

If you have a question, which does not seem to be answered in this manual or if you want to report an issue, please, do so in the Issues tab at the GitHub page of this repository. Also, you may want to check if an issue similar to yours has been reported before. Your feedback is very much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EpitopeScan

1. Repository contents

2. Download and set-up

3. EpitopeScan usage

3.1 Input data

3.2 Running command line tool

3.2.1 Scan mode

3.2.2 Stat mode

3.2.3 Mutation data output

3.3 Running graphical user-interface tool

3.4 Running test scripts

4. Advice on constructing Multiple Sequence Alignment

5. Questions?

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
EpitopeScan		EpitopeScan
test		test
LICENCE.txt		LICENCE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
update_msa.py		update_msa.py

License

Aleksandr-biochem/EpitopeScan

Folders and files

Latest commit

History

Repository files navigation

EpitopeScan

1. Repository contents

2. Download and set-up

3. EpitopeScan usage

3.1 Input data

3.2 Running command line tool

3.2.1 Scan mode

3.2.2 Stat mode

3.2.3 Mutation data output

3.3 Running graphical user-interface tool

3.4 Running test scripts

4. Advice on constructing Multiple Sequence Alignment

5. Questions?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages