# SARS-ARENA: Structure-based identification of SARS-derived peptides with potential to induce broad protective immunity

## *Workflow New_w1A* - Sequence Alignment and Peptide Selection

Welcome to the Sequence Alignment and Peptide Selection **Workflow 1A**. This notebook will allow you to search and download SARS-CoV-2 proteins from NCBI in a specified date of deposition. At the end, you will be able to align these protein sequences and search for conserved regions according to your especifications. The peptide's list output can be used in the subsequent workflow. (**Workflow 2: Peptide-HLA Prediction for Conserved SARS-CoV-2 Peptides**) 

This workflow consists of five steps: 
    1. Fetch dataset from NCBI,
    2. Extract and filter the sequence file,
    3. Multiple Sequence Alignment, 
    4. Computing conservation score, and 
    5. Computing conserved peptides.
    
In case you need to align **more than 20000 proteins**, we recommend you to use the [Workflow 1B](http://127.0.0.1:8888/notebooks/ProjectDevelopment/Peptide_Extraction_Workflow_1B.ipynb).

**In order to run a cell, first click on the cell, then press shift-enter. The code inside the cell will then be executed. Note that the content of the cell can be executed as Code or Markdown. Also, inside the cell you may find comments to explain a specific command. These comments are marked with "#"**

### Step 1) Fetch dataset from NCBI Virus

In this first part, you will be able to download the SARS-CoV-2 protein dataset directly from NCBI, using the python package for calling the NCBI datasets API. You can find more information [here](https://github.com/ncbi/datasets).

In [1]:
# Mauricio: This is necessary given the docker image that we have. We an say to Anja that 
# uploaded the last version of the image that included 3pHLA to also add curl, we need it to download the sequences
!apt-get update -y
!apt-get install curl -y

Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]      
Get:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]    
Get:5 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1474 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2252 kB]
Get:7 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages [806 kB]
Get:8 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [2596 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [29.0 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [840 kB]
Get:11 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [21.1 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [3035

#### 1.1. Necessary imports:
Run this cell to make the necessary imports. This cell should be run only one time, unless you close this session and open it again.

In [1]:
# System-based imports
import os

# Subprocess module
from subprocess import PIPE, run
import multiprocessing

# Data processing
import pandas as pd

# Visualization
import seaborn as sns
sns.set(rc={'figure.figsize':(18, 8)}) # Use seaborn style defaults and set the default figure size

# For utility functions used within the code
from SARS_Arena import *

_ColormakerRegistry()

#### 1.2. Setting a working directory:
Choose an appropriate directory for storing all the files or use the default (*Peptide_Extraction_Workflow*).

In [2]:
dir_of_workflow_1 = "Peptide_Extraction_Workflow_1A"

In [3]:
os.makedirs(dir_of_workflow_1, exist_ok=True)
os.chdir(dir_of_workflow_1)

#### 1.3 Arguments for the API:
The protein sequences will be retrieved from online databases. Change these arguments according to your preference. These are the options available (*default values are shown*):
<br/>
<br/>

    Protein = 'nucleocapsid phosphoprotein'
This workflow is focused on N protein. However, you can choose a protein to be analyzed from the following list: ORF1ab, ORF1a, nsp1, nsp2, nsp3, nsp4, nsp5, nsp6, nsp7, nsp8, nsp9, nsp10, nsp11, nsp13, nsp14, nsp15, nsp16, RdRp, S, ORF3a, E, M, ORF6, ORF7a, ORF7B, ORF8, N, ORF10

### Mauricio: Maybe you can fill out the rest of them, you probably can explain better isolation sources, pangolin lineages etc. 

In [4]:
tab = create_tab(os.getcwd())
dataset_selection(tab)

Tab(children=(Accordion(children=(Dropdown(description='Virus:', index=1, options=('Sars-CoV', 'Sars-CoV-2', '…

Validate the selection you made above before proceeding.

**WARNING**: Do not skip running this cell after you specified you arguments above, as in needs to be run to update the values!

In [9]:
Virus_Type, Protein, Completeness, Host, Refseq, Geographic_region, Isolation_source, Pangolin_lineage, Released_Dates = extract_elements(tab)

print("The following is the selection made using the UI above. Return to it to make corrections if needed!")
print('-------')

# Virus
print("Virus type: " + Virus_Type)

# Protein
print("Protein type: " + Protein)

# Protein
print("Sequence type: " + Refseq)

# Completeness
print("Completeness type: " + Completeness)

# Host
print("Host type: " + Host)

# Geography
print("Geographic type: " + Geographic_region[0])
print("Geographic selection: " + str(Geographic_region[1]))

# Isolation source|
print("Isolation source: " + str(Isolation_source))

# Pangolin source
print("Pangolin lineage: " + str(Pangolin_lineage))

# Dates
print("From " + str(Released_Dates[0]) + " to " + str(Released_Dates[1]))

The following is the selection made using the UI above. Return to it to make corrections if needed!
-------
Virus type: Sars-CoV-2
Protein type: nucleocapsid phosphoprotein
Sequence type: RefSeq
Completeness type: Complete
Host type: Human
Geographic type: Continent
Geographic selection: ()
Isolation source: ()
Pangolin lineage: ()
From 2019-12-01 to 2022-02-25


### Step 2) Extract and filter the sequence file:

Now that you have defined which sequences should be analyzed, it is time to extract these sequences. In this section, python code automatically extracts the .fasta file that contains the protein sequences.

#### 2.1 Download the .fasta sequence file from NCBI Virus:

In [10]:
sequence_file = call_ncbi_virus(Virus_Type, Protein, Completeness, Host, Refseq, Geographic_region, 
                                     Isolation_source, Pangolin_lineage, Released_Dates)
print(sequence_file)

q=*:*&fq={!tag=SeqType_s}SeqType_s:("Protein")&fq=VirusLineageId_ss:(2697049)&fq={!tag=QualNum_i}QualNum_i:([0 TO 0])&fq={!tag=ProtNames_ss}ProtNames_ss:("nucleocapsid phosphoprotein")&fq={!tag=SourceDB_s}SourceDB_s:("RefSeq")&fq={!tag=Completeness_s}Completeness_s:("complete")&fq=HostLineageId_ss:(9606)&fq={!tag=CreateDate_dt}CreateDate_dt:([2019-12-01T00:00:00.00Z TO 2022-02-25T00:00:00.00Z])&cmd=download&sort=SourceDB_s desc,CreateDate_dt desc,id asc&dlfmt=fasta&fl=AccVer_s,Definition_s,Protein_seq
/tmp/sequences.fasta


#### 2.2 Check the total number of protein sequences

In [11]:
no_of_sequences = count_sequences(sequence_file)

Total number of sequences: 1


### Step 3) Multiple Sequence Alignment

In this step you will perform the Multiple Sequence Alignment based on the sequences you have chosen. This is performed with the software [MAFFT](https://mafft.cbrc.jp/alignment/software/). 

#### 3.1 Run in parallel (optional)
In case your machine has multiple cores, you can select a specific number of cores to run the alignment. If your machine has a single core, you can move to step 3.2.

Run the cell below in case you don't know how many cores your machine has.

In [12]:
print("Number of cores :", multiprocessing.cpu_count())

Number of cores : 12


Now, define the number of cores to be used (`ncores`).

In [11]:
ncores = 8

#### 3.2 Run the Multiple Sequence Alignment 
After running this cell, you will be able to see a consensus sequence for this alignment. Choose a threshold for calculating the consensus sequence (frequencies below this threshold will have an unknown amino acid).

**WARNING:** Be aware that the more sequences you have in total, the more waiting time there is for MAFFT to finish. As suggested in the start of the workflow, if you want to do just the conservation analysis on sequences that are already aligned, use Workflows 1B and 1C instead!

In [12]:
threshold = 0.5
consensus_sequence = run_msa(sequence_file, ncores, threshold)
print(consensus_sequence)

MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLPNNTASWFTALTQHGKEDLKFPRGQGVPINTNSSPDDQIGYYRRATRRIRGGDGKMKDLSPRWYFYYLGTGPEAGLPYGANKDGIIWVATEGALNTPKDHIGTRNPANNAAIVLQLPQGTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSKRTSPARMAGNGGDAALALLLLDRLNQLESKMSGKGQQQQGQTVTKKSAAEASKKPRQKRTATKAYNVTQAFGRRGPEQTQGNFGDQELIRQGTDYKHWPQIAQFAPSASAFFGMSRIGMEVTPSGTWLTYTGAIKLDDKDPNFKDQVILLNKHIDAYKTFPPTEPKKDKKKKADETQALPQRQKKQQTVTLLPAADLDDFSKQLQQSMSSADSTQA


**Alignment Scoring**: After the *MAFFT* algorithm performs the alignment, the **aligned.faa** file will contain all the aligned sequences.

### Step 4) Computing conservation score

Using the file with the aligned sequences, you can now score each position in terms of conservation. We offer four scoring method options:

- *Jensen-Shannon divergence score* (used as 'js_divergence') (**Recommended**)
- *Shannon Entropy* (used as 'shannon_entropy')
- *Property entropy* (used as 'property_entropy')
- *Von Neumann entropy* (used as 'vn_entropy')

For details on the scoring method options, please consult [Capra & Singh (2007)](https://academic.oup.com/bioinformatics/article/23/15/1875/203579).

For scoring matrices you can choose one of the BLOSUM options:
- BLOSUM62 (**Recommended**)
- BLOSUM35
- BLOSUM40
- BLOSUM45
- BLOSUM50
- BLOSUM80
- BLOSUM100

In [13]:
scoring_method = 'js_divergence'

In [14]:
scoring_matrix = 'blosum62' #This only applies to methods that actually use a scoring matrix for calculating conservation, like JS-divergence, else, it is ignored (e.g. Shannon Entropy)

Now that the arguments have been defined, run the conservation analysis and store the conservation results in the `conservation_file` variable:

In [15]:
conservation_file = conservation_analysis(scoring_method, scoring_matrix)

Scoring Completed!
Results written to conservation.csv
 


### Step 5) Computing conserved peptides

In the final step of this workflow you will be able to compute the conservation of peptides based on residue conservation.

#### 5.1 Retrieve information on conservation residues:

In [16]:
conservation_df = pd.read_csv(filepath_or_buffer = conservation_file,
                              header = 0,
                              names = ['Position', 'Score', 'Alignment'],
                              converters={'Score': lambda x: max(float(x)*100, 0)})

In [17]:
conservation_df #Show the conservation by residue

Unnamed: 0,Position,Score,Alignment
0,0,91.8892,MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM...
1,1,83.6737,SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...
2,2,85.1613,DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD...
3,3,86.8314,NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...
4,4,82.8658,GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
...,...,...,...
414,414,84.5292,DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD...
415,415,83.5952,SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...
416,416,84.5164,TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT...
417,417,89.3111,QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ...


Now fetch the aligned sequences, where we extract the peptides from:

In [18]:
aligned_sequences_df = pd.read_csv(filepath_or_buffer = "aligned.csv",
                                   header = 0,
                                   names = ['Sequence_ID', 'Aligned_Sequences'])

In [19]:
aligned_sequences_df #Show the aligned sequences

Unnamed: 0,Sequence_ID,Aligned_Sequences
0,1,MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLP...
1,2,MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLP...
2,3,MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLP...
3,4,MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLP...
4,5,MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLP...
...,...,...
420,421,MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLP...
421,422,MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLP...
422,423,MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLP...
423,424,MSDNGPQNQRNALRITFGGPSDSTGSNQNG---GARSKQRRPQGLP...


Now, before using the interactive plot to filter the peptides by conservation, we will pre-compute all the peptides in the sequences. For that, define the peptide length boundaries you want to analyze and extract the peptides:

In [20]:
max_len = 10 #Maximum length of the peptide
min_len = 8 #Minimum length of the peptide
extracted_peptides_from_sequences = extract_peptides(min_len, max_len, aligned_sequences_df)

Extracting all peptides from sequences


HBox(children=(FloatProgress(value=0.0, max=425.0), HTML(value='')))


Post-processing for all peptide lengths


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




#### 5.2 Choose the peptides based on conservation values:
Use the sliders below the cell (after run) to set the following parameters:

- *Conservation threshold (CV_cutoff)*: Conservation degree of the peptides.
- *Rolling Median Window length (RMW_cutoff)*: As conservation values are different and not homogeneous for each position, the regions can be smoothed based on this filter. Alternatively, you can set to 1 to take conservation as it is. 
- *Peptide Length (Pep_length)*: Fetch peptides of desired length for post-processing.

In [21]:
interactive_plot_selection(conservation_df, extracted_peptides_from_sequences, min_len, max_len)

interactive(children=(FloatSlider(value=83.49687326968973, continuous_update=False, description='CV_cutoff', m…

#### 5.3 Print the peptides:

Print the peptides sequence based on the threshold set above.

In [23]:
peptide_file = open("peptides.list", "r")
peptide_list = peptide_file.readlines()
peptide_list = [peptide.strip() for peptide in peptide_list] 
print(peptide_list)

['SPRWYFYYL', 'TDYKHWPQI', 'DYKHWPQIA', 'YKHWPQIAQ', 'KHWPQIAQF', 'AYKTFPPTE', 'AYKTFPPTQ']


<font size="+2"><center><b>This is the end of Workflow 1A</font></center></b>


You will find a file named *peptides.list* in your folder that can be used as input for the [Workflow 2](Peptide-HLA_Binding_Prediction_Workflow_2.ipynb).