# SARS-ARENA: Structure-based identification of SARS-derived peptides with potential to induce broad protective immunity

## *Workflow 1C* - Sequence Alignment and Peptide Selection

Welcome to the Sequence Alignment and Peptide Selection **Workflow 1C**. This notebook will allow you to recover beta-coronaviruses protein sequences already aligned and search for conserved regions according to your especifications. In the first part you will recover the consensus sequence from SARS-CoV-2 from a pre-computed alignment. The protein sequences are aligned in a weekly basis, so you have information that is updated. After that, you will be able to alingn and recover conserved peptides using other beta-coronaviruses proteins in your input. Just like Workflow 1A and 1B, the peptide's list output can be used in the subsequent workflow. (**Workflow 2: Peptide-HLA Prediction for Conserved SARS-CoV-2 Peptides**) 

This workflow consists of four steps: 
    1. Fetch Pre-computed Multiple Sequence Alignment (MSA) from SARS-CoV-2,
    2. Multiple Sequence Alignment,
    3. Computing conservation score,
    4. Computing conserved peptides.
    
**In order to run a cell, first click on the cell, then press shift-enter. The code inside the cell will then be executed. Note that the content of the cell can be executed as Code or Markdown. Also, inside the cell you may find comments to explain a specific command. These comments are marked with "#"**

### Step 1) Fetch Pre-computed Multiple Sequence Alignment (MSA) from SARS-CoV-2

#### 1.1. Necessary imports:
Run this cell to make the necessary imports. This cell should be run only one time, unless you close this session and open it again.

In [1]:
# System-based imports
import os
import glob

# Data processing
import pandas as pd

# Subprocess module
from subprocess import PIPE, run
import multiprocessing

#For visualization and interaction purposes
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import *
sns.set(rc={'figure.figsize':(18, 8)}) # Use seaborn style defaults and set the default figure size

# For utility functions used within the code
from SARS_Arena import *

_ColormakerRegistry()

#### 1.2. Setting a working directory:
Choose an appropriate directory for storing all the files or use the default (*Peptide_Extraction_Workflow*).

In [2]:
dir_of_workflow = "./Peptide_Extraction_Workflow_1C"

In [3]:
os.makedirs(dir_of_workflow, exist_ok=True)
os.chdir(dir_of_workflow)

#### 1.3. Setting month/year to recover sequences:

Set up desired month and year that you want to draw sequences from, until today. Use 3-letter abbreviation for the month (Jan, Feb, ...)

In [4]:
month = "Jan"
year = "2021"

Fetch the sequences and print the consensus sequence:

In [5]:
fetch_precomputed_sequences(year, month)

'aligned.faa'

In [6]:
print("Consensus sequence:")
with open('Consensus_sequence.txt', 'r') as f:
    cons = f.read().replace('\n', '')
    print(cons)

Consensus sequence:
-MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLPNNTASWFTALTQHGKEDLKFPRGQGVPINTNSSPDDQIGYYRRATRRIRGGDGKMKDLSPRWYFYYLGTGPEAGLPYGANKDGIIWVATEGALNTPKDHIGTRNPANNAAIVLQLPQGTTLPKGFY-AEGSRGGSQASSRSSSRSRNSSRNSTPG-SSKRTSPARMAGNGGDAALALLLLDRLNQLESKMS---------GKG-QQQQ---GQTVTK---KSAAEASKKPRQKRTATKAYN--VTQAFGRRGPEQTQGNFGDQELIRQGTDYKHWPQIAQFAPSASAFFGMSRIGMEVTPSGTWLTYTGAIKLDDKDPNFKDQVILLNKHIDAYKTFPPTEPKKD-K--------KKKADETQALPQRQKKQQ----TVT-LLPAADLDDFSKQLQQSMSSADSTQA


Include the SARS-CoV-2 consensus sequence with other beta-coronaviruses sequences:

In [7]:
filtered_sequence_file = "protein_refs.faa"

In [8]:
os.system("cp ../Coronaviruses_sequences.fasta " + filtered_sequence_file)

cons = '\n'.join(''.join(filter(lambda n: n is not None, chunk)) for chunk in helper_grouper(str(cons), 70))

with open(filtered_sequence_file, "a") as file_object:
    file_object.write("\n>gi||cons|| N gene product [SARS-CoV-2]\n")
    file_object.write(cons)

### Step 2) Multiple Sequence Alignment

In this step you will perform the Multiple Sequence Alignment based on the sequences you have chosen. This is performed with the software [MAFFT](https://mafft.cbrc.jp/alignment/software/). 

#### 2.1 Run in parallel (optional)
In case your machine has multiple cores, you can select a specific number of cores to run the alignment. If your machine has a single core, you can move to step 2.2.

Run the cell below in case you don't know how many cores your machine has.

In [9]:
print("Number of cores :", multiprocessing.cpu_count())

Number of cores : 12


Now, define the number of cores to be used (`nthreads`).

In [10]:
nthreads = 2

#### 2.2 Run the Multiple Sequence Alignment 
After run this cell, you will be able to see a consensus sequence for this alignment. Choose a threshold for calculating the consensus sequence (frequencies below this threshold will have an unknown amino acid).

In [11]:
threshold = 0.5
consensus_sequence = run_msa(filtered_sequence_file, nthreads, threshold)

M------------A-----------XXVXX-----XDXX-------XXXXXRGRX----------------------XPX-XXXXXXXSWFXXLXXXXKXXXXXXXXGXGVPXXXGXXXXXQXGYWXRXXR--XXXXXGXXXXLXPXWXFYYLGTGPXAXLXXGXX--------XXGVXWVAXXGAXTXXXX--XXGXRXPX--XXXXXXXXFXXGXXLPXGFXXXXXX-----XSXXXSRXXSRXX---XXSRXXSX----------------------------------XSRXXS---XXRX-X--------------XX-XXXXXXXXXXXLXXXXX---XXXX-XX-----------------XPXXXXKXXAX----------------XXXXXXXXKXXXKRTPXKX--XXVXQXFGXRXXXX---NFGDXXXXKXGXXDPXXPXXAELXPXXXAXLFGSXXXXXX--------------XX-D-XX------------XLTYXXXXXXXXXXPXXXXXXXXXXXXXXAYXX----------XPXX-XXKXXXXX--------XXXXXXXX-------XX------------------------------------------------XPXXXXXX---------------------------XXXXXX----XXXDXXXX------------------XXX


### Step 3) Computing conservation score

Using the file with the aligned sequences, you can now score each position in terms of conservation. We offer four scoring method options:

- *Jensen-Shannon divergence score* (used as 'js_divergence') (**Recommended**)
- *Shannon Entropy* (used as 'shannon_entropy')
- *Property entropy* (used as 'property_entropy')
- *Von Neumann entropy* (used as 'vn_entropy')

For details on the scoring method options, please consult [Capra & Singh (2007)](https://academic.oup.com/bioinformatics/article/23/15/1875/203579).

For scoring matrices you can choose one of the BLOSUM options:
- BLOSUM62 (**Recommended**)
- BLOSUM35
- BLOSUM40
- BLOSUM45
- BLOSUM50
- BLOSUM80
- BLOSUM100

In [12]:
scoring_method = 'js_divergence'

In [13]:
scoring_matrix = 'blosum62' #This only applies to methods that actually use a scoring matrix for calculating conservation, like JS-divergence, else, it is ignored (e.g. Shannon Entropy)

Now that the arguments have been defined, run conservation analysis, and store the conservation results in the `conservation_file` variable:

In [14]:
conservation_file = conservation_analysis(scoring_method, scoring_matrix)

Scoring Completed!
Results written to conservation.csv
 


### Step 4) Computing conserved peptides

In the final step of this workflow you will be able to compute the conservation of peptides based on residue conservation.

#### 4.1 Retrieve information on conservation residues:

In [15]:
conservation_df = pd.read_csv(filepath_or_buffer = conservation_file,
                              header = 0,
                              names = ['Position', 'Score', 'Alignment'],
                              converters={'Score': lambda x: max(float(x)*100, 0)})

In [16]:
conservation_df #Show the conservation by residue

Unnamed: 0,Position,Score,Alignment
0,0,91.8892,MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM...
1,1,0.0000,-------SS----LS-----SS----------------S--SS---...
2,2,0.0000,-------YF----EF-----FF----------------H--FF---...
3,3,0.0000,-------TT----VT-----VT----------------T--VV---...
4,4,0.0000,-------PP-----P-----PP----------------P--PP---...
...,...,...,...
704,704,0.0000,D--E-GDEDD-H--DG-G--DD----GG--G-------G--DDN--...
705,705,0.0000,SD-W-ASDTS-E--TS-E--DT----DD--E-------S--DDI-E...
706,706,0.0000,TSSS-ETSST-S--SS-N--SS----SS--NN-NN---S--SSS-T...
707,707,0.0000,QEEE-RQVEQ-F--EE-E--NE----VE--ET-TT---E--NNEEF...


Now fetch the aligned sequences, where we extract the peptides from:

In [17]:
aligned_sequences_df = pd.read_csv(filepath_or_buffer = "aligned.csv",
                                   header = 0,
                                   names = ['Sequence_ID', 'Aligned_Sequences'])

In [18]:
aligned_sequences_df #Show the aligned sequences

Unnamed: 0,Sequence_ID,Aligned_Sequences
0,1,M------------TDNG-QSNSRNAPRITF---GVSDTSD----NN...
1,2,M------------ATPAPP------RAVVF-----ANDNE----TP...
2,3,M------------ATPAAP------RTISF-----ADNND----NQ...
3,4,M------------ASTSGKGKNPADKSVKF----------------...
4,5,M------------A------------TVNW-----GDAVE------...
...,...,...
60,61,M------------A--------------------------------...
61,62,M------------T--------------------------------...
62,63,M------------A--------------------------------...
63,64,M------------A--------------------------------...


Now, before using the interactive plot to filter the peptides by conservation, we will pre-compute all the peptides in the sequences. For that, define the peptide length boundaries you want to analyze and extract the peptides:

In [19]:
max_len = 10 #Maximum length of the peptide
min_len = 8 #Minimum length of the peptide

In [20]:
extracted_peptides_from_sequences = extract_peptides(min_len, max_len, aligned_sequences_df)

Extracting all peptides from sequences


HBox(children=(FloatProgress(value=0.0, max=65.0), HTML(value='')))


Post-processing for all peptide lengths


HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))




#### 4.2 Choose the peptides based on conservation values:
Use the sliders below the cell (after run) to set the following parameters:

- *Conservation threshold (CV_cutoff)*: Conservation degree of the peptides.
- *Rolling Median Window length (RMW_cutoff)*: As conservation values are different and not homogeneous for each position, the regions can be smoothed based on this filter. Alternatively, you can set to 1 to take conservation as it is. 
- *Peptide Length (Pep_length)*: Fetch peptides of desired length for post-processing.

In [22]:
interactive_plot_selection(conservation_df, extracted_peptides_from_sequences, min_len, max_len)

interactive(children=(FloatSlider(value=21.726049788434416, continuous_update=False, description='CV_cutoff', …

#### 4.3 Print the peptides:

Print the peptides sequence based on the threshold set above.

In [27]:
peptide_file = open("peptides.list", "r")
peptide_list = peptide_file.readlines()
peptide_list = [peptide.strip() for peptide in peptide_list] 
print(peptide_list)

['PQNQRNAPR', 'QRRPQGLPN', 'RRPQGLPNN', 'RPQGLPNNT', 'PQGLPNNTA', 'QGLPNNTAS', 'GLPNNTASW', 'SPRWYFYYL', 'TDYKHWPQI', 'DYKHWPQIA', 'YKHWPQIAQ', 'KHWPQIAQF', 'DAYKTFPPT', 'AYKTFPPTE', 'YKTFPPTEP', 'LPQRQKKQQ', 'PQRQKKQQT']


<font size="+2"><center><b>This is the end of Workflow 1C</font></center></b>


You will find a file named *peptides.list* in your folder that can be used as input for the [Workflow 2](Peptide-HLA_Binding_Prediction_Workflow.ipynb).