# SARS-ARENA: Structure-based identification of SARS-derived peptides with potential to induce broad protective immunity

## *Workflow 1B* - Sequence Alignment and Peptide Selection

Welcome to the Sequence Alignment and Peptide Selection **Workflow 1B**. This notebook was implemented to deal with a high number of SARS-CoV-2 protein sequences (**more than 50,000 proteins**). You will be able to recover protein sequences already aligned and search for conserved regions according to your especifications. The protein sequences are aligned in a weekly basis, so you have information that is updated. Just like Workflow 1A, the peptide's list output can be used in the subsequent workflow. (**Workflow 2: Peptide-HLA Prediction for Conserved SARS-CoV-2 Peptides**) 

This workflow consists of three steps: 
    1. Fetch Pre-computed MSA dataset,
    2. Computing conservation score,
    3. Computing conserved peptides.
    
**In order to run a cell, first click on the cell, then press shift-enter. The code inside the cell will then be executed. Note that the content of the cell can be executed as Code or Markdown. Also, inside the cell you may find comments to explain a specific command. These comments are marked with "#"**

### Step 1) Fetch Pre-computed Multiple Sequence Alignment (MSA)

#### 1.1. Necessary imports:
Run this cell to make the necessary imports. This cell should be run only one time, unless you close this session and open it again.

In [None]:
# System-based imports
import os
import glob

# Data processing
import pandas as pd

#For visualization and interaction purposes
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import *
sns.set(rc={'figure.figsize':(18, 8)}) # Use seaborn style defaults and set the default figure size

# For utility functions used within the code
from SARS_Arena import *

#### 1.2. Setting a working directory:
Choose an appropriate directory for storing all the files or use the default (*Peptide_Extraction_Workflow*).

In [None]:
dir_of_workflow = "./Peptide_Extraction_Workflow_1B"

In [None]:
os.makedirs(dir_of_workflow, exist_ok=True)
os.chdir(dir_of_workflow)

#### 1.3. Setting month/year to recover sequences:

Set up desired month and year that you want to draw sequences from, until today. Use 3-letter abbreviation for the month (Jan, Feb, ...)

In [None]:
month = "Jul"
year = "2021"

Fetch the sequences:

In [None]:
fetch_precomputed_sequences(year, month)

### Step 2) Computing conservation score

Using the file with the aligned sequences, you can now score each position in terms of conservation. We offer four scoring method options:

- *Jensen-Shannon divergence score* (used as 'js_divergence') (**Recommended**)
- *Shannon Entropy* (used as 'shannon_entropy')
- *Property entropy* (used as 'property_entropy')
- *Von Neumann entropy* (used as 'vn_entropy')

For details on the scoring method options, please consult [Capra & Singh (2007)](https://academic.oup.com/bioinformatics/article/23/15/1875/203579).

For scoring matrices you can choose one of the BLOSUM options:
- BLOSUM62 (**Recommended**)
- BLOSUM35
- BLOSUM40
- BLOSUM45
- BLOSUM50
- BLOSUM80
- BLOSUM100

In [None]:
scoring_method = 'js_divergence'

In [None]:
scoring_matrix = 'blosum62' #This only applies to methods that actually use a scoring matrix for calculating conservation, like JS-divergence, else, it is ignored (e.g. Shannon Entropy)

Now that the arguments have been defined, run the conservation analysis and store the conservation results in the `conservation_file` variable:

In [None]:
conservation_file = conservation_analysis(scoring_method, scoring_matrix)

### Step 3) Computing conserved peptides

In the final step of this workflow you will be able to compute the conservation of peptides based on residue conservation.

#### 3.1 Retrieve information on conservation residues:

In [None]:
conservation_df = pd.read_csv(filepath_or_buffer = conservation_file,
                              header = 0,
                              names = ['Position', 'Score', 'Alignment'],
                              converters={'Score': lambda x: max(float(x)*100, 0)})

In [None]:
conservation_df

Now fetch the aligned sequences, where we extract the peptides from:

In [None]:
aligned_sequences_df = pd.read_csv(filepath_or_buffer = "aligned.csv",
                                   header = 0,
                                   names = ['Aligned_Sequences', 'Sequence_ID'])

In [None]:
aligned_sequences_df #Show the aligned sequences

Now, before using the interactive plot to filter the peptides by conservation, we will pre-compute all the peptides in the sequences. For that, define the peptide length boundaries you want to analyze and extract the peptides:

In [None]:
max_len = 15 #Maximum length of the peptide
min_len = 8 #Minimum length of the peptide

In [None]:
extracted_peptides = extract_peptides(min_len, max_len, aligned_sequences_df)

#### 3.2 Choose the peptides based on conservation values:
Use the sliders below the cell (after run) to set the following parameters:

- *Conservation threshold (CV_cutoff)*: Conservation degree of the peptides.
- *Rolling Median Window length (RMW_cutoff)*: As conservation values are different and not homogeneous for each position, the regions can be smoothed based on this filter. Alternatively, you can set to 1 to take conservation as it is. 
- *Peptide Length (Pep_length)*: Fetch peptides of desired length for post-processing.

In [None]:
interactive_plot_selection(conservation_df, extracted_peptides, min_len, max_len)

#### 3.3 Print the peptides:

Print the peptides sequence based on the threshold set above.

In [None]:
peptide_file = open("peptides.list", "r")
peptide_list = peptide_file.readlines()
peptide_list = [peptide.strip() for peptide in peptide_list] 
print(peptide_list)

<font size="+2"><center><b>This is the end of Workflow 1B</font></center></b>


You will find a file named *peptides.list* in your folder that can be used as input for the [Workflow 2](Peptide-HLA_Binding_Prediction_Workflow.ipynb).