# SARS-ARENA: Structure-based identification of SARS-derived peptides with potential to induce broad protective immunity

## *Workflow 1A* - Sequence Alignment and Peptide Selection

Welcome to the Sequence Alignment and Peptide Selection **Workflow 1A**. This notebook will allow you to search and download SARS-CoV-2 proteins from NCBI in a specified date of deposition. At the end, you will be able to align these protein sequences and search for conserved regions according to your especifications. The peptide's list output can be used in the subsequent workflow. (**Workflow 2: Peptide-HLA Prediction for Conserved SARS-CoV-2 Peptides**) 

This workflow consists of five steps: 
    1. Fetch dataset from NCBI Virus,
    2. Extract and filter the sequence file,
    3. Multiple Sequence Alignment, 
    4. Computing conservation score, and 
    5. Computing conserved peptides.
    
In case you need to align **more than 50,000 proteins**, we recommend you to use the [Workflow 1B](http://127.0.0.1:8888/notebooks/Peptide_Extraction_Workflow_1B.ipynb).

**In order to run a cell, first click on the cell, then press shift-enter. The code inside the cell will then be executed. Note that the content of the cell can be executed as Code or Markdown. Also, inside the cell you may find comments to explain a specific command. These comments are marked with "#"**

### Step 1) Fetch dataset from NCBI Virus

In this first part, you will be able to download the SARS-CoV-2 protein dataset directly from NCBI Virus.

In [None]:
# Mauricio: This is necessary given the docker image that we have. We an say to Anja that 
# uploaded the last version of the image that included 3pHLA to also add curl, we need it to download the sequences
!apt-get update -y
!apt-get install curl -y

#### 1.1. Necessary imports:
Run this cell to make the necessary imports. This cell should be run only one time, unless you close this session and open it again.

In [None]:
# System-based imports
import os

# Subprocess module
from subprocess import PIPE, run
import multiprocessing

# Data processing
import pandas as pd

# Visualization
import seaborn as sns
sns.set(rc={'figure.figsize':(18, 8)}) # Use seaborn style defaults and set the default figure size

# For utility functions used within the code
from SARS_Arena import *

#### 1.2. Setting a working directory:
Choose an appropriate directory for storing all the files or use the default (*Peptide_Extraction_Workflow*).

In [None]:
dir_of_workflow_1 = "Peptide_Extraction_Workflow_1A"

In [None]:
os.makedirs(dir_of_workflow_1, exist_ok=True)
os.chdir(dir_of_workflow_1)

#### 1.3 Arguments for the API:
The protein sequences will be retrieved from online databases. For each tab, change the arguments according to your preferences:

In [None]:
tab = create_tab(os.getcwd())
dataset_selection(tab)

Validate the selection you made above before proceeding.

**WARNING**: Do not skip running this cell after you specified you arguments above, as in needs to be run to update the values!

In [None]:
Virus_Type, Protein, Completeness, Host, Refseq, Geographic_region, Isolation_source, Pangolin_lineage, Released_Dates = extract_elements(tab)

print("The following is the selection made using the UI above. Return to it to make corrections if needed!")
print('-------')

# Virus
print("Virus type: " + Virus_Type)

# Protein
print("Protein type: " + Protein)

# Protein
print("Sequence type: " + Refseq)

# Completeness
print("Completeness type: " + Completeness)

# Host
print("Host type: " + Host)

# Geography
print("Geographic type: " + Geographic_region[0])
print("Geographic selection: " + str(Geographic_region[1]))

# Isolation source|
print("Isolation source: " + str(Isolation_source))

# Pangolin source
print("Pangolin lineage: " + str(Pangolin_lineage))

# Dates
print("From " + str(Released_Dates[0]) + " to " + str(Released_Dates[1]))

### Step 2) Extract and filter the sequence file:

Now that you have defined which sequences should be analyzed, it is time to extract these sequences. In this section, python code automatically extracts the .fasta file that contains the protein sequences.

#### 2.1 Download the .fasta sequence file from NCBI Virus:

In [None]:
sequence_file = call_ncbi_virus(Virus_Type, Protein, Completeness, Host, Refseq, Geographic_region, 
                                     Isolation_source, Pangolin_lineage, Released_Dates)
print(sequence_file)

#### 2.2 Check the total number of protein sequences

In [None]:
no_of_sequences = count_sequences(sequence_file)

### Step 3) Multiple Sequence Alignment

In this step you will perform the Multiple Sequence Alignment based on the sequences you have chosen. This is performed with the software [MAFFT](https://mafft.cbrc.jp/alignment/software/). 

#### 3.1 Run in parallel (optional)
In case your machine has multiple cores, you can select a specific number of cores to run the alignment. If your machine has a single core, you can move to step 3.2.

Run the cell below in case you don't know how many cores your machine has.

In [None]:
print("Number of cores :", multiprocessing.cpu_count())

Now, define the number of cores to be used (`ncores`).

In [None]:
ncores = 8

#### 3.2 Run the Multiple Sequence Alignment 
After running this cell, you will be able to see a consensus sequence for this alignment. Choose a threshold for calculating the consensus sequence (frequencies below this threshold will have an unknown amino acid).

**WARNING:** Be aware that the more sequences you have in total, the more waiting time there is for MAFFT to finish. As suggested in the start of the workflow, if you want to do just the conservation analysis on sequences that are already aligned, use Workflows 1B and 1C instead!

In [None]:
threshold = 0.5
consensus_sequence = run_msa(sequence_file, ncores, threshold)
print(consensus_sequence)

**Alignment Scoring**: After the *MAFFT* algorithm performs the alignment, the **aligned.faa** file will contain all the aligned sequences.

### Step 4) Computing conservation score

Using the file with the aligned sequences, you can now score each position in terms of conservation. We offer four scoring method options:

- *Jensen-Shannon divergence score* (used as 'js_divergence') (**Recommended**)
- *Shannon Entropy* (used as 'shannon_entropy')
- *Property entropy* (used as 'property_entropy')
- *Von Neumann entropy* (used as 'vn_entropy')

For details on the scoring method options, please consult [Capra & Singh (2007)](https://academic.oup.com/bioinformatics/article/23/15/1875/203579).

For scoring matrices you can choose one of the BLOSUM options:
- BLOSUM62 (**Recommended**)
- BLOSUM35
- BLOSUM40
- BLOSUM45
- BLOSUM50
- BLOSUM80
- BLOSUM100

In [None]:
scoring_method = 'js_divergence'

In [None]:
scoring_matrix = 'blosum62' #This only applies to methods that actually use a scoring matrix for calculating conservation, like JS-divergence, else, it is ignored (e.g. Shannon Entropy)

Now that the arguments have been defined, run the conservation analysis and store the conservation results in the `conservation_file` variable:

In [None]:
conservation_file = conservation_analysis(scoring_method, scoring_matrix)

### Step 5) Computing conserved peptides

In the final step of this workflow you will be able to compute the conservation of peptides based on residue conservation.

#### 5.1 Retrieve information on conservation residues:

In [None]:
conservation_df = pd.read_csv(filepath_or_buffer = conservation_file,
                              header = 0,
                              names = ['Position', 'Score', 'Alignment'],
                              converters={'Score': lambda x: max(float(x)*100, 0)})

In [None]:
conservation_df #Show the conservation by residue

Now fetch the aligned sequences, where we extract the peptides from:

In [None]:
aligned_sequences_df = pd.read_csv(filepath_or_buffer = "aligned.csv",
                                   header = 0,
                                   names = ['Sequence_ID', 'Aligned_Sequences'])

In [None]:
aligned_sequences_df #Show the aligned sequences

Now, before using the interactive plot to filter the peptides by conservation, we will pre-compute all the peptides in the sequences. For that, define the peptide length boundaries you want to analyze and extract the peptides:

In [None]:
max_len = 10 #Maximum length of the peptide
min_len = 8 #Minimum length of the peptide
extracted_peptides_from_sequences = extract_peptides(min_len, max_len, aligned_sequences_df)

#### 5.2 Choose the peptides based on conservation values:
Use the sliders below the cell (after run) to set the following parameters:

- *Conservation threshold (CV_cutoff)*: Conservation degree of the peptides.
- *Rolling Median Window length (RMW_cutoff)*: As conservation values are different and not homogeneous for each position, the regions can be smoothed based on this filter. Alternatively, you can set to 1 to take conservation as it is. 
- *Peptide Length (Pep_length)*: Fetch peptides of desired length for post-processing.

In [None]:
interactive_plot_selection(conservation_df, extracted_peptides_from_sequences, min_len, max_len)

#### 5.3 Print the peptides:

Print the peptides sequence based on the threshold set above.

In [None]:
peptide_file = open("peptides.list", "r")
peptide_list = peptide_file.readlines()
peptide_list = [peptide.strip() for peptide in peptide_list] 
print(peptide_list)

<font size="+2"><center><b>This is the end of Workflow 1A</font></center></b>


You will find a file named *peptides.list* in your folder that can be used as input for the [Workflow 2](Peptide-HLA_Binding_Prediction_Workflow_2.ipynb).