# <span style="color:gray">Pangolin SARS-CoV-2 Pipeline Notebook</span>

We are going to run a standard covid bioinformatics pipeline using the Pangolin workflow. https://cov-lineages.org/resources/pangolin/usage.html

### Required software

In [17]:
#change this depending on how many threads are available in your notebook
CPU=4

In [1]:
%%bash

#install Mamba
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge
rm Mamba*

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   161  100   161    0     0   1047      0 --:--:-- --:--:-- --:--:--  1052
100   665  100   665    0     0   2923      0 --:--:-- --:--:-- --:--:--  2923
100 88.9M  100 88.9M    0     0  57.4M      0  0:00:01  0:00:01 --:--:-- 80.3M
ERROR: File or directory already exists: '/home/jupyter/mambaforge'
If you want to update an existing installation, use the -u option.


In [2]:
%%bash

#move mamba executable to your path
cp /home/jupyter/mambaforge/bin/mamba /opt/conda/bin

In [3]:
%%bash

#install biopython to import packages below
pip install biopython



Now we want to create a conda/mamba env that has all of our necessary dependencies

In [4]:
#you can look at the yaml file that specifies which programs we want to install
#you can also specify specific versions, here we just use the latest conda versionå
#for example, - sra-tools=2.11.0
!cat pangolin.yaml

name: pangolin
channels:
  - bioconda
  - conda-forge
  - defaults
  - eaton-lab
  
dependencies:
  - sra-tools
  - ipyrad
  - toytree
  - pangolin
  - iqtree


In [5]:
#create the environment. Here we use mamba because it is faster than conda
!mamba env create -f pangolin.yaml


CondaValueError: prefix already exists: /home/jupyter/mambaforge/envs/pangolin



In [6]:
#give it the whole path to the env because otherwise it can't find the env
#if you want to play with it add a cell and type 'conda activate pangolin' 
#or 'source activate pangolin'
!source activate /home/jupyter/mambaforge/envs/pangolin

In [7]:
!iqtree

IQ-TREE multicore version 2.1.4-beta COVID-edition for Linux 64-bit built Jun 24 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Command-line examples (replace 'iqtree2 ...' by actual path to executable):

1. Infer maximum-likelihood tree from a sequence alignment (example.phy)
   with the best-fit model automatically selected by ModelFinder:
     iqtree2 -s example.phy

2. Perform ModelFinder without subsequent tree inference:
     iqtree2 -s example.phy -m MF
   (use '-m TEST' to resemble jModelTest/ProtTest)

3. Combine ModelFinder, tree search, ultrafast bootstrap and SH-aLRT test:
     iqtree2 -s example.phy --alrt 1000 -B 1000

4. Perform edge-linked proportional partition model (example.nex):
     iqtree2 -s example.phy -p example.nex
   (replace '-p' by '-Q' for edge-unlinked model)

5. Find best partition scheme by possibly merging partitions:
     iqtree2 -s example.phy -p example.nex -m MF

In [25]:
#import libraries
import os
import pandas as pd
from Bio import SeqIO
from Bio import Entrez
import toytree
import ipyrad.analysis as ipa

### Set up your directory structure and remove files from previous runs if they exist

In [9]:
cd /home/jupyter/cloud-lab-training/GCP/notebooks/pangolin/

/home/jupyter/cloud-lab-training/GCP/notebooks/pangolin


In [10]:
if not os.path.exists('pangolin_analysis'):
    os.mkdir('pangolin_analysis')
os.chdir('pangolin_analysis')

In [11]:
if os.path.exists('sarscov2_sequences.fasta'):
    os.remove('sarscov2_sequences.fasta')
!rm sarscov2_*
!rm lineage_report.csv

rm: cannot remove 'sarscov2_*': No such file or directory
rm: cannot remove 'lineage_report.csv': No such file or directory


### Fetch viral sequences using a list of accession IDs

In [12]:
#give a list of accession number for covid sequences
acc_nums=['NC_045512','LR757995','LR757996','OL698718','OL677199','OL672836','MZ914912','MZ916499','MZ908464','MW580573','MW580574','MW580576','MW991906','MW931310','MW932027','MW424864','MW453109','MW453110']
print('the number of sequences we will analyze = ',len(acc_nums))

the number of sequences we will analyze =  18


Let this block run without going to the next until it finishes, otherwise you may get an error about too many requests. If that happens, reset your kernel and just rerun everything (except installing software).

In [13]:
#use the bio.entrez toolkit within biopython to download the accession numbers
#save those sequences to a single fasta file
Entrez.email = "email@example.com"  # Always tell NCBI who you are
filename = "sarscov2_seqs.fasta"
if not os.path.isfile(filename):
    # Downloading...
    for acc in acc_nums:
        net_handle = Entrez.efetch(
            db="nucleotide", id=acc, rettype="fasta", retmode="text"
        )
        out_handle = open(filename, "a")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()
        print("Saved",acc)

Saved NC_045512
Saved LR757995
Saved LR757996
Saved OL698718
Saved OL677199
Saved OL672836
Saved MZ914912
Saved MZ916499
Saved MZ908464
Saved MW580573
Saved MW580574
Saved MW580576
Saved MW991906
Saved MW931310
Saved MW932027
Saved MW424864
Saved MW453109
Saved MW453110


In [14]:
#make sure our fasta file has the same number of seqs as the acc_nums list
print('the number of seqs in our fasta file: ')
!grep '>' sarscov2_seqs.fasta | wc -l

the number of seqs in our fasta file: 
18


In [15]:
#let's peek at our new fasta file
!head sarscov2_seqs.fasta

>NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
CCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTAC
GTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGG
CTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGAT
GCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC
GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCT


### Run pangolin to identify lineages and output alignment
Here we call pangolin, give it our input sequences and the number of threads. We also tell it to output the alignment. The full list of pangolin parameters can be found in the [docs](https://cov-lineages.org/resources/pangolin/usage.html).

In [18]:
!pangolin sarscov2_seqs.fasta --alignment --threads $CPU

[32mAll dependencies satisfied.[0m
[32mThe query file is:[0m/home/jupyter/cloud-lab-training/GCP/notebooks/pangolin/pangolin_analysis/sarscov2_seqs.fasta
[32m** Running sequence QC **[0m
[32mNumber of sequences detected: [0m18
[32mTotal passing QC: [0m18
[32m
Data files found:[0m
Trained model:	/opt/conda/lib/python3.7/site-packages/pangoLEARN/data/decisionTree_v1.joblib
Header file:	/opt/conda/lib/python3.7/site-packages/pangoLEARN/data/decisionTreeHeaders_v1.joblib
Designated hash:	/opt/conda/lib/python3.7/site-packages/pangoLEARN/data/lineages.hash.csv
[33mJob stats:
job                     count    min threads    max threads
--------------------  -------  -------------  -------------
add_failed_seqs             1              1              1
align_to_reference          1              1              1
all                         1              1              1
generate_report             1              1              1
get_constellations          1              1      

In [36]:
df = pd.read_csv('lineage_report.csv')
df

Unnamed: 0,taxon,lineage,conflict,ambiguity_score,scorpio_call,scorpio_support,scorpio_conflict,version,pangolin_version,pangoLEARN_version,pango_version,status,note
0,NC_045512.2_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_Wuhan-Hu-1__complete_genome,B,,,,,,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,Assigned from designation hash.
1,LR757995.1_Severe_acute_respiratory_syndrome_coronavirus_2_genome_assembly__chromosome:_whole_genome,A,,,,,,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,Assigned from designation hash.
2,LR757996.1_Severe_acute_respiratory_syndrome_coronavirus_2_genome_assembly__chromosome:_whole_genome,B,,,,,,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,Assigned from designation hash.
3,MW580573.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/MD-MDH-0830/2021_ORF1ab_polyprotein_(ORF1ab)__ORF1a_polyprotein_(ORF1ab)__surface_glycoprotein_(S)__ORF3a_protein_(ORF3a)__envelope_protein_(E)__membrane_glyc...,B.1.351,,,Beta (B.1.351-like),0.857,0.0,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,scorpio call: Alt alleles 12; Ref alleles 0; Amb alleles 0; Oth alleles 2
4,MW580574.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/MD-MDH-0831/2021__complete_genome,B.1.351,,,Beta (B.1.351-like),0.857,0.0,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,scorpio call: Alt alleles 12; Ref alleles 0; Amb alleles 0; Oth alleles 2
5,MW580576.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/MD-MDH-0833/2021__complete_genome,B.1.351,,,Beta (B.1.351-like),0.857,0.0,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,scorpio call: Alt alleles 12; Ref alleles 0; Amb alleles 0; Oth alleles 2
6,MW991906.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/CA-CDC-FG-021330/2021_ORF1ab_polyprotein_(ORF1ab)_and_ORF1a_polyprotein_(ORF1ab)_genes__partial_cds;_surface_glycoprotein_(S)__ORF3a_protein_(ORF3a)__envelope...,B.1.617.2,,,Delta (B.1.617.2-like),1.0,0.0,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,scorpio call: Alt alleles 13; Ref alleles 0; Amb alleles 0; Oth alleles 0
7,MW931310.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/IN-CDC-STM-000045992/2021__complete_genome,B.1.617.2,,,Delta (B.1.617.2-like),0.923,0.077,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,scorpio call: Alt alleles 12; Ref alleles 1; Amb alleles 0; Oth alleles 0
8,MW932027.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/MA-CDC-STM-000044850/2021_ORF1ab_polyprotein_(ORF1ab)__ORF1a_polyprotein_(ORF1ab)__surface_glycoprotein_(S)__ORF3a_protein_(ORF3a)__envelope_protein_(E)__memb...,B.1.617.2,,,Delta (B.1.617.2-like),0.923,0.077,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,scorpio call: Alt alleles 12; Ref alleles 1; Amb alleles 0; Oth alleles 0
9,MW424864.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/CA-LACPHL-AF00051/2020__complete_genome,B.1.427,,,,,,PANGO-v1.2.97,3.1.16,2021-11-18,v1.2.97,passed_qc,Assigned from designation hash.


In [37]:
df=df[['taxon','lineage','scorpio_call']]
df.head()

Unnamed: 0,taxon,lineage,scorpio_call
0,NC_045512.2_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_Wuhan-Hu-1__complete_genome,B,
1,LR757995.1_Severe_acute_respiratory_syndrome_coronavirus_2_genome_assembly__chromosome:_whole_genome,A,
2,LR757996.1_Severe_acute_respiratory_syndrome_coronavirus_2_genome_assembly__chromosome:_whole_genome,B,
3,MW580573.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/MD-MDH-0830/2021_ORF1ab_polyprotein_(ORF1ab)__ORF1a_polyprotein_(ORF1ab)__surface_glycoprotein_(S)__ORF3a_protein_(ORF3a)__envelope_protein_(E)__membrane_glyc...,B.1.351,Beta (B.1.351-like)
4,MW580574.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/MD-MDH-0831/2021__complete_genome,B.1.351,Beta (B.1.351-like)


In [43]:
#let's take a quick look at what lineages our dataset includes
lineages=df.lineage.unique()
call = df.scorpio_call.unique()
call, lineages

(array([nan, 'Beta (B.1.351-like)', 'Delta (B.1.617.2-like)',
        'Omicron (B.1.1.529-like)'], dtype=object),
 array(['B', 'A', 'B.1.351', 'B.1.617.2', 'B.1.427', 'B.1.1.529',
        'B.1.1.70'], dtype=object))

You can view the output file from pangolin called lineage_report.csv (within pangolin_analysis folder) by double clicking on the file, or by right clicking and downloading. What lineages are present in the dataset? Is Omicron in there?

### Run iqtree to estimate maximum likelihood tree for our sequences
iqtree can find the best nucleotide model for the data, but here we are going to assign a model to save time (HKY) and just estimate the phylogeny without any bootstrap support values. 

In [44]:
#run iqtree with threads = $CPU variable, if you exclude the -m it will do a phylogenetic model search before tree search
!iqtree -s sequences.aln.fasta -nt $CPU -m HKY --prefix sarscov2_tree --redo-tree

IQ-TREE multicore version 2.1.4-beta COVID-edition for Linux 64-bit built Jun 24 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    cloud-lab-notebook (AVX2, FMA3, 14 GB RAM)
Command: iqtree -s sequences.aln.fasta -nt 4 -m HKY --prefix sarscov2_tree --redo-tree
Seed:    667446 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Tue Dec  7 22:34:48 2021
Kernel:  AVX+FMA - 4 threads (4 CPU cores detected)

Reading alignment file sequences.aln.fasta ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
Alignment has 18 sequences with 29903 columns, 193 distinct patterns
109 parsimony-informative, 33 singleton sites, 29761 constant sites
LR757995.1_Severe_acute_respiratory_syndrome_coronavirus_2_genome_assembly__chromosome:_whole_genome -> LR757995.1_Severe_acute_respiratory_syndrome_coronavirus_2_genome_assembly__chromosome__whole_genome
LR757996.1_Severe_acute_

### Visualize the tree with toytree

In [45]:
#Define the tree file
tre = toytree.tree('sarscov2_tree.treefile')

In [52]:
#let's see if any of our samples are Omicron
omicron=df[df['scorpio_call'] == ('Omicron (B.1.1.529-like)')]
omicron

Unnamed: 0,taxon,lineage,scorpio_call
12,OL698718.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/USA/MN-MDH-18236/2021_ORF1ab_polyprotein_(ORF1ab)__ORF1a_polyprotein_(ORF1ab)__surface_glycoprotein_(S)__ORF3a_protein_(ORF3a)__envelope_protein_(E)__membrane_gly...,B.1.1.529,Omicron (B.1.1.529-like)
13,OL677199.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/CAN/ON-NML-249359/2021_ORF1ab_polyprotein_(ORF1ab)__ORF1a_polyprotein_(ORF1ab)__surface_glycoprotein_(S)__ORF3a_protein_(ORF3a)__envelope_protein_(E)__membrane_gl...,B.1.1.529,Omicron (B.1.1.529-like)
14,OL672836.1_Severe_acute_respiratory_syndrome_coronavirus_2_isolate_SARS-CoV-2/human/BEL/rega-20174/2021__complete_genome,B.1.1.529,Omicron (B.1.1.529-like)


In [46]:
#draw the tree
rtre = tre.root(wildcard="OL")
rtre.draw(tip_labels_align=True);

### Based on this tree, what other lineages would you say the Omicron samples evolved from? What is their closest relative?

You can also visualize the tree by downloading it and opening in figtree.

And that is all! You now know how to run workflows in notebooks in Cloud Lab