# <span style="color:gray">Pangolin SARS-CoV-2 Pipeline Notebook</span>

## Overview

We are going to run a standard covid bioinformatics pipeline using the [Pangolin workflow](https://cov-lineages.org/resources/pangolin/usage.html) all within this Jupyter environment

## Learning Objectives
+ Learn how to run a simple bioinformatic workflow in a Jupyter environment

## Prerequisites
You only need access to a Vertex AI environment

## Get Started

### Install packages

In [None]:
#change this depending on how many threads are available in your notebook
%env CPU=4

In [None]:
#install biopython to import packages below
! pip install biopython

### Install mambaforge
You can also use the default installed conda, but mamba is so much faster! 

In [None]:
! curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
! bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
#add to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

In [None]:
! mamba install -y -c conda-forge -c bioconda  -c etetoolkit sra-tools  pangolin ete3 minimap2 -y

In [None]:
#import libraries
import os
from Bio import SeqIO
from Bio import Entrez

### Set up your directory structure and remove files from previous runs if they exist

In [None]:
if not os.path.exists('pangolin_analysis'):
    os.mkdir('pangolin_analysis')
os.chdir('pangolin_analysis')

In [None]:
if os.path.exists('sarscov2_sequences.fasta'):
    os.remove('sarscov2_sequences.fasta')
!rm sarscov2_*
!rm lineage_report.csv

### Fetch viral sequences using a list of accession IDs using Bio Entrez Toolkit

In [None]:
#give a list of accession number for sars sequences
acc_nums=['NC_045512','LR757995','LR757996','OL698718','OL677199','OL672836','MZ914912','MZ916499','MZ908464','MW580573','MW580574','MW580576','MW991906','MW931310','MW932027','MW424864','MW453109','MW453110']
print('the number of sequences we will analyze = ',len(acc_nums))

Let this block run without going to the next until it finishes, otherwise you may get an error about too many requests. If that happens, reset your kernel and just rerun everything (except installing software).

In [None]:
#use the bio.entrez toolkit within biopython to download the accession numbers
#save those sequences to a single fasta file
Entrez.email = "email@example.com"  # ell NCBI who you are
filename = "sarscov2_seqs.fasta"
if not os.path.isfile(filename):
    # Downloading...
    for acc in acc_nums:
        net_handle = Entrez.efetch(
            db="nucleotide", id=acc, rettype="fasta", retmode="text"
        )
        out_handle = open(filename, "a")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()
        print("Saved",acc)

In [None]:
#make sure our fasta file has the same number of seqs as the acc_nums list
print('the number of seqs in our fasta file: ')
! grep '>' sarscov2_seqs.fasta | wc -l

In [None]:
#let's peek at our new fasta file
! head sarscov2_seqs.fasta

### Run pangolin to identify lineages and output alignment
Here we call pangolin, give it our input sequences and the number of threads. We also tell it to output the alignment. The full list of pangolin parameters can be found in the [docs](https://cov-lineages.org/resources/pangolin/usage.html).

In [None]:
! pangolin --help

In [None]:
! pangolin sarscov2_seqs.fasta --threads $CPU

You can view the output file from pangolin called lineage_report.csv (within pangolin_analysis folder) by double clicking on the file, or by right clicking and downloading. What lineages are present in the dataset? Is Omicron in there?

## Conclusions
Here you learned how to run a workflow locally on a virtual machine. For more advanced tutorials that submit to managed Batch environments, look at our other tutorials.

## Clean Up
Feel free to delete your virtual machine and any cloud storage buckets