GOAL: alignment skills
* BioPython
* Python
* MUSCLE
* algorithm development
* Computing a conservation score

[Large-Scale Sequence Analysis of Hemagglutinin of Influenza A Virus Identifies Conserved Regions Suitable for Targeting an Anti-Viral Response](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0009268)

They are aligning proteins, not RNA.

1. full length HA sequences download
2. seperate files for each H1, H2 etc
3. align each subtype using MUSCLE 3.6 - filtering out frameshifts and partials
4. each profile for each multiple alignment then aligned to other profiles using profile-profile aligner COMPASS
5. Compute a conservation score for each column of the final alignment

[Identification of potential conserved RNA secondary structure throughout influenza A coding regions](https://rnajournal.cshlp.org/content/17/6/991.full_)

1. Interesting that this paper aligns first based on protein, then RNA
2. six full influenza sets were downloaded from human avian swine
3. divide each by segment
4. coding regions translated with BioEdit
5. protein sequences aligned with ClustalW
6. aligned sequences turned back into nt, submitted to RNAz 2.0 for conserved structures

Now that I've downloaded all the records that I need and gotten an idea of what they hold, it's time to look at sequences in an attempt to answer question 3: which is how similiar were the recommended vaccine strains to the reported sequences during the 17-18 season. 

How do I compare the differences between the sequences? Will have to look into BioPython to see what in available in that library to compare genetics. 

First I'm going to attempt to download the vaccine candidate sequences from NCBI.

First one is A/Michigan/45/2015 (H1N1) pdm09-like virus

Retrieve that from IRD

Some ideas:
a script that automatically determines the % identity (essentially a BLAST?) to its matching component in the vaccine.
Researcher uploads sequence to database, then an output of all the similiarities in one dataframe, maybe represented in a table


In [2]:
# Import data analysis libraries 
import numpy as np
import pandas as pd

# Import visualization libraries
import seaborn as sns
sns.set_style('whitegrid')
sns.set_context('paper', font_scale=1.)
import matplotlib.pyplot as plt
%matplotlib inline 

In [3]:
#Load up the vaccine candidate sequences into a dataframe, first with H1N1

AM42_H1N1_db = pd.read_csv('../A-michigan-45-2015.tsv', sep='\t')
AM42_H1N1_db.columns

Index(['Strain Name', 'Complete Genome', 'Subtype', 'Collection Date', 'Host',
       'Country', 'State/Province', 'Geographic Grouping', 'Flu Season',
       'Submission Date', 'Passage History', 'Specimen Source Health Status',
       '1 PB2', '2 PB1', '3 PA', '4 HA', '5 NP', '6 NA', '7 MP', '8 NS', 'Age',
       'Gender', 'M2 31N', 'M2 26F', 'M2 27A', 'M2 30T', 'M2 34E',
       'NA 275Y N1', 'NA 292K N2', 'NA 119V N2', 'NA 294S N2', 'PB1-F2 66S',
       'PB2 E627K', 'PB2 D701N', 'PB2 A199S', 'PB2 A661T', 'PB2 V667I',
       'PB2 K702R', 'PA S409N', 'NP L136M', 'M2 A16G', 'M2 C55F', 'NS1 T92E',
       'RERRRKKR', 'Sensitive Drug', 'Resistant Drug', 'Submission Date.1',
       'NCBI Taxon ID', 'pH1N1-like', 'US Swine H1 Clade',
       'Global Swine H1 Clade test', 'H5 Clade', 'Unnamed: 52'],
      dtype='object')

In [4]:
AM42_H1N1_db.head(3)

Unnamed: 0,Strain Name,Complete Genome,Subtype,Collection Date,Host,Country,State/Province,Geographic Grouping,Flu Season,Submission Date,...,RERRRKKR,Sensitive Drug,Resistant Drug,Submission Date.1,NCBI Taxon ID,pH1N1-like,US Swine H1 Clade,Global Swine H1 Clade test,H5 Clade,Unnamed: 52
0,A/Michigan/45/2015,Yes,H1N1,09/07/2015,Human,USA,Michigan,North America,-N/A-,2017-08-24,...,No,-N/A-,-N/A-,08/24/2017,1777792,Mixed Positive and Negative Segments,npdm,1A.3.3.2,-N/A-,


OK, lets look at just the HA segment - column named '4 HA'

In [5]:
AM42_H1N1_db['4 HA']

0    KY117023,KY090610,KU933493,KU509703
Name: 4 HA, dtype: object

In [6]:
#going to put this into a list so can interate over and call into NCBI
#listing the contents of a column will yield a series, not a list so need to call .tolist() to convert

acc_list = AM42_H1N1_db['4 HA'].tolist()
acc_list = acc_list[0].split(',')
print(acc_list)

['KY117023', 'KY090610', 'KU933493', 'KU509703']


So this give me the accession numbers in a list. Going iterate over this list to call into NCBI and download these sequences and put this into a database.

In [7]:
# Using the Bio Entrez module, going to retreive the sequence data the four accession nos
# Then add this to a new dataframe for storage.

from Bio import Entrez

Entrez.email = "adriana@dranalytics.co"
df = pd.DataFrame(columns=['accession', 'FASTA'])


for i in range(len(acc_list)):
    FASTA = Entrez.efetch(db="nucleotide", id=acc_list[i], rettype="fasta", retmode="text").read()
    
    df.loc[i] = {'accession': acc_list[i], 'FASTA': FASTA}

df.head(4)


Unnamed: 0,accession,FASTA
0,KY117023,>KY117023.1 Influenza A virus (A/Michigan/45/2...
1,KY090610,>KY090610.1 Influenza A virus (A/Michigan/45/2...
2,KU933493,>KU933493.1 Influenza A virus (A/Michigan/45/2...
3,KU509703,>KU509703.1 Influenza A virus (A/Michigan/45/2...


Now going to run a pairwise sequence alignment - 

In [8]:
from Bio import Align
aligner = Align.PairwiseAligner()

seq1 = "GAACT"
seq2 = "GAT"

alignments = aligner.align(seq1, seq2)
print(aligner)

Pairwise sequence aligner with parameters
  match score: 1.000000
  mismatch score: 0.000000
  target open gap score: 0.000000
  target extend gap score: 0.000000
  target left open gap score: 0.000000
  target left extend gap score: 0.000000
  target right open gap score: 0.000000
  target right extend gap score: 0.000000
  query open gap score: 0.000000
  query extend gap score: 0.000000
  query left open gap score: 0.000000
  query left extend gap score: 0.000000
  query right open gap score: 0.000000
  query right extend gap score: 0.000000
  mode: global



In [9]:
import Bio.Align.Applications
dir(Bio.Align.Applications)

['ClustalOmegaCommandline',
 'ClustalwCommandline',
 'DialignCommandline',
 'MSAProbsCommandline',
 'MafftCommandline',
 'MuscleCommandline',
 'PrankCommandline',
 'ProbconsCommandline',
 'TCoffeeCommandline',
 '_ClustalOmega',
 '_Clustalw',
 '_Dialign',
 '_MSAProbs',
 '_Mafft',
 '_Muscle',
 '_Prank',
 '_Probcons',
 '_TCoffee',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__']

In [10]:
from Bio.Align.Applications import ClustalOmegaCommandline
help(ClustalOmegaCommandline)

Help on class ClustalOmegaCommandline in module Bio.Align.Applications._ClustalOmega:

class ClustalOmegaCommandline(Bio.Application.AbstractCommandline)
 |  ClustalOmegaCommandline(cmd='clustalo', **kwargs)
 |  
 |  Command line wrapper for clustal omega.
 |  
 |  http://www.clustal.org/omega
 |  
 |  Notes
 |  -----
 |  Last checked against version: 1.2.0
 |  
 |  References
 |  ----------
 |  Sievers F, Wilm A, Dineen DG, Gibson TJ, Karplus K, Li W, Lopez R,
 |  McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG (2011).
 |  Fast, scalable generation of high-quality protein multiple
 |  sequence alignments using Clustal Omega.
 |  Molecular Systems Biology 7:539 https://doi.org/10.1038/msb.2011.75
 |  
 |  Examples
 |  --------
 |  >>> from Bio.Align.Applications import ClustalOmegaCommandline
 |  >>> in_file = "unaligned.fasta"
 |  >>> out_file = "aligned.fasta"
 |  >>> clustalomega_cline = ClustalOmegaCommandline(infile=in_file, outfile=out_file, verbose=True, auto=True)
 | 

In [11]:
df['FASTA'][0]

'>KY117023.1 Influenza A virus (A/Michigan/45/2015(H1N1)) segment 4 hemagglutinin (HA) gene, complete cds\nGGAAAAACAAAAGCAACAAAAATGAAGGCAATACTAGTAGTTCTGCTATATACATTTACAACCGCAAATG\nCAGACACATTATGTATAGGTTATCATGCGAACAATTCAACAGACACTGTAGACACAGTACTAGAAAAGAA\nTGTAACAGTAACACACTCTGTTAACCTTCTGGAAGACAAGCATAACGGAAAACTATGCAAACTAAGAGGG\nGTAGCCCCATTGCATTTGGGTAAATGTAACATTGCTGGCTGGATCCTGGGAAATCCAGAGTGTGAATCAC\nTCTCCACAGCAAGTTCATGGTCCTACATTGTGGAAACATCTAATTCAGACAATGGAACGTGTTACCCAGG\nAGATTTCATCAATTATGAGGAGCTAAGAGAGCAATTGAGCTCAGTGTCATCATTTGAAAGGTTTGAGATA\nTTCCCCAAGACAAGTTCATGGCCCAATCATGACTCGAACAAAGGTGTAACGGCAGCATGTCCTCACGCTG\nGAGCAAAAAGCTTCTACAAAAACTTGATATGGCTAGTTAAAAAAGGAAATTCATACCCAAAGCTTAACCA\nATCCTACATTAATGATAAAGGGAAAGAAGTCCTCGTGCTGTGGGGCATTCACCATCCATCTACTACTGCT\nGACCAACAAAGTCTCTATCAGAATGCAGATGCATATGTTTTTGTGGGGACATCAAGATACAGCAAGAAGT\nTCAAGCCGGAAATAGCAACAAGACCCAAAGTGAGGGATCAAGAAGGGAGAATGAACTATTACTGGACACT\nAGTAGAGCCGGGAGACAAAATAACATTCGAAGCAACTGGAAATCTAGTGGTACCGAGATATGCATTCACA\nATGGAAAGAAATGCTGGATCTGGTATTAT

In [12]:
file = open('1.FASTA', 'w+')

def copy(var):
    return file.write(var)
    file.close()    
    
df['FASTA'].apply(copy)

0    1885
1    1832
2    1884
3    1832
Name: FASTA, dtype: int64

In [29]:
from Bio.Align.Applications import ClustalOmegaCommandline
in_file = "1.fa"
out_file = "aligned.aln"
clustalomega_cline = ClustalOmegaCommandline(infile=in_file, outfile=out_file, verbose=True, auto=True, force=True)
clustalomega_cline()
 


('Using 1 threads\nRead 4 sequences (type: DNA) from 1.fa\nnot more sequences (4) than cluster-size (100), turn off mBed\nSetting options automatically based on input sequence characteristics (might overwrite some of your options).\nAuto settings: Enabling mBed.\nAuto settings: Setting iteration to 1.\nUsing 3 seeds (chosen with constant stride from length sorted seqs) for mBed (from a total of 4 sequences)\nCalculating pairwise ktuple-distances...\nKtuple-distance calculation progress: 0 % (0 out of 9)\nKtuple-distance calculation progress: 33 % (3 out of 9)\nKtuple-distance calculation progress: 55 % (5 out of 9)\nKtuple-distance calculation progress done. CPU time: 0.02u 0.00s 00:00:00.02 Elapsed: 00:00:00\nmBed created 1 cluster/s (with a minimum of 1 and a soft maximum of 100 sequences each)\nDistance calculation within sub-clusters: 0 % (0 out of 1)\nDistance calculation within sub-clusters done. CPU time: 0.02u 0.01s 00:00:00.03 Elapsed: 00:00:00\nGuide-tree computation (mBed) d

In [30]:
ls

1.FASTA                                 aligned.aln
1.fa                                    aligned2.FASTA
Influ_Seq_1205.ipynb                    aligned2.fa
Influenza_US_Maps.ipynb                 [1m[32mfluA_strains.tsv[m[m*
Influenza_inital_data_assessment.ipynb  [1m[32mfluB_strains.tsv[m[m*
README.md                               inflA.fasta
aligned.FASTA                           new_file.fa


In [36]:
from Bio import AlignIO

align = AlignIO.read(open('aligned.aln'), 'fasta')
print(align.format('clustal'))

CLUSTAL X (1.81) multiple sequence alignment


KY117023.1                          GGAAAAACAAAAGCAACAAAAATGAAGGCAATACTAGTAGTTCTGCTATA
KY090610.1                          ---------------------ATGAAGGCAATACTAGTAGTTCTGCTATA
KU933493.1                          -GAAAAACAAAAGCAACAAAAATGAAGGCAATACTAGTAGTTCTGCTATA
KU509703.1                          ---------------------ATGAAGGCAATACTAGTAGTTCTGCTATA

KY117023.1                          TACATTTACAACCGCAAATGCAGACACATTATGTATAGGTTATCATGCGA
KY090610.1                          TACATTTACAACCGCAAATGCAGACACATTATGTATAGGTTATCATGCGA
KU933493.1                          TACATTTACAACCGCAAATGCAGACACATTATGTATAGGTTATCATGCGA
KU509703.1                          TACATTTACAACCGCAAATGCAGACACATTATGTATAGGTTATCATGCGA

KY117023.1                          ACAATTCAACAGACACTGTAGACACAGTACTAGAAAAGAATGTAACAGTA
KY090610.1                          ACAATTCAACAGACACTGTAGACACAGTACTAGAAAAGAATGTAACAGTA
KU933493.1                          ACAATTCAACAGACACTGTAGACACAGTACTAGAAAAGAATGTAA