<a href="https://colab.research.google.com/github/Angel030331/SARS-CoV-2-sequence-exploratory-analysis/blob/main/Sequence_Alignment_summary_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarising the differences in sequences

After the alignment of sequences using MAFFT by linux command line, Aliview is used to visualise the differences and also translate the nucleotide sequences into amino acid sequences.

* Input file in Aliview: Aligned multiple sequences from MAFFT
* Input file in python:
  * Unaligned sequences downloaded from GISAID
  * Aligned sequences from MAFFT
  * Aligned and translated amino acid sequences from Aliview

The following demonstration is for summarising the differences in nucleotide sequences & amino acid sequences

File used in this demonstration:
* GISAID raw file: gisaid_hcov_19.fasta
* GISAID aligned file: gisaid_hcov_19_aligned.fasta
* GISAID translated file: gisaid_hcov_19_aligned.translated.fasta

## Importing Biopython packages

In [2]:
!pip install biopython
from Bio.Seq import Seq
from Bio import AlignIO
from Bio import SeqIO

Collecting biopython
  Downloading biopython-1.83-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.83


## Summarization of nucleotide sequences differences

### Use AlignIO.read to read the aligned file

In [3]:
hcov_19_aligned = AlignIO.read('gisaid_hcov_19_aligned.fasta', 'fasta')

for seq in hcov_19_aligned:
    print(seq)

ID: hCoV-19/Italy/VEN-IZSVe-21RS1571-1_VI/2021|EPI_ISL_2927997|2021-06-05
Name: hCoV-19/Italy/VEN-IZSVe-21RS1571-1_VI/2021|EPI_ISL_2927997|2021-06-05
Description: hCoV-19/Italy/VEN-IZSVe-21RS1571-1_VI/2021|EPI_ISL_2927997|2021-06-05
Number of features: 0
Seq('tcccaggtaacaaaccaaccaactttcgatctcttgtagatctgttctctaaac...aaa')
ID: hCoV-19/Italy/VEN-IZSVe-21RS8150-1_VI/2021|EPI_ISL_4968925|2021-09-22
Name: hCoV-19/Italy/VEN-IZSVe-21RS8150-1_VI/2021|EPI_ISL_4968925|2021-09-22
Description: hCoV-19/Italy/VEN-IZSVe-21RS8150-1_VI/2021|EPI_ISL_4968925|2021-09-22
Number of features: 0
Seq('--------------ccaaccaactttcgatctcttgtagatctgttctctaaac...---')
ID: hCoV-19/Italy/VEN-IZSVe-21RS1721-7_VI/2021|EPI_ISL_3006795|2021-06-28
Name: hCoV-19/Italy/VEN-IZSVe-21RS1721-7_VI/2021|EPI_ISL_3006795|2021-06-28
Description: hCoV-19/Italy/VEN-IZSVe-21RS1721-7_VI/2021|EPI_ISL_3006795|2021-06-28
Number of features: 0
Seq('---------------------actttcgatctcttgtagatctgttctctaaac...---')


### Explore the sequence structure
After using AlignIO.read() to read the fasta file, it has a rows and columns structure similar to the view interface of bioinformatics tools.

In [4]:
# after using AlignIO.read() to read the fasta file, it has a rows and columns structure

display = 0
i = 0
while display < 10 and i < 10:
    characters = hcov_19_aligned[:, i]
    print(characters)
    display += 1
    i += 1

print(hcov_19_aligned)

t--
c--
c--
c--
a--
g--
g--
t--
a--
a--
Alignment with 3 rows and 29856 columns
tcccaggtaacaaaccaaccaactttcgatctcttgtagatctg...aaa hCoV-19/Italy/VEN-IZSVe-21RS1571-1_VI/2021|EPI_ISL_2927997|2021-06-05
--------------ccaaccaactttcgatctcttgtagatctg...--- hCoV-19/Italy/VEN-IZSVe-21RS8150-1_VI/2021|EPI_ISL_4968925|2021-09-22
---------------------actttcgatctcttgtagatctg...--- hCoV-19/Italy/VEN-IZSVe-21RS1721-7_VI/2021|EPI_ISL_3006795|2021-06-28


### Summarise nucleotides differences

In [10]:
length = hcov_19_aligned.get_alignment_length()
diff_pos = {}
i = 0

while i < length:
    characters = hcov_19_aligned[:, i]
    characters = [nt for nt in characters if nt != '-']

    if len(set(characters)) > 1:
        diff_pos[i] = characters

    i += 1

print(diff_pos)

for k,v in diff_pos.items():
  print(f'position: {k}: {v}')

{328: ['c', 'c', 't'], 3290: ['g', 'g', 'a'], 5126: ['t', 't', 'c'], 7513: ['a', 'a', 'g'], 7731: ['t', 't', 'c'], 9147: ['c', 't', 'c'], 9593: ['c', 'c', 't'], 9839: ['t', 'c', 'c'], 10011: ['c', 't', 't'], 11065: ['g', 'g', 't'], 12466: ['c', 'c', 't'], 12906: ['c', 'c', 't'], 14274: ['c', 'c', 't'], 16869: ['t', 'c', 'c'], 20114: ['t', 't', 'c'], 21582: ['t', 'g', 'g'], 21591: ['c', 'c', 't'], 21828: ['c', 'c', 't'], 21962: ['t', 't', 'c'], 21972: ['t', 't', 'a'], 22307: ['t', 't', 'n'], 22308: ['y', 'c', 'n'], 22309: ['t', 't', 'n'], 22310: ['t', 't', 'n'], 22311: ['c', 'c', 'n'], 22312: ['a', 'a', 'n'], 22477: ['a', 'a', 'n'], 22478: ['a', 'a', 'n'], 22479: ['t', 't', 'n'], 22480: ['c', 'c', 'n'], 22481: ['t', 't', 'n'], 22482: ['a', 'a', 'n'], 22483: ['t', 't', 'n'], 22484: ['c', 'c', 'n'], 22485: ['a', 'a', 'n'], 22486: ['a', 'a', 'n'], 22487: ['a', 'a', 'n'], 22488: ['c', 'c', 'n'], 22489: ['t', 't', 'n'], 22490: ['t', 't', 'n'], 22491: ['c', 'c', 'n'], 22492: ['t', 't', 'n'], 

### Write nucleotides differences into csv file
source code: https://blog.finxter.com/write-python-dict-to-csv-columns-keys-first-values-second-column/

In [11]:
import csv

with open('nt_mismatch.csv', 'w', newline = '') as f:

  # Create a csv writer object
  writer = csv.writer(f)

  # Write the header row
  writer.writerow(['Position', 'Nucleotides variations'])
  # Write one key-value tuple per row
  for k, v in diff_pos.items():
    writer.writerow([k, v])

## Summarization of amino aicd sequences differences

### Use AlignIO.read to read the aligned file

In [13]:
hcov_19_translated = AlignIO.read('gisaid_hcov_19_aligned.translated.fasta', 'fasta')

for seq in hcov_19_translated:
    print(seq)

display = 0
i = 0
while display < 10 and i < 10:
    characters = hcov_19_translated[:, i]
    print(characters)
    display += 1
    i += 1

print(hcov_19_translated)

ID: hCoV-19/Italy/VEN-IZSVe-21RS1571-1_VI/2021|EPI_ISL_2927997|2021-06-05
Name: hCoV-19/Italy/VEN-IZSVe-21RS1571-1_VI/2021|EPI_ISL_2927997|2021-06-05
Description: hCoV-19/Italy/VEN-IZSVe-21RS1571-1_VI/2021|EPI_ISL_2927997|2021-06-05
Number of features: 0
Seq('SQVTNQPTFDLLXICSLNELXNLCGCHSAACLVHSRSIINNXLLSLTGHEXLVY...XXK')
ID: hCoV-19/Italy/VEN-IZSVe-21RS8150-1_VI/2021|EPI_ISL_4968925|2021-09-22
Name: hCoV-19/Italy/VEN-IZSVe-21RS8150-1_VI/2021|EPI_ISL_4968925|2021-09-22
Description: hCoV-19/Italy/VEN-IZSVe-21RS8150-1_VI/2021|EPI_ISL_4968925|2021-09-22
Number of features: 0
Seq('----XQPTFDLLXICSLNELXNLCGCHSAACLVHSRSIINNXLLSLTGHEXLVY...---')
ID: hCoV-19/Italy/VEN-IZSVe-21RS1721-7_VI/2021|EPI_ISL_3006795|2021-06-28
Name: hCoV-19/Italy/VEN-IZSVe-21RS1721-7_VI/2021|EPI_ISL_3006795|2021-06-28
Description: hCoV-19/Italy/VEN-IZSVe-21RS1721-7_VI/2021|EPI_ISL_3006795|2021-06-28
Number of features: 0
Seq('-------TFDLLXICSLNELXNLCGCHSAACLVHSRSIINNXLLSLTGHEXLVY...---')
S--
Q--
V--
T--
NX-
QQ-
PP-
TTT

### Summarise amino acids differences

In [14]:
length_aa = hcov_19_translated.get_alignment_length()
diff_pos_aa = {}
i = 0

while i < length_aa:
    residue = hcov_19_translated[:, i]
    residue = [aa for aa in residue if aa != '-']

    if len(set(residue)) > 1:
        diff_pos_aa[i] = residue

    i += 1

print(diff_pos_aa)

for k,v in diff_pos_aa.items():
  print(f'position: {k}: {v}')

{4: ['N', 'X'], 109: ['S', 'S', 'L'], 1096: ['W', 'W', 'X'], 2504: ['N', 'N', 'S'], 2577: ['X', 'X', 'Q'], 3049: ['P', 'S', 'P'], 3337: ['P', 'S', 'S'], 3688: ['C', 'C', 'F'], 4155: ['S', 'S', 'L'], 4758: ['P', 'P', 'S'], 5623: ['X', 'Q', 'Q'], 7194: ['F', 'V', 'V'], 7324: ['X', 'F', 'X'], 7325: ['X', 'I', 'X'], 7422: ['X', 'X', 'Y'], 7427: ['X', 'X', 'E'], 7435: ['L', 'L', 'X'], 7436: ['X', 'L', 'X'], 7437: ['Q', 'Q', 'X'], 7492: ['E', 'E', 'X'], 7493: ['S', 'S', 'X'], 7494: ['I', 'I', 'X'], 7495: ['K', 'K', 'X'], 7496: ['L', 'L', 'X'], 7497: ['L', 'L', 'X'], 7498: ['T', 'T', 'X'], 7499: ['L', 'L', 'X'], 7500: ['E', 'E', 'X'], 7501: ['S', 'S', 'X'], 7502: ['N', 'N', 'X'], 7503: ['Q', 'Q', 'X'], 7504: ['Q', 'Q', 'X'], 7505: ['N', 'N', 'X'], 7506: ['L', 'L', 'X'], 7507: ['L', 'L', 'X'], 7508: ['L', 'L', 'X'], 7589: ['D', 'N', 'D'], 7665: ['K', 'K', 'Q'], 8916: ['T', 'T', 'S'], 9418: ['I', 'K', 'K'], 9447: ['I', 'M', 'M'], 9622: ['X', 'E', 'E']}
position: 4: ['N', 'X']
position: 109: ['S

### Write amino acids differences into csv file

In [15]:
import csv

with open('AA_mismatch.csv', 'w', newline = '') as f:

  # Create a csv writer object
  writer = csv.writer(f)

  # Write the header row
  writer.writerow(['Position', 'Amino Acids variations'])
  # Write one key-value tuple per row
  for k, v in diff_pos_aa.items():
    writer.writerow([k, v])

### Remarks
All the position number starts from 0
(as index in Python starts from 0). Please feel free to modify the codes for more customised usage.