## Topic - Analyzing Corona Virus

Let's start by getting the complete genome of Coronavirus. Source: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512

> **Basic Information:** Coronavirus is a single stranded RNA-virus (DNA is double stranded). RNA polymers are made up of nucleotides. These nucleotides have three parts: 1) a five carbon Ribose sugar, 2) a phosphate molecule and 3) one of four nitrogenous bases: adenine(a), guanine(g), cytosine(c) or uracil(u) / thymine(t). 

<img src="./images/parts-of-nucleotide.jpg" width="480">

> Thymine is found in DNA and Uracil in RNA. But for following analysis, you can consider (u) and (t) to be analogous.

In [2]:
with open('sars_cov2_genome.txt', 'r') as file:
    corona = file.read()

# print(corona)

Now we only want {a, c, t, g} in corona.  We want to remove the spaces and numbers. Let's do that:

In [3]:
for s in "\n01234567789 ":
    corona = corona.replace(s, "")

# corona

In [4]:
len(corona)

29903

### Q. 1 - What is the 'kolmogorov complexity' of the Coronavirus? 

This question is simply asking - how many bytes of information does it contain. 

In [5]:
import zlib
len(zlib.compress(corona.encode("utf-8")))

8858

The above result means - The RNA of Coronavirus can contain '8858' bytes of information. This is just an upper-bound. This means - Coronavirus cannot contain more than '8858' bytes of information. Let's see if we can compress it a little more.

In [6]:
import lzma
lzc = lzma.compress(corona.encode("utf-8"))
len(lzc)

8408

This is a better compression. So we can conclude that - Coronavirus approximately contains 8.3 kB of information. 

### Q. 2 - What type of information does this genome contain? How we can extract it?

The genome contains the information about the proteins it can make. These proteins determine the characteristics of the cell in which they are produced. So we need to extract information about the proteins. To extract this info, we must know - how proteins are formed from the genetic material i.e. DNA/RNA.

> **Basic Information:** RNAs and DNAs form proteins. This is how proteins are formed from DNA. In DNA, A-T/U and G-C form pairs. This pair formation is because - the chemical structure of A, T/U, G and C is such that - A and T are attracted towards each other by 2 hydrogen bonds and G and C together are attracted by 3 hydrogen bonds. A-C and G-T can't form such stable bonds. 

<img src="./images/AT-GC.jpg" width="480">

> What happens during protein formation is:

<img src="./images/transcript-translate-cell.jpg">

> An enzyme called 'RNA polymerase' breaks these hydrogen bonds for a small part, takes one strand of DNA and forms its corresponding paired RNA. This process happens inside the nucleus of the cell. We call this RNA generated as 'mRNA' or 'messenger RNA' because this RNA will come out of nucleus and act like a messaage to Ribosome which will generate proteins accordingly. This process of generation of mRNA is called - **Transcription.** Now Ribosome will read the mRNA in sets of 3 bases. This set of 3 bases is called codon. Codons decide the Amino acids. Depending on the codon read by Ribosome, tRNA (transfer-RNA) brings the appropiate amino acid. These amino acids are then linked using peptide bonds to form a chain called *Polypeptide chain*. At the other end of Ribosome, tRNA is free and can go to take another amino acid. 

> *Note:* Amino acids are organic compounds that contain amine (-NH2) and carboxyl (-COOH) functional groups. There are 20 standard amino acids and 2 non-standard. Of the 20 standard amino acids, nine (His, Ile, Leu, Lys, Met, Phe, Thr, Trp and Val) are called essential amino acids because the human body cannot synthesize them from other compounds at the level needed for normal growth, so they must be obtained from food. Here is the table of codons and their corresponding Amino acids. 'Met' is usually the starting amino acid i.e. 'AUG' forms the start of mRNA. Hence 'AUG' is called *start codon.* 'UAA', 'UGA' and 'UAG' are *stop codons* as they mark the ending of the polypeptide chain, so that a new chain should start from the next codon. 

<img src="./images/genetic-code-table.jpg" width="600">

> This process of generation of chains of amino acids is called - **Translation.** A very long chain of amino acids is called *Protein.* In summary, we can understand the process as:

<img src="./images/transcription-translation.png" width="600">


Now since in Coronavirus, we only has RNA, the process of Transcription won't occur and only Translation will happen. So what we now need to write is - *a translation function*, which takes corona's genome as input and gives back all the polypeptide chains that could be formed from that genome. For that, we first need a dictionary of codons. Following codons' string is copied from 'Genetic code' - Wikipedia.

In [7]:
# Asn or Asp / B	AAU, AAC; GAU, GAC
# Gln or Glu / Z	CAA, CAG; GAA, GAG
# START	AUG

codons = """Ala / A	GCU, GCC, GCA, GCG
Ile / I	AUU, AUC, AUA
Arg / R	CGU, CGC, CGA, CGG; AGA, AGG, AGR;
Leu / L	CUU, CUC, CUA, CUG; UUA, UUG, UUR;
Asn / N	AAU, AAC
Lys / K	AAA, AAG
Asp / D	GAU, GAC
Met / M	AUG
Phe / F	UUU, UUC
Cys / C	UGU, UGC
Pro / P	CCU, CCC, CCA, CCG
Gln / Q	CAA, CAG
Ser / S	UCU, UCC, UCA, UCG; AGU, AGC;
Glu / E	GAA, GAG
Thr / T	ACU, ACC, ACA, ACG
Trp / W	UGG
Gly / G	GGU, GGC, GGA, GGG
Tyr / Y	UAU, UAC
His / H	CAU, CAC
Val / V	GUU, GUC, GUA, GUG
STOP	UAA, UGA, UAG""".strip()

for t in codons.split('\n'):
    t.split('\t')

In [8]:
dec = {}  # Decoder dictionary

for t in codons.split('\n'):
    k, v = t.split('\t')
    if '/' in k:
        k = k.split('/')[-1].strip()
    k = k.replace("STOP", "*")
    v = v.replace(",", "").replace(";", "").lower().replace("u", "t").split(" ")
    for vv in v:
        if vv in dec:
            print("duplicate", vv)
        dec[vv] = k

# dec

In [9]:
len(set(dec.values()))  # We have 21 amino acids in our decoder

21

Now, decoding the genome can result in one of the three possible ways. These 3 ways are called 'reading frames'. 

<img src="./images/reading-frames.png" width="480">

In [10]:
def translation(x, isProtein = False):
    aa = []
    for i in range(0, len(x)-2, 3):
        aa.append(dec[x[i:i+3]])
    aa = ''.join(aa)
    if isProtein:
        if aa[0] != "M" or aa[-1] != "*":
            print("BAD PROTEIN!")
            return None
        aa = aa[:-1]
    return aa

aa = translation(corona[0:]) + translation(corona[1:]) + translation(corona[2:])

In [11]:
polypeptides = aa.split("*")
# polypeptides

In [12]:
len(polypeptides)

1777

In [13]:
long_polypep_chains = list(filter(lambda x: len(x) > 100, aa.split("*")))
# long_polypep_chains

In [14]:
len(long_polypep_chains)

10

This is the genome organisation of Sars-Cov-2. _(Genome organisation is the linear order of genetic material (DNA/RNA) and its division into segments performing some specific function.)_ 

> Note: ORF stands for 'Open Reading Frame', the reading frame in which protein starts with M and ends with *.

Source: https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2#Phylogenetics_and_taxonomy

<img src="./images/SARS-CoV-2-genome.png" width="900">

Let's see if we can extract all the segments as mentioned here. We will refer to the following source again. Source: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512

Also, if you will see the following genome organisation of Sars-Cov (old coronavirus), you will notice - the structure is very similar to Sars-CoV-2. _(Ignore the detailing given in the structure.)_

<img src="./images/SARS-CoV-1-genome.png" width="800">

In [15]:
# https://www.ncbi.nlm.nih.gov/protein/1802476803 -  
# Orf1a polyprotein, found in Sars-Cov-2 (new Covid 19)
orf1a_v2 = translation(corona[265:13483], True)

# orf1a_v2

In [15]:
# https://www.uniprot.org/uniprot/A7J8L3
# Orf1a polyprotein, found in Sars-Cov
orf1a_v1 = """MESLVLGVNEKTHVQLSLPVLQVRDVLVRGFGDSVEEALSEAREHLKNGTCGLVELEKGV
LPQLEQPYVFIKRSDALSTNHGHKVVELVAEMDGIQYGRSGITLGVLVPHVGETPIAYRN
VLLRKNGNKGAGGHSYGIDLKSYDLGDELGTDPIEDYEQNWNTKHGSGALRELTRELNGG
AVTRYVDNNFCGPDGYPLDCIKDFLARAGKSMCTLSEQLDYIESKRGVYCCRDHEHEIAW
FTERSDKSYEHQTPFEIKSAKKFDTFKGECPKFVFPLNSKVKVIQPRVEKKKTEGFMGRI
RSVYPVASPQECNNMHLSTLMKCNHCDEVSWQTCDFLKATCEHCGTENLVIEGPTTCGYL
PTNAVVKMPCPACQDPEIGPEHSVADYHNHSNIETRLRKGGRTRCFGGCVFAYVGCYNKR
AYWVPRASADIGSGHTGITGDNVETLNEDLLEILSRERVNINIVGDFHLNEEVAIILASF
SASTSAFIDTIKSLDYKSFKTIVESCGNYKVTKGKPVKGAWNIGQQRSVLTPLCGFPSQA
AGVIRSIFARTLDAANHSIPDLQRAAVTILDGISEQSLRLVDAMVYTSDLLTNSVIIMAY
VTGGLVQQTSQWLSNLLGTTVEKLRPIFEWIEAKLSAGVEFLKDAWEILKFLITGVFDIV
KGQIQVASDNIKDCVKCFIDVVNKALEMCIDQVTIAGAKLRSLNLGEVFIAQSKGLYRQC
IRGKEQLQLLMPLKAPKEVTFLEGDSHDTVLTSEEVVLKNGELEALETPVDSFTNGAIVG
TPVCVNGLMLLEIKDKEQYCALSPGLLATNNVFRLKGGAPIKGVTFGEDTVWEVQGYKNV
RITFELDERVDKVLNEKCSVYTVESGTEVTEFACVVAEAVVKTLQPVSDLLTNMGIDLDE
WSVATFYLFDDAGEENFSSRMYCSFYPPDEEEEDDAECEEEEIDETCEHEYGTEDDYQGL
PLEFGASAETVRVEEEEEEDWLDDTTEQSEIEPEPEPTPEEPVNQFTGYLKLTDNVAIKC
VDIVKEAQSANPMVIVNAANIHLKHGGGVAGALNKATNGAMQKESDDYIKLNGPLTVGGS
CLLSGHNLAKKCLHVVGPNLNAGEDIQLLKAAYENFNSQDILLAPLLSAGIFGAKPLQSL
QVCVQTVRTQVYIAVNDKALYEQVVMDYLDNLKPRVEAPKQEEPPNTEDSKTEEKSVVQK
PVDVKPKIKACIDEVTTTLEETKFLTNKLLLFADINGKLYHDSQNMLRGEDMSFLEKDAP
YMVGDVITSGDITCVVIPSKKAGGTTEMLSRALKKVPVDEYITTYPGQGCAGYTLEEAKT
ALKKCKSAFYVLPSEAPNAKEEILGTVSWNLREMLAHAEETRKLMPICMDVRAIMATIQR
KYKGIKIQEGIVDYGVRFFFYTSKEPVASIITKLNSLNEPLVTMPIGYVTHGFNLEEAAR
CMRSLKAPAVVSVSSPDAVTTYNGYLTSSSKTSEEHFVETVSLAGSYRDWSYSGQRTELG
VEFLKRGDKIVYHTLESPVEFHLDGEVLSLDKLKSLLSLREVKTIKVFTTVDNTNLHTQL
VDMSMTYGQQFGPTYLDGADVTKIKPHVNHEGKTFFVLPSDDTLRSEAFEYYHTLDESFL
GRYMSALNHTKKWKFPQVGGLTSIKWADNNCYLSSVLLALQQLEVKFNAPALQEAYYRAR
AGDAANFCALILAYSNKTVGELGDVRETMTHLLQHANLESAKRVLNVVCKHCGQKTTTLT
GVEAVMYMGTLSYDNLKTGVSIPCVCGRDATQYLVQQESSFVMMSAPPAEYKLQQGTFLC
ANEYTGNYQCGHYTHITAKETLYRIDGAHLTKMSEYKGPVTDVFYKETSYTTTIKPVSYK
LDGVTYTEIEPKLDGYYKKDNAYYTEQPIDLVPTQPLPNASFDNFKLTCSNTKFADDLNQ
MTGFTKPASRELSVTFFPDLNGDVVAIDYRHYSASFKKGAKLLHKPIVWHINQATTKTTF
KPNTWCLRCLWSTKPVDTSNSFEVLAVEDTQGMDNLACESQQPTSEEVVENPTIQKEVIE
CDVKTTEVVGNVILKPSDEGVKVTQELGHEDLMAAYVENTSITIKKPNELSLALGLKTIA
THGIAAINSVPWSKILAYVKPFLGQAAITTSNCAKRLAQRVFNNYMPYVFTLLFQLCTFT
KSTNSRIRASLPTTIAKNSVKSVAKLCLDAGINYVKSPKFSKLFTIAMWLLLLSICLGSL
ICVTAAFGVLLSNFGAPSYCNGVRELYLNSSNVTTMDFCEGSFPCSICLSGLDSLDSYPA
LETIQVTISSYKLDLTILGLAAEWVLAYMLFTKFFYLLGLSAIMQVFFGYFASHFISNSW
LMWFIISIVQMAPVSAMVRMYIFFASFYYIWKSYVHIMDGCTSSTCMMCYKRNRATRVEC
TTIVNGMKRSFYVYANGGRGFCKTHNWNCLNCDTFCTGSTFISDEVARDLSLQFKRPINP
TDQSSYIVDSVAVKNGALHLYFDKAGQKTYERHPLSHFVNLDNLRANNTKGSLPINVIVF
DGKSKCDESASKSASVYYSQLMCQPILLLDQALVSDVGDSTEVSVKMFDAYVDTFSATFS
VPMEKLKALVATAHSELAKGVALDGVLSTFVSAARQGVVDTDVDTKDVIECLKLSHHSDL
EVTGDSCNNFMLTYNKVENMTPRDLGACIDCNARHINAQVAKSHNVSLIWNVKDYMSLSE
QLRKQIRSAAKKNNIPFRLTCATTRQVVNVITTKISLKGGKIVSTCFKLMLKATLLCVLA
ALVCYIVMPVHTLSIHDGYTNEIIGYKAIQDGVTRDIISTDDCFANKHAGFDAWFSQRGG
SYKNDKSCPVVAAIITREIGFIVPGLPGTVLRAINGDFLHFLPRVFSAVGNICYTPSKLI
EYSDFATSACVLAAECTIFKDAMGKPVPYCYDTNLLEGSISYSELRPDTRYVLMDGSIIQ
FPNTYLEGSVRVVTTFDAEYCRHGTCERSEVGICLSTSGRWVLNNEHYRALSGVFCGVDA
MNLIANIFTPLVQPVGALDVSASVVAGGIIAILVTCAAYYFMKFRRVFGEYNHVVAANAL
LFLMSFTILCLVPAYSFLPGVYSVFYLYLTFYFTNDVSFLAHLQWFAMFSPIVPFWITAI
YVFCISLKHCHWFFNNYLRKRVMFNGVTFSTFEEAALCTFLLNKEMYLKLRSETLLPLTQ
YNRYLALYNKYKYFSGALDTTSYREAACCHLAKALNDFSNSGADVLYQPPQTSITSAVLQ
SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTAEDMLNPNYEDLLIR
KSNHSFLVQAGNVQLRVIGHSMQNCLLRLKVDTSNPKTPKYKFVRIQPGQTFSVLACYNG
SPSGVYQCAMRPNHTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGK
FYGPFVDRQTAQAAGTDTTITLNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYE
PLTQDHVDILGPLSAQTGIAVLDMCAALKELLQNGMNGRTILGSTILEDEFTPFDVVRQC
SGVTFQGKFKKIVKGTHHWMLLTFLTSLLILVQSTQWSLFFFVYENAFLPFTLGIMAIAA
CAMLLVKHKHAFLCLFLLPSLATVAYFNMVYMPASWVMRIMTWLELADTSLSGYRLKDCV
MYASALVLLILMTARTVYDDAARRVWTLMNVITLVYKVYYGNALDQAISMWALVISVTSN
YSGVVTTIMFLARAIVFVCVEYYPLLFITGNTLQCIMLVYCFLGYCCCCYFGLFCLLNRY
FRLTLGVYDYLVSTQEFRYMNSQGLLPPKSSIDAFKLNIKLLGIGGKPCIKVATVQSKMS
DVKCTSVVLLSVLQQLRVESSSKLWAQCVQLHNDILLAKDTTEAFEKMVSLLSVLLSMQG
AVDINRLCEEMLDNRATLQAIASEFSSLPSYAAYATAQEAYEQAVANGDSEVVLKKLKKS
LNVAKSEFDRDAAMQRKLEKMADQAMTQMYKQARSEDKRAKVTSAMQTMLFTMLRKLDND
ALNNIINNARDGCVPLNIIPLTTAAKLMVVVPDYGTYKNTCDGNTFTYASALWEIQQVVD
ADSKIVQLSEINMDNSPNLAWPLIVTALRANSAVKLQNNELSPVALRQMSCAAGTTQTAC
TDDNALAYYNNSKGGRFVLALLSDHQDLKWARFPKSDGTGTIYTELEPPCRFVTDTPKGP
KVKYLYFIKGLNNLNRGMVLGSLAATVRLQAGNATEVPANSTVLSFCAFAVDPAKAYKDY
LASGGQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDHPNPKGFC
DLKGKYVQIPTTCANDPVGFTLRNTVCTVCGMWKGYGCSCDQLREPLMQSADASTFLNGF
AV""".replace("\n", "")

In [16]:
len(orf1a_v1), len(orf1a_v2)

(4382, 4405)

Usually orf1b is not studied alone but along with orf1a. So we need to look at 'orf1ab'. But just to prove that the length of orf1b is 2595, here is just finding the length of orf1b in SARS-CoV-2.

In [17]:
# For orf1b_v1, refer - https://www.uniprot.org/uniprot/A0A0A0QGJ0
orf1b_v2 = translation(corona[13467:21555])

# Length calculated from first 'M'. The last base is *, so extra -1 for that. 
len(orf1b_v2) - orf1b_v2.find('M') - 1   

2595

In [18]:
# https://www.ncbi.nlm.nih.gov/protein/1796318597 - 
# Orf1ab polyprotein - found in Sars-cov-2
orf1ab_v2 = translation(corona[265:13468]) + translation(corona[13467:21555])

In [19]:
# https://www.uniprot.org/uniprot/A7J8L2
# Orf1ab polyprotein - found in Sars-cov

orf1ab_v1 = """MESLVLGVNEKTHVQLSLPVLQVRDVLVRGFGDSVEEALSEAREHLKNGTCGLVELEKGV
LPQLEQPYVFIKRSDALSTNHGHKVVELVAEMDGIQYGRSGITLGVLVPHVGETPIAYRN
VLLRKNGNKGAGGHSYGIDLKSYDLGDELGTDPIEDYEQNWNTKHGSGALRELTRELNGG
AVTRYVDNNFCGPDGYPLDCIKDFLARAGKSMCTLSEQLDYIESKRGVYCCRDHEHEIAW
FTERSDKSYEHQTPFEIKSAKKFDTFKGECPKFVFPLNSKVKVIQPRVEKKKTEGFMGRI
RSVYPVASPQECNNMHLSTLMKCNHCDEVSWQTCDFLKATCEHCGTENLVIEGPTTCGYL
PTNAVVKMPCPACQDPEIGPEHSVADYHNHSNIETRLRKGGRTRCFGGCVFAYVGCYNKR
AYWVPRASADIGSGHTGITGDNVETLNEDLLEILSRERVNINIVGDFHLNEEVAIILASF
SASTSAFIDTIKSLDYKSFKTIVESCGNYKVTKGKPVKGAWNIGQQRSVLTPLCGFPSQA
AGVIRSIFARTLDAANHSIPDLQRAAVTILDGISEQSLRLVDAMVYTSDLLTNSVIIMAY
VTGGLVQQTSQWLSNLLGTTVEKLRPIFEWIEAKLSAGVEFLKDAWEILKFLITGVFDIV
KGQIQVASDNIKDCVKCFIDVVNKALEMCIDQVTIAGAKLRSLNLGEVFIAQSKGLYRQC
IRGKEQLQLLMPLKAPKEVTFLEGDSHDTVLTSEEVVLKNGELEALETPVDSFTNGAIVG
TPVCVNGLMLLEIKDKEQYCALSPGLLATNNVFRLKGGAPIKGVTFGEDTVWEVQGYKNV
RITFELDERVDKVLNEKCSVYTVESGTEVTEFACVVAEAVVKTLQPVSDLLTNMGIDLDE
WSVATFYLFDDAGEENFSSRMYCSFYPPDEEEEDDAECEEEEIDETCEHEYGTEDDYQGL
PLEFGASAETVRVEEEEEEDWLDDTTEQSEIEPEPEPTPEEPVNQFTGYLKLTDNVAIKC
VDIVKEAQSANPMVIVNAANIHLKHGGGVAGALNKATNGAMQKESDDYIKLNGPLTVGGS
CLLSGHNLAKKCLHVVGPNLNAGEDIQLLKAAYENFNSQDILLAPLLSAGIFGAKPLQSL
QVCVQTVRTQVYIAVNDKALYEQVVMDYLDNLKPRVEAPKQEEPPNTEDSKTEEKSVVQK
PVDVKPKIKACIDEVTTTLEETKFLTNKLLLFADINGKLYHDSQNMLRGEDMSFLEKDAP
YMVGDVITSGDITCVVIPSKKAGGTTEMLSRALKKVPVDEYITTYPGQGCAGYTLEEAKT
ALKKCKSAFYVLPSEAPNAKEEILGTVSWNLREMLAHAEETRKLMPICMDVRAIMATIQR
KYKGIKIQEGIVDYGVRFFFYTSKEPVASIITKLNSLNEPLVTMPIGYVTHGFNLEEAAR
CMRSLKAPAVVSVSSPDAVTTYNGYLTSSSKTSEEHFVETVSLAGSYRDWSYSGQRTELG
VEFLKRGDKIVYHTLESPVEFHLDGEVLSLDKLKSLLSLREVKTIKVFTTVDNTNLHTQL
VDMSMTYGQQFGPTYLDGADVTKIKPHVNHEGKTFFVLPSDDTLRSEAFEYYHTLDESFL
GRYMSALNHTKKWKFPQVGGLTSIKWADNNCYLSSVLLALQQLEVKFNAPALQEAYYRAR
AGDAANFCALILAYSNKTVGELGDVRETMTHLLQHANLESAKRVLNVVCKHCGQKTTTLT
GVEAVMYMGTLSYDNLKTGVSIPCVCGRDATQYLVQQESSFVMMSAPPAEYKLQQGTFLC
ANEYTGNYQCGHYTHITAKETLYRIDGAHLTKMSEYKGPVTDVFYKETSYTTTIKPVSYK
LDGVTYTEIEPKLDGYYKKDNAYYTEQPIDLVPTQPLPNASFDNFKLTCSNTKFADDLNQ
MTGFTKPASRELSVTFFPDLNGDVVAIDYRHYSASFKKGAKLLHKPIVWHINQATTKTTF
KPNTWCLRCLWSTKPVDTSNSFEVLAVEDTQGMDNLACESQQPTSEEVVENPTIQKEVIE
CDVKTTEVVGNVILKPSDEGVKVTQELGHEDLMAAYVENTSITIKKPNELSLALGLKTIA
THGIAAINSVPWSKILAYVKPFLGQAAITTSNCAKRLAQRVFNNYMPYVFTLLFQLCTFT
KSTNSRIRASLPTTIAKNSVKSVAKLCLDAGINYVKSPKFSKLFTIAMWLLLLSICLGSL
ICVTAAFGVLLSNFGAPSYCNGVRELYLNSSNVTTMDFCEGSFPCSICLSGLDSLDSYPA
LETIQVTISSYKLDLTILGLAAEWVLAYMLFTKFFYLLGLSAIMQVFFGYFASHFISNSW
LMWFIISIVQMAPVSAMVRMYIFFASFYYIWKSYVHIMDGCTSSTCMMCYKRNRATRVEC
TTIVNGMKRSFYVYANGGRGFCKTHNWNCLNCDTFCTGSTFISDEVARDLSLQFKRPINP
TDQSSYIVDSVAVKNGALHLYFDKAGQKTYERHPLSHFVNLDNLRANNTKGSLPINVIVF
DGKSKCDESASKSASVYYSQLMCQPILLLDQALVSDVGDSTEVSVKMFDAYVDTFSATFS
VPMEKLKALVATAHSELAKGVALDGVLSTFVSAARQGVVDTDVDTKDVIECLKLSHHSDL
EVTGDSCNNFMLTYNKVENMTPRDLGACIDCNARHINAQVAKSHNVSLIWNVKDYMSLSE
QLRKQIRSAAKKNNIPFRLTCATTRQVVNVITTKISLKGGKIVSTCFKLMLKATLLCVLA
ALVCYIVMPVHTLSIHDGYTNEIIGYKAIQDGVTRDIISTDDCFANKHAGFDAWFSQRGG
SYKNDKSCPVVAAIITREIGFIVPGLPGTVLRAINGDFLHFLPRVFSAVGNICYTPSKLI
EYSDFATSACVLAAECTIFKDAMGKPVPYCYDTNLLEGSISYSELRPDTRYVLMDGSIIQ
FPNTYLEGSVRVVTTFDAEYCRHGTCERSEVGICLSTSGRWVLNNEHYRALSGVFCGVDA
MNLIANIFTPLVQPVGALDVSASVVAGGIIAILVTCAAYYFMKFRRVFGEYNHVVAANAL
LFLMSFTILCLVPAYSFLPGVYSVFYLYLTFYFTNDVSFLAHLQWFAMFSPIVPFWITAI
YVFCISLKHCHWFFNNYLRKRVMFNGVTFSTFEEAALCTFLLNKEMYLKLRSETLLPLTQ
YNRYLALYNKYKYFSGALDTTSYREAACCHLAKALNDFSNSGADVLYQPPQTSITSAVLQ
SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTAEDMLNPNYEDLLIR
KSNHSFLVQAGNVQLRVIGHSMQNCLLRLKVDTSNPKTPKYKFVRIQPGQTFSVLACYNG
SPSGVYQCAMRPNHTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGK
FYGPFVDRQTAQAAGTDTTITLNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYE
PLTQDHVDILGPLSAQTGIAVLDMCAALKELLQNGMNGRTILGSTILEDEFTPFDVVRQC
SGVTFQGKFKKIVKGTHHWMLLTFLTSLLILVQSTQWSLFFFVYENAFLPFTLGIMAIAA
CAMLLVKHKHAFLCLFLLPSLATVAYFNMVYMPASWVMRIMTWLELADTSLSGYRLKDCV
MYASALVLLILMTARTVYDDAARRVWTLMNVITLVYKVYYGNALDQAISMWALVISVTSN
YSGVVTTIMFLARAIVFVCVEYYPLLFITGNTLQCIMLVYCFLGYCCCCYFGLFCLLNRY
FRLTLGVYDYLVSTQEFRYMNSQGLLPPKSSIDAFKLNIKLLGIGGKPCIKVATVQSKMS
DVKCTSVVLLSVLQQLRVESSSKLWAQCVQLHNDILLAKDTTEAFEKMVSLLSVLLSMQG
AVDINRLCEEMLDNRATLQAIASEFSSLPSYAAYATAQEAYEQAVANGDSEVVLKKLKKS
LNVAKSEFDRDAAMQRKLEKMADQAMTQMYKQARSEDKRAKVTSAMQTMLFTMLRKLDND
ALNNIINNARDGCVPLNIIPLTTAAKLMVVVPDYGTYKNTCDGNTFTYASALWEIQQVVD
ADSKIVQLSEINMDNSPNLAWPLIVTALRANSAVKLQNNELSPVALRQMSCAAGTTQTAC
TDDNALAYYNNSKGGRFVLALLSDHQDLKWARFPKSDGTGTIYTELEPPCRFVTDTPKGP
KVKYLYFIKGLNNLNRGMVLGSLAATVRLQAGNATEVPANSTVLSFCAFAVDPAKAYKDY
LASGGQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDHPNPKGFC
DLKGKYVQIPTTCANDPVGFTLRNTVCTVCGMWKGYGCSCDQLREPLMQSADASTFLNRV
CGVSAARLTPCGTGTSTDVVYRAFDIYNEKVAGFAKFLKTNCCRFQEKDEEGNLLDSYFV
VKRHTMSNYQHEETIYNLVKDCPAVAVHDFFKFRVDGDMVPHISRQRLTKYTMADLVYAL
RHFDEGNCDTLKEILVTYNCCDDDYFNKKDWYDFVENPDILRVYANLGERVRQSLLKTVQ
FCDAMRDAGIVGVLTLDNQDLNGNWYDFGDFVQVAPGCGVPIVDSYYSLLMPILTLTRAL
AAESHMDADLAKPLIKWDLLKYDFTEERLCLFDRYFKYWDQTYHPNCINCLDDRCILHCA
NFNVLFSTVFPPTSFGPLVRKIFVDGVPFVVSTGYHFRELGVVHNQDVNLHSSRLSFKEL
LVYAADPAMHAASGNLLLDKRTTCFSVAALTNNVAFQTVKPGNFNKDFYDFAVSKGFFKE
GSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDIRQLLFVVEVVDKYFDCYDGGCINAN
QVIVNNLDKSAGFPFNKWGKARLYYDSMSYEDQDALFAYTKRNVIPTITQMNLKYAISAK
NRARTVAGVSICSTMTNRQFHQKLLKSIAATRGATVVIGTSKFYGGWHNMLKTVYSDVET
PHLMGWDYPKCDRAMPNMLRIMASLVLARKHNTCCNLSHRFYRLANECAQVLSEMVMCGG
SLYVKPGGTSSGDATTAYANSVFNICQAVTANVNALLSTDGNKIADKYVRNLQHRLYECL
YRNRDVDHEFVDEFYAYLRKHFSMMILSDDAVVCYNSNYAAQGLVASIKNFKAVLYYQNN
VFMSEAKCWTETDLTKGPHEFCSQHTMLVKQGDDYVYLPYPDPSRILGAGCFVDDIVKTD
GTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDN
TSRYWEPEFYEAMYTPHTVLQAVGACVLCNSQTSLRCGACIRRPFLCCKCCYDHVISTSH
KLVLSVNPYVCNAPGCDVTDVTQLYLGGMSYYCKSHKPPISFPLCANGQVFGLYKNTCVG
SDNVTDFNAIATCDWTNAGDYILANTCTERLKLFAAETLKATEETFKLSYGIATVREVLS
DRELHLSWEVGKPRPPLNRNYVFTGYRVTKNSKVQIGEYTFEKGDYGDAVVYRGTTTYKL
NVGDYFVLTSHTVMPLSAPTLVPQEHYVRITGLYPTLNISDEFSSNVANYQKVGMQKYST
LQGPPGTGKSHFAIGLALYYPSARIVYTACSHAAVDALCEKALKYLPIDKCSRIIPARAR
VECFDKFKVNSTLEQYVFCTVNALPETTADIVVFDEISMATNYDLSVVNARLRAKHYVYI
GDPAQLPAPRTLLTKGTLEPEYFNSVCRLMKTIGPDMFLGTCRRCPAEIVDTVSALVYDN
KLKAHKDKSAQCFKMFYKGVITHDVSSAINRPQIGVVREFLTRNPAWRKAVFISPYNSQN
AVASKILGLPTQTVDSSQGSEYDYVIFTQTTETAHSCNVNRFNVAITRAKIGILCIMSDR
DLYDKLQFTSLEIPRRNVATLQAENVTGLFKDCSKIITGLHPTQAPTHLSVDIKFKTEGL
CVDIPGIPKDMTYRRLISMMGFKMNYQVNGYPNMFITREEAIRHVRAWIGFDVEGCHATR
DAVGTNLPLQLGFSTGVNLVAVPTGYVDTENNTEFTRVNAKPPPGDQFKHLIPLMYKGLP
WNVVRIKIVQMLSDTLKGLSDRVVFVLWAHGFELTSMKYFVKIGPERTCCLCDKRATCFS
TSSDTYACWNHSVGFDYVYNPFMIDVQQWGFTGNLQSNHDQHCQVHGNAHVASCDAIMTR
CLAVHECFVKRVDWSVEYPIIGDELRVNSACRKVQHMVVKSALLADKFPVLHDIGNPKAI
KCVPQAEVEWKFYDAQPCSDKAYKIEELFYSYATHHDKFTDGVCLFWNCNVDRYPANAIV
CRFDTRVLSNLNLPGCDGGSLYVNKHAFHTPAFDKSAFTNLKQLPFFYYSDSPCESHGKQ
VVSDIDYVPLKSATCITRCNLGGAVCRHHANEYRQYLDAYNMMISAGFSLWIYKQFDTYN
LWNTFTRLQSLENVAYNVVNKGHFDGHAGEAPVSIINNAVYTKVDGIDVEIFENKTTLPV
NVAFELWAKRNIKPVPEIKILNNLGVDIAANTVIWDYKREAPAHVSTIGVCTMTDIAKKP
TESACSSLTVLFDGRVEGQVDLFRNARNGVLITEGSVKGLTPSKGPAQASVNGVTLIGES
VKTQFNYFKKVDGIIQQLPETYFTQSRDLEDFKPRSQMETDFLELAMDEFIQRYKLEGYA
FEHIVYGDFSHGQLGGLHLMIGLAKRSQDSPLKLEDFIPMDSTVKNYFITDAQTGSSKCV
CSVIDLLLDDFVEIIKSQDLSVISKVVKVTIDYAEISFMLWCKDGHVETFYPKLQASQAW
QPGVAMPNLYKMQRMLLEKCDLQNYGENAVIPKGIMMNVAKYTQLCQYLNTLTLAVPYNM
RVIHFGAGSDKGVAPGTAVLRQWLPTGTLLVDSDLNDFVSDADSTLIGDCATVHTANKWD
LIISDMYDPRTKHVTKENDSKEGFFTYLCGFIKQKLALGGSIAVKITEHSWNADLYKLMG
HFSWWTAFVTNVNASSSEAFLIGANYLGKPKEQIDGYTMHANYIFWRNTNPIQLSSYSLF
DMSKFPLKLRGTAVMSLKENQINDMIYSLLEKGRLIIRENNRVVVSSDILVNN""".replace("\n", "")

In [20]:
len(orf1ab_v2), len(orf1ab_v1)

(7097, 7073)

So by now, we have extracted Orf1a and Orf1b RNA segments. 

In [21]:
# https://www.ncbi.nlm.nih.gov/protein/1796318598
# Spike glycoprotein - found in Sars-cov-2

spike_v2 = translation(corona[21562:25384], True)

In [22]:
# https://www.uniprot.org/uniprot/P59594
# Spike glycoprotein - found in Sars-cov

spike_v1 = """MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFL
PFYSNVTGFHTINHTFGNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNS
TNVVIRACNFELCDNPFFAVSKPMGTQTHTMIFDNAFNCTFEYISDAFSLDVSEKSGNFK
HLREFVFKNKDGFLYVYKGYQPIDVVRDLPSGFNTLKPIFKLPLGINITNFRAILTAFSP
AQDIWGTSAAAYFVGYLKPTTFMLKYDENGTITDAVDCSQNPLAELKCSVKSFEIDKGIY
QTSNFRVVPSGDVVRFPNITNLCPFGEVFNATKFPSVYAWERKKISNCVADYSVLYNSTF
FSTFKCYGVSATKLNDLCFSNVYADSFVVKGDDVRQIAPGQTGVIADYNYKLPDDFMGCV
LAWNTRNIDATSTGNYNYKYRYLRHGKLRPFERDISNVPFSPDGKPCTPPALNCYWPLND
YGFYTTTGIGYQPYRVVVLSFELLNAPATVCGPKLSTDLIKNQCVNFNFNGLTGTGVLTP
SSKRFQPFQQFGRDVSDFTDSVRDPKTSEILDISPCSFGGVSVITPGTNASSEVAVLYQD
VNCTDVSTAIHADQLTPAWRIYSTGNNVFQTQAGCLIGAEHVDTSYECDIPIGAGICASY
HTVSLLRSTSQKSIVAYTMSLGADSSIAYSNNTIAIPTNFSISITTEVMPVSMAKTSVDC
NMYICGDSTECANLLLQYGSFCTQLNRALSGIAAEQDRNTREVFAQVKQMYKTPTLKYFG
GFNFSQILPDPLKPTKRSFIEDLLFNKVTLADAGFMKQYGECLGDINARDLICAQKFNGL
TVLPPLLTDDMIAAYTAALVSGTATAGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYE
NQKQIANQFNKAISQIQESLTTTSTALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLN
DILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSK
RVDFCGKGYHLMSFPQAAPHGVVFLHVTYVPSQERNFTTAPAICHEGKAYFPREGVFVFN
GTSWFITQRNFFSPQIITTDNTFVSGNCDVVIGIINNTVYDPLQPELDSFKEELDKYFKN
HTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYVWL
GFIAGLIAIVMVTILLCCMTSCCSCLKGACSCGSCCKFDEDDSEPVLKGVKLHYT""".replace("\n", "")

In [23]:
len(spike_v2), len(spike_v1)

(1273, 1255)

In [24]:
# https://www.ncbi.nlm.nih.gov/gene/43740569
# orf3a protein found in Sars-cov-2.

orf3a_v2 = translation(corona[25392:26220], True)

In [25]:
# https://www.uniprot.org/uniprot/J9TEM7

orf3a_v1 = """MDLFMRFFTLXSITAQPVKIDNASXASTVHATATIPLQASLPFGWLVIGVAFLAVFQSAT
KIIALNKRWQLALYKGFQFICNLLLLFVTIYSHLLLVAAGMEAQFLYLYALIYFLQCINA
CRIIMRCWLCWKCKSKNPLLYDANYFVCWHTHNYDYCIPYNSVTDTIVVTEGDGISTPKL
KEDYQIGGYSEDRHSGVKDYVVVHGYFTEVYYQLESTQITTDTGIENATFFIFNKLVKDP
PNVQIHTIDGSSGVANPAMDPIYDEPTTTTSVPL""".replace("\n", "");

In [26]:
len(orf3a_v2), len(orf3a_v1)

(275, 274)

By now, you must have seen that there is a very little change in the corresponding protein lengths of SARS-CoV and SARS-CoV-2. So, **Can we say - there isn't much difference between the proteins of two viruses?** And the answer is - **NO!** 

This is because - length of the proteins is not the accurate measure of how dissimilar they are. So now we have another question.

### Q. 3 How much different is the protein of this novel coronavirus as compared to the older one?

The answer is - **The Edit Distance.** In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T. 

Source: https://en.wikipedia.org/wiki/Edit_distance

Let's calculate the edit distance of the genomes of the two versions of coronaviruses. 

Source of complete genome of old coronavirus: https://www.ncbi.nlm.nih.gov/nuccore/30271926

In [27]:
old_corona = """1 atattaggtt tttacctacc caggaaaagc caaccaacct cgatctcttg tagatctgtt
       61 ctctaaacga actttaaaat ctgtgtagct gtcgctcggc tgcatgccta gtgcacctac
      121 gcagtataaa caataataaa ttttactgtc gttgacaaga aacgagtaac tcgtccctct
      181 tctgcagact gcttacggtt tcgtccgtgt tgcagtcgat catcagcata cctaggtttc
      241 gtccgggtgt gaccgaaagg taagatggag agccttgttc ttggtgtcaa cgagaaaaca
      301 cacgtccaac tcagtttgcc tgtccttcag gttagagacg tgctagtgcg tggcttcggg
      361 gactctgtgg aagaggccct atcggaggca cgtgaacacc tcaaaaatgg cacttgtggt
      421 ctagtagagc tggaaaaagg cgtactgccc cagcttgaac agccctatgt gttcattaaa
      481 cgttctgatg ccttaagcac caatcacggc cacaaggtcg ttgagctggt tgcagaaatg
      541 gacggcattc agtacggtcg tagcggtata acactgggag tactcgtgcc acatgtgggc
      601 gaaaccccaa ttgcataccg caatgttctt cttcgtaaga acggtaataa gggagccggt
      661 ggtcatagct atggcatcga tctaaagtct tatgacttag gtgacgagct tggcactgat
      721 cccattgaag attatgaaca aaactggaac actaagcatg gcagtggtgc actccgtgaa
      781 ctcactcgtg agctcaatgg aggtgcagtc actcgctatg tcgacaacaa tttctgtggc
      841 ccagatgggt accctcttga ttgcatcaaa gattttctcg cacgcgcggg caagtcaatg
      901 tgcactcttt ccgaacaact tgattacatc gagtcgaaga gaggtgtcta ctgctgccgt
      961 gaccatgagc atgaaattgc ctggttcact gagcgctctg ataagagcta cgagcaccag
     1021 acacccttcg aaattaagag tgccaagaaa tttgacactt tcaaagggga atgcccaaag
     1081 tttgtgtttc ctcttaactc aaaagtcaaa gtcattcaac cacgtgttga aaagaaaaag
     1141 actgagggtt tcatggggcg tatacgctct gtgtaccctg ttgcatctcc acaggagtgt
     1201 aacaatatgc acttgtctac cttgatgaaa tgtaatcatt gcgatgaagt ttcatggcag
     1261 acgtgcgact ttctgaaagc cacttgtgaa cattgtggca ctgaaaattt agttattgaa
     1321 ggacctacta catgtgggta cctacctact aatgctgtag tgaaaatgcc atgtcctgcc
     1381 tgtcaagacc cagagattgg acctgagcat agtgttgcag attatcacaa ccactcaaac
     1441 attgaaactc gactccgcaa gggaggtagg actagatgtt ttggaggctg tgtgtttgcc
     1501 tatgttggct gctataataa gcgtgcctac tgggttcctc gtgctagtgc tgatattggc
     1561 tcaggccata ctggcattac tggtgacaat gtggagacct tgaatgagga tctccttgag
     1621 atactgagtc gtgaacgtgt taacattaac attgttggcg attttcattt gaatgaagag
     1681 gttgccatca ttttggcatc tttctctgct tctacaagtg cctttattga cactataaag
     1741 agtcttgatt acaagtcttt caaaaccatt gttgagtcct gcggtaacta taaagttacc
     1801 aagggaaagc ccgtaaaagg tgcttggaac attggacaac agagatcagt tttaacacca
     1861 ctgtgtggtt ttccctcaca ggctgctggt gttatcagat caatttttgc gcgcacactt
     1921 gatgcagcaa accactcaat tcctgatttg caaagagcag ctgtcaccat acttgatggt
     1981 atttctgaac agtcattacg tcttgtcgac gccatggttt atacttcaga cctgctcacc
     2041 aacagtgtca ttattatggc atatgtaact ggtggtcttg tacaacagac ttctcagtgg
     2101 ttgtctaatc ttttgggcac tactgttgaa aaactcaggc ctatctttga atggattgag
     2161 gcgaaactta gtgcaggagt tgaatttctc aaggatgctt gggagattct caaatttctc
     2221 attacaggtg tttttgacat cgtcaagggt caaatacagg ttgcttcaga taacatcaag
     2281 gattgtgtaa aatgcttcat tgatgttgtt aacaaggcac tcgaaatgtg cattgatcaa
     2341 gtcactatcg ctggcgcaaa gttgcgatca ctcaacttag gtgaagtctt catcgctcaa
     2401 agcaagggac tttaccgtca gtgtatacgt ggcaaggagc agctgcaact actcatgcct
     2461 cttaaggcac caaaagaagt aacctttctt gaaggtgatt cacatgacac agtacttacc
     2521 tctgaggagg ttgttctcaa gaacggtgaa ctcgaagcac tcgagacgcc cgttgatagc
     2581 ttcacaaatg gagctatcgt tggcacacca gtctgtgtaa atggcctcat gctcttagag
     2641 attaaggaca aagaacaata ctgcgcattg tctcctggtt tactggctac aaacaatgtc
     2701 tttcgcttaa aagggggtgc accaattaaa ggtgtaacct ttggagaaga tactgtttgg
     2761 gaagttcaag gttacaagaa tgtgagaatc acatttgagc ttgatgaacg tgttgacaaa
     2821 gtgcttaatg aaaagtgctc tgtctacact gttgaatccg gtaccgaagt tactgagttt
     2881 gcatgtgttg tagcagaggc tgttgtgaag actttacaac cagtttctga tctccttacc
     2941 aacatgggta ttgatcttga tgagtggagt gtagctacat tctacttatt tgatgatgct
     3001 ggtgaagaaa acttttcatc acgtatgtat tgttcctttt accctccaga tgaggaagaa
     3061 gaggacgatg cagagtgtga ggaagaagaa attgatgaaa cctgtgaaca tgagtacggt
     3121 acagaggatg attatcaagg tctccctctg gaatttggtg cctcagctga aacagttcga
     3181 gttgaggaag aagaagagga agactggctg gatgatacta ctgagcaatc agagattgag
     3241 ccagaaccag aacctacacc tgaagaacca gttaatcagt ttactggtta tttaaaactt
     3301 actgacaatg ttgccattaa atgtgttgac atcgttaagg aggcacaaag tgctaatcct
     3361 atggtgattg taaatgctgc taacatacac ctgaaacatg gtggtggtgt agcaggtgca
     3421 ctcaacaagg caaccaatgg tgccatgcaa aaggagagtg atgattacat taagctaaat
     3481 ggccctctta cagtaggagg gtcttgtttg ctttctggac ataatcttgc taagaagtgt
     3541 ctgcatgttg ttggacctaa cctaaatgca ggtgaggaca tccagcttct taaggcagca
     3601 tatgaaaatt tcaattcaca ggacatctta cttgcaccat tgttgtcagc aggcatattt
     3661 ggtgctaaac cacttcagtc tttacaagtg tgcgtgcaga cggttcgtac acaggtttat
     3721 attgcagtca atgacaaagc tctttatgag caggttgtca tggattatct tgataacctg
     3781 aagcctagag tggaagcacc taaacaagag gagccaccaa acacagaaga ttccaaaact
     3841 gaggagaaat ctgtcgtaca gaagcctgtc gatgtgaagc caaaaattaa ggcctgcatt
     3901 gatgaggtta ccacaacact ggaagaaact aagtttctta ccaataagtt actcttgttt
     3961 gctgatatca atggtaagct ttaccatgat tctcagaaca tgcttagagg tgaagatatg
     4021 tctttccttg agaaggatgc accttacatg gtaggtgatg ttatcactag tggtgatatc
     4081 acttgtgttg taataccctc caaaaaggct ggtggcacta ctgagatgct ctcaagagct
     4141 ttgaagaaag tgccagttga tgagtatata accacgtacc ctggacaagg atgtgctggt
     4201 tatacacttg aggaagctaa gactgctctt aagaaatgca aatctgcatt ttatgtacta
     4261 ccttcagaag cacctaatgc taaggaagag attctaggaa ctgtatcctg gaatttgaga
     4321 gaaatgcttg ctcatgctga agagacaaga aaattaatgc ctatatgcat ggatgttaga
     4381 gccataatgg caaccatcca acgtaagtat aaaggaatta aaattcaaga gggcatcgtt
     4441 gactatggtg tccgattctt cttttatact agtaaagagc ctgtagcttc tattattacg
     4501 aagctgaact ctctaaatga gccgcttgtc acaatgccaa ttggttatgt gacacatggt
     4561 tttaatcttg aagaggctgc gcgctgtatg cgttctctta aagctcctgc cgtagtgtca
     4621 gtatcatcac cagatgctgt tactacatat aatggatacc tcacttcgtc atcaaagaca
     4681 tctgaggagc actttgtaga aacagtttct ttggctggct cttacagaga ttggtcctat
     4741 tcaggacagc gtacagagtt aggtgttgaa tttcttaagc gtggtgacaa aattgtgtac
     4801 cacactctgg agagccccgt cgagtttcat cttgacggtg aggttctttc acttgacaaa
     4861 ctaaagagtc tcttatccct gcgggaggtt aagactataa aagtgttcac aactgtggac
     4921 aacactaatc tccacacaca gcttgtggat atgtctatga catatggaca gcagtttggt
     4981 ccaacatact tggatggtgc tgatgttaca aaaattaaac ctcatgtaaa tcatgagggt
     5041 aagactttct ttgtactacc tagtgatgac acactacgta gtgaagcttt cgagtactac
     5101 catactcttg atgagagttt tcttggtagg tacatgtctg ctttaaacca cacaaagaaa
     5161 tggaaatttc ctcaagttgg tggtttaact tcaattaaat gggctgataa caattgttat
     5221 ttgtctagtg ttttattagc acttcaacag cttgaagtca aattcaatgc accagcactt
     5281 caagaggctt attatagagc ccgtgctggt gatgctgcta acttttgtgc actcatactc
     5341 gcttacagta ataaaactgt tggcgagctt ggtgatgtca gagaaactat gacccatctt
     5401 ctacagcatg ctaatttgga atctgcaaag cgagttctta atgtggtgtg taaacattgt
     5461 ggtcagaaaa ctactacctt aacgggtgta gaagctgtga tgtatatggg tactctatct
     5521 tatgataatc ttaagacagg tgtttccatt ccatgtgtgt gtggtcgtga tgctacacaa
     5581 tatctagtac aacaagagtc ttcttttgtt atgatgtctg caccacctgc tgagtataaa
     5641 ttacagcaag gtacattctt atgtgcgaat gagtacactg gtaactatca gtgtggtcat
     5701 tacactcata taactgctaa ggagaccctc tatcgtattg acggagctca ccttacaaag
     5761 atgtcagagt acaaaggacc agtgactgat gttttctaca aggaaacatc ttacactaca
     5821 accatcaagc ctgtgtcgta taaactcgat ggagttactt acacagagat tgaaccaaaa
     5881 ttggatgggt attataaaaa ggataatgct tactatacag agcagcctat agaccttgta
     5941 ccaactcaac cattaccaaa tgcgagtttt gataatttca aactcacatg ttctaacaca
     6001 aaatttgctg atgatttaaa tcaaatgaca ggcttcacaa agccagcttc acgagagcta
     6061 tctgtcacat tcttcccaga cttgaatggc gatgtagtgg ctattgacta tagacactat
     6121 tcagcgagtt tcaagaaagg tgctaaatta ctgcataagc caattgtttg gcacattaac
     6181 caggctacaa ccaagacaac gttcaaacca aacacttggt gtttacgttg tctttggagt
     6241 acaaagccag tagatacttc aaattcattt gaagttctgg cagtagaaga cacacaagga
     6301 atggacaatc ttgcttgtga aagtcaacaa cccacctctg aagaagtagt ggaaaatcct
     6361 accatacaga aggaagtcat agagtgtgac gtgaaaacta ccgaagttgt aggcaatgtc
     6421 atacttaaac catcagatga aggtgttaaa gtaacacaag agttaggtca tgaggatctt
     6481 atggctgctt atgtggaaaa cacaagcatt accattaaga aacctaatga gctttcacta
     6541 gccttaggtt taaaaacaat tgccactcat ggtattgctg caattaatag tgttccttgg
     6601 agtaaaattt tggcttatgt caaaccattc ttaggacaag cagcaattac aacatcaaat
     6661 tgcgctaaga gattagcaca acgtgtgttt aacaattata tgccttatgt gtttacatta
     6721 ttgttccaat tgtgtacttt tactaaaagt accaattcta gaattagagc ttcactacct
     6781 acaactattg ctaaaaatag tgttaagagt gttgctaaat tatgtttgga tgccggcatt
     6841 aattatgtga agtcacccaa attttctaaa ttgttcacaa tcgctatgtg gctattgttg
     6901 ttaagtattt gcttaggttc tctaatctgt gtaactgctg cttttggtgt actcttatct
     6961 aattttggtg ctccttctta ttgtaatggc gttagagaat tgtatcttaa ttcgtctaac
     7021 gttactacta tggatttctg tgaaggttct tttccttgca gcatttgttt aagtggatta
     7081 gactcccttg attcttatcc agctcttgaa accattcagg tgacgatttc atcgtacaag
     7141 ctagacttga caattttagg tctggccgct gagtgggttt tggcatatat gttgttcaca
     7201 aaattctttt atttattagg tctttcagct ataatgcagg tgttctttgg ctattttgct
     7261 agtcatttca tcagcaattc ttggctcatg tggtttatca ttagtattgt acaaatggca
     7321 cccgtttctg caatggttag gatgtacatc ttctttgctt ctttctacta catatggaag
     7381 agctatgttc atatcatgga tggttgcacc tcttcgactt gcatgatgtg ctataagcgc
     7441 aatcgtgcca cacgcgttga gtgtacaact attgttaatg gcatgaagag atctttctat
     7501 gtctatgcaa atggaggccg tggcttctgc aagactcaca attggaattg tctcaattgt
     7561 gacacatttt gcactggtag tacattcatt agtgatgaag ttgctcgtga tttgtcactc
     7621 cagtttaaaa gaccaatcaa ccctactgac cagtcatcgt atattgttga tagtgttgct
     7681 gtgaaaaatg gcgcgcttca cctctacttt gacaaggctg gtcaaaagac ctatgagaga
     7741 catccgctct cccattttgt caatttagac aatttgagag ctaacaacac taaaggttca
     7801 ctgcctatta atgtcatagt ttttgatggc aagtccaaat gcgacgagtc tgcttctaag
     7861 tctgcttctg tgtactacag tcagctgatg tgccaaccta ttctgttgct tgaccaagct
     7921 cttgtatcag acgttggaga tagtactgaa gtttccgtta agatgtttga tgcttatgtc
     7981 gacacctttt cagcaacttt tagtgttcct atggaaaaac ttaaggcact tgttgctaca
     8041 gctcacagcg agttagcaaa gggtgtagct ttagatggtg tcctttctac attcgtgtca
     8101 gctgcccgac aaggtgttgt tgataccgat gttgacacaa aggatgttat tgaatgtctc
     8161 aaactttcac atcactctga cttagaagtg acaggtgaca gttgtaacaa tttcatgctc
     8221 acctataata aggttgaaaa catgacgccc agagatcttg gcgcatgtat tgactgtaat
     8281 gcaaggcata tcaatgccca agtagcaaaa agtcacaatg tttcactcat ctggaatgta
     8341 aaagactaca tgtctttatc tgaacagctg cgtaaacaaa ttcgtagtgc tgccaagaag
     8401 aacaacatac cttttagact aacttgtgct acaactagac aggttgtcaa tgtcataact
     8461 actaaaatct cactcaaggg tggtaagatt gttagtactt gttttaaact tatgcttaag
     8521 gccacattat tgtgcgttct tgctgcattg gtttgttata tcgttatgcc agtacataca
     8581 ttgtcaatcc atgatggtta cacaaatgaa atcattggtt acaaagccat tcaggatggt
     8641 gtcactcgtg acatcatttc tactgatgat tgttttgcaa ataaacatgc tggttttgac
     8701 gcatggttta gccagcgtgg tggttcatac aaaaatgaca aaagctgccc tgtagtagct
     8761 gctatcatta caagagagat tggtttcata gtgcctggct taccgggtac tgtgctgaga
     8821 gcaatcaatg gtgacttctt gcattttcta cctcgtgttt ttagtgctgt tggcaacatt
     8881 tgctacacac cttccaaact cattgagtat agtgattttg ctacctctgc ttgcgttctt
     8941 gctgctgagt gtacaatttt taaggatgct atgggcaaac ctgtgccata ttgttatgac
     9001 actaatttgc tagagggttc tatttcttat agtgagcttc gtccagacac tcgttatgtg
     9061 cttatggatg gttccatcat acagtttcct aacacttacc tggagggttc tgttagagta
     9121 gtaacaactt ttgatgctga gtactgtaga catggtacat gcgaaaggtc agaagtaggt
     9181 atttgcctat ctaccagtgg tagatgggtt cttaataatg agcattacag agctctatca
     9241 ggagttttct gtggtgttga tgcgatgaat ctcatagcta acatctttac tcctcttgtg
     9301 caacctgtgg gtgctttaga tgtgtctgct tcagtagtgg ctggtggtat tattgccata
     9361 ttggtgactt gtgctgccta ctactttatg aaattcagac gtgtttttgg tgagtacaac
     9421 catgttgttg ctgctaatgc acttttgttt ttgatgtctt tcactatact ctgtctggta
     9481 ccagcttaca gctttctgcc gggagtctac tcagtctttt acttgtactt gacattctat
     9541 ttcaccaatg atgtttcatt cttggctcac cttcaatggt ttgccatgtt ttctcctatt
     9601 gtgccttttt ggataacagc aatctatgta ttctgtattt ctctgaagca ctgccattgg
     9661 ttctttaaca actatcttag gaaaagagtc atgtttaatg gagttacatt tagtaccttc
     9721 gaggaggctg ctttgtgtac ctttttgctc aacaaggaaa tgtacctaaa attgcgtagc
     9781 gagacactgt tgccacttac acagtataac aggtatcttg ctctatataa caagtacaag
     9841 tatttcagtg gagccttaga tactaccagc tatcgtgaag cagcttgctg ccacttagca
     9901 aaggctctaa atgactttag caactcaggt gctgatgttc tctaccaacc accacagaca
     9961 tcaatcactt ctgctgttct gcagagtggt tttaggaaaa tggcattccc gtcaggcaaa
    10021 gttgaagggt gcatggtaca agtaacctgt ggaactacaa ctcttaatgg attgtggttg
    10081 gatgacacag tatactgtcc aagacatgtc atttgcacag cagaagacat gcttaatcct
    10141 aactatgaag atctgctcat tcgcaaatcc aaccatagct ttcttgttca ggctggcaat
    10201 gttcaacttc gtgttattgg ccattctatg caaaattgtc tgcttaggct taaagttgat
    10261 acttctaacc ctaagacacc caagtataaa tttgtccgta tccaacctgg tcaaacattt
    10321 tcagttctag catgctacaa tggttcacca tctggtgttt atcagtgtgc catgagacct
    10381 aatcatacca ttaaaggttc tttccttaat ggatcatgtg gtagtgttgg ttttaacatt
    10441 gattatgatt gcgtgtcttt ctgctatatg catcatatgg agcttccaac aggagtacac
    10501 gctggtactg acttagaagg taaattctat ggtccatttg ttgacagaca aactgcacag
    10561 gctgcaggta cagacacaac cataacatta aatgttttgg catggctgta tgctgctgtt
    10621 atcaatggtg ataggtggtt tcttaataga ttcaccacta ctttgaatga ctttaacctt
    10681 gtggcaatga agtacaacta tgaacctttg acacaagatc atgttgacat attgggacct
    10741 ctttctgctc aaacaggaat tgccgtctta gatatgtgtg ctgctttgaa agagctgctg
    10801 cagaatggta tgaatggtcg tactatcctt ggtagcacta ttttagaaga tgagtttaca
    10861 ccatttgatg ttgttagaca atgctctggt gttaccttcc aaggtaagtt caagaaaatt
    10921 gttaagggca ctcatcattg gatgctttta actttcttga catcactatt gattcttgtt
    10981 caaagtacac agtggtcact gtttttcttt gtttacgaga atgctttctt gccatttact
    11041 cttggtatta tggcaattgc tgcatgtgct atgctgcttg ttaagcataa gcacgcattc
    11101 ttgtgcttgt ttctgttacc ttctcttgca acagttgctt actttaatat ggtctacatg
    11161 cctgctagct gggtgatgcg tatcatgaca tggcttgaat tggctgacac tagcttgtct
    11221 ggttataggc ttaaggattg tgttatgtat gcttcagctt tagttttgct tattctcatg
    11281 acagctcgca ctgtttatga tgatgctgct agacgtgttt ggacactgat gaatgtcatt
    11341 acacttgttt acaaagtcta ctatggtaat gctttagatc aagctatttc catgtgggcc
    11401 ttagttattt ctgtaacctc taactattct ggtgtcgtta cgactatcat gtttttagct
    11461 agagctatag tgtttgtgtg tgttgagtat tacccattgt tatttattac tggcaacacc
    11521 ttacagtgta tcatgcttgt ttattgtttc ttaggctatt gttgctgctg ctactttggc
    11581 cttttctgtt tactcaaccg ttacttcagg cttactcttg gtgtttatga ctacttggtc
    11641 tctacacaag aatttaggta tatgaactcc caggggcttt tgcctcctaa gagtagtatt
    11701 gatgctttca agcttaacat taagttgttg ggtattggag gtaaaccatg tatcaaggtt
    11761 gctactgtac agtctaaaat gtctgacgta aagtgcacat ctgtggtact gctctcggtt
    11821 cttcaacaac ttagagtaga gtcatcttct aaattgtggg cacaatgtgt acaactccac
    11881 aatgatattc ttcttgcaaa agacacaact gaagctttcg agaagatggt ttctcttttg
    11941 tctgttttgc tatccatgca gggtgctgta gacattaata ggttgtgcga ggaaatgctc
    12001 gataaccgtg ctactcttca ggctattgct tcagaattta gttctttacc atcatatgcc
    12061 gcttatgcca ctgcccagga ggcctatgag caggctgtag ctaatggtga ttctgaagtc
    12121 gttctcaaaa agttaaagaa atctttgaat gtggctaaat ctgagtttga ccgtgatgct
    12181 gccatgcaac gcaagttgga aaagatggca gatcaggcta tgacccaaat gtacaaacag
    12241 gcaagatctg aggacaagag ggcaaaagta actagtgcta tgcaaacaat gctcttcact
    12301 atgcttagga agcttgataa tgatgcactt aacaacatta tcaacaatgc gcgtgatggt
    12361 tgtgttccac tcaacatcat accattgact acagcagcca aactcatggt tgttgtccct
    12421 gattatggta cctacaagaa cacttgtgat ggtaacacct ttacatatgc atctgcactc
    12481 tgggaaatcc agcaagttgt tgatgcggat agcaagattg ttcaacttag tgaaattaac
    12541 atggacaatt caccaaattt ggcttggcct cttattgtta cagctctaag agccaactca
    12601 gctgttaaac tacagaataa tgaactgagt ccagtagcac tacgacagat gtcctgtgcg
    12661 gctggtacca cacaaacagc ttgtactgat gacaatgcac ttgcctacta taacaattcg
    12721 aagggaggta ggtttgtgct ggcattacta tcagaccacc aagatctcaa atgggctaga
    12781 ttccctaaga gtgatggtac aggtacaatt tacacagaac tggaaccacc ttgtaggttt
    12841 gttacagaca caccaaaagg gcctaaagtg aaatacttgt acttcatcaa aggcttaaac
    12901 aacctaaata gaggtatggt gctgggcagt ttagctgcta cagtacgtct tcaggctgga
    12961 aatgctacag aagtacctgc caattcaact gtgctttcct tctgtgcttt tgcagtagac
    13021 cctgctaaag catataagga ttacctagca agtggaggac aaccaatcac caactgtgtg
    13081 aagatgttgt gtacacacac tggtacagga caggcaatta ctgtaacacc agaagctaac
    13141 atggaccaag agtcctttgg tggtgcttca tgttgtctgt attgtagatg ccacattgac
    13201 catccaaatc ctaaaggatt ctgtgacttg aaaggtaagt acgtccaaat acctaccact
    13261 tgtgctaatg acccagtggg ttttacactt agaaacacag tctgtaccgt ctgcggaatg
    13321 tggaaaggtt atggctgtag ttgtgaccaa ctccgcgaac ccttgatgca gtctgcggat
    13381 gcatcaacgt ttttaaacgg gtttgcggtg taagtgcagc ccgtcttaca ccgtgcggca
    13441 caggcactag tactgatgtc gtctacaggg cttttgatat ttacaacgaa aaagttgctg
    13501 gttttgcaaa gttcctaaaa actaattgct gtcgcttcca ggagaaggat gaggaaggca
    13561 atttattaga ctcttacttt gtagttaaga ggcatactat gtctaactac caacatgaag
    13621 agactattta taacttggtt aaagattgtc cagcggttgc tgtccatgac tttttcaagt
    13681 ttagagtaga tggtgacatg gtaccacata tatcacgtca gcgtctaact aaatacacaa
    13741 tggctgattt agtctatgct ctacgtcatt ttgatgaggg taattgtgat acattaaaag
    13801 aaatactcgt cacatacaat tgctgtgatg atgattattt caataagaag gattggtatg
    13861 acttcgtaga gaatcctgac atcttacgcg tatatgctaa cttaggtgag cgtgtacgcc
    13921 aatcattatt aaagactgta caattctgcg atgctatgcg tgatgcaggc attgtaggcg
    13981 tactgacatt agataatcag gatcttaatg ggaactggta cgatttcggt gatttcgtac
    14041 aagtagcacc aggctgcgga gttcctattg tggattcata ttactcattg ctgatgccca
    14101 tcctcacttt gactagggca ttggctgctg agtcccatat ggatgctgat ctcgcaaaac
    14161 cacttattaa gtgggatttg ctgaaatatg attttacgga agagagactt tgtctcttcg
    14221 accgttattt taaatattgg gaccagacat accatcccaa ttgtattaac tgtttggatg
    14281 ataggtgtat ccttcattgt gcaaacttta atgtgttatt ttctactgtg tttccaccta
    14341 caagttttgg accactagta agaaaaatat ttgtagatgg tgttcctttt gttgtttcaa
    14401 ctggatacca ttttcgtgag ttaggagtcg tacataatca ggatgtaaac ttacatagct
    14461 cgcgtctcag tttcaaggaa cttttagtgt atgctgctga tccagctatg catgcagctt
    14521 ctggcaattt attgctagat aaacgcacta catgcttttc agtagctgca ctaacaaaca
    14581 atgttgcttt tcaaactgtc aaacccggta attttaataa agacttttat gactttgctg
    14641 tgtctaaagg tttctttaag gaaggaagtt ctgttgaact aaaacacttc ttctttgctc
    14701 aggatggcaa cgctgctatc agtgattatg actattatcg ttataatctg ccaacaatgt
    14761 gtgatatcag acaactccta ttcgtagttg aagttgttga taaatacttt gattgttacg
    14821 atggtggctg tattaatgcc aaccaagtaa tcgttaacaa tctggataaa tcagctggtt
    14881 tcccatttaa taaatggggt aaggctagac tttattatga ctcaatgagt tatgaggatc
    14941 aagatgcact tttcgcgtat actaagcgta atgtcatccc tactataact caaatgaatc
    15001 ttaagtatgc cattagtgca aagaatagag ctcgcaccgt agctggtgtc tctatctgta
    15061 gtactatgac aaatagacag tttcatcaga aattattgaa gtcaatagcc gccactagag
    15121 gagctactgt ggtaattgga acaagcaagt tttacggtgg ctggcataat atgttaaaaa
    15181 ctgtttacag tgatgtagaa actccacacc ttatgggttg ggattatcca aaatgtgaca
    15241 gagccatgcc taacatgctt aggataatgg cctctcttgt tcttgctcgc aaacataaca
    15301 cttgctgtaa cttatcacac cgtttctaca ggttagctaa cgagtgtgcg caagtattaa
    15361 gtgagatggt catgtgtggc ggctcactat atgttaaacc aggtggaaca tcatccggtg
    15421 atgctacaac tgcttatgct aatagtgtct ttaacatttg tcaagctgtt acagccaatg
    15481 taaatgcact tctttcaact gatggtaata agatagctga caagtatgtc cgcaatctac
    15541 aacacaggct ctatgagtgt ctctatagaa atagggatgt tgatcatgaa ttcgtggatg
    15601 agttttacgc ttacctgcgt aaacatttct ccatgatgat tctttctgat gatgccgttg
    15661 tgtgctataa cagtaactat gcggctcaag gtttagtagc tagcattaag aactttaagg
    15721 cagttcttta ttatcaaaat aatgtgttca tgtctgaggc aaaatgttgg actgagactg
    15781 accttactaa aggacctcac gaattttgct cacagcatac aatgctagtt aaacaaggag
    15841 atgattacgt gtacctgcct tacccagatc catcaagaat attaggcgca ggctgttttg
    15901 tcgatgatat tgtcaaaaca gatggtacac ttatgattga aaggttcgtg tcactggcta
    15961 ttgatgctta cccacttaca aaacatccta atcaggagta tgctgatgtc tttcacttgt
    16021 atttacaata cattagaaag ttacatgatg agcttactgg ccacatgttg gacatgtatt
    16081 ccgtaatgct aactaatgat aacacctcac ggtactggga acctgagttt tatgaggcta
    16141 tgtacacacc acatacagtc ttgcaggctg taggtgcttg tgtattgtgc aattcacaga
    16201 cttcacttcg ttgcggtgcc tgtattagga gaccattcct atgttgcaag tgctgctatg
    16261 accatgtcat ttcaacatca cacaaattag tgttgtctgt taatccctat gtttgcaatg
    16321 ccccaggttg tgatgtcact gatgtgacac aactgtatct aggaggtatg agctattatt
    16381 gcaagtcaca taagcctccc attagttttc cattatgtgc taatggtcag gtttttggtt
    16441 tatacaaaaa cacatgtgta ggcagtgaca atgtcactga cttcaatgcg atagcaacat
    16501 gtgattggac taatgctggc gattacatac ttgccaacac ttgtactgag agactcaagc
    16561 ttttcgcagc agaaacgctc aaagccactg aggaaacatt taagctgtca tatggtattg
    16621 ccactgtacg cgaagtactc tctgacagag aattgcatct ttcatgggag gttggaaaac
    16681 ctagaccacc attgaacaga aactatgtct ttactggtta ccgtgtaact aaaaatagta
    16741 aagtacagat tggagagtac acctttgaaa aaggtgacta tggtgatgct gttgtgtaca
    16801 gaggtactac gacatacaag ttgaatgttg gtgattactt tgtgttgaca tctcacactg
    16861 taatgccact tagtgcacct actctagtgc cacaagagca ctatgtgaga attactggct
    16921 tgtacccaac actcaacatc tcagatgagt tttctagcaa tgttgcaaat tatcaaaagg
    16981 tcggcatgca aaagtactct acactccaag gaccacctgg tactggtaag agtcattttg
    17041 ccatcggact tgctctctat tacccatctg ctcgcatagt gtatacggca tgctctcatg
    17101 cagctgttga tgccctatgt gaaaaggcat taaaatattt gcccatagat aaatgtagta
    17161 gaatcatacc tgcgcgtgcg cgcgtagagt gttttgataa attcaaagtg aattcaacac
    17221 tagaacagta tgttttctgc actgtaaatg cattgccaga aacaactgct gacattgtag
    17281 tctttgatga aatctctatg gctactaatt atgacttgag tgttgtcaat gctagacttc
    17341 gtgcaaaaca ctacgtctat attggcgatc ctgctcaatt accagccccc cgcacattgc
    17401 tgactaaagg cacactagaa ccagaatatt ttaattcagt gtgcagactt atgaaaacaa
    17461 taggtccaga catgttcctt ggaacttgtc gccgttgtcc tgctgaaatt gttgacactg
    17521 tgagtgcttt agtttatgac aataagctaa aagcacacaa ggataagtca gctcaatgct
    17581 tcaaaatgtt ctacaaaggt gttattacac atgatgtttc atctgcaatc aacagacctc
    17641 aaataggcgt tgtaagagaa tttcttacac gcaatcctgc ttggagaaaa gctgttttta
    17701 tctcacctta taattcacag aacgctgtag cttcaaaaat cttaggattg cctacgcaga
    17761 ctgttgattc atcacagggt tctgaatatg actatgtcat attcacacaa actactgaaa
    17821 cagcacactc ttgtaatgtc aaccgcttca atgtggctat cacaagggca aaaattggca
    17881 ttttgtgcat aatgtctgat agagatcttt atgacaaact gcaatttaca agtctagaaa
    17941 taccacgtcg caatgtggct acattacaag cagaaaatgt aactggactt tttaaggact
    18001 gtagtaagat cattactggt cttcatccta cacaggcacc tacacacctc agcgttgata
    18061 taaagttcaa gactgaagga ttatgtgttg acataccagg cataccaaag gacatgacct
    18121 accgtagact catctctatg atgggtttca aaatgaatta ccaagtcaat ggttacccta
    18181 atatgtttat cacccgcgaa gaagctattc gtcacgttcg tgcgtggatt ggctttgatg
    18241 tagagggctg tcatgcaact agagatgctg tgggtactaa cctacctctc cagctaggat
    18301 tttctacagg tgttaactta gtagctgtac cgactggtta tgttgacact gaaaataaca
    18361 cagaattcac cagagttaat gcaaaacctc caccaggtga ccagtttaaa catcttatac
    18421 cactcatgta taaaggcttg ccctggaatg tagtgcgtat taagatagta caaatgctca
    18481 gtgatacact gaaaggattg tcagacagag tcgtgttcgt cctttgggcg catggctttg
    18541 agcttacatc aatgaagtac tttgtcaaga ttggacctga aagaacgtgt tgtctgtgtg
    18601 acaaacgtgc aacttgcttt tctacttcat cagatactta tgcctgctgg aatcattctg
    18661 tgggttttga ctatgtctat aacccattta tgattgatgt tcagcagtgg ggctttacgg
    18721 gtaaccttca gagtaaccat gaccaacatt gccaggtaca tggaaatgca catgtggcta
    18781 gttgtgatgc tatcatgact agatgtttag cagtccatga gtgctttgtt aagcgcgttg
    18841 attggtctgt tgaataccct attataggag atgaactgag ggttaattct gcttgcagaa
    18901 aagtacaaca catggttgtg aagtctgcat tgcttgctga taagtttcca gttcttcatg
    18961 acattggaaa tccaaaggct atcaagtgtg tgcctcaggc tgaagtagaa tggaagttct
    19021 acgatgctca gccatgtagt gacaaagctt acaaaataga ggaactcttc tattcttatg
    19081 ctacacatca cgataaattc actgatggtg tttgtttgtt ttggaattgt aacgttgatc
    19141 gttacccagc caatgcaatt gtgtgtaggt ttgacacaag agtcttgtca aacttgaact
    19201 taccaggctg tgatggtggt agtttgtatg tgaataagca tgcattccac actccagctt
    19261 tcgataaaag tgcatttact aatttaaagc aattgccttt cttttactat tctgatagtc
    19321 cttgtgagtc tcatggcaaa caagtagtgt cggatattga ttatgttcca ctcaaatctg
    19381 ctacgtgtat tacacgatgc aatttaggtg gtgctgtttg cagacaccat gcaaatgagt
    19441 accgacagta cttggatgca tataatatga tgatttctgc tggatttagc ctatggattt
    19501 acaaacaatt tgatacttat aacctgtgga atacatttac caggttacag agtttagaaa
    19561 atgtggctta taatgttgtt aataaaggac actttgatgg acacgccggc gaagcacctg
    19621 tttccatcat taataatgct gtttacacaa aggtagatgg tattgatgtg gagatctttg
    19681 aaaataagac aacacttcct gttaatgttg catttgagct ttgggctaag cgtaacatta
    19741 aaccagtgcc agagattaag atactcaata atttgggtgt tgatatcgct gctaatactg
    19801 taatctggga ctacaaaaga gaagccccag cacatgtatc tacaataggt gtctgcacaa
    19861 tgactgacat tgccaagaaa cctactgaga gtgcttgttc ttcacttact gtcttgtttg
    19921 atggtagagt ggaaggacag gtagaccttt ttagaaacgc ccgtaatggt gttttaataa
    19981 cagaaggttc agtcaaaggt ctaacacctt caaagggacc agcacaagct agcgtcaatg
    20041 gagtcacatt aattggagaa tcagtaaaaa cacagtttaa ctactttaag aaagtagacg
    20101 gcattattca acagttgcct gaaacctact ttactcagag cagagactta gaggatttta
    20161 agcccagatc acaaatggaa actgactttc tcgagctcgc tatggatgaa ttcatacagc
    20221 gatataagct cgagggctat gccttcgaac acatcgttta tggagatttc agtcatggac
    20281 aacttggcgg tcttcattta atgataggct tagccaagcg ctcacaagat tcaccactta
    20341 aattagagga ttttatccct atggacagca cagtgaaaaa ttacttcata acagatgcgc
    20401 aaacaggttc atcaaaatgt gtgtgttctg tgattgatct tttacttgat gactttgtcg
    20461 agataataaa gtcacaagat ttgtcagtga tttcaaaagt ggtcaaggtt acaattgact
    20521 atgctgaaat ttcattcatg ctttggtgta aggatggaca tgttgaaacc ttctacccaa
    20581 aactacaagc aagtcaagcg tggcaaccag gtgttgcgat gcctaacttg tacaagatgc
    20641 aaagaatgct tcttgaaaag tgtgaccttc agaattatgg tgaaaatgct gttataccaa
    20701 aaggaataat gatgaatgtc gcaaagtata ctcaactgtg tcaatactta aatacactta
    20761 ctttagctgt accctacaac atgagagtta ttcactttgg tgctggctct gataaaggag
    20821 ttgcaccagg tacagctgtg ctcagacaat ggttgccaac tggcacacta cttgtcgatt
    20881 cagatcttaa tgacttcgtc tccgacgcag attctacttt aattggagac tgtgcaacag
    20941 tacatacggc taataaatgg gaccttatta ttagcgatat gtatgaccct aggaccaaac
    21001 atgtgacaaa agagaatgac tctaaagaag ggtttttcac ttatctgtgt ggatttataa
    21061 agcaaaaact agccctgggt ggttctatag ctgtaaagat aacagagcat tcttggaatg
    21121 ctgaccttta caagcttatg ggccatttct catggtggac agcttttgtt acaaatgtaa
    21181 atgcatcatc atcggaagca tttttaattg gggctaacta tcttggcaag ccgaaggaac
    21241 aaattgatgg ctataccatg catgctaact acattttctg gaggaacaca aatcctatcc
    21301 agttgtcttc ctattcactc tttgacatga gcaaatttcc tcttaaatta agaggaactg
    21361 ctgtaatgtc tcttaaggag aatcaaatca atgatatgat ttattctctt ctggaaaaag
    21421 gtaggcttat cattagagaa aacaacagag ttgtggtttc aagtgatatt cttgttaaca
    21481 actaaacgaa catgtttatt ttcttattat ttcttactct cactagtggt agtgaccttg
    21541 accggtgcac cacttttgat gatgttcaag ctcctaatta cactcaacat acttcatcta
    21601 tgaggggggt ttactatcct gatgaaattt ttagatcaga cactctttat ttaactcagg
    21661 atttatttct tccattttat tctaatgtta cagggtttca tactattaat catacgtttg
    21721 gcaaccctgt catacctttt aaggatggta tttattttgc tgccacagag aaatcaaatg
    21781 ttgtccgtgg ttgggttttt ggttctacca tgaacaacaa gtcacagtcg gtgattatta
    21841 ttaacaattc tactaatgtt gttatacgag catgtaactt tgaattgtgt gacaaccctt
    21901 tctttgctgt ttctaaaccc atgggtacac agacacatac tatgatattc gataatgcat
    21961 ttaattgcac tttcgagtac atatctgatg ccttttcgct tgatgtttca gaaaagtcag
    22021 gtaattttaa acacttacga gagtttgtgt ttaaaaataa agatgggttt ctctatgttt
    22081 ataagggcta tcaacctata gatgtagttc gtgatctacc ttctggtttt aacactttga
    22141 aacctatttt taagttgcct cttggtatta acattacaaa ttttagagcc attcttacag
    22201 ccttttcacc tgctcaagac atttggggca cgtcagctgc agcctatttt gttggctatt
    22261 taaagccaac tacatttatg ctcaagtatg atgaaaatgg tacaatcaca gatgctgttg
    22321 attgttctca aaatccactt gctgaactca aatgctctgt taagagcttt gagattgaca
    22381 aaggaattta ccagacctct aatttcaggg ttgttccctc aggagatgtt gtgagattcc
    22441 ctaatattac aaacttgtgt ccttttggag aggtttttaa tgctactaaa ttcccttctg
    22501 tctatgcatg ggagagaaaa aaaatttcta attgtgttgc tgattactct gtgctctaca
    22561 actcaacatt tttttcaacc tttaagtgct atggcgtttc tgccactaag ttgaatgatc
    22621 tttgcttctc caatgtctat gcagattctt ttgtagtcaa gggagatgat gtaagacaaa
    22681 tagcgccagg acaaactggt gttattgctg attataatta taaattgcca gatgatttca
    22741 tgggttgtgt ccttgcttgg aatactagga acattgatgc tacttcaact ggtaattata
    22801 attataaata taggtatctt agacatggca agcttaggcc ctttgagaga gacatatcta
    22861 atgtgccttt ctcccctgat ggcaaacctt gcaccccacc tgctcttaat tgttattggc
    22921 cattaaatga ttatggtttt tacaccacta ctggcattgg ctaccaacct tacagagttg
    22981 tagtactttc ttttgaactt ttaaatgcac cggccacggt ttgtggacca aaattatcca
    23041 ctgaccttat taagaaccag tgtgtcaatt ttaattttaa tggactcact ggtactggtg
    23101 tgttaactcc ttcttcaaag agatttcaac catttcaaca atttggccgt gatgtttctg
    23161 atttcactga ttccgttcga gatcctaaaa catctgaaat attagacatt tcaccttgcg
    23221 cttttggggg tgtaagtgta attacacctg gaacaaatgc ttcatctgaa gttgctgttc
    23281 tatatcaaga tgttaactgc actgatgttt ctacagcaat tcatgcagat caactcacac
    23341 cagcttggcg catatattct actggaaaca atgtattcca gactcaagca ggctgtctta
    23401 taggagctga gcatgtcgac acttcttatg agtgcgacat tcctattgga gctggcattt
    23461 gtgctagtta ccatacagtt tctttattac gtagtactag ccaaaaatct attgtggctt
    23521 atactatgtc tttaggtgct gatagttcaa ttgcttactc taataacacc attgctatac
    23581 ctactaactt ttcaattagc attactacag aagtaatgcc tgtttctatg gctaaaacct
    23641 ccgtagattg taatatgtac atctgcggag attctactga atgtgctaat ttgcttctcc
    23701 aatatggtag cttttgcaca caactaaatc gtgcactctc aggtattgct gctgaacagg
    23761 atcgcaacac acgtgaagtg ttcgctcaag tcaaacaaat gtacaaaacc ccaactttga
    23821 aatattttgg tggttttaat ttttcacaaa tattacctga ccctctaaag ccaactaaga
    23881 ggtcttttat tgaggacttg ctctttaata aggtgacact cgctgatgct ggcttcatga
    23941 agcaatatgg cgaatgccta ggtgatatta atgctagaga tctcatttgt gcgcagaagt
    24001 tcaatggact tacagtgttg ccacctctgc tcactgatga tatgattgct gcctacactg
    24061 ctgctctagt tagtggtact gccactgctg gatggacatt tggtgctggc gctgctcttc
    24121 aaataccttt tgctatgcaa atggcatata ggttcaatgg cattggagtt acccaaaatg
    24181 ttctctatga gaaccaaaaa caaatcgcca accaatttaa caaggcgatt agtcaaattc
    24241 aagaatcact tacaacaaca tcaactgcat tgggcaagct gcaagacgtt gttaaccaga
    24301 atgctcaagc attaaacaca cttgttaaac aacttagctc taattttggt gcaatttcaa
    24361 gtgtgctaaa tgatatcctt tcgcgacttg ataaagtcga ggcggaggta caaattgaca
    24421 ggttaattac aggcagactt caaagccttc aaacctatgt aacacaacaa ctaatcaggg
    24481 ctgctgaaat cagggcttct gctaatcttg ctgctactaa aatgtctgag tgtgttcttg
    24541 gacaatcaaa aagagttgac ttttgtggaa agggctacca ccttatgtcc ttcccacaag
    24601 cagccccgca tggtgttgtc ttcctacatg tcacgtatgt gccatcccag gagaggaact
    24661 tcaccacagc gccagcaatt tgtcatgaag gcaaagcata cttccctcgt gaaggtgttt
    24721 ttgtgtttaa tggcacttct tggtttatta cacagaggaa cttcttttct ccacaaataa
    24781 ttactacaga caatacattt gtctcaggaa attgtgatgt cgttattggc atcattaaca
    24841 acacagttta tgatcctctg caacctgagc ttgactcatt caaagaagag ctggacaagt
    24901 acttcaaaaa tcatacatca ccagatgttg atcttggcga catttcaggc attaacgctt
    24961 ctgtcgtcaa cattcaaaaa gaaattgacc gcctcaatga ggtcgctaaa aatttaaatg
    25021 aatcactcat tgaccttcaa gaattgggaa aatatgagca atatattaaa tggccttggt
    25081 atgtttggct cggcttcatt gctggactaa ttgccatcgt catggttaca atcttgcttt
    25141 gttgcatgac tagttgttgc agttgcctca agggtgcatg ctcttgtggt tcttgctgca
    25201 agtttgatga ggatgactct gagccagttc tcaagggtgt caaattacat tacacataaa
    25261 cgaacttatg gatttgttta tgagattttt tactcttaga tcaattactg cacagccagt
    25321 aaaaattgac aatgcttctc ctgcaagtac tgttcatgct acagcaacga taccgctaca
    25381 agcctcactc cctttcggat ggcttgttat tggcgttgca tttcttgctg tttttcagag
    25441 cgctaccaaa ataattgcgc tcaataaaag atggcagcta gccctttata agggcttcca
    25501 gttcatttgc aatttactgc tgctatttgt taccatctat tcacatcttt tgcttgtcgc
    25561 tgcaggtatg gaggcgcaat ttttgtacct ctatgccttg atatattttc tacaatgcat
    25621 caacgcatgt agaattatta tgagatgttg gctttgttgg aagtgcaaat ccaagaaccc
    25681 attactttat gatgccaact actttgtttg ctggcacaca cataactatg actactgtat
    25741 accatataac agtgtcacag atacaattgt cgttactgaa ggtgacggca tttcaacacc
    25801 aaaactcaaa gaagactacc aaattggtgg ttattctgag gataggcact caggtgttaa
    25861 agactatgtc gttgtacatg gctatttcac cgaagtttac taccagcttg agtctacaca
    25921 aattactaca gacactggta ttgaaaatgc tacattcttc atctttaaca agcttgttaa
    25981 agacccaccg aatgtgcaaa tacacacaat cgacggctct tcaggagttg ctaatccagc
    26041 aatggatcca atttatgatg agccgacgac gactactagc gtgcctttgt aagcacaaga
    26101 aagtgagtac gaacttatgt actcattcgt ttcggaagaa acaggtacgt taatagttaa
    26161 tagcgtactt ctttttcttg ctttcgtggt attcttgcta gtcacactag ccatccttac
    26221 tgcgcttcga ttgtgtgcgt actgctgcaa tattgttaac gtgagtttag taaaaccaac
    26281 ggtttacgtc tactcgcgtg ttaaaaatct gaactcttct gaaggagttc ctgatcttct
    26341 ggtctaaacg aactaactat tattattatt ctgtttggaa ctttaacatt gcttatcatg
    26401 gcagacaacg gtactattac cgttgaggag cttaaacaac tcctggaaca atggaaccta
    26461 gtaataggtt tcctattcct agcctggatt atgttactac aatttgccta ttctaatcgg
    26521 aacaggtttt tgtacataat aaagcttgtt ttcctctggc tcttgtggcc agtaacactt
    26581 gcttgttttg tgcttgctgc tgtctacaga attaattggg tgactggcgg gattgcgatt
    26641 gcaatggctt gtattgtagg cttgatgtgg cttagctact tcgttgcttc cttcaggctg
    26701 tttgctcgta cccgctcaat gtggtcattc aacccagaaa caaacattct tctcaatgtg
    26761 cctctccggg ggacaattgt gaccagaccg ctcatggaaa gtgaacttgt cattggtgct
    26821 gtgatcattc gtggtcactt gcgaatggcc ggacactccc tagggcgctg tgacattaag
    26881 gacctgccaa aagagatcac tgtggctaca tcacgaacgc tttcttatta caaattagga
    26941 gcgtcgcagc gtgtaggcac tgattcaggt tttgctgcat acaaccgcta ccgtattgga
    27001 aactataaat taaatacaga ccacgccggt agcaacgaca atattgcttt gctagtacag
    27061 taagtgacaa cagatgtttc atcttgttga cttccaggtt acaatagcag agatattgat
    27121 tatcattatg aggactttca ggattgctat ttggaatctt gacgttataa taagttcaat
    27181 agtgagacaa ttatttaagc ctctaactaa gaagaattat tcggagttag atgatgaaga
    27241 acctatggag ttagattatc cataaaacga acatgaaaat tattctcttc ctgacattga
    27301 ttgtatttac atcttgcgag ctatatcact atcaggagtg tgttagaggt acgactgtac
    27361 tactaaaaga accttgccca tcaggaacat acgagggcaa ttcaccattt caccctcttg
    27421 ctgacaataa atttgcacta acttgcacta gcacacactt tgcttttgct tgtgctgacg
    27481 gtactcgaca tacctatcag ctgcgtgcaa gatcagtttc accaaaactt ttcatcagac
    27541 aagaggaggt tcaacaagag ctctactcgc cactttttct cattgttgct gctctagtat
    27601 ttttaatact ttgcttcacc attaagagaa agacagaatg aatgagctca ctttaattga
    27661 cttctatttg tgctttttag cctttctgct attccttgtt ttaataatgc ttattatatt
    27721 ttggttttca ctcgaaatcc aggatctaga agaaccttgt accaaagtct aaacgaacat
    27781 gaaacttctc attgttttga cttgtatttc tctatgcagt tgcatatgca ctgtagtaca
    27841 gcgctgtgca tctaataaac ctcatgtgct tgaagatcct tgtaaggtac aacactaggg
    27901 gtaatactta tagcactgct tggctttgtg ctctaggaaa ggttttacct tttcatagat
    27961 ggcacactat ggttcaaaca tgcacaccta atgttactat caactgtcaa gatccagctg
    28021 gtggtgcgct tatagctagg tgttggtacc ttcatgaagg tcaccaaact gctgcattta
    28081 gagacgtact tgttgtttta aataaacgaa caaattaaaa tgtctgataa tggaccccaa
    28141 tcaaaccaac gtagtgcccc ccgcattaca tttggtggac ccacagattc aactgacaat
    28201 aaccagaatg gaggacgcaa tggggcaagg ccaaaacagc gccgacccca aggtttaccc
    28261 aataatactg cgtcttggtt cacagctctc actcagcatg gcaaggagga acttagattc
    28321 cctcgaggcc agggcgttcc aatcaacacc aatagtggtc cagatgacca aattggctac
    28381 taccgaagag ctacccgacg agttcgtggt ggtgacggca aaatgaaaga gctcagcccc
    28441 agatggtact tctattacct aggaactggc ccagaagctt cacttcccta cggcgctaac
    28501 aaagaaggca tcgtatgggt tgcaactgag ggagccttga atacacccaa agaccacatt
    28561 ggcacccgca atcctaataa caatgctgcc accgtgctac aacttcctca aggaacaaca
    28621 ttgccaaaag gcttctacgc agagggaagc agaggcggca gtcaagcctc ttctcgctcc
    28681 tcatcacgta gtcgcggtaa ttcaagaaat tcaactcctg gcagcagtag gggaaattct
    28741 cctgctcgaa tggctagcgg aggtggtgaa actgccctcg cgctattgct gctagacaga
    28801 ttgaaccagc ttgagagcaa agtttctggt aaaggccaac aacaacaagg ccaaactgtc
    28861 actaagaaat ctgctgctga ggcatctaaa aagcctcgcc aaaaacgtac tgccacaaaa
    28921 cagtacaacg tcactcaagc atttgggaga cgtggtccag aacaaaccca aggaaatttc
    28981 ggggaccaag acctaatcag acaaggaact gattacaaac attggccgca aattgcacaa
    29041 tttgctccaa gtgcctctgc attctttgga atgtcacgca ttggcatgga agtcacacct
    29101 tcgggaacat ggctgactta tcatggagcc attaaattgg atgacaaaga tccacaattc
    29161 aaagacaacg tcatactgct gaacaagcac attgacgcat acaaaacatt cccaccaaca
    29221 gagcctaaaa aggacaaaaa gaaaaagact gatgaagctc agcctttgcc gcagagacaa
    29281 aagaagcagc ccactgtgac tcttcttcct gcggctgaca tggatgattt ctccagacaa
    29341 cttcaaaatt ccatgagtgg agcttctgct gattcaactc aggcataaac actcatgatg
    29401 accacacaag gcagatgggc tatgtaaacg ttttcgcaat tccgtttacg atacatagtc
    29461 tactcttgtg cagaatgaat tctcgtaact aaacagcaca agtaggttta gttaacttta
    29521 atctcacata gcaatcttta atcaatgtgt aacattaggg aggacttgaa agagccacca
    29581 cattttcatc gaggccacgc ggagtacgat cgagggtaca gtgaataatg ctagggagag
    29641 ctgcctatat ggaagagccc taatgtgtaa aattaatttt agtagtgcta tccccatgtg
    29701 attttaatag cttcttagga gaatgacaaa aaaaaaaaaa aaaaaaaaaa a"""

In [28]:
for s in "\n01234567789 ":
    old_corona = old_corona.replace(s, "")

In [29]:
import lzma
lzc_v1 = lzma.compress(old_corona.encode("utf-8"))
len(lzc_v1)

8412

In [30]:
len(lzc_v1) - len(lzc)

4

By looking at the kolmogorov complexity of the old coronavirus, you can comment that both coronaviruses differ by about 4 bytes of information. But here is the reality:

(Since my PC hangs in executing the following command, I am posting the screenshot of it which I executed on google colab.)

![](./images/edit-distance.png)

And here is what happens when you compare the lengths:

In [31]:
len(corona) - len(old_corona)

152

From this, we can see that - Novel coronavirus differ alot than expected from old coronavirus. Now that we know - the difference between two DNAs/RNAs is measured by calculating edit-distance, we can now just simply complete extracting other proteins. 

In [32]:
# https://www.ncbi.nlm.nih.gov/gene/43740570 - Envelope protein in Cov-2
envelope_v2 = translation(corona[26244:26472], True)

In [33]:
len(envelope_v2)

75

In [34]:
# https://www.ncbi.nlm.nih.gov/gene/43740571 - Membrane Glycoprotein in Cov-2
membrane_v2 = translation(corona[26522:27191], True)

In [35]:
len(membrane_v2)

222

In [36]:
# https://www.ncbi.nlm.nih.gov/gene/43740572 - Orf6 in Cov-2
orf6_v2 = translation(corona[27201:27387], True)

In [37]:
len(orf6_v2)

61

In [38]:
# https://www.ncbi.nlm.nih.gov/gene/43740573 - orf7a in Cov-2
orf7a = translation(corona[27393:27759], True)

In [39]:
len(orf7a)

121

In [40]:
# https://www.ncbi.nlm.nih.gov/gene/43740574 - orf7b in Cov-2
orf7b = translation(corona[27755:27887], True)

In [42]:
len(orf7b)

43

In [43]:
# https://www.ncbi.nlm.nih.gov/gene/43740577 - orf8 in Cov-2
orf8 = translation(corona[27893:28259], True)

In [44]:
len(orf8)

121

In [45]:
# https://www.ncbi.nlm.nih.gov/gene/43740576 - orf10 in Cov-2
orf10 = translation(corona[29557:29674], True)

In [46]:
len(orf10)

38