## Using Biopython to explore *APOL1*



[Biopython](https://biopython.org/) is a python package that is used for biological computations such as alignment, population genetics and it allows the user to study sequences. Sequences can be studied using Biopython's Sequence object, which is an object that can not only read strings that have the letters, "A,T,C,G" but also can apply biological methods such as translation, transcription and much more onto the string. Furthermore, it can also parse FASTA files, a common file that used in bioinformatics.


Splicesosomes are a complex of proteins, mostly small nuclear riboproteins (snRNPs) and small nuclear RNA (snRNA) that are involved in processing heterogenous nuclear RNA  (hnRNA) into mRNA. They carry this out by cutting introns from a transcript by recognition of splicing sites. 
Major splicing sites usually have a GU donor site and a AG acceptor site whereas minor splicing sites have  AU donor and AC acceptor sites.


If there is a mutation on either of these sites, it can cause the splicesosomes to cut within the exon of the coding gene, changing the structure of the protein, potentially preventing it to function and causing disease. 
With this in mind, Biopython could be used to identify the existence of these sites and how many there are in the gene. 
In this notebook, I will be exploring the gene APOL1 (apolipoprotein L1), which encodes a protein that is involved in the transport of cholesterol, to see how many splicing sites it has using Biopython 


In [2]:
#setting the working directory 
import os 
os.chdir("C:/Users/smart/Documents/Bioinformatics//Python code")

In [3]:
#importing Biopython's Seq object and SeqIO
from Bio.Seq import Seq 
from Bio import SeqIO

The code below parses the APOL1 fasta file into the notebook and prints out the fasta header, which is an ID: NG_023228.1, which is the identification number the gene is given to in NCBI's nucleotide database. 

The length of the sequence is 21461bp 


In [4]:
#parsing the APOL1 fasta file
for seq_record in SeqIO.parse("test.fa", "fasta"):
    print("fasta header: ", seq_record,id)
    print(repr(seq_record.seq))
    print("seq length", len(seq_record))
          

FileNotFoundError: [Errno 2] No such file or directory: 'APOL1.FASTA'

Here you can see the full sequence of the APOL1 gene

In [5]:
APOL1 = seq_record.seq
print(APOL1)

ATTACTCTCATAACATAACACAGGCTCTGCTGACGTTAGATACCCAGAGTCCAGCCACAGAGTCTTTGTCTTATGATCTTATCACCTCATGGTTTAATCAACTCCCAGACACGTGGAGAAACTTTGTATCTAGCGTAATTGCAATTGCGTTCATAATCCTTTGTTGAGGTTAGTGTTGCTGCTGGGGTGTCTGGCTGCAATGCTCTTCCACAAAACTGGCCCCAGGAAAATCACAGAAAAACAGTGTGATGAGATTGTGAGGGCCATTCAAGGCTATGCAGGGGTGTGGAGAGCATGGTGGGCATGTCCCGTCAGACAGCAATAACTGAAGCATCCCCTAAGAATAACCCTACATTCCTTGGTGAATTTTTGCTCAGTGTTCCAAGGTATGGATCCCGGGAATGGCCAATCCAGATGTTTACATCATACTTATGAAGAACTCTGTTCCTTGGATCAGAGGTTGTGCAAGGAATCAAGACCTTTTGTTTTGGGCTAGGTGGAGGTTTCCTGGCAGAGGTGCTAAGTGGAGGTTGCTCTGGGAAGAATATCATATAACCTGCATGCATTTGACAAACAGGAGGGGATTTCTTGTCTTGCCTGCTGCTCCTGGGCAACCTGTACATAAGCTCCCTGAATAAAGCTTATGTCTCACCTGCTGTCTCCAGGTCTTTTCTTCTGTCTCTCCAAGTCTGTGTCCTGTCAGCTCATTAGCATAGGGGTCCAGCATGACAAATATAAAACGGTGTTTAACATTCTTTGGATTATGTTTATATAAATGTGTTATTAATATGTGTTCCAAAATTGTACGAGATTCTATAGGTCTAATATGTCTTGGCATATGTTATCAGAAACTATTATTATTATTATGTTAAGTAGTTGTTTGCCACAGAAATAAAGTAATTTCCTTGTCGAATGTGTCTTTACCAATGCTGTTCTCATACTTTTGTTATCTGCAAAAAGTGTTTACTTTACATCTCAAAAAACAGTTTATGATCAGCTA

With the sequence read into python, the splicing sites were found. The major splicing on hnRNA would be GU and AG. It would make sense for the DNA version to be GT and AG however, this strand is the template strand. Therefore, I will be searching for CA and TC sequences 

In [6]:
APOL1.find("CA")

8

In [7]:
APOL1.find("TC")

5

The .find() method has allowed us to see where those sequences are found first. Now it's time to see how many CAs and TCs there are in the *APOL1* gene sequence using the .count() method. 

In [8]:
APOL1.count("CA")

1577

In [9]:
APOL1.count("TC")

1318

Another feature of Biopython is that it can produce the complement of a given sequence. This can be done using two methods:

.complement() gives the complement of the sequence. the complement sequence however, is 5' to 3' 

.reverse_compliment() complements the original sequence but from 3'to 5' 

In [10]:
#The complement method produces the complement of the sequnece 
print(APOL1.complement())

TAATGAGAGTATTGTATTGTGTCCGAGACGACTGCAATCTATGGGTCTCAGGTCGGTGTCTCAGAAACAGAATACTAGAATAGTGGAGTACCAAATTAGTTGAGGGTCTGTGCACCTCTTTGAAACATAGATCGCATTAACGTTAACGCAAGTATTAGGAAACAACTCCAATCACAACGACGACCCCACAGACCGACGTTACGAGAAGGTGTTTTGACCGGGGTCCTTTTAGTGTCTTTTTGTCACACTACTCTAACACTCCCGGTAAGTTCCGATACGTCCCCACACCTCTCGTACCACCCGTACAGGGCAGTCTGTCGTTATTGACTTCGTAGGGGATTCTTATTGGGATGTAAGGAACCACTTAAAAACGAGTCACAAGGTTCCATACCTAGGGCCCTTACCGGTTAGGTCTACAAATGTAGTATGAATACTTCTTGAGACAAGGAACCTAGTCTCCAACACGTTCCTTAGTTCTGGAAAACAAAACCCGATCCACCTCCAAAGGACCGTCTCCACGATTCACCTCCAACGAGACCCTTCTTATAGTATATTGGACGTACGTAAACTGTTTGTCCTCCCCTAAAGAACAGAACGGACGACGAGGACCCGTTGGACATGTATTCGAGGGACTTATTTCGAATACAGAGTGGACGACAGAGGTCCAGAAAAGAAGACAGAGAGGTTCAGACACAGGACAGTCGAGTAATCGTATCCCCAGGTCGTACTGTTTATATTTTGCCACAAATTGTAAGAAACCTAATACAAATATATTTACACAATAATTATACACAAGGTTTTAACATGCTCTAAGATATCCAGATTATACAGAACCGTATACAATAGTCTTTGATAATAATAATAATACAATTCATCAACAAACGGTGTCTTTATTTCATTAAAGGAACAGCTTACACAGAAATGGTTACGACAAGAGTATGAAAACAATAGACGTTTTTCACAAATGAAATGTAGAGTTTTTTGTCAAATACTAGTCGAT

In [11]:
#The reverse_complement method does the same thng
#as the complement method 
#but shows the complement strand from 3' to 5'
print(APOL1.reverse_complement())

AGTGGGAGACTTTAACACCCCACTGTCAATATTAGACAGGTCAACGAGACAGAAAATTAACAAGGATGTTCAGGACTTGAACTCAGCCCTGGACTGAGTGGACCTAACAGACATCTACAGAACTCTCCGTCCCAAATCAACAGAATATACTTTCTTCTCAGCACCGTATAACACTTATTCTAAAGTCGACCACATAATTGGAAGTAAAACATTCCTCAGCAAATGCAAAAGAATGGAAATCATAACAAACAGTGTCTCAGACCACAGTGCAATCAAATTAGAACTCAGGATTAAGAAACTCACTAAAAACCGCACAACTACATGGAAATTGAACAACGTGTTCCTGAATGATTACTGGGTAAATAACAAAATTAAGGCAGAAATCAAGAAGTTCTTTGAAACCAATGAGAACAAAGACACAATGTACCAGAACCCCTGGGACACAGCTAAAGCAGTGGATAAAGGCCTCCATGTCTCCTCCCTGGAAACAAAGATTCTGAGGACGCTCCTTGAAGGCACCTCAGAAGGTCTCATTGCCCATATGGATGGCCCATTCAAAAACTTCTTGTATCAACGTTTCCTCCTTCCCTCCTGGATCCTCAGTTCTGCTCCTAGAATCACTTCCTAGAAACCCTCTGCACCAGAACGTTGCCTCTGGCTCTTCCTTCACAGGACCCCAGGCCGCAACAGGCTACTTTAGGCAGCATCATGGCCTACATCTTTGTGCTAAGAGTAATGGGAAGCCATTGAATGGTCAGAAGTGGGTGCTCGATCAAATCTATGTTCTTAGGTGGGCACGGTGGCTCGTGTCTGTAATTGCAGCCCATTGGAATGCCAAGGCGAACAGATCACCTGAACTCAGGAGTTTGAGACCAGCCTGGGCAATGTGGCACACACGGATGGTCCCAGCTACTCAGGGGGTCAAGGTGGAAGGGTAGTTTGAGGCCGGGAGGTCCAAGATGCAGTGAGCCATGCTCATGCCACGGGACGCCAGCCTCAC

Biopython can also transcribe the gene sequence into mRNA and transcribe a mRNA sequence into DNA as showcased in the two following cells

In [11]:
#Transcribes the DNA into RNA 
APOL1_mrna =  APOL1.transcribe()
print(APOL1_mrna)



AUUACUCUCAUAACAUAACACAGGCUCUGCUGACGUUAGAUACCCAGAGUCCAGCCACAGAGUCUUUGUCUUAUGAUCUUAUCACCUCAUGGUUUAAUCAACUCCCAGACACGUGGAGAAACUUUGUAUCUAGCGUAAUUGCAAUUGCGUUCAUAAUCCUUUGUUGAGGUUAGUGUUGCUGCUGGGGUGUCUGGCUGCAAUGCUCUUCCACAAAACUGGCCCCAGGAAAAUCACAGAAAAACAGUGUGAUGAGAUUGUGAGGGCCAUUCAAGGCUAUGCAGGGGUGUGGAGAGCAUGGUGGGCAUGUCCCGUCAGACAGCAAUAACUGAAGCAUCCCCUAAGAAUAACCCUACAUUCCUUGGUGAAUUUUUGCUCAGUGUUCCAAGGUAUGGAUCCCGGGAAUGGCCAAUCCAGAUGUUUACAUCAUACUUAUGAAGAACUCUGUUCCUUGGAUCAGAGGUUGUGCAAGGAAUCAAGACCUUUUGUUUUGGGCUAGGUGGAGGUUUCCUGGCAGAGGUGCUAAGUGGAGGUUGCUCUGGGAAGAAUAUCAUAUAACCUGCAUGCAUUUGACAAACAGGAGGGGAUUUCUUGUCUUGCCUGCUGCUCCUGGGCAACCUGUACAUAAGCUCCCUGAAUAAAGCUUAUGUCUCACCUGCUGUCUCCAGGUCUUUUCUUCUGUCUCUCCAAGUCUGUGUCCUGUCAGCUCAUUAGCAUAGGGGUCCAGCAUGACAAAUAUAAAACGGUGUUUAACAUUCUUUGGAUUAUGUUUAUAUAAAUGUGUUAUUAAUAUGUGUUCCAAAAUUGUACGAGAUUCUAUAGGUCUAAUAUGUCUUGGCAUAUGUUAUCAGAAACUAUUAUUAUUAUUAUGUUAAGUAGUUGUUUGCCACAGAAAUAAAGUAAUUUCCUUGUCGAAUGUGUCUUUACCAAUGCUGUUCUCAUACUUUUGUUAUCUGCAAAAAGUGUUUACUUUACAUCUCAAAAAACAGUUUAUGAUCAGCUA

In [12]:
#Transcibing APOL1 RNA to DNA 
print(APOL1_mrna.back_transcribe())

ATTACTCTCATAACATAACACAGGCTCTGCTGACGTTAGATACCCAGAGTCCAGCCACAGAGTCTTTGTCTTATGATCTTATCACCTCATGGTTTAATCAACTCCCAGACACGTGGAGAAACTTTGTATCTAGCGTAATTGCAATTGCGTTCATAATCCTTTGTTGAGGTTAGTGTTGCTGCTGGGGTGTCTGGCTGCAATGCTCTTCCACAAAACTGGCCCCAGGAAAATCACAGAAAAACAGTGTGATGAGATTGTGAGGGCCATTCAAGGCTATGCAGGGGTGTGGAGAGCATGGTGGGCATGTCCCGTCAGACAGCAATAACTGAAGCATCCCCTAAGAATAACCCTACATTCCTTGGTGAATTTTTGCTCAGTGTTCCAAGGTATGGATCCCGGGAATGGCCAATCCAGATGTTTACATCATACTTATGAAGAACTCTGTTCCTTGGATCAGAGGTTGTGCAAGGAATCAAGACCTTTTGTTTTGGGCTAGGTGGAGGTTTCCTGGCAGAGGTGCTAAGTGGAGGTTGCTCTGGGAAGAATATCATATAACCTGCATGCATTTGACAAACAGGAGGGGATTTCTTGTCTTGCCTGCTGCTCCTGGGCAACCTGTACATAAGCTCCCTGAATAAAGCTTATGTCTCACCTGCTGTCTCCAGGTCTTTTCTTCTGTCTCTCCAAGTCTGTGTCCTGTCAGCTCATTAGCATAGGGGTCCAGCATGACAAATATAAAACGGTGTTTAACATTCTTTGGATTATGTTTATATAAATGTGTTATTAATATGTGTTCCAAAATTGTACGAGATTCTATAGGTCTAATATGTCTTGGCATATGTTATCAGAAACTATTATTATTATTATGTTAAGTAGTTGTTTGCCACAGAAATAAAGTAATTTCCTTGTCGAATGTGTCTTTACCAATGCTGTTCTCATACTTTTGTTATCTGCAAAAAGTGTTTACTTTACATCTCAAAAAACAGTTTATGATCAGCTA

Lastly, both mRNA and DNA sequences can be translated into thier protein sequence. At first I was confused with why there were asterisks in the sequences, but I found out that these were stop codons. 

In [16]:
#producing the protein sequence with the RNA sequence 
print(APOL1_mrna.translate())

ITLIT*HRLC*R*IPRVQPQSLCLMILSPHGLINSQTRGETLYLA*LQLRS*SFVEVSVAAGVSGCNALPQNWPQENHRKTV**DCEGHSRLCRGVESMVGMSRQTAITEASPKNNPTFLGEFLLSVPRYGSREWPIQMFTSYL*RTLFLGSEVVQGIKTFCFGLGGGFLAEVLSGGCSGKNII*PACI*QTGGDFLSCLLLLGNLYISSLNKAYVSPAVSRSFLLSLQVCVLSAH*HRGPA*QI*NGV*HSLDYVYINVLLICVPKLYEIL*V*YVLAYVIRNYYYYYVK*LFATEIK*FPCRMCLYQCCSHTFVICKKCLLYISKNSL*SATDCAIGLRKK*IFQDCNSKVDVVMKIANPISSGRRLLHGNELMENKNKFLMTFCLIYD*FFCFVFLSQENFFFPLGSM*LLTIE*SILL*AKFEAYFFLTI*FLQNLETICEYS*FMAIELFA*VQLESVFFHNRTQLETLVILPRL*LAWHISTCT*AALSNGGWLIEPMKASLKNFPQTLPIQFLYRVPVLW*VKNVTF*QAQEAQVTLGP*GERNSPNIYRHLQANINHWLSLRF*KAYSEISYEKENIVAKSISKRSPYAK*VFLLHFIQMIRPRTIRLKLLLQIN*PYYDLSLVEIQD*REKRYVSKETRECLLLDSSFLHCFLAVLFVHSLSGLHTEFFPGHKSPN*YFAFLFLSFF*LGITRNPNCAFLKVLQTEA*KLCYSEGRERPSHIIFILFCTQYLF*EKTTRK*NQRQAARRQARNQAWARLA*TH*LKINL*FRSRCYS*IPYIM*KNIVKLPALFCSPLTTCVCSPCHVPPACSNQS*PFRVKSLVL*ALKRDRNCALRELGF*GSSLPMLPAE*SPSFYYSVSERFCLQLVLLQ*NSDQTRDQRLICFYSVLSKRC*RTARRSGSRL*SQRFRRPRQADHNAKRSRPCWPRW*NAVCTKNTKISWAWWCVPVVPATWEAEAGESLEPGRWRLQ*AEITTLHSSLATEQESV

In [21]:
#producing the protein sequence with the DNA sequence 
print(APOL1.translate())

ITLIT*HRLC*R*IPRVQPQSLCLMILSPHGLINSQTRGETLYLA*LQLRS*SFVEVSVAAGVSGCNALPQNWPQENHRKTV**DCEGHSRLCRGVESMVGMSRQTAITEASPKNNPTFLGEFLLSVPRYGSREWPIQMFTSYL*RTLFLGSEVVQGIKTFCFGLGGGFLAEVLSGGCSGKNII*PACI*QTGGDFLSCLLLLGNLYISSLNKAYVSPAVSRSFLLSLQVCVLSAH*HRGPA*QI*NGV*HSLDYVYINVLLICVPKLYEIL*V*YVLAYVIRNYYYYYVK*LFATEIK*FPCRMCLYQCCSHTFVICKKCLLYISKNSL*SATDCAIGLRKK*IFQDCNSKVDVVMKIANPISSGRRLLHGNELMENKNKFLMTFCLIYD*FFCFVFLSQENFFFPLGSM*LLTIE*SILL*AKFEAYFFLTI*FLQNLETICEYS*FMAIELFA*VQLESVFFHNRTQLETLVILPRL*LAWHISTCT*AALSNGGWLIEPMKASLKNFPQTLPIQFLYRVPVLW*VKNVTF*QAQEAQVTLGP*GERNSPNIYRHLQANINHWLSLRF*KAYSEISYEKENIVAKSISKRSPYAK*VFLLHFIQMIRPRTIRLKLLLQIN*PYYDLSLVEIQD*REKRYVSKETRECLLLDSSFLHCFLAVLFVHSLSGLHTEFFPGHKSPN*YFAFLFLSFF*LGITRNPNCAFLKVLQTEA*KLCYSEGRERPSHIIFILFCTQYLF*EKTTRK*NQRQAARRQARNQAWARLA*TH*LKINL*FRSRCYS*IPYIM*KNIVKLPALFCSPLTTCVCSPCHVPPACSNQS*PFRVKSLVL*ALKRDRNCALRELGF*GSSLPMLPAE*SPSFYYSVSERFCLQLVLLQ*NSDQTRDQRLICFYSVLSKRC*RTARRSGSRL*SQRFRRPRQADHNAKRSRPCWPRW*NAVCTKNTKISWAWWCVPVVPATWEAEAGESLEPGRWRLQ*AEITTLHSSLATEQESV

## Conlusion 

To conclude, the Seq object is quite versatile when it comes to exploring biological sequences. It allows the users to work with fasta files, convert these strings into DNA, RNA or protein sequences and can also identify short sequnces and count how many times they appear in a gene, transcript or protein.