Theory Bases :

1. FASTA and GB are two file formats used to store DNA and protein sequences. FASTA files are text-based, while GB files are binary.

2. Both FASTA and GB files are widely used in computational biology. FASTA files are often used for simple tasks such as sequence alignment, while GB files are often used for more complex tasks such as genome assembly and annotation.

In [41]:
from Bio import Entrez

# Library that we use : 
# PMC, Nucleotide

# PMC -> PubMed Center -> Is a library to search a data that relates with health
# Nucleotide -> data that relates with nucleotides

Entrez.email = "tintin6892@gmail.com"

# find in pmc database that related to biopython and return 10 result
record = Entrez.read(Entrez.esearch("pmc", "biopython", retmax = 10)) 
id_list = record["IdList"]


fetch = Entrez.efetch(db='nucleotide',ID=id_list[1], rettype='fasta',retmode='text')\

fasta_value = fetch.read()

with open('custom_fasta.fasta', 'w') as data:
    data.write(fasta_value)


In [42]:
# Read From Local Directory

from Bio import SeqIO 
from Bio.Seq import Seq

records = SeqIO.parse('./custom_fasta.fasta', 'fasta')
print(type(records)) # record is an iterator

for record in records:
    print(record)

<class 'Bio.SeqIO.FastaIO.FastaIterator'>
ID: BE803957.1
Name: BE803957.1
Description: BE803957.1 sr81c10.y1 Gm-c1052 Glycine max cDNA clone GENOME SYSTEMS CLONE ID: Gm-c1052-2155 5' similar to TR:P93169 P93169 EARLY LIGHT-INDUCED PROTEIN, mRNA sequence
Number of features: 0
Seq('GGTGAGATACACAGTAGTTAGGTAGCAAGCATAAGTGTCTAACCTTAGAAACAT...GGT')


In [43]:
# Read a single record

from Bio.SeqUtils import seq3

record = SeqIO.read("./custom_fasta.fasta", 'fasta')

protein = record.seq.translate().transcribe()
protein_name = seq3(protein)
protein_sequence = protein.split('*')

print(f'Protein : {protein}')
print(f'Protein Name : {protein_name}')
print(f'Protein Sequence : {protein_sequence}')

Protein : GEIHSS*VASISV*P*KHICYISCAFWNCAUKILFIYGCLILCYAINPCKPYDPHFQRV*GEPFWCSCLSHEKECWPES*VHG*GRAUKSACDPSYUUUSRUQUUASLYSFUKGEHQVF*CVGVQWASU*EDQWKVGHDCVCGCNGSGSSHRAUCV*UNIQWWYPVVPGNKCGSYLASLIHSRKDHCGV*SKG
Protein Name : GlyGluIleHisSerSerTerValAlaSerIleSerValTerProTerLysHisIleCysTyrIleSerCysAlaPheTrpAsnCysAlaSecLysIleLeuPheIleTyrGlyCysLeuIleLeuCysTyrAlaIleAsnProCysLysProTyrAspProHisPheGlnArgValTerGlyGluProPheTrpCysSerCysLeuSerHisGluLysGluCysTrpProGluSerTerValHisGlyTerGlyArgAlaSecLysSerAlaCysAspProSerTyrSecSecSecSerArgSecGlnSecSecAlaSerLeuTyrSerPheSecLysGlyGluHisGlnValPheTerCysValGlyValGlnTrpAlaSerSecTerGluAspGlnTrpLysValGlyHisAspCysValCysGlyCysAsnGlySerGlySerSerHisArgAlaSecCysValTerSecAsnIleGlnTrpTrpTyrProValValProGlyAsnLysCysGlySerTyrLeuAlaSerLeuIleHisSerArgLysAspHisCysGlyValTerSerLysGly
Protein Sequence : [Seq('GEIHSS'), Seq('VASISV'), Seq('P'), Seq('KHICYISCAFWNCAUKILFIYGCLILCYAINPCKPYDPHFQRV'), Seq('GEPFWCSCLSHEKECWPES'), Seq('VHG'), Seq('GRAUKSACDPSYUUUSRUQUUASLYSFUKGEHQVF'), Seq('

In [44]:
# Print a table with pandas dataframe

# Data frame -> like dictionary (pandas data structure)
import pandas as pd

df_protein = pd.DataFrame({"Protein" : protein_sequence})
df_protein['Sequence Length'] = df_protein["Protein"].str.len()

# Print All
# df_protein.head()

# Print Sorted
df_protein.nlargest(10, "Sequence Length")

Unnamed: 0,Protein,Sequence Length
3,"(K, H, I, C, Y, I, S, C, A, F, W, N, C, A, U, ...",43
6,"(G, R, A, U, K, S, A, C, D, P, S, Y, U, U, U, ...",35
9,"(U, N, I, Q, W, W, Y, P, V, V, P, G, N, K, C, ...",32
8,"(E, D, Q, W, K, V, G, H, D, C, V, C, G, C, N, ...",26
4,"(G, E, P, F, W, C, S, C, L, S, H, E, K, E, C, ...",19
7,"(C, V, G, V, Q, W, A, S, U)",9
0,"(G, E, I, H, S, S)",6
1,"(V, A, S, I, S, V)",6
5,"(V, H, G)",3
10,"(S, K, G)",3


In [55]:
# PDB (Protein Data Bank)

# PDB format is a standard file format for storing the three-dimensional structures of proteins and other biological molecules.

from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()
structure = parser.get_structure('7XQK', './7xqk.pdb1')

# ignore warning
import warnings
warnings.simplefilter('ignore')

model = structure[0]
for chain in model:
    print(f'Chain : {chain}')
    for residue in chain:
        print(f'Residue : {residue}')
        for atom in residue: 
          print(f'Atom : {atom}')

Chain : <Chain id=A>
Residue : <Residue MET het=  resseq=3 icode= >
Atom : <Atom N>
Atom : <Atom CA>
Atom : <Atom C>
Atom : <Atom O>
Atom : <Atom CB>
Atom : <Atom CG>
Atom : <Atom SD>
Atom : <Atom CE>
Residue : <Residue PHE het=  resseq=4 icode= >
Atom : <Atom N>
Atom : <Atom CA>
Atom : <Atom C>
Atom : <Atom O>
Atom : <Atom CB>
Atom : <Atom CG>
Atom : <Atom CD1>
Atom : <Atom CD2>
Atom : <Atom CE1>
Atom : <Atom CE2>
Atom : <Atom CZ>
Residue : <Residue GLN het=  resseq=5 icode= >
Atom : <Atom N>
Atom : <Atom CA>
Atom : <Atom C>
Atom : <Atom O>
Atom : <Atom CB>
Atom : <Atom CG>
Atom : <Atom CD>
Atom : <Atom OE1>
Atom : <Atom NE2>
Residue : <Residue LYS het=  resseq=6 icode= >
Atom : <Atom N>
Atom : <Atom CA>
Atom : <Atom C>
Atom : <Atom O>
Atom : <Atom CB>
Atom : <Atom CG>
Atom : <Atom CD>
Atom : <Atom CE>
Atom : <Atom NZ>
Residue : <Residue VAL het=  resseq=7 icode= >
Atom : <Atom N>
Atom : <Atom CA>
Atom : <Atom C>
Atom : <Atom O>
Atom : <Atom CB>
Atom : <Atom CG1>
Atom : <Atom CG2>
Res

In [54]:
# 3D Protein Projection

import py3Dmol


view = py3Dmol.view('pdb:7xqk')
view.setStyle({"cartoon": {'color': "spectrum"}})
view.show()