## Protein Sequence Analysis

The Biopython suite provides tools for protein sequence analysis. Specifically, the `ProteinAnalysis` package will be demonstrated. Given an input amino acid sequence, it can perform several operations, such as counting the number of times a certain amino acid is repeated in the sequence, or computing the molecular weight of the protein.

The amino acid sequence of epidermal growth factor receptor (EGFR) isoform X1, found in *Homo sapiens*, will be used for this demonstration.

In [1]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.84


In [2]:
import numpy as np

In [3]:
from Bio.SeqUtils.ProtParam import ProteinAnalysis
egfr_seq = """
MFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVE
                     RIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPA
                     LCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKI
                     ICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYN
                     PTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKC
                     EGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPL
                     DPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNI
                     TSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATG
                     QVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPE
                     CLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHL
                     CHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVVALGIGLFMRRRHIVRKR
                     TLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEG
                     EKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLM
                     PFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHV
                     KITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMT
                     FGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIE
                     FSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQQGFFS
                     SPSTSRTPLLSSLSATSNNSTVACIDRNGLQSCPIKEDSFLQRYSSDPTGALTEDSID
                     DTFLPVPEYINQSVPKRPAGSVQNPVYHNQPLNPAPSRDPHYQDPHSTAVGNPEYLNT
                     VQPTCVNSTFDSPAHWAQKGSHQISLDNPDYQQDFFPKEAKPNGIFKGSTAENAEYLR
                     VAPQSSEFIGA
                     """
egfr_seq = ''.join(egfr_seq.split())

In [4]:
egfr_analysed = ProteinAnalysis(egfr_seq)
egfr_analysed.molecular_weight() #To calculate the molecular weight of the protein

128678.59110000057

In [5]:
# The .gravy() function is used to calculate the protein's GRAVY value.
# GRAVY stands for Grand Average of Hydropathicity.
# It is calculated as the sum of the hydropathy values of all amino acids, divided by the number of residues in the sequence.
# Hydropathicity is a measure of the hydrophobic or hydrophilic nature of amino acids in a protein.
# The hydropathicity of a protein segment can be calculated using the Kyte-Doolittle scale.
# Amino acids with non-polar side chains tend to be hydrophobic, while those with polar or charged side chains tend to be hydrophilic.

egfr_analysed.gravy() # A negative value indicates that the protein is hydrophilic. A positive score indicates a hydrophobic protein.

-0.3321521175453757

We observe that the GRAVY score for EGFR isoform X1 is negative. This indicates that is a hydrophilic protein. This makes sense when one considers that this is a transmembrane receptor protein that interacts with the intracellular as well as the extracellular fluids, both of which tend to be aqueous.

In [6]:
aa_counts = egfr_analysed.count_amino_acids()  # To count the number of times each amino acid is repeated in the sequence.
keys = list(aa_counts.keys())
values = list(aa_counts.values())
sorted_value_index = np.argsort(values)

sorted_aa_counts = {keys[i] : values[i] for i in sorted_value_index}
print(sorted_aa_counts)

{'W': 13, 'M': 24, 'H': 30, 'F': 34, 'Y': 36, 'Q': 46, 'R': 57, 'C': 58, 'D': 60, 'T': 60, 'K': 63, 'A': 64, 'N': 65, 'I': 69, 'V': 69, 'P': 73, 'E': 74, 'S': 80, 'G': 81, 'L': 101}


From the previous code, we infer that the most common amino acid in this sequence is Leucine, an uncharged, non-polar amino acid. Perhaps this amino acid occurs in that part of the receptor that traverses the plasma membrane, as this too is hydrophobic, being composed of lipids.

In [7]:
egfr_analysed.get_amino_acids_percent() # To compute the total fraction of the sequence that a particular amino acid comprises.

{'A': 0.055315471045808126,
 'C': 0.05012964563526361,
 'D': 0.05185825410544512,
 'E': 0.06395851339671564,
 'F': 0.029386343993085567,
 'G': 0.07000864304235091,
 'H': 0.02592912705272256,
 'I': 0.059636992221261884,
 'K': 0.05445116681071737,
 'L': 0.08729472774416594,
 'M': 0.020743301642178046,
 'N': 0.056179775280898875,
 'P': 0.06309420916162489,
 'Q': 0.03975799481417459,
 'R': 0.049265341400172864,
 'S': 0.06914433880726016,
 'T': 0.05185825410544512,
 'V': 0.059636992221261884,
 'W': 0.011235955056179775,
 'Y': 0.03111495246326707}

In [8]:
egfr_analysed.length # The total number of amino acids in the sequence.

1157

In [9]:
#The aromaticity of a protein is the relative frequency of aromatic amino acids in the protein sequence. These are Phe, Trp and Tyr.

egfr_analysed.aromaticity()

0.07173725151253241

EGFR isoform X1 contains relatively few aromatic amino acids.

In [13]:
# This function returns the fraction of amino acids that tend to be in
# alpha-helices, beta-turns or beta-sheets, in that order.

egfr_analysed.secondary_structure_fraction()

(0.2817631806395851, 0.31028522039758, 0.3301642178046672)

We observe that the highest number of amino acids in EGFR isoform X1 form beta-sheets. This lends further insight into the structure of this protein.

The amino acid sequence was obtained from the [NCBI database](https://www.ncbi.nlm.nih.gov/nuccore/XM_054357418.1).

The documentation for the packages used was obtained from the [Biopython website](https://biopython.org/wiki/ProtParam).