# Cookbook SeqUtils Biopython

The present cookbook aims to explain the use of a biopython module: `SeqUtils`.

This module is a package full of different functions. The description in the official web said: *" Miscellaneous functions for dealing with sequences."*

![Biopython](images/biopython.png)

## Installation

To install it, you need to install the full library `Biopython` with the following command:



In [73]:
pip install biopython


The following command must be run outside of the IPython shell:

    $ pip install biopython

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/



## Use of the functions

The package is divide into differente submodules:

- SeqUtils
- CheckSum
- CodonUsage
- ProtData
- LCC

Now, I will do some example of the different points.

### SeqUtils

In order to read fasta file, We need to import the module `SeqIO` and use the function `read`.

In [2]:
import Bio.SeqIO as bsio

seq_adn_tyclv = bsio.read("data/TYCLVgenome.txt", "fasta").seq
seq_adn_llactis = bsio.read("data/LLactis.txt", "fasta").seq
seq_mrna = bsio.read("data/mRNA.fasta", "fasta").seq
seq_aa = bsio.read("data/aa.fasta", "fasta").seq
seq_adn_elegans = bsio.read("data/CElegans_chrX.fasta", "fasta").seq

Now, we can apply different functions to these sequences; for example: `gc`, `nt_search`, `six_frame_translation`...

The function `gc` compute the porcentage of GC nucleotides.

In [4]:
import Bio.SeqUtils as bsu

tyclv_gc = bsu.GC(seq_adn_tyclv)
print("GC % of TYCLV" + str(tyclv_gc))
llactis_gc = bsu.GC(seq_adn_llactis)
print("GC % of LLactis" + str(llactis_gc))
ce_gc = bsu.GC(seq_adn_elegans)
print("GC % of CElegans chr X" + str(ce_gc))

GC % of TYCLV40.95649047105358
GC % of LLactis35.22896903161049
GC % of CElegans chr X35.20328696826255


The functions  `six_frame_translation` translated RNA to protein with the 6 differents frames.

In [76]:
import Bio.SeqUtils as bsu

translation = bsu.six_frame_translations(seq_mrna)
print("The translations produce: " + translation[0:550])

The translations produce: GC_Frame: a:51 t:58 g:53 c:48 
Sequence: aggactactt ... aggcgttgcc, 210 nt, 48.10 %GC


1/1
  D  Y  L  H  D  Q  S  Q  D  G  R  Y  V  L  L  E  F  K  V  S
 G  L  L  A  *  P  K  P  R  W  Q  V  R  P  F  R  I  *  G  *
R  T  T  C  M  T  K  A  K  M  A  G  T  S  F  *  N  L  R  L
aggactacttgcatgaccaaagccaagatggcaggtacgtccttttagaatttaaggtta   41 %
tcctgatgaacgtactggtttcggttctaccgtccatgcaggaaaatcttaaattccaat
V  V  Q  M  V  L  A  L  I  A  P  V  D  K  *  F  K  L  N  T 
 P  S  S  A  H  G  F  G  L  H  C  T  R  G  K  L  I  *  P  *
  S  *  K  C  S  W  L  W  S  


Another important function is `nt_search`, which permit to look for a specific word in a sequence.

In [77]:
print(bsu.nt_search(str(seq_adn_llactis), "TTTTACACATTA"))

['TTTTACACATTA', 16]


Also there are son functions only for protein, like `seq3` or `seq1`.

In [78]:
seq_aa_3_letters = bsu.seq3(seq_aa)
seq_aa_1_letters = bsu.seq1(seq_aa_3_letters)
print("Three letter: " + seq_aa_3_letters[0:50])
print("One letter: " + seq_aa_1_letters[0:40])


Three letter: MetAspAspAspIleAlaAlaLeuValValAspAsnGlySerGlyMetCy
One letter: MDDDIAALVVDNGSGMCKAGFAGDDAPRAVFPSIVGRPRH


## CheckSum

Now, we move into another submodule. This part of the library offers the functions related to the checksum value like `crc32` or `seguid`.

In [5]:
import Bio.SeqUtils.CheckSum as bsu

print("CRC32 for LLactis: " + str(bsu.crc32(seq_adn_llactis)))
print("CRC32 for CElegans: " + str(bsu.crc32(seq_adn_elegans)))
print("Seguid for LLactis: " + str(bsu.seguid(seq_adn_llactis)))
print("Seguid for CElegans: " + str(bsu.seguid(seq_adn_elegans)))


CRC32 for LLactis: 334704398
CRC32 for CElegans: 3075742384
Seguid for LLactis: OIxZqNTVQ1YI9sL0A04G6n7oQGA
Seguid for CElegans: GPWMB3w97ix0bh5mhwoF+VGYKIU


## ProtData

This submodule is only for protein sequence and contain the basic funcions for these type of biological sequences, for example:

- `count_amino_acids`
- `molecular_weight` (it can be used also with nucleotides sequences)
- `aromaticity`
- `flexibility`
- `gravy`
- `isoelectric_point`
- `secondary_structure_fraction`

In the following script we can see the use of these functions:


In [80]:
from Bio.SeqUtils import ProtParam as pp

prot = pp.ProteinAnalysis(str(seq_aa))

print("AA count:")
print(prot.count_amino_acids())
print("Molecular weigth: " + str(prot.molecular_weight()))
print("Aromaticity" + str(prot.aromaticity()))
print("2º structur (helix, turn, sheet)")
print(prot.secondary_structure_fraction())
print("Isoelectric point: " + str(prot.isoelectric_point()))

AA count:
{'A': 29, 'C': 6, 'D': 23, 'E': 26, 'F': 13, 'G': 28, 'H': 9, 'I': 28, 'K': 19, 'L': 27, 'M': 17, 'N': 9, 'P': 19, 'Q': 12, 'R': 18, 'S': 25, 'T': 26, 'V': 22, 'W': 4, 'Y': 15}
Molecular weigth: 41736.287900000054
Aromaticity0.08533333333333333
2º structur (helix, turn, sheet)
(0.2906666666666667, 0.21599999999999997, 0.264)
Isoelectric point: 5.29046630859375


## LCC

The last submodule is Local Composition Complexity.

Local compositional complexity is a numerical measure of repetitiveness of sequences of symbols from a finite alphabet. Highly repetitive sequences are considered simple, whereas highly nonrepetitive sequences are considered complex. 

In [7]:
from Bio.SeqUtils.lcc import lcc_simp

print("Example of sequence of aa: " + str(lcc_simp(seq_aa)))
print("Example of sequence of dna: " + str(lcc_simp(seq_adn_elegans)))

Example of sequence of aa: 0.9274876521248023
Example of sequence of dna: 1.9358698575432178


![Python](images/python.png)