# BioPython

This is another popular and useful library full of methods that will appeal to a budding genomics/bioinformatics individual!

I am not going to exhaustively list all of the features of BioPython. Instead, I am going to pull out a few items from the official BioPython tutorial (http://biopython.org/DIST/docs/tutorial/Tutorial.html) that focus on setting up a Seq object. Seq objects are the fundamental data type of the BioPython package and appreciating their syntax and flexibility will allow you to follow the many tutorials you will find online that apply BioPython Seq objects to solve Bioinformatics problems.  

There are a few popular ways to use BioPython:
1. Sequence handling and parsing from major databases
    * BLAST, NCBI (Entrez), GenBank, FASTA files
2. Alignment and basic Population Genetics (GenePop)
    * and phylogenetics (ie. a pipeline that feeds directly into ClustalW, MUSCLE etc)
    * Linkage disequilibrium, Hardy_weingberg, Fst, migration estimates
3. 3D structure (but I don't know much about that field)

Some of you will find BioPython to be efficient and maybe even indispensable. I found a tutorial online that I think does a straightfoward job of working through how to use BioPython to retrieve sequence information from NCBI, how to use BLAST via BioPyton and even takes you through a few examples (for instance, there is an example that takes you through diagnosing, genetically, Sickle Cell Anemia). The tutorial is found here:
https://krother.gitbooks.io/biopython-tutorial/content/

A more advanced tutorial, takes you through population genetics simulations, Cluster analysis, supervised learning etc. Once again: what you find interesting and useful about all of the features of BioPython will be dictated by what you kind of work you do.

https://nbviewer.jupyter.org/github/tiagoantao/biopython-notebook/blob/master/notebooks/00%20-%20Tutorial%20-%20Index.ipynb

BioPython focuses on tools that make bioinformatics and genomics easy.

Of particular importance: parsing. Parsing has a particular meaning in computer science but you can generally think of it as taking information that is one format and translating it into the Biopython format so that you can manipulate the data.

BioPython works seamlessly with the common formats (FASTA) and the common databases (Genbank, SwissProt, BLAST, Entrez, microarray etc)

* Seq objects - like strings - are immutable in BioPython to make life easy. You don't normally want to accidentally 'write over' a sequence object:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec17
* there are mutable versions of Seq objects called MutableSeq objects
* you can also convert an immutable Seq object to a mutable one by using .tomutable() method (instead of using Seq() when you create the object out of a sequence string, you would use MutableSeq())

The website that gives methods and attribute of Seq objects can be found here: https://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html
You will undoubtedly notice that there are methods such as "complement", "reverse_complement", "transcribe", "translate", and many other methods that we wrote out as functions in Intro to Python I.

In [None]:
#import Biopython once you have installed it via Anaconda environment
# especially the powerful Seq object
from Bio.Seq import Seq
# you could also import IUPAC which dictates the naming conventions for DNA and RNA
from Bio.Alphabet import IUPAC

# create a Seq object of DNA:
DNA_seq=Seq("ATGCGTTGCC",IUPAC.unambiguous_dna)
# try to print out the DNA_seq object:
print(DNA_seq)
print("~"*20)

#create a Seq object of RNA:
RNA_seq=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
print(RNA_seq)
# create a Seq object of RNA:
print("*"*20)

# let's try to do the same type of manipulations that we did in earlier problem sets
print(DNA_seq.reverse_complement())
print(RNA_seq.translate())
# SO VERY MUCH EASIER, RIGHT?

# you can also treat seq objects like strings in terms of slicing and other methods
print(DNA_seq[4:-1:2])
print(len(DNA_seq))
# and there are methods like join, upper, lower and the usual string methods
# here is an example of the join method taken from the biopython tutorial
contigs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]
spacer = Seq("N"*10)
spacer.join(contigs)

ATGCGTTGCC
~~~~~~~~~~~~~~~~~~~~
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
********************
GGCAACGCAT
MAIVMGR*KGAR*
GTC
10


Seq('ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCA')

# In class Question:
We want the following to be included in your object: following attributes: species name, gene accession name, and nucleotide sequence. We want to be able to manipulate these objects with the following methods: determine the length of the sequence, translate and print the user-provided DNA into amino acid sequence, print the complement of the DNA sequence, and determine the CG content of the gene. Ensure that there are default values for your object but your marker will input their own sequence data as well and your code will still need to work!
Here are the default values that you should use (they will be replaced when specified by the user):
*
Sequence = "ACTGATCGTTACGTACGAGTCA
  T* "
Species ="Drosophila melanogast
  e* r"
Gene name = “ABC1”
