# Installing modules

Up to now, every Python library we've needed to use, such as Pandas, Numpy, and Matplotlib, came included with our Anaconda Python distribution.  However, there are many useful Python libraries that are not included in Anaconda Python by default, or are not directly accessible via the Anaconda package manager. 

In class we'll review three ways to install modules:

### Installing modules using the Anaconda GUI

1. Run the Anaconda Navigator program
2. Select "Environments" tab on the left
3. Select the environment you want to install a packge into -- "base" by default
4. Select "All" in package pull down menu in right pane 
5. Search for the package of interest -- e.g. "biopython"
6. Click checkbox next to packages you wish to install and then select the "Apply" button

### Installing modules using the `conda` command line tool

1. Search for the package of interest -- `conda search biopython`
2. Install the package of interest -- `conda install biopython`

### Installing modules using pip

1. Search for packages of interest on [PyPI](https://pypi.org/) or via the command line, e.g. `pip search gff`
2. Install via `pip install` command -- e.g. `pip install gffpandas`



## Introducing Biopython

Biopython is a library that contains a wide variety of functions and classes for working with bioinformatics data of various kinds.  Nucleotide and protein sequence information is particularly well supported, but Biopython has tools for a wide variety of tasks, such as running automated data base searches over the internet, working with 3D structural data,  running population genetic simulations, etc.  Today we'll focus primarily on working with sequence data and associated metadata.

In [1]:
import Bio  # base library, this is a check to see if we installed it correctly

### How do I start to learn a new library?

1. Find the documentation and look for a tutorial
2. Read, test, and extend code examples illustrating how the library works
3. Learning how to effectively use API documentation
4. Learn how to query Python objects in an interactive session
5. Read the source code

We'll illustrate all of these steps today as we start to get acquainted with Biopython

## CLASS TODO
1. Find the Biopython home page
2. Find the link to the Biopython documentation
3. Go the the API (application programmers interface) documentation

### Creating Seq objects

In [2]:
from Bio.Seq import Seq

In [3]:
s1 = Seq("ATGCGCGATGA")

In [4]:
s1

Seq('ATGCGCGATGA')

In [5]:
s1[0]  # indexing similar to strings

'A'

In [6]:
s1[0:6]  # slicing similar to strings

Seq('ATGCGC')

In [7]:
s1[-1]

'A'

In [8]:
str(s1) #makes into String type from Seq type

'ATGCGCGATGA'

In [9]:
s1.count('A') #returns int of how much something in Seq

3

In [10]:
s1.count('a')

0

In [11]:
s1.count('AT')

2

In [12]:
ms1 = s1.tomutable() #changes to type MutableSeq, so is mutable
ms1

MutableSeq('ATGCGCGATGA')

In [13]:
ms1[2] = "T"
ms1

MutableSeq('ATTCGCGATGA')

In [14]:
type(ms1)

Bio.Seq.MutableSeq

In [15]:
type(s1)

Bio.Seq.Seq

### Python tools for introspection -- type, dir

In [16]:
type(s1) # Seq objects are string like, but are not strings

Bio.Seq.Seq

In [17]:
dir(s1)  # when applied to an object, dir gives all the fields and attributes (methods) associated with an object

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_data',
 '_get_seq_str_and_check_alphabet',
 'alphabet',
 'back_transcribe',
 'complement',
 'count',
 'count_overlap',
 'encode',
 'endswith',
 'find',
 'index',
 'join',
 'lower',
 'lstrip',
 'reverse_complement',
 'rfind',
 'rindex',
 'rsplit',
 'rstrip',
 'split',
 'startswith',
 'strip',
 'tomutable',
 'transcribe',
 'translate',
 'ungap',
 'upper']

In [18]:
s1.complement() #get nucelotide complements of sequence

Seq('TACGCGCTACT')

In [19]:
import Bio
Bio.Seq.(tab)

SyntaxError: invalid syntax (<ipython-input-19-b71094ca8591>, line 2)

In [None]:
[i for i in dir(s1) if not i.startswith("_")]  # get the attributes, hiding the "dunders"

In [None]:
# we can even wrap this up in a function
def object_attributes(o):
    return [i for i in dir(o) if not i.startswith("_")]

In [None]:
object_attributes(s1)

In [None]:
# We can even write a couple of functions to automatically query methods and attributes

import types

def methods_of(o):
    methods = [i for i in dir(o) if (not i.startswith("_")) and (type(getattr(o,i)) == types.MethodType)]
    return methods

def attributes_of(o):
    attribs = [i for i in dir(o) if (not i.startswith("_")) and (type(getattr(o,i)) != types.MethodType)]
    return attribs

In [None]:
methods_of(s1)

In [None]:
attributes_of(s1)

## CLASS TODO

1. Find the Bio.Seq page in API docs
2. Skim the documentation for the non-dunder methods to get a sense of what sort of built-in functionality Seq objects have

### Examples of methods on Bio.Seq objects

In [20]:
s1.complement()

Seq('TACGCGCTACT')

In [21]:
s1.reverse_complement()

Seq('TCATCGCGCAT')

In [22]:
s1.transcribe()

Seq('AUGCGCGAUGA', RNAAlphabet())

In [23]:
s1.translate()



Seq('MRD', ExtendedIUPACProtein())

### Parsing sequence records from a FASTA file

## CLASS TODO

1. Read the Bio.SeqIO.arse docs and short examples

In [24]:
from Bio import SeqIO

In [32]:
# use a for loop to iterate over fasta records in a file
#C:\Users\cleve\OneDrive\Documents\Notes\Notebooks Linked Materials\Duke University\Bio 208 Computing on the Genome
for rec in SeqIO.parse("/Users/cleve/OneDrive/Documents/Notes/Notebooks Linked Materials/Duke University/Bio 208 Computing on the Genome/covid-S-and-E.fsa", format="fasta"):
    print(rec.name)

YP_009724390.1
YP_009724392.1


In [36]:
# use a list comprehension to get all the fasta records out of a file and store them in a list
recs = [rec for rec in SeqIO.parse("/Users/cleve/OneDrive/Documents/Notes/Notebooks Linked Materials/Duke University/Bio 208 Computing on the Genome/covid-S-and-E.fsa","fasta")]

In [37]:
recs

[SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT', SingleLetterAlphabet()), id='YP_009724390.1', name='YP_009724390.1', description='YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[]),
 SeqRecord(seq=Seq('MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKP...LLV', SingleLetterAlphabet()), id='YP_009724392.1', name='YP_009724392.1', description='YP_009724392.1 envelope protein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[])]

In [38]:
len(recs)

2

In [39]:
recs[0]

SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT', SingleLetterAlphabet()), id='YP_009724390.1', name='YP_009724390.1', description='YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[])

In [40]:
type(recs[0])

Bio.SeqRecord.SeqRecord

In [41]:
r0 = recs[0]

## CLASS TODO

1. Find the SeqRecord page in API docs
2. What are the non-method attributes associated with SeqRecords?
3. What are the methods associated with SeqRecords

In [42]:
r0.id

'YP_009724390.1'

In [43]:
recs[0].name

'YP_009724390.1'

In [44]:
recs[0].description

'YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]'

In [45]:
r0.seq.find('PLV')

8

## Parsing records from a Genbank file

In [46]:
filename = "/Users/cleve/OneDrive/Documents/Notes/Notebooks Linked Materials/Duke University/Bio 208 Computing on the Genome/NC_045512.gb"
covidrecs = [rec for rec in SeqIO.parse(filename, format="genbank")]

In [47]:
covidrecs

[SeqRecord(seq=Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', IUPACAmbiguousDNA()), id='NC_045512.2', name='NC_045512', description='Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome', dbxrefs=['BioProject:PRJNA485481'])]

In [48]:
len(covidrecs)

1

In [49]:
covidref = covidrecs[0]

In [56]:
covidref.seq

Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', IUPACAmbiguousDNA())

In [57]:
len(covidref.seq)

29903

In [58]:
covidref.name, covidref.description

('NC_045512',
 'Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome')

In [59]:
len(covidref.features)

57

In [60]:
covidref.features[0]

SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29903), strand=1), type='source')

In [61]:
covidref.features[:3]

[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29903), strand=1), type='source'),
 SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(265), strand=1), type="5'UTR"),
 SeqFeature(FeatureLocation(ExactPosition(265), ExactPosition(21555), strand=1), type='gene')]

In [62]:
type(covidref.features[0])

Bio.SeqFeature.SeqFeature

## CLASS TODO

1. Read the SeqFeature docs
2. What are the non-method attributes associated with SeqRecords?
3. What are the methods associated with SeqRecords

In [64]:
covidref.features[0].location.start, covidref.features[0].location.end

(ExactPosition(0), ExactPosition(29903))

In [65]:
covidref.features[0].qualifiers

OrderedDict([('organism', ['Severe acute respiratory syndrome coronavirus 2']),
             ('mol_type', ['genomic RNA']),
             ('isolate', ['Wuhan-Hu-1']),
             ('host', ['Homo sapiens']),
             ('db_xref', ['taxon:2697049']),
             ('country', ['China']),
             ('collection_date', ['Dec-2019'])])

In [66]:
covidref.features[1].qualifiers

OrderedDict()

In [67]:
covidref.features[2].qualifiers

OrderedDict([('gene', ['ORF1ab']),
             ('locus_tag', ['GU280_gp01']),
             ('db_xref', ['GeneID:43740578'])])

In [68]:
covidref.features[2].location

FeatureLocation(ExactPosition(265), ExactPosition(21555), strand=1)

In [69]:
covidref.features[40].qualifiers

OrderedDict([('gene', ['M']),
             ('locus_tag', ['GU280_gp05']),
             ('note', ['ORF5; structural protein']),
             ('codon_start', ['1']),
             ('product', ['membrane glycoprotein']),
             ('protein_id', ['YP_009724393.1']),
             ('db_xref', ['GeneID:43740571']),
             ('translation',
              ['MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFLYIIKLIFLWLLWPVTLACFVLAAVYRINWITGGIAIAMACLVGLMWLSYFIASFRLFARTRSMWSFNPETNILLNVPLHGTILTRPLLESELVIGAVILRGHLRIAGHHLGRCDIKDLPKEITVATSRTLSYYKLGASQRVAGDSGFAAYSRYRIGNYKLNTDHSSSSDNIALLVQ'])])

In [72]:
covidref.features[0].type

'source'

In [73]:
covidref.features[40].type

'CDS'

In [74]:
genefeatures = [ftr for ftr in covidref.features if ftr.type == "gene"]
genefeatures

[SeqFeature(FeatureLocation(ExactPosition(265), ExactPosition(21555), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(21562), ExactPosition(25384), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(25392), ExactPosition(26220), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(26244), ExactPosition(26472), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(26522), ExactPosition(27191), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(27201), ExactPosition(27387), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(27393), ExactPosition(27759), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(27755), ExactPosition(27887), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(27893), ExactPosition(28259), strand=1), type='gene'),
 SeqFeature(FeatureLocation(ExactPosition(28273), ExactPosition(29533), strand=1), type='gene'),
 SeqFeature(FeatureLocation(Exac

In [75]:
len(genefeatures)

11

In [76]:
ftrsbyname = dict((ftr.qualifiers["gene"][0], ftr)  for ftr in genefeatures)

In [77]:
type(ftrsbyname)

dict

In [78]:
ftrsbyname.keys()

dict_keys(['ORF1ab', 'S', 'ORF3a', 'E', 'M', 'ORF6', 'ORF7a', 'ORF7b', 'ORF8', 'N', 'ORF10'])

In [79]:
ftrsbyname["S"].location

FeatureLocation(ExactPosition(21562), ExactPosition(25384), strand=1)

In [80]:
ftrsbyname["S"].extract(covidref) #pulls out subset of sequence that corresponds to gene

SeqRecord(seq=Seq('ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTT...TAA', IUPACAmbiguousDNA()), id='NC_045512.2', name='NC_045512', description='Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome', dbxrefs=[])

In [81]:
ftrsbyname["S"].extract(covidref).translate()

SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...YT*', HasStopCodon(ExtendedIUPACProtein(), '*')), id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=[])