# Reading *An Introduction to Applied Bioinformatics*

**Bioinformatics, as I see it, is the application of the tools of computer science (things like programming languages, algorithms, and databases) to address biological problems (for example, inferring the evolutionary relationship between a group of organisms based on fragments of their genomes, or understanding if or how the community of microorganisms that live in my gut changes if I modify my diet).** Bioinformatics is a rapidly growing field, largely in response to the vast increase in the quantity of data that biologists now grapple with. Students from varied disciplines (e.g., biology, computer science, statistics, and biochemistry) and stages of their educational careers (undergraduate, graduate, or postdoctoral) are becoming interested in bioinformatics.

*An **I**ntroduction to **A**pplied **B**ioinformatics*, or **IAB**, is an open source, interactive bioinformatics text. **It introduces readers to the core concepts of bioinformatics in the context of their implementation and application to real-world problems and data.** IAB is closely tied to the [scikit-bio](www.scikit-bio.org) python package, which provides production-ready implementations of core bioinformatics algorithms and data structures. Readers therefore learn the concepts in the context of tools they can use to develop their own bioinformatics software and pipelines, enabling them to rapidly get started on their own projects. While some theory is discussed, the focus of IAB is on what readers need to know to be effective, practicing bioinformaticians. 

IAB is interactive, being **based on IPython Notebooks** which can be installed on a reader’s computer or viewed statically online. As readers are learning a concept, for example, pairwise sequence alignment, they are presented with its scikit-bio implementation directly in the text. scikit-bio code is well annotated (adhering to the [pep8](https://www.python.org/dev/peps/pep-0008/) and [numpydoc](https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt) conventions), so readers can use it to assist with their understanding of the concept. And, because IAB is presented as an IPython Notebook, readers can execute the code directly in the text. For example, when learning pairwise alignment, users can align sequences provided in IAB (or their own sequences) and modify parameters (or even the algorithm itself) to see how changes affect the resulting alignments. 

IAB is **completely open access**, with all software being BSD-licensed, and all text being licenced under Creative Commons Attribution Only (i.e., CC BY-NC-SA 4.0). All development and publication is coordinated under [public revision control on GitHub](https://github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics). 

IAB is also an **electronic-only resource**. There are currently no plans to commercialize it or to create a print version. This means that, unlike printed bioinformatics texts which are generally out of date before the ink dries, IAB can be updated as the field changes. 

**The life cycle of IAB is more like a software package than a book.** There will be development and release versions of IAB, where the release versions are more polished but won't always contain the latest content, and the development versions will contain all of the latest materials, but won't necessarily be copy-edited and polished.

We are in the process of developing a **project status page** that will detail the plans for IAB. This will include the full table of contents, and what stage you can expect chapters to be at at different times. You can track progress of this on [IAB #97](https://github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/issues/97).

[My](http://github.com/gregcaporaso) goal for IAB is for it to make bioinformatics as accessible as possible to students from varied backgrounds, and to get more people about this hugely exciting field. I'm very interested in hearing from readers and instructors who are using IAB, so get in touch if you have corrections, suggestions for how to improve the content, or any other thoughts or comments on the text. In the spirit of openness, I'd prefer to be contacted via the [IAB issue tracker](https://github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/issues/). I'll respond to direct e-mail as will, but I'm always backlogged (just ask my students), so responses are likely to be slower.

I hope you find IAB useful, and that you enjoy reading it!

## Who should read IAB?

IAB is useful for scientists, software developers, and students interested in understanding and applying bioinformatics methods, and ultimately in developing their own bioinformatics software and analysis pipelines. 

IAB was initially developed for an undergraduate course cross-listed in computer science and biology with no pre-requisites. It therefore assumes little background in biology or computer science, however some basic background is very helpful. For example, an understanding of the roles of and relationship between DNA and protein in a cell, and the ability to read and follow well-annotated python code, are both helpful (but not necessary) to get started. 

In the sections below I provide some suggestions for other texts that will help you to get started.

## How to read IAB

There are two ways to read *An Introduction To Applied Bioinformatics*:

* The *recommended* way is to install it and work with it interactively.
* The *easiest* way is to view the static notebooks online using [nbviewer](http://nbviewer.ipython.org/). You should:
 * [start here to view the latest release version](http://nbviewer.ipython.org/github/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/blob/0.1.0/Index.ipynb), or
 * [start here to view the development version](http://nbviewer.ipython.org/github/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/blob/master/Index.ipynb) (which will have the latest content, but will be less polished and possibly buggy).

## Installation

The [project website](http://caporasolab.us/An-Introduction-To-Applied-Bioinformatics/) has instructions installing and running *An Introduction To Applied Bioinformatics*.

## Using the IPython Notebook

IAB is built using the IPython Notebook, an interactive HTML-based computing environment. The main source for information about the IPython Notebook is the [IPython Notebook website](http://ipython.org/notebook). You can find information there on how to use the IPython Notebook, and also on how to set up and run and IPython Notebook server (for example, if you'd like to make one available to your students).

Most of the code that is used in IAB comes from [scikit-bio](http://scikit-bio.org) package, or other python scientific computing tools. You can access these in the same way that you would in a python script. For example:

In [1]:
import skbio

from __future__ import print_function
from IPython.core import page
page.page = print

We can then access functions, variables, and classes from these modules.

In [2]:
print(skbio.title)
print(skbio.art)


*                                                    *
               _ _    _ _          _     _
              (_) |  (_) |        | |   (_)
      ___  ___ _| | ___| |_ ______| |__  _  ___
     / __|/ __| | |/ / | __|______| '_ \| |/ _ \
     \__ \ (__| |   <| | |_       | |_) | | (_) |
     |___/\___|_|_|\_\_|\__|      |_.__/|_|\___/

*                                                    *



           Opisthokonta
                   \  Amoebozoa
                    \ /
                     *    Euryarchaeota
                      \     |_ Crenarchaeota
                       \   *
                        \ /
                         *
                        /
                       /
                      /
                     *
                    / \
                   /   \
        Proteobacteria  \
                       Cyanobacteria



We'll inspect a lot of source code in IAB as we explore bioinformatics algorithms. If you're ever interested in seeing the source code for some functionality that we're using, you can do that using IPython's ``psource`` magic.

In [3]:
from skbio.alignment import Alignment

%psource Alignment.position_entropies

    [0;32mdef[0m [0mposition_entropies[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mbase[0m[0;34m=[0m[0mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m                           [0mnan_on_non_standard_chars[0m[0;34m=[0m[0mTrue[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""Return Shannon entropy of positions in Alignment[0m
[0;34m[0m
[0;34m        Parameters[0m
[0;34m        ----------[0m
[0;34m        base : float, optional[0m
[0;34m            log base for entropy calculation. If not passed, default will be e[0m
[0;34m            (i.e., natural log will be computed).[0m
[0;34m        nan_on_non_standard_chars : bool, optional[0m
[0;34m            if True, the entropy at positions containing characters outside of[0m
[0;34m            the first sequence's `iupac_standard_characters` will be `np.nan`.[0m
[0;34m            This is useful, and the default behavior, as it's not clear how a[0m
[0;34m            gap or degenerate character sh

The documentation for scikit-bio is also very extensive (though the package itself is still in early development). You can view the documentation for the `Alignment` object, for example, [here](http://scikit-bio.org/docs/0.2.3/generated/skbio.alignment.Alignment.html). These documents will be invaluable for learning how to use the objects.

## Getting started with Biology

If you're new to biology, these are some books and resources that will help you get started.

* [The Processes of Life](http://www.amazon.com/Processes-Life-Introduction-Molecular-Biology/dp/0262013053) by Lawrence Hunter. *An introduction to biology for computer scientists.*


* [Molecular Biology of the Cell](http://www.amazon.com/Molecular-Biology-Cell-Bruce-Alberts/dp/0815341059/ref=sr_1_1?s=books&ie=UTF8&qid=1314225305&sr=1-1) by Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, Peter Walter. *One of the best texts on molecular biology. This is fairly advanced (it's generally used in upper division molecular biology courses) so it may not be the best place to start. You'll find it invaluable though if you plan to go on in Bioinformatics. This book is available via the NIH Bookshelf (for example, from Chapter 1: [The Universal Features of Cells on Earth](http://www.ncbi.nlm.nih.gov/books/NBK26864/) and [The Diversity of Genomes and the Tree of Life](http://www.ncbi.nlm.nih.gov/books/NBK26866/).*


* [Brock Biology of Microorganisms](http://www.amazon.com/Brock-Biology-Microorganisms-Michael-Madigan/dp/032164963X/ref=dp_ob_title_bk) by Michael T. Madigan, John M. Martinko, David Stahl, David P. Clark. *One of the best textbooks on microbiology. This is also fairly advanced, but if you're interested in microbial ecology or other aspects of microbiology it will likely be extremely useful.*


* The [NIH Bookshelf](http://www.ncbi.nlm.nih.gov/books/) A lot of free biology texts, some obviously better than others.

## Getting started with Computer Science and programming

If you're new to Computer Science and programming, these are some books and resources that will help you get started.

* [Software Carpentry](www.software-carpentry.org) *Online resources for learning scientific computing skills, and regular in-person workshops all over the world. Taking a Software Carpentry workshop **will** pay off for biology students interested in a career in research.*


* [Practical Computing for Biologists](http://www.amazon.com/Practical-Computing-Biologists-Steven-Haddock/dp/0878933913) by Steven Haddock and Casey Dunn. *A great introduction to many computational skills that are required of modern biologists. I *highly* recommend this book to all Biology undergraduate and graduate students.*


* [Practical Programming: A Introduction to Computer Science Using Python](http://www.amazon.com/Practical-Programming-Introduction-Pragmatic-Programmers/dp/1934356271) by Jennifer Campbell, Paul Gries, Jason Montojo, Greg Wilson. *An introduction to the python programming language and basic computer science. This is a great first programming book for people interested in bioinformatics or scientific computing in general.*


* [Learn Python the Hard Way](http://learnpythonthehardway.org/) by Zed Shaw. *Another good python introduction. This one is very focused on exercises and is great for practicing python. My students have complained that it doesn't provide enough background information (i.e., `what` you're doing and `why` it works) and for that reason I recommend using this in conjunction with `Practical Programming`. Beware that the two don't follow each other exactly. One strategy that some students use is to work through these exercises in order and use `Practical Programming` as a reference.*

## Need help?

If you're having issues getting *An Introduction to Applied Bioinformatics* running on your computer, or you have corrections or suggestions on the content, you should get in touch through the [IAB issue tracker](https://github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/issues). This will generally be much faster than e-mailing the author directly, as there are multiple people who monitor the issue tracker. It also helps us manage our technical support load if we can consolidate all requests and responses in one place.

## Contributing to IAB

If you're interested in contributing content or features to IAB, I'd love to hear from you. You should start by reviewing [CONTRIBUTING.md](https://github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/blob/master/CONTRIBUTING.md) which provides guidelines on how to get involved.

## About the author

<div style="float: right; margin-left: 20px; margin-bottom: 15px; width: 200px"><img title="@gregcaporaso, circa 2015" style="float: right;margin-left: 30px;" src="https://raw.github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/master/images/ponytail.png" align=right height=175/></div>

My name is Greg Caporaso. I'm the primary author of IAB, but there are [other contributors](https://github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/graphs/contributors) as well and I hope that list will grow. 

I have degress in Computer Science (B.S., Univ of Colorado, 2001) and Biochemistry (B.A., Univ of Colorado, 2004; Ph.D., Univ of Colorado 2009). Following my formal training, I joined the [Rob Knight Laboratory](http://knightlab.ucsd.edu), then at the University of Colorado, for approximately 2 years as a post-doctoral scholar. In 2011, I joined the faculty at [Northern Arizona University (NAU)](www.nau.edu) where I'm an Assistant Professor in the Biological Sciences department. I [teach](http://www.caporasolab.us/teaching/) one course per year in bioinformatics for graduate and undergraduate students of Biology and Computer Science. I also run a [research lab](http://www.caporasolab.us/) in the [Center for Microbial Genetics and Genomics](http://www.mggen.nau.edu/), which is focused on developing bioinformatics software and studying the human microbiome.

I'm very active in open source bioinformatics software development. I'm not the world expert on the topics that I present in IAB, but I have a passion for bioinformatics, open source software, writing, and education. When I'm learning a new bioinformatics concept, for example an algorithm like pairwise alignment or a statistical technique like Monte Carlo simulation, implementing it is usually the best way for me to wrap my head around it. This led me to start developing IAB, as I found that my implementations helped my students learn the concepts too. I think that one of my strongest skills is the ability to break complex ideas into accessible components. I do this well for bioinformatics because I remember (and still regularly experience) the challenges of learning it, so can relate to newcomers in the field. 

I am most widely known for my involvement in the development of the [QIIME software package](http://www.qiime.org), and more recently for leading the development of [scikit-bio](http://www.scikit-bio.org). I am also involved in many other bioinformatics software projects (see my [GitHub page](http://github.com/gregcaporaso)). IAB is one of the projects that I'm currently most excited about, so I truly hope that it's as useful for you as it is fun for me.

For updates on IAB and various other things, you should [follow me on Twitter](https://twitter.com/gregcaporaso).