# Understanding how: BLAST

## Goal: 

**Use [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) to identify genes and their function in an sequence of DNA from a non-model organism.**  

## Learning Objectives
Completing this tutorial, we anticipate learners will be able to: 

- Explain fundamental BLAST concepts including:
    - Similarity search
    - Homology
    - Sequence databases
    - Expectation value (e-value)
- Execute a BLAST search using Biopython (command line)
- Interpret the results of a BLAST search to determine homology
- Use Biopython to retrieve sequence information from NCBI Entrez databases including:
    - Conserved protein domains and functional annotation
    - Protein-coding gene structural annotation 
- Explain the advantages of running BLAST at the command line. 

## Time and prior knowledge

**Level:** Intermediate

This notebook is appropriate for a learner with:
- Basic knowledge of molecular biology/genetics (2-4th year undergraduate)
- Introductory knowledge of command line (basic Linux commands, e.g.: cd, ls, mkdir)
- Introductory knowledge of Python (e.g. basic syntax, saving variables, for loops)

**Timing:**
We estimate you can run through exercises in these notebooks in the following times:
- Notebook 0 Intro: 5 minutes
- Notebook 1 [Title]: 30 minutes
- Notebook 2 [Title]: 15 minutes
- Notebook 3 [Title]: 20 minutes
- Notebook 4 [Title]: 20 minutes


## Recommended readings

These are some useful background readings and resources:
- [An Introduction to Sequence Similarity (“Homology”) Searching](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/pdf/nihms519883.pdf)
- [NCBI BLAST manual](https://www.ncbi.nlm.nih.gov/books/NBK153387/)

## Prerequisites and using your own data

In this tutorial, we will use the following sample data. You can substitute your own, similarly formatted data to run these analyses on your own. Keep in mind analysis times for several steps will increase with the size of your data (i.e. file sizes, number of files). 

|Input(s)|Sample data|Notes|
|:------:|:---------:|:---:|
|One or more DNA sequence files in [Fasta](https://en.wikipedia.org/wiki/FASTA_format#Format) format.|yakuba.fa|This example file is an 11kb sample of Yakuba DNA. In principle, you could extend this workflow to analyze one or more contigs from any genomic or transcriptomic sequence|

## Introduction

In this tutorial we will use on of the most important tools in bioinformatics, BLAST. According to the [BLAST homepage](https://blast.ncbi.nlm.nih.gov/Blast.cgi):

> "Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between    sequences. The program compares nucleotide or protein sequences to sequence databases and 
calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene 
families. "

These lessons will increase your understanding how BLAST blast works, as well as teach you how to use BLAST using Biopython at the command line. The advantage of a command line BLAST is that you can create a more reproducible, scalable workflow. 

## Problems to solve
We have an 11kb sequence of DNA from the fly *Drosophila yakuba.* This is a non-model organism, but fortunately it should be very closely related to the well-studied *D. melanogaster*. Some of the questions we want to address include:

- What gene(s) orthologs can we identify in the yakuba sequence?
- For identified gene(s), what structural and functional annotation information can we identify by homology?
- Given our BLAST results, how can we develop our own annotation of the protein coding regions of our yakuba sequence?

### On to Notebook 1