# Big Data for Biologists: Decoding Genomic Function- Class 1

## What is a gene and how can we read gene sequences into a computer program?

##  Learning Objectives
 ***Students should be able to***
 <ol>
 <li>Access Jupyter Notebooks </li>
 <li>Explain what a gene is </li>
 <li>Describe what a command line is and start Python from the command line</li>
 <li>Find and/or set a working directory</li>
 <li>Identify the absolute or relative path for a file or directory</li>
 <li>Download a gene sequence from a genome database</li>
 <li>Read a gene sequence into Python and determine the length of the sequence</li>
 </ol>

**Note: For additional background on DNA and the Central Dogma see [Khan Academy video on DNA](https://www.khanacademy.org/science/biology/classical-genetics/molecular-basis-of-genetics-tutorial/v/dna-deoxyribonucleic-acid) or [Khan Academy video on Central Dogma](https://www.khanacademy.org/test-prep/mcat/biomolecules/amino-acids-and-proteins1/v/central-dogma-of-molecular-biology-2). 


## **What is a gene**?

If you open up a recent science news media website there's a good chance that you'll find some kind of article talking about the lastest study having to do with genes. Headlines like:

 *"For Coffee Drinkers, the Buzz May Be in Your Genes" - New York Times*

or 

 *Gene Tests Identify Breast Cancer Patients Who Can Skip Chemotherapy, Study Says"*-New York Times?

seem to be a product of living in the post-genomic era (see below if you're wondering what "post-genomic era" means). 
 
But what is a gene? 

To define the term gene, we first need to define DNA. **DNA** is the molecule in cells that enables the transmission of genetic information from one generation to the next. 

Pioneering work by James Watson, Francis Crick and Rosalind Franklin in the 1950s unraveled the structure of DNA and elucidated the mechanism by which DNA can be reliably replicated when one cell divides to become two. 

DNA is made of four nucleotides (or bases): 
* **Adenine (A)**   
* **Cytosine (C)** 
* **Guanine (G)** 
* **Thymine (T)**   
      
Inside a cell, DNA nucleotides link together in two strands that form a double helix. In the figure below you can see how the base pairs come together: 

* **Adenine (A) pairs with Thymine (T)**
* **Guanine (G) pairs with Cytosine (C)**

<img src="../Images/1-DNA Structure.png" style="width: 40%; height: 50%" align="center"//>
 
 
**Genes** are segments of DNA that code for RNA and proteins, which are critical molecules for carrying out cellular function. 

The typical flow of information in cells is shown in the figure below, along with the names of the processes for each step in the information flow.  

In addition to ultimately coding for proteins, DNA can code for RNA that has a function on its own and is not translated into protein. That RNA is referred to as functional RNA. 

Unraveling the function of **non-coding DNA**,  parts of DNA that are not genes is an active area of research. You will learn a lot more about the mysteries of "non-coding DNA" later in this class. 

<img src="../Images/1-CentralDogma.png" style="width: 40%; height: 50%" align="center">



## **Using the command line and getting started with Python** 

Being able to work with gene or DNA sequences in a computer program is a core skill that you will need for this class.  

Before we can start using a computer to analyze DNA and gene sequences, we first need to get started with a programming language. For this course we are primarily going to use **Python**, which is one of the programming languages commonly used by biologists. 

Python is what is known as a "scripting language" which means that its a type of programming language that is ready to use after it is installed and you can start entering commands, or you can write a set of commands, in what is known as **"a script"**. 

A lot of the principles we are introducing will also help you if you need to use other types of programs that are commonly used by biologists and in biomedical research such as **R** or **MATLAB**. 

One of the first steps in using a programming language is getting it installed. Part of your homework for will be to install Python as well as a text editing program, but for today in class we can start to explore Python using these Jupyter Notebooks. 

Unlike a windows or graphical user interface (GUI) environment like you may be used to, Python and other scripting languages have what is known as "command line".  

**"Command line"** is a location where you can enter code to give a computer program an instruction. 

When you are first starting, it is often helpful to use the command line, but often as you become more comfortable using programs you can assemble your code into scripts, like we mentioned above, that will initiate a series of commands all at once without the user 

In programming classes one of the first commands that you often learn is to ask the computer to write out the phrase "Hello World". For example: 


In [1]:
print ('Hello World')

Hello World


Since this is a biology course we are going to have the computer print out the four types of DNA bases. 

Try running the command in the next box by clicking on the box and clicking enter while holding down the shift key. 


In [2]:
print(DNA makes RNA makes Protein)

SyntaxError: invalid syntax (<ipython-input-2-49cc9a7653fb>, line 1)

Why didn't that work? 

One of the challenges that beginners at coding quickly learn is that computers are very literal. Details such as quotation marks, spaces, capital letters versus upper case letters all matter. 

The first lesson from this exercise is to pay attention to quotes, spaces and caps when you are coding. 

Fortunately, many coding interfaces help give you feedback (or error messages) to help you debug your code. 

Try editing the line above to fix the errors. 


## How do I find and/or set my working directory?

Scripts or computer programs can get very complex, but at its most basic, a computer program reads an input, does something with the input and then writes an output.

<img src="../Images/1-Working Directories.png" style="width: 40%; height: 50%" align="center"//>

The question in red in the figure above repesents a key part of writing a functioning program. 

How you organize your input and output files is also essential for building well-organized functioning projects. We'll have some more tips on that later.  

When you start Python the directory where you start the program becomes, by default, what is known as your working directory. 

The **working directory** is where a computer program will look for inputs and write outputs unless you instruct it to do so otherwise. 

Its sometimes helpful to be able to identify the working directory. You can find the working directory that you are using with the series of commands below. 

In the commands, import os sets up a way to use functions that may be dependent on your operating system (ie. are you using a Mac OSX, a PC or some other environment). 

The getcwd command is "get current working directory". 

In [13]:
import os
os.chdir('../')
os.getcwd()

'/Users/annettesalmeen/Box Sync/2016-2017/Faculty College/CompBio/vptl-course-master'

## How can I find the absolute or relative path of a directory or file?

The output of the os.getcwd() command gives what is known as the **absolute path** of the current working directory, or the exact location of the current working directory on the computer.  

When you are telling a program where to find inputs or to write outputs it is essential to be able to tell the computer where they are. You can do this by telling the computer either the absolute path or relative path. 

The **relative path** indicates where a file or directory is with respect to the working directory. 

A helpful trick to define the path to go up one directory is to use the "../" command. 

For example, if you want to change your working directory to:

/Users/annettesalmeen/Box Sync/2016-2017/Faculty College/CompBio/

You could use the comand with the absolute path: 
os.chdir('/Users/annettesalmeen/Box Sync/2016-2017/Faculty College/CompBio/')

Or, you can run the command below.    

In [1]:
import os
os.chdir('../') 
print(os.getcwd())

/Users/annettesalmeen/Box Sync/2016-2017/Faculty College/CompBio/vptl-course-master


If you want to go back down to a subdirectory, you can just give the name of the directory. 

For example to go from: 

/Users/annettesalmeen/Box Sync/2016-2017/Faculty College/CompBio

Back to: 

/Users/annettesalmeen/Box Sync/2016-2017/Faculty College/CompBio/JupyterNtbks

In [21]:
import os
os.chdir('JupyterNtbks')
print(os.getcwd())
#go back! 
# .. stands for the parent directory
os.chdir("..")


/home/ubuntu/vptl/JupyterNtbks


It also can be helpful to list the names of the files that are in working directory.

Try it below and see whats there. 

In [22]:
import os 
# a single period (.)  stands for the current directory 
os.listdir('.')


['JupyterNtbks',
 'class_4',
 '.ipynb_checkpoints',
 'Unix_Basics.ipynb',
 '.git',
 'ComputationalThinkingforHumanBiologySyllabus.pdf',
 'class_1',
 'class_5',
 'class_2',
 'Images']

What command could you use if you want to change the directory to: 

'/Users/annettesalmeen/Box Sync/2016-2017/Faculty College/CompBio/JupyterNtbks/Class 1'

Without looking below, try writing the code using both the absolute and relative paths in the boxes below. 

We've given you some guidance in comment lines which are denoted by a #. You'll see comment lines used a lot throughout the class. Adding comments is a helpful way to help you and others follow your code! 

In [23]:
#Write the commands to change the current working directory using absolute paths
#Print the working directory so you can check your work. 

In [24]:
#Write the commands to change the current working directory using relative paths
#Print the working directory so you can check your work. 

In [25]:
import os
os.chdir('/home/ubuntu/vptl/class_1')
print(os.getcwd())

/home/ubuntu/vptl/class_1


With these basic navigation commands we are now ready to apply what we've learned to an example from biology. 

For the rest of the class we are going to apply what we've covered to this point and show how to use it to read a gene sequence into a program and calculate the length of the sequence.

<img src="../Images/1-ReadWriteGeneSequence.png" style="width: 40%; height: 50%" align="center"//>


## **How can you find a gene sequence in a genome database**?


The first step for our project today is finding a gene sequence. 

Many, many gene sequences have been collected in publicly available on-line databases from genome sequencing research projects or smaller scale research projects to determine the sequence of single genes or sets of genes.  

Three commonly used databases to obtain gene or genome sequence information are:
   
   [NCBI Gene](https://www.ncbi.nlm.nih.gov/gene)  
   [Ensemble](http://uswest.ensembl.org/index.html)  
   [UCSC Genome Browser](https://genome.ucsc.edu/)  

We aren't going to go into detail now about the differences between these three sites, but an important point is that there are several large scale collaborative efforts that have created very organized sites to collect genome sequences from the research community. 

These sites include genome sequences not only from humans (or, more technically *Homo Sapiens*) ranging from bacteria like *E.Coli* to large organisms like elephants (*Loxodonta africana*). 

To see what a gene sequence in agenome browser looks like, visit the entry in the NCBI Gene database for [human insulin](https://www.ncbi.nlm.nih.gov/nuccore/NG_007114.1?from=4986&to=6416&report=fasta)

If you have extra time, click on the NCBI Gene link above and search for any gene of interest to see what you find. 


## How can I download a gene sequence?

For today, we are going to use a gene sequence from the NCBI database. 

Depending on the computer program that you are using or writing, you might need gene sequences in a particular format.  You will want to make sure that the format for your input file matches the format that you need for the program. 

You'll see more examples of sequence formats later in the class, but the sequence we will use today is known as "FASTA". You can download sequences in FASTA format directly from the NCBI database.   

In **FASTA format** the first line of the file starts with a > followed by an identifier describing the nucleotide sequence. You can see an example of what the FASTA sequence for a gene in the NCBI database looks like below. 

<img src="../Images/1-Gene Sequence.png" style="width: 60%; height: 75%" align="center"//>

To get the sequence of the gene of interest into a file, you can copy it into a text file using a textfile editor and save the file or, as you become more advanced and more comfortable with programming, you can also write code to direcly instruct a computer program to access the web and "scrape" information from a website to load into a computer.

Well-organized procedures for storing data and analysis scripts are **ESSENTIAL** for successful projects and for the ability to communicate within research teams and to others who may need to reproduce all or parts of your analysis, so we will give you guidance on ways to organize projects, particularly in the early parts of this course. 

Note, that there isn't one correct way to organize a project and name files, but there are best practices such as having clear well organized file names that can be immensely helpful both to you and to others following your work. 

For this activity, we are going to follow the proceedures below: 

In the working directory that we set up above CompBio/Class1/

1. Create a Directory called data (where data is stored)
2. Create a Directory called src (where your scripts are stored) 
3. Create a Directory called results (where your results (or output) is stored).  
4. Visit the entry in the NCBI Gene database for [human insulin](https://www.ncbi.nlm.nih.gov/nuccore/NG_007114.1?from=4986&to=6416&report=fasta)
5. Open up a text editor and paste the sequence into a file. 
6. Save the human insulin sequence into a file called Human-Insulin-NG_007114.1.txt in the CompBio/Class1/data directory. 



In [26]:
import os
try: 
    os.mkdir('data')
    os.mkdir('src')
    os.mkdir('results')
except: 
    pass
os.listdir('data')

['Human-Insulin NM_000207.2.txt', 'Human-Insulin-NG_007114.1.txt']

Now that we are set up with the data that we need as input, we are ready to start writing the program to read in the sequence into Python and calculate the length of the sequence and the number of A/T and G/C pairs. 

Think for a moment about how you might set this up and then we'll look at the example we've provided below. 

In [27]:
#Read in the sequence and print it 
FASTAgenesequence=open('data/Human-Insulin-NG_007114.1.txt','r')
print(sequence.read())

>NG_007114.1:4986-6416 Homo sapiens insulin (INS), RefSeqGene on chromosome 11
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT
GGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG
TGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGC
CTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTG
GCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAG
CTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCT
GCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGGCCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCAC
CCAGCATGGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTTTTAAAAAGAAG
TTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAGCCTGAGGACGGTG
TTGGCTTCGGCAGCCCCGAGATACATCAGAGGGTGGGCACGCTCCTCCCTCCACTCGCCCCTCAAACAAA
TGCCCCGCAGCCCATTTCTCCACCCTCATTTGATGACCGCAGATTCAAGTGTTTTGTTAAGTAAAGTCCT
GGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGGGCGAACACCCCATCACGCCCGGAGGA
GGGCGTGGCTGCCTGCCTGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGCAGCTCCATAGTCAGGA

Give yourself a chance to think about the code above. 
What do you think the first line did? What do you think the second line did?

If you want to calculate the length of the sequence what do you think you will need to do next?

There is a command "len" that will allow you to calculate the length of variables like you read in above. 

However, if you use the length command on the sequence variable as is, what would be the problem with that?

In this class you will see a lot of examples when it may be helpful to look at only part of an input file. 

For example, here it would be helpful to calculate the number of letters in the file with the exception of the first line. 

Fortunately, Python (and other scripting languages) make that a pretty easy task as you'll see in the next example. 

In [28]:
#Read in the sequence and trim the first line
#Note in Python the numbering of lines or characters starts with 0
FASTAgenesequence=open('data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
print(genesequence)

['AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGT\n', 'GGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCG\n', 'TGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGC\n', 'CTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTG\n', 'GCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAG\n', 'CTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCT\n', 'GCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGGCCGCCCCCAGCCACCCCCTGCTCCTGGCGCTCCCAC\n', 'CCAGCATGGGCAGAAGGGGGCAGGAGGCTGCCACCCAGCAGGGGGTCAGGTGCACTTTTTTAAAAAGAAG\n', 'TTCTCTTGGTCACGTCCTAAAAGTGACCAGCTCCCTGTGGCCCAGTCAGAATCTCAGCCTGAGGACGGTG\n', 'TTGGCTTCGGCAGCCCCGAGATACATCAGAGGGTGGGCACGCTCCTCCCTCCACTCGCCCCTCAAACAAA\n', 'TGCCCCGCAGCCCATTTCTCCACCCTCATTTGATGACCGCAGATTCAAGTGTTTTGTTAAGTAAAGTCCT\n', 'GGGTGACCTGGGGTCACAGGGTGCCCCACGCTGCCTGCCTCTGGGCGAACACCCCATCACGCCCGGAGGA\n', 'GGGCGTGGCTGCCTGCCTGAGTGGGCCAGACCCCTGTCGCCAGGCCTCACGGCAGCTCCATAGTCAGGAG\n', 'ATGGGGAAGA

What are all those weird \n? 

The \n are linebreaks and in this case get in the way of calculating the length of the genesequence.

We can get rid of them, however, using the join and replace commands. 

In [11]:
#Read in the sequence and trim the first line
FASTAgenesequence=open('/data/Human-Insulin-NG_007114.1.txt','r')
genesequence=(FASTAgenesequence.readlines()[1:])
genesequence=''.join(genesequence)
genesequence=genesequence.replace('\n','')
print(len(genesequence))

1430


Congratulations! We've covered a lot today, but everything that we've covered you'll see for a lot more practice as the quarter continues. 

See you next time where we'll start looking at transcription, the next step in gene expression!  