# Module 1, Tutorial 4: Bioinformatics Functions
-------------------------------------------------------------------------------

## Learning Objectives
In this lesson, you will:
* Learn the structure of Python functions
* Call a function
* Utilize bioinformatics functions from BioPython

## Prerequisites

- Submodule 1 - Tutorial 1: Python Overview
- Submodule 1 - Tutorial 2: Variables
- Submodule 1 - Tutorial 3: Data Structures

## Getting Started
Run the code box below to import the required libraries

In [None]:
#To install required packages
%pip install jupyterquiz
from jupyterquiz import display_quiz
import os
print("Done installing required packages")

## Functions

In Python, a **function** is a block of reusable code that performs a specific task. It takes input, processes it, and optionally returns a value. It only runs when you "call" it.

Let’s try using a common bioinformatics task to illustrate the structure of **functions** in Python.

They are created with the keyword `def` (you are **defining** the function). The function may be named in any way that makes logical sense to you.

Parentheses surround the variables that will be provided by the user of the function. Within the function, many different tasks can be performed, including calculations, and possibly returning a value—such as what you've already seen with the `len(string)` function, which returns the length of the string passed to it.

In the Python code box below, we define a function called `Count_base`, which needs two pieces of information: a sequence and the base to be counted in that sequence. Calling this function will return a number representing how many times that base appears in the sequence you provide. (`count` is a built-in Python function that works on strings.)

The last line uses the keyword `return` to tell Python to print the result of the function’s operation to the console.

<div class="alert alert-block alert-info"> <b>Tip:</b> Try changing the base or making the "base" multiple letters (e.g., "aaa") and running the Python code box again.</div>


In [1]:
def count_base(dna, base): #the function is named count_base and takes 2 inputs- a sequence string and the letter to look for
    return dna.count(base)

seq=('tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttcaaggccgccactatgacagcgattgcgactgtgc') #this string is lowercase, so the base should be too
count_base(seq, 'g')


17

<div class="alert alert-block alert-info"> <b>Tip:</b> Instead of just returning the number, edit the function so that it returns "g=17" or "In seq, g=17"</a>. </div>

Since sequences can be provided in either capital or lowercase letters, and python is case-sensitive, a more rigorous function might eliminate the risk that you did not remember whether the sequence was lower or uppercase in asking for the base. Now, you can see that we can use other tools we know about to add additional tasks which will run within count_base.

In [None]:
def count_base(dna, base): #the function is named count_base and takes 2 inputs- a sequence string and the letter to look for
    dna=dna.upper()   #convert all letters in the string to uppercase
    base=base.upper() #convert the letter provided to uppercase
    return dna.count(base)

count_base(seq, 'C') #seq was already created and you might have forgotten it was lowercase

Did you notice that the first string that we provide to count_base does not have to be called "dna?" That name is used only within the function. It will take on the value that we send it when we invoke count_base(x, y). Try editing the above to not use seq in count_base but put in a string directly, as we did with 'C'
<br>
In using functions in Python you will need to know what information a function needs to be given (here, a string and the character to look for in the string.)
<br>

### Test your knowledge

In [None]:
from jupyterquiz import display_quiz
ttt_quiz="PythonQuizQuestions/ttt_qz.json"
display_quiz(ttt_quiz)

Functions can also call another function, though to use these routinely you will need to learn to save these. For now, lets write another function that calls our count_base function to calculate the GC%. 

In [None]:
def gcPercent(dna):
    gc_total =count_base(dna, "C") + count_base(dna, "G")
    percent = gc_total/(len(dna)) * 100
    return percent

print("The GC% is: ", gcPercent(seq))

def g_in_pair(dna):
    ag_total =count_base(dna, "AG") 
    g_total  =count_base(dna, "G")
    fraction = ag_total/g_total *100
    return fraction

print(str(g_in_pair(seq))[0:2] + "% of G's are in an AG dinucleotide sequence")

Can you write your own tool that calculates the percentage of the time of all guanines that are found as the pairing AG?

# Biopython

There are many ways we might want to manipulate, align, and evaluate bioinformatic data sets—such as FASTA sequences, both DNA and protein. Fortunately, many standard functions for these tasks have already been written and are freely available through **Biopython**: *“A set of Python tools for computational molecular biology.”* (biopython.org)  
<br>
We will begin by using tools developed for **sequence input and output** (`SeqIO`).  
<br>
We import that specific set of functions and tools (also called "objects") from the full Biopython toolkit using the following syntax:  
<br>
`from Bio import SeqIO`  
<br>
We’ll use this to examine a provided file, **`glut_human.fasta`**, which contains four different protein FASTA sequences. Analyzing a file like this manually would be quite challenging for a novice Python programmer—Biopython makes it much easier.


In [None]:
import os # imports tools for directory & folder operations

myfile="." + os.sep + "bioDataSets" + os.sep+ "glut_human.fasta" #used since this dataset is in a subdirectory. See the Input/output module for more on os

from Bio import SeqIO
for record in SeqIO.parse(myfile, "fasta"):
  print(record.id)

You should see the 4 different protein identifiers in the file- in this case with PDB ID numbers. 

There is a lot of information besides just the ID in each of these records, but it is not convenient to access the pieces yet. But, we can load all of that information into a single variable called (here) record_glut. The specific format is as a python LIST. 

In [None]:
record_glut = list(SeqIO.parse(myfile, "fasta"))
print(record_glut[1])


Printing the whole file is not that useful. We will want to work with parts of the record (such as the protein sequences) or maybe get a more interesting description. 

Individual elements of a list can be accessed by using the name of the list and a number of the item in brackets. BioPython divided the information in a FASTA record into its parts (sequence, id, name, and description). Once we make a variable to contain just one record, we can look at its elements.

In [None]:
first_record=record_glut[0]
print(first_record.description)     
print(first_record.name)

How long is that protein sequence? 

In [None]:
len(first_record.seq)

We wrote a function above (count_base) that can now come in handy to determine how many of any amino acid was present in that sequence. Although we conceived of it as a nucleotide counter, the mini program accepts whatever information we submit to it. 
```
def count_base(dna, base):
    return dna.count(base)
```
We can send it the FASTA sequence of the GLUT protein and count an amino acid, rather than a base. The function takes any sequence and will count the letter you give it in quotes. This helps us to see how these functions "think" about the material you provide to it.

In [None]:
count_base(first_record.seq, "A")

Edit the above python tools to determine the number of D (aspartic acid) in the second record (7WSN_1|Chain)

In [None]:
from jupyterquiz import display_quiz
wsnquiz="PythonQuizQuestions/7wsn_quiz.json"
display_quiz(wsnquiz)

A common bioinformatics task is to align two sequences. Biopython has several tools for this activity. We will use pairwise2 (https://biopython.org/docs/1.76/api/Bio.pairwise2.html) It offers an opportunity to use iteration. 

In [None]:
from Bio import Align
seq1 = 'GATTACAGC' 
seq2 = 'GTATTAAT'
aligner=Align.PairwiseAligner(match_score=1.0, mode="local") #try mode global to see differences
alignments = aligner.align(seq1, seq2)
for alignment in alignments:
    print(alignment, alignment.score)

This can also align the glucose transporter FASTA sequences we used earlier. These two GLUT family transporters are *not* similar in sequence. 

In [None]:
second_record=record_glut[1] 
glut_align=aligner.align(first_record.seq[1:50], second_record.seq[1:50])
print(glut_align[0])

Now it's your turn. Try this quiz to check your coding (re-coding) skills!
What is the alignment score for the first global alignment between GLUT4 (first in the FASTA list) and GLUT1 (4th sequence) protein sequences? Use the WHOLE sequence. 


In [None]:
from jupyterquiz import display_quiz
glutAlignquiz="PythonQuizQuestions/glut_align_quiz.json"
display_quiz(glutAlignquiz)

Fetching Records from NCBI Using Biopython

The public databases of bioinformatics data have built-in ways to access their extensive files **programmatically**, without needing to use graphical user interfaces (GUIs). This allows us to efficiently collect data for analysis and comparison in bioinformatics tasks.  
<br>
In Biopython, the modules for accessing these databases are found in **Entrez**. We must import both `Entrez` (for data fetching) and `SeqIO` (for reading and parsing sequence files).  
<br>
The commands below will **fetch and parse** a GenBank RefSeq file for the human *insulin receptor 2* protein.  
<br>
🔗 You can learn more about database names and how to use `efetch` from this [NCBI book chapter](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch).  
<br>
**Note:** To run these commands, you must provide a valid email address to `Entrez`.  
<br>
The data is conventionally read into a variable named `handle`, but any valid variable name can be used. Once the data is read, **you must close the connection** with:  
`handle.close()`
ction = handle.close()

In [None]:
from Bio import Entrez, SeqIO

Entrez.email = "jrchase@nnu.edu" #always tell Entrez who you are

handle = Entrez.efetch(db="protein", id="NP_001073285.1", rettype="gb")

humInsR2 = SeqIO.read(handle, "gb") #creates a variable to hold all of the record

handle.close()

print(type(humInsR2)) 

#print(humInsR2.seq)


Reading this GenBank file creates a variable of **class `SeqRecord`** (i.e., a *sequence record*), which behaves somewhat like a list—but also includes useful attributes such as an ID, a sequence, and other identifying information.  
<br>
We can explore what components are included in the file by requesting a **directory of all the available attributes** using the `dir()` function.

In this lesson, we'll focus on just a few key parts of the `SeqRecord`, highlighting the attributes most commonly used in bioinformatics workflows.  
<br>
For example, to access the description field of the record, you can write:

```python
humInsR2.description
```
You can replace .description with any other attribute listed in the output of dir(humInsR2), although only a few are typically useful for common tasks.

In [None]:
dir(humInsR2)[-13:-1]
humInsR2.description

We can use our previous variable tools to evaluate the sequence portion (length, amino acid content, etc). See what you can evaluate or measure from this protein.


In [None]:
def count_AA(seq, letter):
    return seq.count(letter) #counts the frequency in whatever seq was provided

print(len(humInsR2.seq))
count_AA(humInsR2.seq, "LG")

### Test your knowledge

Now, it's your turn to import a protein sequence from the NCBI. This time, you should fetch a similar protein (AAA39318.1).

The quizzes will ask you about the protein sequence information

In [None]:
from jupyterquiz import display_quiz
who_quiz="PythonQuizQuestions/who_quiz2.json"
display_quiz(who_quiz)

# Conclusion

In this tutorial, you have imported, used, and even made a function to carry out bioinformatics tasks that would be very time-consuming without Python.
<br>
You are ready to wrap up this unit and do a project where you use bioinformatics data obtained from databases in the [Sequence Project.](./Submodule_1_Tutorial5_Project.ipynb)

## Clean up
Remember to shut down your Jupyter Notebook instance when you are done for the day to avoid unnecessary charges. You can do this by stoping the notebook instance from the Cloud console.