# Introduction to Bioinformatics

## 01 - Working with sequence information

###  This is our first hands-on session in the world of Bioinformatics!

Bioinformatics is an interdisciplinary field of biology and informatics. Why? Because a great deal of the work we do in biology is about data. Computers make it easy and extremely more robust to process this data. 

Our mission is to gain biological insight from our data. We make questions and design how to answer them (experiments), collect the pertinent data, and analyze it. In the end, we infer new hypotheses from these results. This scheme is the scientific method at work and is as valid here as in any other context. To do bioinformatics is to do experiments with a computer. Of course, first, we need to learn how to use a computer in a more severe and involved manner. This means knowing how to talk to the computer to ask it to do our workload for us, or better said, learn how to command it through programming.

Programming may be a daunting task when beginning this course but extremely rewarding at the end. Programming is learning a new language and, as any language, opens new doors on how we think and operate about our world. More specifically, and perhaps, more importantly, programming allows us to carry out jobs that would be humanly impossible due to their time-consuming nature. Albert Einstein once stated:

"Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination."

In this first notebook, we will learn some basic things about Python, a general-purpose programming language that is extremely useful and ubiquitous nowadays. We do so by learning how to work with sequence information. Much of the work in Bioinformatics is about sequences. Remember that the primary structures of DNA, RNA, and proteins can be represented as sequences of letters, so most of the bio-macromolecules in the living cell can be thought of as linear strings of code (quite reductionist, but this is just one dimension on the issue). We will discuss this further as we progress with the course; for now, let's focus on our baby steps.

### Before moving on, we need to have clarity about the following concepts:

- Python data types
- Python function

### Working with files

Most data will be in text files. These files are generic and contain lines of plain text. To work with the data inside these files, we need to tell Python their locations to open them. The following code opens a text file and prints its content:

In [1]:
# Read file content
text = open('input/example.txt')

# Iterate over the file content (or lines)
for line in text:
    print(line) # print one line at the time
    
# Close the file to free up memory
text.close()

The text file "example.txt," inside the "input" folder, contains three text lines that are printed in the notebook. Note how we passed the location of the file to the "open()" function and that the output returned was assigned to a variable called "text." Then, the "text" variable is iterated to print every "line" inside it. Finally, the "text" object is closed. Note that the syntax used to call an iteration of an object.

The previous code is equivalent to:

In [2]:
cat = open('input/example.txt')

for mouse in cat:
    print(mouse)    
    
cat.close()

Python does not care how you call your variables, but we usually give them logical names to make human sense of what we are doing. Another important thing is to comment. Note that any text after a "#" symbol won't be interpreted as python code. We put them to remember what we are doing. With time, and once you are becoming more proficient with the language, comments should reflect general ideas of the code. Documenting is fundamental if you share your code with others, so they know what you intended with it. Comments will also help the future-you to understand what you did. You'll be surprised how many times you review old code, and you don't know what it is doing, even if you were the one who wrote it in the first place.

An alternative to the code above is the following:

In [3]:
with open('input/example.txt') as text_lines:
    for line in text_lines:
        print(line)    

In programming, writing differently the same code is known as alternative syntax. As in other languages, there are many ways to say the same thing, but some are more convenient than others. In this case, note that we did not close our "text_lines" object. In this case, the syntax "with" allows us to close the file object immediately when we left its indentation.

### How does sequence data look?

The first thing we need to know if we are working with sequences is where to get them. We will use the UNIPROT database to look for the sequence of the human protein "ubiquitin" with code "P0CG48" (What this protein does?). Let's enter the database and get the sequence in a format called fasta. Save this fasta file in the repository's "input" folder.

[UNIPROT](https://www.uniprot.org/)

Now we can read and print its content:

In [4]:
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:
        print(l)

The format is quite clear. First, a line starting with ">," contains the name and other data about the sequence. Then, the following lines have the sequence information.

We learn how to print the file's content, but how do we save and manipulate this information?

### Storing sequence data

To save the information, we need to create a variable to contain our data and then append the info to this variable at every iteration:

In [5]:
ubq_sequence = ''
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:
        ubq_sequence = ubq_sequence+l
print(ubq_sequence)

Now we can access our file's content through "ubq_sequence" without the need to reopen the file every time. However, we would like to get this data more conveniently. Our "ubq_sequence" string variable contains the fasta file data as-is; nevertheless, this is not very useful for many practical purposes. We want to get into our variable only the sequence, excluding any line breaks:

In [6]:
ubq_sequence = ''
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:
        # Only work with lines that do not begin with ">"
        if not l.startswith('>'):
            # The strip method deletes line breaks and spaces at the begining or the end of the string
            ubq_sequence = ubq_sequence+l.strip()
print(ubq_sequence)

This code starts to look better, but it would be even better if we did not lose the sequence's reference information. When we dropped the line starting with ">," we lost the reference information about the sequence. If we were working with many sequences, then this rapidly would become chaotic because we could not follow which sequence is which.

There are many ways to track information in Python. Here we will use Python dictionaries. A dictionary is a particular Python object with "keys" as entries and "values" as outputs, very similar to a standard dictionary. We use this Python object to store the sequence together with its reference name. First, let's see how we could get the UniProt reference ID as a separate string.

In [7]:
ubq_sequence = ''
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:        
        if l.startswith('>'):
            
            ## This will be printed to explain what we are doing here ##
            
            print('1:')
            print(l)
            print('2:')
            print(l.strip())
            print('3:')
            print(l.split())
            print('4:')
            print(l.split('|'))
            print('5:')
            print(l.split('|')[1])
            
            ubq_uniprot_id = l.split('|')[1]
            
print('6:')
print(ubq_uniprot_id)

In [8]:
ubq_sequence = ''
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:        
        if l.startswith('>'):
            ubq_uniprot_id = l.split('|')[1]
        # This matches all that did not match the above if statement
        else:
            ubq_sequence = ubq_sequence+l.strip()

# Declare dictionary to store id + sequence
sequences = {ubq_uniprot_id : ubq_sequence}

In [9]:
print(sequences)

Nice! This code looks much more likely. 

### Calculating over our sequence data 

Let's make some questions to our sequence. One simple question we can make is its amino acid composition. We can calculate this now very easily. First, we get the non-redundant set of letters in our sequence together with a list containing the sequence as individual elements:

In [10]:
# Store letters inside a set
letters_set = set()
# Store letters inside a list
letters_list = []

for l in sequences['P0CG48']:
    letters_set.add(l)
    letters_list.append(l)
    
print('Letters set:')
print(letters_set)
print('Letters list:')
print(letters_list)

Can you tell the difference between a list and a set object?

Now let's calculate our composition and store it in a dictionary as percentages:

In [11]:
composition = {}
for l in letters_set:
    # Store composition for each aminoacid
    composition[l] = letters_list.count(l)/len(letters_list)*100.0 # Composition formula
print(composition)

{'R': 5.255474452554744, 'Y': 1.313868613138686, 'L': 11.824817518248175, 'E': 7.883211678832117, 'P': 3.9416058394160585, 'V': 5.401459854014599, 'K': 9.197080291970803, 'G': 7.883211678832117, 'F': 2.627737226277372, 'T': 9.197080291970803, 'N': 2.627737226277372, 'M': 1.313868613138686, 'H': 1.313868613138686, 'A': 2.627737226277372, 'S': 3.9416058394160585, 'D': 6.569343065693431, 'Q': 7.883211678832117, 'I': 9.197080291970803}


This code would be even better if we printed our composition sorted by percentage value:

In [12]:
print('Ubiqutin composition:')
# This is how you sort a dictionary by values in reverse order
for aa, percentage in sorted(composition.items(), key=lambda item: item[1], reverse=True):
    print(aa, percentage)

Ubiqutin composition:
L 11.824817518248175
K 9.197080291970803
T 9.197080291970803
I 9.197080291970803
E 7.883211678832117
G 7.883211678832117
Q 7.883211678832117
D 6.569343065693431
V 5.401459854014599
R 5.255474452554744
P 3.9416058394160585
S 3.9416058394160585
F 2.627737226277372
N 2.627737226277372
A 2.627737226277372
Y 1.313868613138686
M 1.313868613138686
H 1.313868613138686


Can we calculate the theoretical mass of our protein? 

Of course, but first, we need the mass of each amino acid:

In [13]:
# Dictionary with amino acid masses in Dalton
aa_masses = {'A': 71.0, 'C': 103.0, 'D': 114.0, 'E': 128.0, 'F': 147.0, 
             'G': 57.0, 'H': 138.0, 'I': 113.0, 'K': 128.0, 'L': 113.0, 
             'M': 131.0, 'N': 114.0, 'P': 97.0, 'Q': 128.0, 'R': 157.0, 
             'S': 87.0, 'T': 101.0, 'V': 99.0, 'W': 186.0, 'Y': 163.0}

protein_mass = 0

for aa in sequences['P0CG48']:
    protein_mass += aa_masses[aa]
    
print('Ubiquitin mass:')
print(protein_mass, 'Da')

Ubiquitin mass:
76878.0 Da


[Compute pI/Mw for Swiss-Prot/TrEMBL entries or a user-entered sequence](https://web.expasy.org/compute_pi/)

Why do you think these values differ?
How would you improve our calculator?

Can you calculate the mass-averaged composition of our protein?

Ubiquitin is a polyprotein. Can you find the largest motif repeated?

To answer this question, we need to learn how to slice strings. We show this with examples:

In [14]:
my_sequence = sequences['P0CG48']
# Print full sequence
print(my_sequence)

# Slice from beginning
print(my_sequence[0])
print(my_sequence[19])

# Slice from the end
print(my_sequence[-1])
print(my_sequence[-20])

# Slice in a range
print(my_sequence[0:19])
print(my_sequence[-20:-1])

MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGV
M
S
V
D
MQIFVKTLTGKTITLEVEP
DYNIQKESTLHLVLRLRGG


We can use this to get how many times a sequence segment is contained in the sequence:

In [15]:
motif = my_sequence[0:19]
n_times = my_sequence.count(motif)
print(n_times, motif)

9 MQIFVKTLTGKTITLEVEP


In [16]:
# Count number of times a motif is inside the sequence
# Note that this only counts motifs at the begining of the sequence
motifs = {}
for i in range(len(sequences['P0CG48'])):
    motif = sequences['P0CG48'][:i]
    length = sequences['P0CG48'].count(motif)
    motifs[length] = motif
print(motifs)

{686: '', 9: 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG', 4: 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG', 3: 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG', 2: 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG', 1: 'MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFA

Can you get the most repeated four-letter motif inside the sequence?

In [17]:
motifs = {}
for i in range(len(sequences['P0CG48'])):
    motif = sequences['P0CG48'][i:i+4]
    length = sequences['P0CG48'].count(motif)
    motifs[motif] = length
    if i == len(sequences['P0CG48'])-4:
        break
print(motifs)

{'MQIF': 9, 'QIFV': 9, 'IFVK': 9, 'FVKT': 9, 'VKTL': 9, 'KTLT': 9, 'TLTG': 9, 'LTGK': 9, 'TGKT': 9, 'GKTI': 9, 'KTIT': 9, 'TITL': 9, 'ITLE': 9, 'TLEV': 9, 'LEVE': 9, 'EVEP': 9, 'VEPS': 9, 'EPSD': 9, 'PSDT': 9, 'SDTI': 9, 'DTIE': 9, 'TIEN': 9, 'IENV': 9, 'ENVK': 9, 'NVKA': 9, 'VKAK': 9, 'KAKI': 9, 'AKIQ': 9, 'KIQD': 9, 'IQDK': 9, 'QDKE': 9, 'DKEG': 9, 'KEGI': 9, 'EGIP': 9, 'GIPP': 9, 'IPPD': 9, 'PPDQ': 9, 'PDQQ': 9, 'DQQR': 9, 'QQRL': 9, 'QRLI': 9, 'RLIF': 9, 'LIFA': 9, 'IFAG': 9, 'FAGK': 9, 'AGKQ': 9, 'GKQL': 9, 'KQLE': 9, 'QLED': 9, 'LEDG': 9, 'EDGR': 9, 'DGRT': 9, 'GRTL': 9, 'RTLS': 9, 'TLSD': 9, 'LSDY': 9, 'SDYN': 9, 'DYNI': 9, 'YNIQ': 9, 'NIQK': 9, 'IQKE': 9, 'QKES': 9, 'KEST': 9, 'ESTL': 9, 'STLH': 9, 'TLHL': 9, 'LHLV': 9, 'HLVL': 9, 'LVLR': 9, 'VLRL': 9, 'LRLR': 9, 'RLRG': 9, 'LRGG': 9, 'RGGM': 8, 'GGMQ': 8, 'GMQI': 8, 'RGGV': 1}


### Multiple sequence data

Most of the time, we will want to compare multiple sequences to know how they relate to each other. When these sequences are evolutionarily related, they are known as homologous sequences. Now that we are familiar with single sequences, let's try to read a multiple sequence fasta file into a dictionary.

To obtain the data, we will [BLAST](https://en.wikipedia.org/wiki/BLAST_(biotechnology)) our sequence and download the resulting fasta file. We will run the BLAST algorithm using the P0CG48 code inside the Uniprot page:

[Uniprot BLAST page](https://www.uniprot.org/blast/)

Now we put the fasta file into a dictionary:

In [18]:
sequences = {}
sequence = None # Define as None to avoid the: not defined variable problem.

# Open the multiple sequences fasta file
with open('input/P0CG48_BLAST_sequences.fasta') as ff:
    for l in ff:        
        # Gather sequence lines only
        if not l.startswith('>'):
            # Merge these lines into a single string
            sequence += l.strip()
        # Match fasta name lines (starting with >)
        else:
            # Not at the first iteration because sequence must different than None
            if sequence != None:
                # Save entry in the dictionary
                sequences[uniprot_id] = sequence
            uniprot_id = l.split('|')[1]
            # Empty the sequence string 
            sequence = ''
            
    # After the final iteration there is still a missing entry in the dictionary
    sequences[uniprot_id] = sequence

# Print the dictionary entries
print(len(sequences.keys()))
print(sequences.keys())

50
dict_keys(['P0CG61', 'P0CG64', 'P0CG48', 'L8IDB8', 'A0A4W2CUP5', 'I5AMR3', 'P0CG69', 'A0A0N8P0W8', 'A4V1F9', 'B3M7Z7', 'Q63429', 'P0CG50', 'A0A672GE88', 'A0A3Q1HML6', 'F1LML2', 'A0A2C9F3G2', 'A0A5G2QDJ2', 'F7E910', 'E9GJF3', 'Q17E99', 'A0A6I8T8L2', 'A0A3R7Q158', 'A0A423TTK6', 'A0A3Q0JF84', 'A0A6I8VHL7', 'A0A4C1XI51', 'A0A4C1UZ89', 'A0A2A4JQD5', 'A0A0Q9XB42', 'A0A2A4JQG3', 'B0W973', 'F7IUE6', 'A0A182G3H1', 'B4KXU3', 'D6WQT5', 'A0A3S1BV23', 'P0CH28', 'B4HTV7', 'A0A653D7M7', 'A0A673BED4', 'A0A3B5PX76', 'A0A665TEB5', 'A0A667ZZN4', 'T1FVF6', 'A0A0B1PSR6', 'A0A0T6B4N3', 'A0A0R3WDD8', 'A0A3Q3FL84', 'A0A498MSM4', 'A0A671YE85'])


### Declaring python functions

Note that the previous code is sufficiently generic to be applied to any fasta file. You can reuse the code as many times as you want if you wrap it as a Python function. If the functions you write are generalizable, they can be applied in many different contexts, saving us a lot of typing!

We will convert the above code into a function. This function will get as input the path to the fasta file, and it will return a dictionary containing the sequences. Also, the function must have proper documentation. The documentation should include the definition and usage of the function. In this way, the function can be understood and used by others.

To declare a function, we begin with the word "def," followed by the function's name, inputs, and keywords. The function ends when it encounters a return statement. This statement allows you to store the function's output into a variable after its execution.

```
def myfunction(inputs, keywords=None):
    """
    Here goes the documentation
    """
    
    Here goes the code
    
    Here goes the return statement
    
```  
     
Let's declare a function to read fasta files as dictionaries:

In [19]:
def readFasta(input_fasta):
    """
    Reads a fasta file and returns the sequence data as a dictionary.
    
    Parameters
    ----------
    input_fasta : str
        Path to the input fasta file
    
    Returns
    -------
    sequences : dict
        Dictionary containing the IDs and squences in the fasta file.
    """
    sequences = {}
    sequence = None
    with open(input_fasta) as ff:
        for l in ff:        
            if not l.startswith('>'):
                sequence += l.strip()
            else:
                if sequence != None:
                    sequences[uniprot_id] = sequence
                uniprot_id = l.split('|')[1]
                sequence = ''
        sequences[uniprot_id] = sequence
    
    return sequences

Once the function has been created, now we can call it over and over to process fasta files:

In [20]:
sequences = readFasta('input/P0CG48_BLAST_sequences.fasta')
print(sequences.keys())

dict_keys(['P0CG61', 'P0CG64', 'P0CG48', 'L8IDB8', 'A0A4W2CUP5', 'I5AMR3', 'P0CG69', 'A0A0N8P0W8', 'A4V1F9', 'B3M7Z7', 'Q63429', 'P0CG50', 'A0A672GE88', 'A0A3Q1HML6', 'F1LML2', 'A0A2C9F3G2', 'A0A5G2QDJ2', 'F7E910', 'E9GJF3', 'Q17E99', 'A0A6I8T8L2', 'A0A3R7Q158', 'A0A423TTK6', 'A0A3Q0JF84', 'A0A6I8VHL7', 'A0A4C1XI51', 'A0A4C1UZ89', 'A0A2A4JQD5', 'A0A0Q9XB42', 'A0A2A4JQG3', 'B0W973', 'F7IUE6', 'A0A182G3H1', 'B4KXU3', 'D6WQT5', 'A0A3S1BV23', 'P0CH28', 'B4HTV7', 'A0A653D7M7', 'A0A673BED4', 'A0A3B5PX76', 'A0A665TEB5', 'A0A667ZZN4', 'T1FVF6', 'A0A0B1PSR6', 'A0A0T6B4N3', 'A0A0R3WDD8', 'A0A3Q3FL84', 'A0A498MSM4', 'A0A671YE85'])


This is a very powerful way of programming. By making functions you can reuse any code previously made and put it to good use in different contexts.

To get help for a function you just call:

In [21]:
help(readFasta)

Help on function readFasta in module __main__:

readFasta(input_fasta)
    Reads a fasta file and returns the sequence data as a dictionary.
    
    Parameters
    ----------
    input_fasta : str
        Path to the input fasta file
    
    Returns
    -------
    sequences : dict
        Dictionary containing the IDs and squences in the fasta file.



This prints out the documentation of the function. Good documentation should give you all you need to understand how the function is used.

### Wrapping up

In this first practice session, we learned:

- How to use Python to read files.
- How to iterate the content of a file.
- How data sequence is organized in fasta files.
- How to employ sets, lists, and dictionaries to store data.
- How to make calculations using Python strings
- How to write a Python function