# Introduction to Bioinformatics

## 01 - Working with sequence information

###  This is our first hands-on session into the world of Bioinformatics!

Bioinformatics is an interdisciplinary field of Biology and Informatics. Why? Because a great deal of the work we do in biology is about data. Computers make it easy and extremely more robust to process this data. 

Our mission is to gain biological insight from our data. First, we make questions, then design how to answer them, collect the pertinent data, and analyze it. Finally, we infer new hypotheses from it. This scheme is the scientific method at work and is as valid as in any other scientific field. To do bioinformatics is to do experiments with a computer. Of course, first, we need to learn how to use a computer in a more severe and involved manner. This means know how to talk with the computer so it can understand and carry out our workload for us, or more correctly, learn how to command it through programming.

Programming may be a daunting task when beginning with but extremely rewarding at the end. Programming is learning a new language and, as any language, opens new doors on how we think about our world. More specifically, and perhaps, more importantly, programming allows us to carry out jobs that would be humanly impossible due to their time-consuming nature. Albert Einstein once stated:

"Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination."

In this first notebook, we will learn some basic things about Python; a general-purpose programming language that nowadays is extremely useful and ubiquitous. We do so by learning how to work with sequence information. Much of the work in Bioinformatics is about sequences. Remember that DNA, RNA, and proteins can be represented as strings of sequences, so most of the bio-macromolecules in the living cell can be thought of as a linear string of sequence (quite reductionist, but this is just one dimension on the issue). We will discuss this more further inside the course, for now, let's focus on our baby steps.

### Before moving on, we need to have clarity about the following concepts:

- Python data
- Data types
- Python function
- Python method

- Line break

### Working with files

Most data will be in text files. This are generic files which contain lines of plain text. To work with the data inside these files, we need to tell python its location and how to open them. The following code opens a text file and print its content:

In [None]:
# Read file content
text = open('input/example.txt')

# Iterate over the file content (or lines)
for line in text:
    print(line) # print one line at the time
    
# Close the file to free up memory
text.close()

The text file "example.txt", inside the "input" folder, contains three lines of text which are printed in the notebook. Note how the location of the file was passed to the "open()" function and that the output returned by that function is assigned to a varible called "text". Then, the "text" variable is iterated to print every "line" inside it. Finally, the "text" object is closed. Note that the syntax used to call an iteration of an object.

This is equivalent to:

In [None]:
cat = open('input/example.txt')

for mouse in cat:
    print(mouse)    
    
cat.close()

Python does not care how you call your varibles, but we usually give them logaical names to make human sense of what we are doing. Other important thing are comments. Note that any text after a "#" symbol won't be interpreted as python code. We put them to remmeber what we are doing. With time, and once you are becoming more proficient with the language, comments should reflect general ideas of what your are doing. This is fundamental if your code is going to be shared with other, so they know what you intended with the code. As well comments will help the future-you to know what you did. You'll be suprised how many times you review old code and you don't know what is doing, even if you were the one who wrote it in the first place.

An alternative to the code above is the following:

In [None]:
with open('input/example.txt') as text_lines:
    for line in text_lines:
        print(line)    

This is what in programming is called an alternative syntax. As in other languages, there are many ways to say the same thing, but some ways are more convinient. In this case, note that we did not closed our "text_lines" object. The syntax "with", in this case, allows us to close the file object inmmediatley when we left its indentation.

### How does sequence data looks like?

The first thing we need to know if we are working with sequences is where to get it. We will use the UNIPROT database and going to look for the sequence of the human protein "ubiquitin" with code "P0CG48" (What this protein does?). Let's enter the database and get the sequence in a format called fasta. Save this fasta file in our "input" folder.

[UNIPROT](https://www.uniprot.org/)

Now we read and print its content:

In [None]:
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:
        print(l)

The format is quite clear. First a line that starts with ">", containing the name and other data about the sequence, followed by all the lines containing the sequence. 

This is OK to display the contents of the file, but how do we save and manipulate this information?

### Storing sequence data

For this we need to create a variable to contain our data and append the information at every iteration:

In [None]:
ubq_sequence = ''
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:
        ubq_sequence = ubq_sequence+l
print(ubq_sequence)

Now the content of our file can be accessed through "ubq_sequence" without the need of reopen the file every time. However we would like to get this in more convenient way. Our "ubq_sequence" string variable contains the data in the as is in the fasta file, however for many pratical purposes this is not very useful. We would like to get into our variable only the sequence, without any line breaks:

In [None]:
ubq_sequence = ''
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:
        # Only work with lines that do not begin with ">"
        if not l.startswith('>'):
            # The strip method deletes line breaks and spaces at the begining or the end of the string
            ubq_sequence = ubq_sequence+l.strip()
print(ubq_sequence)

This starts to look better, but it would be even better not to loose the information of the sequence. When we dropped the line with ">" we lost the reference information about our sequence, if we were working with a lot of sequences then this rapidly would become chaotic, because we could not follow which sequence is which.

There are many ways to track information in python, here we will use what is called a dictionary. This is a special python object that have "keys" as entries  and "values" as outputs; very similar to a standard dictionary. We use this kind ob object to store the sequence together with its reference. First let's see how we could get the uniprot reference ID as a separate string

In [None]:
ubq_sequence = ''
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:        
        if l.startswith('>'):
            
            ## This will be printed to explain what we are doing here ##
            
            print('1:')
            print(l)
            print('2:')
            print(l.strip())
            print('3:')
            print(l.split())
            print('4:')
            print(l.split('|'))
            print('5:')
            print(l.split('|')[1])
            
            ubq_uniprot_id = l.split('|')[1]
            
print('6:')
print(ubq_uniprot_id)

In [None]:
ubq_sequence = ''
with open('input/P0CG48.fasta') as fasta:
    for l in fasta:        
        if l.startswith('>'):
            ubq_uniprot_id = l.split('|')[1]
        # This matches all that did not match the above if statement
        else:
            ubq_sequence = ubq_sequence+l.strip()

# Declare dictionary to store id + sequence
sequences = {ubq_uniprot_id : ubq_sequence}

In [None]:
print(sequences)

Nice! This looks much more likely. 

### Calculating over our sequence data 

Let's make some questions to our sequence. One basic question we can make is its amino acidic composition. We can caculate this now very easily. First we get the non-redundant set of letters in our sequence together with a list containing the sequence as individual letters:

In [None]:
# Store letters inside a set
letters_set = set()
# Store letters inside a list
letters_list = []

for l in sequences['P0CG48']:
    letters_set.add(l)
    letters_list.append(l)
    
print('Letters set:')
print(letters_set)
print('Letters list:')
print(letters_list)

Can you tell the difference between a list and set object?

Now let's calculate our composition and store it in a dictionary as percentages

In [None]:
composition = {}
for l in letters_set:
    # Store composition for each aminoacid
    composition[l] = letters_list.count(l)/len(letters_list)*100.0 # Composition formula
print(composition)

This would be even better if we printed our composition sorted by percentage value:

In [None]:
print('Ubiqutin composition:')
# This is how you sort a dictionary by values in reverse order
for aa, percentage in sorted(composition.items(), key=lambda item: item[1], reverse=True):
    print(aa, percentage)

Can we calculate the theoretical mass of our protein? 

Of course, we just need the mass of each aminoacid:

In [None]:
# Dictionary with amino acid masses in Dalton
aa_masses = {'A': 71.0, 'C': 103.0, 'D': 114.0, 'E': 128.0, 'F': 147.0, 
             'G': 57.0, 'H': 138.0, 'I': 113.0, 'K': 128.0, 'L': 113.0, 
             'M': 131.0, 'N': 114.0, 'P': 97.0, 'Q': 128.0, 'R': 157.0, 
             'S': 87.0, 'T': 101.0, 'V': 99.0, 'W': 186.0, 'Y': 163.0}

protein_mass = 0

for aa in sequences['P0CG48']:
    protein_mass += aa_masses[aa]
    
print('Ubiquitin mass:')
print(protein_mass, 'Da')

Compare this value with the one obtained in a prediction server:

[Compute pI/Mw for Swiss-Prot/TrEMBL entries or a user-entered sequence](https://web.expasy.org/compute_pi/)

Why do you think these value differ?
How would you improve our calculator?

Can you calculate the mass-averaged composition of our protein?

Ubiquitin is a poly protein, can you find the largest motif repeated?

To answer this question we need to learn how to slice strings. We show this with examples:

In [None]:
my_sequence = sequences['P0CG48']
# Print full sequence
print(my_sequence)

# Slice from beginning
print(my_sequence[0])
print(my_sequence[19])

# Slice from the end
print(my_sequence[-1])
print(my_sequence[-20])

# Slice in a range
print(my_sequence[0:19])
print(my_sequence[-20:-1])

We can use this in order to get how many times a sequence segment is contained in the sequence:

In [None]:
motif = my_sequence[0:19]
n_times = my_sequence.count(motif)
print(n_times, motif)

In [None]:
# Count number of times a motif is inside the sequence
# Note that this only counts motifs at the begining of the sequence
motifs = {}
for i in range(len(sequences['P0CG48'])):
    motif = sequences['P0CG48'][:i]
    length = sequences['P0CG48'].count(motif)
    motifs[length] = motif
print(motifs)

Can you get the most repeated four letter motif inside the sequence?

In [None]:
motifs = {}
for i in range(len(sequences['P0CG48'])):
    motif = sequences['P0CG48'][i:i+4]
    length = sequences['P0CG48'].count(motif)
    motifs[motif] = length
print(motifs)

### Multiple sequence data

Most of the time we will want to compare multiple sequences to know how they are related to each other. When these sequences are evolutionary related their are known as homologous sequences. Now that we are familiar with working with a single sequence, let's try to read a multiple sequence fasta file into a dictionary.