# Lesson 2 - 2024/10/17

## Functions

A function is a group of statements that performs a specific task and is executed only when called.

In [1]:
def valid_sequence(sequence):
    for c in sequence:
        if c not in ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']:
            return False

    return True

The `return` statement is used to return a value. 

In [2]:
print('MNKMDLV:', valid_sequence('MNKMDLV'))
print('MNKM0LV:', valid_sequence('MNKM0LV'))

MNKMDLV: True
MNKM0LV: False


In [3]:
print('AUG:', valid_sequence('AUG'))

AUG: False


This function is specific to the alphabet used. How could it be improved?

In [4]:
def valid_sequence(sequence, valid_characters):
    for c in sequence:
        if c not in valid_characters:
            return False

    return True

print('AUG:', valid_sequence('AUG', ['A', 'U', 'G', 'C']))

AUG: True


In this way, it can be used the same function to validate DNA / RNA / PROTEIN sequences.

In [5]:
print('aug:', valid_sequence('aug', ['A', 'U', 'G', 'C']))

aug: False


There are several solutions. For example:

In [6]:
def valid_sequence_1(sequence, valid_characters):
    for c in sequence:
        if c.upper() not in valid_characters:
            return False

    return True

def valid_sequence_2(sequence, valid_characters):
    for c in sequence.upper():
        if c not in valid_characters:
            return False

    return True

print('aug:', valid_sequence_1('aug', ['A', 'U', 'G', 'C']))
print('aug:', valid_sequence_2('aug', ['A', 'U', 'G', 'C']))

aug: True
aug: True


Which of the two implementations is the best one?

In [7]:
%timeit valid_sequence_1('auggcgagca' * 10000000, ['A', 'U', 'G', 'C'])

4.75 s ± 73.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
%timeit valid_sequence_2('auggcgagca' * 10000000, ['A', 'U', 'G', 'C'])

2.55 s ± 79.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The second implementation is faster, but...

In [12]:
# !pip3 install memory_profiler
%load_ext memory_profiler

In [13]:
%memit valid_sequence_1('auggcgagca' * 10000000, ['A', 'U', 'G', 'C'])

peak memory: 178.30 MiB, increment: 96.47 MiB


In [14]:
%memit valid_sequence_2('auggcgagca' * 10000000, ['A', 'U', 'G', 'C'])

peak memory: 273.74 MiB, increment: 190.75 MiB


... it requires more memory.

During the development, there are often trade-offs you can make between implementations.

### Libraries

A library is a collection of files (called modules) that contains functions.

Use `import` to load a library module into a program’s memory.

In [9]:
import random

random.randrange(1, 10)

2

`randrange` chooses a random integer (between 1 and 9 in this example).

It is possible to load only specific items from a library module.

In [10]:
from random import randrange

randrange(1, 10)

9

Use `as` to give a library an alias.

In [11]:
from random import randrange as r

r(1, 10)

3

Let's organize a tiny library. Take a look at [`validation.py`](../data/validation.py).

In [12]:
import validation

ModuleNotFoundError: No module named 'validation'

In [14]:
import sys

sys.path.append('../data')

The `sys` module provides access to variables used or maintained by the interpreter and to functions that interact with the interpreter.

In [15]:
import validation

validation.validate_dna('AGGAGG') # Shine–Dalgarno sequence

True

In [16]:
from validation import validate_protein

validate_protein('HSQGTFTSDYSKYLDSRRAQDFVQWLMNT') # Glucagon

True

In [17]:
from validation import validate_rna as is_rna_ok

is_rna_ok('AUG')

True

## Files

To open the file, use the built-in `open()` function.

The `open()` function returns a file object that allows to perform various operations on the file.

In [18]:
f = open('../data/P04439.fasta')
print(f.read())
f.close()

>sp|P04439|HLAA_HUMAN HLA class I histocompatibility antigen, A alpha chain OS=Homo sapiens OX=9606 GN=HLA-A PE=1 SV=2
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRF
DSDAASQRMEPRAPWIEQEGPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQ
IMYGCDVGSDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAAHEAEQL
RAYLDGTCVEWLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLT
WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEL
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSL
TACKV



Using the `with` statement:

In [19]:
with open('../data/P04439.fasta') as f:
    print(f.read())

>sp|P04439|HLAA_HUMAN HLA class I histocompatibility antigen, A alpha chain OS=Homo sapiens OX=9606 GN=HLA-A PE=1 SV=2
MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRF
DSDAASQRMEPRAPWIEQEGPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQ
IMYGCDVGSDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAAHEAEQL
RAYLDGTCVEWLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLT
WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEL
SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSL
TACKV



There are several ways to operate on files.

In [20]:
with open('../data/P04439.fasta') as f:
    list_of_lines = f.readlines()

list_of_lines

['>sp|P04439|HLAA_HUMAN HLA class I histocompatibility antigen, A alpha chain OS=Homo sapiens OX=9606 GN=HLA-A PE=1 SV=2\n',
 'MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRF\n',
 'DSDAASQRMEPRAPWIEQEGPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQ\n',
 'IMYGCDVGSDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAAHEAEQL\n',
 'RAYLDGTCVEWLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLT\n',
 'WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEL\n',
 'SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSL\n',
 'TACKV\n']

In [21]:
with open('../data/P04439.fasta') as f:
    line = f.readline()
    while line:
        print(line)
        line = f.readline()

>sp|P04439|HLAA_HUMAN HLA class I histocompatibility antigen, A alpha chain OS=Homo sapiens OX=9606 GN=HLA-A PE=1 SV=2

MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRF

DSDAASQRMEPRAPWIEQEGPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQ

IMYGCDVGSDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAAHEAEQL

RAYLDGTCVEWLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLT

WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEL

SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSL

TACKV



When the end of the file (`EOF`) is reached, the `readline()` method returns an empty string.

In [23]:
with open('../data/P04439.fasta') as f:
    for line in f:
        print(line)

>sp|P04439|HLAA_HUMAN HLA class I histocompatibility antigen, A alpha chain OS=Homo sapiens OX=9606 GN=HLA-A PE=1 SV=2

MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRF

DSDAASQRMEPRAPWIEQEGPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQ

IMYGCDVGSDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAAHEAEQL

RAYLDGTCVEWLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLT

WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEL

SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSL

TACKV



The built-in `strip()` function returns a copy of the string with leading and trailing specified characters removed (based on the string argument passed).

In [24]:
list_of_lines = []

with open('../data/P04439.fasta') as f:
    f.readline()
    
    for line in f:
        list_of_lines.append(line.strip('\n'))

list_of_lines

['MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRF',
 'DSDAASQRMEPRAPWIEQEGPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQ',
 'IMYGCDVGSDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAAHEAEQL',
 'RAYLDGTCVEWLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLT',
 'WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEL',
 'SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSL',
 'TACKV']

The `lstrip ()` and `rstrip ()` methods return a copy of the string with, respectively, leading and trailing characters removed (based on the string argument passed).

To obtain the entire sequence:

In [25]:
sequence_hla_a = ''.join(list_of_lines)

sequence_hla_a

'MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQIMYGCDVGSDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAAHEAEQLRAYLDGTCVEWLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWELSSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSLTACKV'

The `join()` method returns a string by joining all the elements of an **iterable**, separated by the specified string separator.

In [26]:
'*********'.join(list_of_lines)

'MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRF*********DSDAASQRMEPRAPWIEQEGPEYWDQETRNVKAQSQTDRVDLGTLRGYYNQSEAGSHTIQ*********IMYGCDVGSDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMAAQITKRKWEAAHEAEQL*********RAYLDGTCVEWLRRYLENGKETLQRTDPPKTHMTHHPISDHEATLRCWALGFYPAEITLT*********WQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEL*********SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAASSDSAQGSDVSL*********TACKV'

# Dictionary (I)

Dictionaries are objects made of <strong>pairs of elements</strong>. These elements are called respectively key and value.

In [27]:
aa_3L_to_1L = {} # Empty dictionary

aa_3L_to_1L = {
    'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
    'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N',
    'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W',
    'ALA': 'A', 'VAL': 'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M',
}

print('{} --> {}'.format('GLU', aa_3L_to_1L['GLU']))

GLU --> E
