# Lecture 6 - From sequences to structures

During the lecture you learned about two main types of conserved structures inside a protein, which are related to its function, **motifs** and **domains**.

## Sequence motifs

Motifs are (typically short) highly conserved sequences of nucleotides or amino acids. 

Here is an example of a small DNA motif:

In [None]:
from Bio import motifs

alignment = [
    "TACAAGGG",
    "TACAAGGG",
    "TACGCGGT",
    "TACACTGG",
    "TACACTGG",
    "TACCCGGG",
    "AACCCGGA",
    "AATGCAGG",
    "AATGCCGG",
    "AATGCCGG"
]

motif = motifs.create(alignment)

We can build a motif from a sequence alignment and calculate a frequency table:

In [None]:
print(motif.counts)

The best way to visualize a sequence motif is to create a [*sequence logo*](https://en.wikipedia.org/wiki/Sequence_logo):

In [None]:
motif.weblogo('files/motif.png')
from IPython.display import Image
Image(filename='files/motif.png') 

We can also find the most representative (i.e. consensus) sequence for that motif:

In [None]:
print(motif.consensus)

### Exercise 1

Here is a sequence logo from a *"secret"* motif:

![secret motif](files/secret.png)

Can you create a sequence alignment that reproduces this motif?

> Note: This will require some trial and error. Don't worry about making it perfect.

In [None]:
# delete the sequences below and create your own

my_alignment = [
    "AAAAAAA",
    "AAAAAAA",
    "AAAAAAA",
]

my_motif = motifs.create(my_alignment)
my_motif.weblogo('files/my_motif.png')
Image(filename='files/my_motif.png') 

## Protein Domains

Domains are (typically large) regions in a protein with a conserved 3D-structure that associated with a given role or function. 

![SPIKE COV2](files/spike_cov2.png)

The figure above shows the location of four domains identified in the spike protein of the SARS-CoV-2 virus.

### Exercise 2: 

In this exercise we will try to identify these domains using a FASTA file with the protein sequence. 

#### 2.1 

Let's begin by loading the file (under: *files/P0DTC2.faa*) using BioPython : 

In [None]:
# type your code here...

Click the cell below to show the solution.

In [None]:
from Bio.SeqIO import parse

sequences = list(parse('files/P0DTC2.faa', 'fasta'))
sequence = sequences[0] # we need to do this because the parser returns a list of sequences even if there is only one
print(sequence)

We can use [ScanProsite](https://prosite.expasy.org/scanprosite/) to search for protein domains:

In [None]:
from Bio.ExPASy.ScanProsite import scan, read
domains = read(scan(sequence.seq))

for domain in domains:
    print(domain)

### 2.2

Create a loop to iterate over the results above and, for each domain, print the domain identifier (*signature_ac*) followed by the respective amino acid sequence.

In [None]:
# type your code here...

Click the cell below to show the solution.

In [None]:
for domain in domains:
    identifier = domain['signature_ac']
    start = domain['start'] - 1   # remember: python is 0-indexed
    end = domain['stop'] - 1 
    domain_seq = sequence.seq[start:end+1]  # remember: the last position is excluded
    print(identifier)
    print(domain_seq)
    print()