# Lecture 6 - From sequences to structures

In this lecture you learned how amino acid sequences lead to protein structures and how conserved regions in a protein, like **motifs** and **domains**, are related to protein function. In this tutorial we will explore some of these elements.

### Learning objectives:

- Interpret motifs represented as sequence logos
- Search for domains in a given sequence
- Visualize 3D protein structures

## Sequence motifs

Motifs are (typically short) highly conserved sequences of nucleotides or amino acids. 

Here is an example of a short (hypothetical) DNA motif:

In [None]:
from Bio import motifs

alignment = [
    "TACAAGGG",
    "TACAAGGG",
    "TACGCGGT",
    "TACACTGG",
    "TACACTGG",
    "TACCCGGG",
    "AACCCGGA",
    "AATGCAGG",
    "AATGCCGG",
    "AATGCCGG"
]

motif = motifs.create(alignment)

We can build a motif from a sequence alignment and calculate a frequency table:

In [None]:
print(motif.counts)

The best way to visualize a sequence motif is to create a [*sequence logo*](https://en.wikipedia.org/wiki/Sequence_logo):

In [None]:
motif.weblogo('files/motif.png', format='png_print', show_fineprint=False, show_xaxis=False, show_yaxis=False, show_errorbars=False)

from IPython.display import Image
Image(filename='files/motif.png', width=500)

> **Note:** If the code above doesn't work, you can do this instead:
> - `print(motif)`
> - copy-paste the output [here](https://weblogo.berkeley.edu/logo.cgi)
> - press *create logo* at the bottom of the page

We can also find the most representative (i.e. consensus) sequence for that motif:

In [None]:
print(motif.consensus)

### Exercise 1

Here is a sequence logo from a *"secret"*(*) motif:

![secret motif](files/secret.png)

Can you create a sequence alignment that reproduces this motif?

> Note: This will require some trial and error. Don't worry about making it perfect.

In [None]:
# delete the sequences below and create your own 

my_alignment = [
    "AAAAAAAAAAAA",
    "AAAAAAAAAAAA",
    "AAAAAAAAAAAA",
    "AAAAAAAAAAAA",
    "AAAAAAAAAAAA",
    "AAAAAAAAAAAA",
    "AAAAAAAAAAAA",
    "AAAAAAAAAAAA",
    "AAAAAAAAAAAA"
]

# this code will generate and print the sequence logo

my_motif = motifs.create(my_alignment)
my_motif.weblogo('files/my_motif.png', format='png_print', show_fineprint=False, show_xaxis=False, show_yaxis=False, show_errorbars=False)

Image(filename='files/my_motif.png', width=500) 

(*) Actually it is not a secret, it is the binding site of the *Rox1* transcription factor of *S. cerevisiae*. 
You can see the original alignment in this [publication](https://www.nature.com/articles/nbt0406-423).

## Protein Domains

Domains are (typically large) regions in a protein with a conserved 3D-structure that associated with a given role or function. 

![SPIKE COV2](files/spike_cov2.png)

The figure above shows the location of four domains identified in the spike protein of the SARS-CoV-2 virus.

### Exercise 2: 

In this exercise we will try to identify these domains using a FASTA file with the protein sequence. 

#### 2.1 

Begin by loading the file (under: *files/P0DTC2.faa*) using BioPython and reading the protein sequence. If necessary, use the [documentation](https://biopython.org/wiki/SeqIO) to refresh your memory.

In [None]:
# type your code here...

Click the cell below to show the solution.

In [None]:

from Bio.SeqIO import parse

sequences = list(parse('files/P0DTC2.faa', 'fasta'))
sequence = sequences[0] # we need to do this because the parser returns a list of sequences even if there is only one
print(sequence)

[ScanProsite](https://prosite.expasy.org/scanprosite/) is a web-based tool that can search motifs and domains in a protein sequence.

We can also run ScanProsite directly from BioPython: 

In [None]:
from Bio.ExPASy.ScanProsite import scan, read

# here the sequence object is the result from the previous exercise 
domains = read(scan(sequence.seq))

for domain in domains:
    print(domain)

> Note: if the code above fails with Error 308, it means that BioPython is using an outdated ScanProsite URL, then please run the code below: 

In [None]:

def scan(seq):
    from urllib.request import urlopen
    from urllib.parse import urlencode

    parameters = {"seq": seq, "output": 'xml'}
    command = urlencode(parameters)
    url = f"https://prosite.expasy.org/cgi-bin/prosite/scanprosite/PSScan.cgi?{command}"

    return urlopen(url)

domains = read(scan(sequence.seq))

for domain in domains:
    print(domain)

You can see that each domain is a dictionary with some information like:
- the start and stop position of each domain along the sequence
- the identifier of that domain in the ProSite database
- a confidence score for the domain match. 

We can also get a more detailed description of each domain identifier:

In [None]:
from Bio import ExPASy
from Bio.ExPASy import Prosite

for domain in domains:

    record = Prosite.read(ExPASy.get_prosite_raw(domain['signature_ac']))
    print(record.accession + ': ' + record.description)

### 2.2

Let's practice our Python skills a bit with a simple exercise. Create a loop to iterate over the domains and, for each domain, print the domain identifier (*signature_ac*) followed by the respective amino acid sequence.

> Note: You can use the start and stop positions to *"crop"* the sequence of the domain from the original protein sequence:

In [None]:
# type your code here...

Click the cell below to show the solution.

In [None]:

for domain in domains:
    identifier = domain['signature_ac']
    start = domain['start'] - 1   # remember: python is 0-indexed
    end = domain['stop'] - 1 
    domain_seq = sequence.seq[start:end+1]  # remember: when we slice a list, the last position is excluded
    print(identifier)
    print(domain_seq)
    print()

## Protein structures

Now let's load a protein structure for the spike protein. As you can imagine, this has been a very well-studied protein and you can find thousands of experimentally measured structures for this protein on [PDB](https://www.rcsb.org/). 

For this exercise we will use entry [**6VSB**](https://www.rcsb.org/structure/6VSB).

> Check the website for more details on how the structure was measured.

![6SVB](files/6vsb_cartoon.png)

### Exercise 3.1

Run the code below to import and view the protein using [NGLViewer](https://nglviewer.org/nglview/latest/). 

- Rotate the protein until it looks like the figure above.

In [None]:
import nglview as nv
view = nv.NGLWidget(height='500px')
view.add_pdbid('6VSB')
view.clear_representations()
view.add_cartoon('protein', color='residueindex', color_scale='RdYlBu', color_reverse=True)
view

**NGLViewer** is a very flexible and powerfull library. For instance we can try different visualization styles, and we can highlight specific parts of the protein:

In [None]:
view = nv.NGLWidget(height='500px')
view.add_pdbid('6VSB')
view.clear_representations()
view.add_spacefill(selection='protein', color='grey', opacity=0.3)
view.add_spacefill(selection='protein and 1-1000', color='red', opacity=0.6)
view

## 3.2

The code above highlighted the first 1000 residues of the protein...

- Go back to **exercise 2.1** where you printed the position of the different domains 
- Where are the start and stop positions of the S1 C-terminal domain (the domain that binds to the human receptor) ?
- Modify the code above to highlight only that part.

🤔 Does the location of the domain make sense ?

> Remember: The spike protein is located on the surface of the virus and is oriented like a cauliflower.

![sars-cov-2](files/sars_cov_2.png)

## Wrap-up

Not a lot of coding today... 😉 

Hopefully you now have a better understanding of the interconnection between the **sequence** and **structure** of proteins. 