# <b><font color='#009e74'>Functional site model: step-by-step MSA  preparation for GEMME web server </font></b>

Colaboratory supporting notebook for creating MSA compatible with GEMME webserver.\
Manuscript: **Cagiada, et al.,** [Discovering functionally important sites in proteins
](https://www.biorxiv.org/content/10.1101/2022.07.14.500015v1.full).\
Source code is available on the project [Github](https://github.com/KULL-Centre/papers/tree/main/2022/functional-sites-cagiada-et-al) page.

If you use this notebook, rememeber to cite the software and packages used and the current version of our manuscript.

###1. <b><font color='#56b4e9'>Retrieve your sequence in fasta format</font></b>

This can be done also using [UNIPROT](https://www.uniprot.org/)
1. Search the protein ID
2. Go to sequence tab
3. Click on Download bottom and copy the sequence from the new

![](https://raw.githubusercontent.com/KULL-Centre/_2022_functional-sites-cagiada/main/images_colab/Screenshot%202022-09-10%20at%2014.18.49.png)

###2.   <b><font color='#56b4e9'>Generation of HHBlits alignment</font></b>:

Go to https://toolkit.tuebingen.mpg.de/tools/hhblits:

1. Paste your sequence in fasta format into the box
2. Select UniRef30 as database

![](https://raw.githubusercontent.com/KULL-Centre/_2022_functional-sites-cagiada/main/images_colab/Screenshot%202022-09-10%20at%2015.44.37.png)

3. Go to the parameters tab and select as E-value threshold 1e-20 (you can increase if there are not enough sequences in the output)
4. Select at least 2000 sequences as output (you can increase it for a larger signal)
5. Fill the job name cell and run the prediction


![](https://raw.githubusercontent.com/KULL-Centre/_2022_functional-sites-cagiada/main/images_colab/Screenshot%202022-09-10%20at%2013.38.53.png)

6. Once the run is finished go to the output panel and click on the query MSA tab

7. Download the full A3M MSA

![](https://raw.githubusercontent.com/KULL-Centre/_2022_functional-sites-cagiada/main/images_colab/download_MSA_hhblits.png)



###3. <b><font color='#56b4e9'> Convert you MSA in the appropriate GEMME format</font></b>

Upload the downloaded MSA using the following code cell and it will be converted in the proper GEMME input msa format.

  1. The MSA will be converted from a3m to fasta
  2. All the non-query sequence columns will be removed from the MSA
  3. All the sequences with more gaps than the selected threshold will be removed. The threshold (in percentage) can be set in the next cell (default 50%)

A formatted GEMME MSA will be downloaded at the end of the process.

\

### EXECUTABLE CODE STARTS HERE

****

In [None]:
#@title 2.1 Upload alignment
#@markdown Run this cell to upload the <b>HHblits A3M alignment file</b>.
#@markdown
#@markdown  <b>N.B.:</b> To perform a new alignment, first restart the notebook by clicking `Runtime`->  `Disconnect and delete runtime` 

from google.colab import files
import os,sys,shutil
if 'input_path' in locals():
  shutil.rmtree(input_path)

input_path='/content/work_folder'

if not os.path.exists(input_path):
  os.mkdir(input_path)
output_name=''
uploaded_msa = files.upload()
for fn in uploaded_msa.keys():
    print(fn)
    output_name=f'{fn[:-4]}_filtered.fas'
    print(output_name)
    os.rename(fn, f"{input_path}/input.a3m")
    print('-->MSA uploaded')

In [None]:
#@title 2.2 Convert and filter the uploaded MSA
#@markdown Run this cell to convert and filter the uploaded MSA for use in the GEMME web server.
#@markdown
#@markdown  <b>N.B.:</b> To perform a new alignment, first restart the notebook by clicking `Runtime`->  `Disconnect and delete runtime` 

%%bash
wget -cq https://github.com/KULL-Centre/_2022_functional-sites-cagiada/raw/main/additional_msafilters.zip -o /content/additional_msafilters.zip
unzip /content/additional_msafilters.zip > /dev/null 2>&1
rm /content/additional_msafilters.zip > /dev/null 2>&1
perl /content/convertseqs.pl a3m fas /content/work_folder/input.a3m /content/work_folder/input.fas
chmod +x /content/MSAFILTERexe 
/content/MSAFILTERexe /content/work_folder/input.fas /content/work_folder/output 0.5 > /dev/null 2>&1

In [4]:
#@title 2.2 Download the GEMME-ready MSA
#@markdown Run this cell to download the converted MSA.
os.rename(f'{input_path}/output.fasta', f'/content/{output_name}')
files.download(f"/content/{output_name}")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### EXECUTABLE CODE ENDS HERE

****

###4. <b><font color='#56b4e9'>RUN GEMME webserver</font></b>:

Go to http://www.lcqb.upmc.fr/GEMME/submit.html:

1. upload the MSA you have generated and filtered
2. set the protein name

![](https://raw.githubusercontent.com/KULL-Centre/_2022_functional-sites-cagiada/main/images_colab/gemme_1.png)

3. set the number of iteration to 5
4. input your email and press the submit button

![](https://raw.githubusercontent.com/KULL-Centre/_2022_functional-sites-cagiada/main/images_colab/gemme_2.png)

5. Once finished, download the GEMME results file