docu:
 - sequenceModels
 - setUpAlphaFold

## * General considerations
#### Get the information of a method / function
We can always see information of a method, for example a function, by writting a "?" at the end of the line. It will retrieve the function's documentation, typically a general explanation of what does the function do and its options. In the example below, we are going to take the "models" variable where we used the proteinModels class and stablished our "models_folder" and we will apply the *removeTerminiByConfidenceScore* function. To see its documentation, we add a "?" at the end of the function's name and execute the cell.

In [1]:
models.removeTerminiByConfidenceScore?

Object `models.removeTerminiByConfidenceScore` not found.


# 01 - Preparing your proteins

## Introduction

The prepare_proteins library was written to deal with the high throughput setup of protein systems. It can handle many PDB files simultaneously to set up general optimizations that prepare the systems for specific calculations and simulations.

In this document, we will show an example of the general workflow that can be followed to accomplish the previously mentioned objectives. We will work with several glutathione peroxidase (GPX) sequences from building their models (with Alpha Fold) to setting up PELE simulations.

## 1. What modules and libraries do we need?

First, we need to import the main library **"prepare_proteins"**. 

Second, we will also import an additional library to help us send calculations to the different BSC clusters. The **"bsc_calculations"** library sets up the calculation files, folders and slurm scripts for efficiently launching jobs to the clusters. 

In [2]:
import prepare_proteins
import bsc_calculations

We will also load other common Python libraries to help us in out set up:

In [3]:
import os
import shutil

## 2. Preparing sequences - starting from a FASTA file

In this case, we are starting from protein sequences, so we need to model their protein structures. We will set up AlphaFold calculations from a FASTA file ("gpx_sequences.fasta") containing five GPX sequences. 

The first step is to initialise the *sequenceModels* class with the path to our fasta file. We assigned the initialised class to the variable *sequences*:

In [4]:
sequences = prepare_proteins.sequenceModels('gpx_sequences.fasta')

Now we can use the class method *setUpAlphafold* to create all the files, folders and commands to launch AlphaFold. It takes as the only parameter the folder's name in which we want to put our calculation files. The method returns a list of the commands that must be executed to run the job. We store that list in a variable called jobs:

In [5]:
jobs = sequences.setUpAlphaFold('alphafold')

Finally, we can create a slurm script to launch the AlphaFold jobs using the **"bsc_calculations"** library. Since the job will be run in the Minotauro cluster, we call a method inside the corresponding sub-class from the library:

In [6]:
bsc_calculations.minotauro.jobArrays(jobs, job_name='AF_sequences', partition='bsc_ls', program='alphafold')

The *jobArrays* method needs the list of commands to generate the slurm script file. We have specified the 'bsc_ls' partition to run the calculations, and with the keyword "program", we tell the script to load all necessary libraries to run AlphaFold in this cluster.

To launch the calculations you will need to upload the 'AF_sequences' folder and the 'slurm_array.sh' script to the cluster and then launch it with: 

    sbatch slurm_array.sh

After all the AlphaFold calculation has finished, we will need to get the protein structures output from the cluster. Since AlphaFold generates large-memory outputs we are only interested in grabbing the PDB files to load them into our library. This can be easily done with a command like this:

    tar cvf AF_sequences.tar AF_sequences/output_models/\*/relaxed_model_\*pdb"

The tar file contains only the relaxed PDB outputs but mantains the folder structure of our AlphaFold calculation.

### 2.2. Preparing models - taking PDB files

After we get our AlphaFold results from the cluster we need to put them into a folder renamed woth their corrsponfing protein names. To do that we run the following code:

In [7]:
# Create a structures folder if it does not exists
if not os.path.exists('structures'):
    os.mkdir('structures')
    
# Copy each alphafold output model (from a specific rank) into the structures folder
rank = 0
for model in os.listdir('alphafold/output_models/'):
    if os.path.exists('alphafold/output_models/'+model+'/ranked_'+str(rank)+'.pdb'):
        shutil.copyfile('alphafold/output_models/'+model+'/ranked_'+str(rank)+'.pdb', 
                        'structures/'+model+'.pdb')

Now we can initialise the *proteinModels* class with our PDB files from the structures folder:

In [8]:
models = prepare_proteins.proteinModels('structures')

The library reads all PDB files as [biopython Bio.PDB.Structure()](https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ) objects at the structures attribute. The attribute is a dictionary whose keys are the protein models names and the values are the Bio.PDB objects. The library can be iterated to get the protein models names at each iteration:

In [9]:
for model in models:
    print(models.structures[model])

<Structure id=GPX_Bacillus-subtilis>
<Structure id=GPX_Gossypium-hirsutum>
<Structure id=GPX_Lactococcus-lactis>
<Structure id=GPX_Neisseria-meningitidis>
<Structure id=GPX_Staphylococcus-aureus>


## 3. System preparation

### 3.1 Removing low confidence regions from AlphaFold models at the protein termini

AlphaFold models can contain structural regions with low confidence in their prediction. Since this can represent large structural domains or segments, we are interested in removing them, specially if their are found at the N- and C-termini.

The **prepapare_proteins** library has a method to remove terminal segments from AlphaFold structures using the confidence score stored at the B-factor column of the PDBs:

In [10]:
models.removeTerminiByConfidenceScore(confidence_threshold=90)

The condifedence_threshold keyword indicates the maximum confidence score at which to stop the triming of terminal regions. 

At any point, when we are modifying our proteins it is a good idea to check that the structural changes have been carried out as we expected. The library has a method for writing all the structures into a folder so we can visualise the state of our set up:

In [11]:
models.saveModels('trimed_models')

We can open the PDB files with any external programs to check what the previous code did.

In the current state of the library, after some modifications on the structures, we need to re-initialise the *proteinModels* class using the models written to a folder with the saveModels() method. For this we simply repeat:

In [12]:
models = prepare_proteins.proteinModels('trimed_models')

### 3.2 Align structures to a reference PDB

When comparing related proteins is a good idea to align them to have a common structural framework. The library helps you align the proteins with the method alignModelsToReferencePDB(). We need to give a reference PDB (which, as in this case, can be any PDB from our models), then a folder where to write the aligned structures, and the index (or indedxes) of the chains to align (see the documentation inside the function for details on how the chain_indexes are given). For now we set up the index to be the first folder in the structure (chain_indexes=0) and we run the alignement:

In [13]:
models.alignModelsToReferencePDB('trimed_models/GPX_Bacillus-subtilis.pdb', 'aligned_models', chain_indexes=0)

outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.490
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
0 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
    0 / 2
done.

Progressive alignment ... 
STEP     1 /1 
done.
tbfast (aa) Version 7.490
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
1 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 0
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

    0 / 2
Segment   1/  1    1- 159
done 001-001-1  identical.   
dvtditr (aa) Version 7.490
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
0 thread(s)


Strategy:
 L-INS-i (Probably most accurate, very slow)
 Iterative refinement me

We will continue working with the aligned structures, for that we reload our output models into the library:

In [14]:
models = prepare_proteins.proteinModels('aligned_models/')



# 4. Prepwizard optimizations

After our protein models are correctly trimed and aligned we can continue with the prepwizard optimazion of the structures. We create this set up by calling the method setUpPrepwizardOptimization():

In [16]:
jobs = models.setUpPrepwizardOptimization('prepwizard')

Again, the method needs a folder name to put all inout files for the calculations. After executing the method it returns the comamnds to be executed for running the Prepwizard optimization in a machine endowed with the Schrodinger Software license. The commands can be passed to the **bsc_calculations** library to generate the scripts for facilitate the execution:

In [15]:
bsc_calculations.local.parallel(jobs, cpus=min([40, len(jobs)]))

We define the number of cpus we want to use beforehand, so the library will create one script for each CPU to be used. In our case, we are working with 5 files, so only 5 script files were created. 