In this pactice we will be introducing the different tools to do a simple homology modelling experiment.
The steps include looking for similar template structures to our target seqences, doing a multiple alignment of the target and template sequences and finally using this alignment and the template PDB files in Modeller to produce different models to be analyzed.
Let us start by importing some useful modules from Biopython.

0) Get Biopython first:
``````
conda install biopython
```

In [1]:
from Bio.Blast import NCBIWWW 
from Bio import SeqIO

1) Let  us start by reading all sequences from a fasta file. In this case, we first download the file in http://predictioncenter.org/download_area/CASP13/sequences/ containing the query sequences in CASP 13. More information in http://predictioncenter.org/casp13/targetlist.cgi 


In [20]:
!wget http://predictioncenter.org/download_area/CASP13/sequences/casp13.seq.txt -O files/all_targets.fasta
flh = open('files/all_targets.fasta')
fasta_sequences = SeqIO.parse(flh,'fasta')
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    print(name,sequence)
flh.close()

--2020-03-13 22:37:17--  http://predictioncenter.org/download_area/CASP13/sequences/casp13.seq.txt
Resolving predictioncenter.org... 128.120.136.155
Connecting to predictioncenter.org|128.120.136.155|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32102 (31K) [text/plain]
Saving to: 'files/all_targets.fasta'


2020-03-13 22:37:18 (155 KB/s) - 'files/all_targets.fasta' saved [32102/32102]

H0953 GALGSASIAIGDNDTGLRWGGDGIVQIVANNAIVGGWNSTDIFTEAGKHITSNGNLNQWGGGAIYCRDLNVS
H0953 MAVQGPWVGSSYVAETGQNWASLAANELRVTERPFWISSFIGRSKEEIWEWTGENHSFNKDWLIGELRNRGGTPVVINIRAHQVSYTPGAPLFEFPGDLPNAYITLNIYADIYGRGGTGGVAYLGGNPGGDCIHNWIGNRLRINNQGWICGGGGGGGGFRVGHTEAGGGGGRPLGAGGVSSLNLNGDNATLGAPGRGYQLGNDYAGNGGDVGNPGSASSAEMGGGAAGRAVVGTSPQWINVGNIAGSWL
H0957 SNSFEVSSLPDANGKNHITAVKGDAKIPVDKIELYMRGKASGDLDSLQAEYNSLKDARISSQKEFAKDPNNAKRMEVLEKQIHNIERSQDMARVLEQAGIVNTASNNSMIMDKLLDSAQGATSANRKTSVVVSGPNGNVRIYATWTILPDGTKRLSTVTGTFK
H0957 SNAMINVNSTAKDIEGLESYLANGYVEANSFNDPEDDALECLSNLLVKDSRGGLSFCKKILNSNNIDGVFIK

2) Now, let us run BLAST for one of the particular query sequences `idname` and keep the XML file

In [17]:
idname = 'T0951'

In [18]:
flh = open('files/all_targets.fasta')
for i in SeqIO.parse(flh, "fasta"): 
   if i.id==idname: 
      print("Found match for "+idname)
      filename="files/"+idname+"_blast.xml"
      print("Calling BLAST against the PDB, and saving the XML file "+filename)
      result_handle=NCBIWWW.qblast("blastp", "pdb", i.seq, hitlist_size=100) 
      a=str(i.seq[0:4]) 
      blastXMLfh = open(filename, "w") 
      blastXMLfh.write(result_handle.read()) 
      blastXMLfh.close() 
      result_handle.close()
      print("Done!")

Found match for T0951
Calling BLAST against the PDB, and saving the XML file files/T0951_blast.xml
Done!


3) once we got the XML file containing all the results from the BLAST search, we need to parse it.
BioPython provides [NCBIXML](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc97). 
We could read the file created in the previous step by
```
result_handle = open(files/idname+"_blast.xml")
```
but as we already got our `result_handle` in the previous step we can proceed faster by accessing it directly

In [19]:
from Bio.Blast import NCBIXML
result_handle = open("files/"+idname+"_blast.xml")
blast_records=list(NCBIXML.parse(result_handle)) #putting the results into a list is convenient 
                                                 #to do some extra work with them

E_VALUE_THRESH = 0.001
for blast_record in blast_records:
    print(blast_record)
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < E_VALUE_THRESH:
                print("\nALIGNMENT\n=========\n")
                print(alignment.title)
                print("E value:",hsp.expect)
                print(hsp.query[0:75]+"...")
                print(hsp.match[0:75]+"...")
                print(hsp.sbjct[0:75]+"...")

<Bio.Blast.Record.Blast object at 0x1067de290>

ALIGNMENT

pdb|5Z82|A Chain A, Hyposensitive to light 7 [Striga hermonthica] >pdb|5Z8P|A Chain A, Hyposensitive to light 7 [Striga hermonthica] >pdb|5Z95|A Chain A, Hyposensitive to light 7 [Striga hermonthica]
E value: 0.0
GPLGSMSSIGLAHNVTILGSGETTVVLGHGYGTDQSVWKLLVPYLVDDYKVLLYDHMGAGTTNPDYFDFDRYSSL...
GPLGSMSSIGLAHNVTILGSGETTVVLGHGYGTDQSVWKLLVPYLVDDYKVLLYDHMGAGTTNPDYFDFDRYSSL...
GPLGSMSSIGLAHNVTILGSGETTVVLGHGYGTDQSVWKLLVPYLVDDYKVLLYDHMGAGTTNPDYFDFDRYSSL...

ALIGNMENT

pdb|5Z89|A Chain A, Hyposensitive to light 7 [Striga hermonthica]
E value: 0.0
GPLGSMSSIGLAHNVTILGSGETTVVLGHGYGTDQSVWKLLVPYLVDDYKVLLYDHMGAGTTNPDYFDFDRYSSL...
GPLGSMSSIGLAHNVTILGSGETTVVLGHGYGTDQSVWKLLVPYLVDDYKVLLYDHMGAGTTNPDYFDFDRYSSL...
GPLGSMSSIGLAHNVTILGSGETTVVLGHGYGTDQSVWKLLVPYLVDDYKVLLYDHMGAGTTNPDYFDFDRYSSL...

ALIGNMENT

pdb|6A9D|A Chain A, Hyposensitive to light 7 [Striga hermonthica] >pdb|6A9D|B Chain B, Hyposensitive to light 7 [Striga hermonthica]
E value: 0.0
MSSIG

4) Now try to decide a way to select the correct protein. 

QUESTION: E-value based? A suggestion: use pairwise alignments and get the score using the `pairwise`module in Biopython. Try 
```
from Bio import pairwise2
```
and play with the module following the instructions in [this page](https://towardsdatascience.com/pairwise-sequence-alignment-using-biopython-d1a9d0ba861f).

Write the selected FASTA format sequences into a file you will need for the MSA.

5) next you will need clustal omega to run a MSA. You can get it through this instructions:
```
conda install -c bioconda clustalo 
conda install -c bioconda/label/cf201901 clustalo
```

QUESTION: See the example below and try to run a MSA and do it for your selected sequences

In [32]:
from Bio import AlignIO 
from Bio.Align.Applications import ClustalOmegaCommandline 

# use these lines to ensure clustalo can be found. This works for my installation using 
# the above conda instructions
import os
os.environ['PATH'] += ':~/miniconda3/bin/'

# this is an example using the complete collection of sequences. You should try it with your collection
# obtained from the BALST calculation
file='files/all_targets.fasta'
outfile='files/all_targets_aligned.fasta'
cline = ClustalOmegaCommandline(infile= file, outfile= outfile, verbose=True, auto=True, force=True) 
stdout, stderr = cline() 

In [33]:
print(stdout)

Using 1 threads
Read 90 sequences (type: Protein) from files/all_targets.fasta
not more sequences (90) than cluster-size (100), turn off mBed
Setting options automatically based on input sequence characteristics (might overwrite some of your options).
Auto settings: Enabling mBed.
Auto settings: Setting iteration to 1.
Using 42 seeds (chosen with constant stride from length sorted seqs) for mBed (from a total of 90 sequences)
Calculating pairwise ktuple-distances...
Ktuple-distance calculation progress: 0 % (0 out of 2919)
Ktuple-distance calculation progress: 3 % (89 out of 2919)
Ktuple-distance calculation progress: 6 % (177 out of 2919)
Ktuple-distance calculation progress: 9 % (264 out of 2919)
Ktuple-distance calculation progress: 11 % (350 out of 2919)
Ktuple-distance calculation progress: 14 % (435 out of 2919)
Ktuple-distance calculation progress: 17 % (519 out of 2919)
Ktuple-distance calculation progress: 20 % (602 out of 2919)
Ktuple-distance calculation progress: 23 % (684 

In [34]:
# you can check what you created:
!cat files/all_targets_aligned.fasta

>H0953 Phage S16 long tail fiber adhesin tip, Salmonella phage vB_SenMS16, subunit 1, 72 residues;
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------GALGSASIAIGDNDTGLRWGG
DGIVQIVA----------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------------------------

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------DYK--------------DDDDGAPKET
RGYGGDAPF-CTR--------------LNHSYTGMWAPERSAEARGNLTRPPGSGEDCGS
VSVAFPITM----------------------------------------------LLTGF
VGNAL-----------------------AMLLVSRS---------------YRRR-----
----------------------ESKRKKSFLLCI--GWL----ALTDLVGQL--------
--------

Once you have the list of PDB files and the alignment you are ready to use MODELLER.
Check the [SaliLab web site](https://salilab.org) for registration to the program.
```
conda config --add channels salilab
conda install modeller
```
In the same site you can find examples of use.

QUESTION: so, did you get how to obtain your models with MODELLER? 


Now you are ready to check the quality of your models with [SAVES](https://servicesn.mbi.ucla.edu/SAVES/)