<img src="https://raw.githubusercontent.com/RuneROe/git_color_by_similarity/master/logo.png" height="200" align="right" style="height:240px">

## <b><font color='#ff3333'>SIMalign </font></b>

SIMalign is a program that structural compare and align structures based on similarity: Conserved areas are used for the superimposing of the structures, while less conserved residues are discarded for alignment. The program is able to use foldseek to find homologs of your protein of interest, but the program can also accept user specified homologs. Based on the structure and sequence of homologs of your farorit protein structure, it can predict single mutations that might increase stability whitout interfering with the structure.

**Quick run:**
1.   Press "Runtime" -> "Run all".
2.   A bottum saying "Choose Files" will appear. Press it and choose all the structures that you want to analyse (At least 3 structures or use foldseek).
3.   If you upload more than one file, type in the name of the structure you want as reference structure and press enter. 

In [None]:
# TO DO
# - Download thermophilic database -> DONE
#       - Tjekke at det virker og test andre proteiner
# - ændre det sådan at det er den del af proteinet der er conservert der bliver muteret og ikke ikke-exposed

#@title Importing files

#@markdown Import at least 3 files or activate foldseek.


from google.colab import files
import sys
import os



# Removing old uploads
OK_files = {"tmp","foldseek","foldseek_output",".config", "condacolab_install.log", "__pycache__", "SIMalign_READY", "ThermoDB_READY", "thermoDB", "foldseek_output", "sample_data","findsurfaceatoms.py","foldseek_search.py","hotspot_finder.py","SIMalign.py","visualization.py"}
for file in os.listdir():
    if file not in OK_files and file.startswith("DB") != True:
        os.remove(file)

# Wait until files are removed
while True:
    if set(os.listdir()).issubset(OK_files):
        break


# Checking if imported files are OK
infiles = files.upload()
infilenames = list(infiles.keys())


for i, file in enumerate(infilenames):
    if " " in file:
        new_key = "_".join(file.split(" "))
        os.system(f"mv {file} {new_key}")
        infilenames[i] = new_key # Removing white spaces
    if len(file) > 19:
        new_key = infilenames[i].split(".")[0][:15]+"."+infilenames[i].split(".")[-1]
        os.system(f"mv {infilenames[i]} {new_key}")
        infilenames[i] = new_key # Making long names short
        


ref_structure = infilenames[0]
print("Choose a reference structure:")

# Prompt the user to choose a file
infile_set = set(infilenames)
if len(infilenames) > 1:
    while True:
        choice = input("Reference: ").lower()
        number = 0
        for file in infilenames:
            if file.lower().startswith(choice):
                number += 1
                ref_structure = file
        if number == 1:
            break
        elif number > 1:
            print("Not unique choice. Please enter full file name or remove files of similar names.")
        else:
            print("Invalid choice. Please enter a name of a file.")
print(f'Selected reference structure: {ref_structure}')

In [None]:
#@title Download options
download_pymol = True #@param {type: "boolean"}
#@markdown  - `download_pymol` allows you to download a pymol file of your aligned structures. 
outfile_name = "outfile" #@param {type:"string"}

download_alignment = False #@param {type: "boolean"}
#@markdown  - `download_alignment` allows you to download a alignment file of your aligned structures in sequence format.
alignment_file_name = "alignment" #@param {type:"string"}

download_hotspot_file = False

download_foldseek_log = False


# removing spaces from outfile and add .pse
outfile_name = "_".join(outfile_name.split(" "))+".pse"

# removeing spaces from alignment file and add .aln
alignment_file_name = "_".join(alignment_file_name.split(" "))+".aln"

In [None]:
#@title PyMOL visualization options
color_mode = "similarity" #@param ["similarity", "hotspot", "none"] {type:"string"}
#@markdown  - `color_mode` specify which way the structures should be colored.
structure_format = "spheres-sticks" #@param ["spheres-sticks","cartoon","spheres","sticks"] {type:"string"}
#@markdown   - `structure_format` specify how the structure should be showed in pymol.
show_in_pymol = "only_not_conserved" #@param ["only_not_conserved_core","only_core", "only_not_conserved","entire_chain_A","everything"] {type:"string"}
#@markdown   - `show_in_pymol` specify what part of the structures that will be shown in pymol.
color_by_element = True #@param {type: "boolean"}
#@markdown   - If `color_by_element` is ON then atom will be colored by element in pymol.

In [None]:
#@title SIMalign options
max_iterations = 3 #@param {type:"integer"}
#@markdown  - `max_iterations` is the maximum number of alignments. A high number can lead to slow runtime. Minimum 1.
min_aligned_aa = 100 #@param {type:"integer"}
#@markdown  - `min_aligned_aa` is how many amino acid that minimum should be used for alignment. A low number can lead to overfitting.
max_dist = 6 #@param {type:"integer"}
#@markdown  - `max_dist` is the maximum length between to amino acids before it is considered as a gab in the alignment. A too low number can lead to false gabs and a too high number can lead to false positive.
max_initial_rmsd = 5 #@param {type:"number"}
#@markdown  - `max_initial_rmsd` is maximum allowed RMSD when a template structure are superimposed to the reference structure first time. 
# remove_chain_duplicate = True #@param {type:"boolean"}
# For now it only takes chain A
#  #@markdown If `remove_chain_duplicate` is true then is chain duplicates removed from the structure.

In [None]:
#@title Foldseek options
foldseek = True #@param {type:"boolean"}
#@markdown  - Activate foldseek by setting foldseek ON.
foldseek_database = "Thermophilic_DB" #@param ["Alphafold/UniProt50-minimal","Alphafold/Swiss-Prot","PDB","Thermophilic_DB"] {type:"string"}
#@markdown  - `foldseek_database` specify which database should be used for the foldseek search.
foldseek_variable_tresshold = "number_of_structures" #@param ["number_of_structures","evalue","pident","fident","nident","alnlen","bits","mismatch","qcov","tcov","lddt","qtmscore","ttmscore","alntmscore","rmsd","prob"]
#@markdown  - `foldseek_variable_tresshold` specify what foldseek variable that should be used as cutoff for structures. The defualt is "number_of_structures" which enable the user to specify how many of the top performing structures that should be downloaded.
foldseek_value_tresshold = 20  #@param {type:"number"}
#@markdown  - `foldseek_value_tresshold` specify the cutoff value for the given foldseek variable.
foldseek_search_against = "ref_structure" #@param ["ref_structure","all_structures"]
#@markdown  - `foldseek_search_against` specify wheather only the reference structure or all structures should be used in the foldseek search.

#@markdown Read more about foldseek on https://github.com/steineggerlab/foldseek

In [None]:
#@title Hotspot finding options
find_hotspots = True #@param {type: "boolean"}
#@markdown  - If `find_hotspots` is true, then the program will find amino acid in the structure that can be mutated to potentially alter the stability of the protein.
print_hospots_from_structure = "ref_structure" #@param ["ref_structure","all_structures"]
#@markdown  - `print_hospots_from_structure` specify wheather the hotspots from only the reference structure or all structures should be printed.
discard_exposed = True #@param {type: "boolean"}
#hotspot_min_size = 2 #@param {type: "integer"}
#For now we only finds single mutations

In [None]:
#@title Install dependencies

import os
def get_script(script):
    raw_script = f"https://raw.githubusercontent.com/RuneROe/git_color_by_similarity/master/{script}.py"
    local_script_path = f"/content/{script}.py"
    os.system(f"wget {raw_script} -O {local_script_path}")

scripts = {"findsurfaceatoms","foldseek_search","hotspot_finder","SIMalign","visualization"}
if not os.path.isfile("SIMalign_READY"):
    print("installing pymol...")
    os.system("apt-get install pymol")
    print("installing py3Dmol...")
    os.system("pip install py3Dmol")    
    for s in scripts:
        get_script(s)
    print("installing biopython...")
    os.system("pip install biopython")
    print("installing foldseek...")
    os.system("wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz")
    os.system("tar xvzf foldseek-linux-avx2.tar.gz")
    os.system("export PATH=$(pwd)/foldseek/bin/:$PATH")
    os.system("touch SIMalign_READY")
    print("Done")
else:
    print("Dependendies already installed.")

In [None]:
#@title Run prediction
import SIMalign


if len(infilenames) < 3 and foldseek == False:
    print("ERROR: Import at least 3 files or activate foldseek.")
    sys.exit(1)
if max_iterations < 1:
    print("ERROR: max_iterations have to be 1 or higher")
    sys.exit(1)

if foldseek:
    import foldseek_search
    infilenames = foldseek_search.run(foldseek_database,foldseek_variable_tresshold,foldseek_value_tresshold,foldseek_search_against,ref_structure,infilenames)
    if len(infilenames) < 3:
        print("ERROR: At least 3 structures are needed. Try with less restrictive criteria for the foldseek search.")
        sys.exit(1)
len_ref_structure, score_list, structure_list, core_selection = SIMalign.run(ref_structure, infilenames, max_iterations, min_aligned_aa, max_dist, alignment_file_name, max_initial_rmsd)

if find_hotspots:
    import hotspot_finder
    hotspot_list, exposed_list = hotspot_finder.run(structure_list,alignment_file_name,discard_exposed)
    hotspot_finder.print_hotspot(hotspot_list,structure_list,print_hospots_from_structure)
else:
    hotspot_list, exposed_list = None, None


if color_mode == "hotspot" and find_hotspots == False:
    print("ERROR: unable to color hotspot without finding them!")
else:
    import visualization
    visualization.run(color_mode,hotspot_list,score_list,structure_list,core_selection,exposed_list,structure_format,show_in_pymol,color_by_element)

from pymol import cmd
cmd.save(outfile_name)
if download_pymol:
    files.download(outfile_name)
if download_alignment:
    files.download(alignment_file_name)
else:
    print("Done")

In [None]:
#@title Display reference structure
#@markdown Reference structure needs to be a pdb file in order to visualize.

import visualization
view = visualization.show_pdb(ref_structure,color_mode,score_list,len_ref_structure,hotspot_list)
view.zoomTo()