<h1>Step 1: ligand and protein selection and preparation</h1>

**Choosing the protein**

For this project we have chosen the 5cno protein which can be downloaded at: <br>
https://www.rcsb.org/structure/5CNO <br>
The protein is saved in this repo as 5cno.pdb.
Using PyMOL we have removed solvent, removed organic and added hydrogens. 
The prepared 5cno is saved in this repo as 5cno_prepared.pdb. 

**Choosing the ligand library**

We have chosen the ligand library available for download at: <br>
https://enamine.net/compound-libraries/targeted-libraries/kinase-library <br>
The zipped ligand library is saved in this repo as Enamine_Kinase_Library_plated.zip.
Due to limitations of our personal hardware we have randomly chosen 10k ligands using the sampling.py script
and saved the output as sampled_ligands.sdf. <br>

**Importing necessary libraries**

In [8]:
from rdkit import Chem
from rdkit.Chem import AllChem, rdMolDescriptors
from rdkit.Chem.rdMolAlign import GetBestRMS
from rdkit.SimDivFilters.rdSimDivPickers import MaxMinPicker
from rdkit.ML.Cluster import Butina
from rdkit import DataStructs
import random
import numpy as np
import json

Loading the sampled ligands and clustering them based on their tanimoto similarity. Then adding hydrogen and charges (Gasteiger) to the sampled ligands. We will perform docking on one representative of each cluster, again, due to technical limitations. We will later dock the members of the clusters of top performers.

In [9]:
# Loading the .sdf file
def load_ligands(sdf_file):
    suppl = Chem.SDMolSupplier(sdf_file)
    return [mol for mol in suppl if mol is not None]

# Computing MACCS fingerprints
def get_fingerprints(mols):
    return [rdMolDescriptors.GetMACCSKeysFingerprint(mol) for mol in mols]

# Computing Tanimoto similarity matrix
def tanimoto_similarity(fp_list):
    size = len(fp_list)
    sim_matrix = np.zeros((size, size))
    for i in range(size):
        for j in range(i + 1, size):
            sim_matrix[i, j] = DataStructs.TanimotoSimilarity(fp_list[i], fp_list[j])
            sim_matrix[j, i] = sim_matrix[i, j]
    return sim_matrix

# Butina clusterisation
def cluster_ligands(fp_list, cutoff=0.5):
    dists = []
    size = len(fp_list)
    for i in range(1, size):
        for j in range(i):
            dists.append(1 - DataStructs.TanimotoSimilarity(fp_list[i], fp_list[j]))
    clusters = Butina.ClusterData(dists, size, cutoff, isDistData=True)
    return clusters

# Adding hydrogen and Gasteiger charges
def prepare_ligands(mols):
    prepared_mols = []
    for mol in mols:
        mol = Chem.AddHs(mol)
        AllChem.ComputeGasteigerCharges(mol)
        prepared_mols.append(mol)
    return prepared_mols

# Main function
def process_ligands(sdf_file):
    mols = load_ligands(sdf_file)
    fps = get_fingerprints(mols)
    clusters = cluster_ligands(fps)
    prepared_mols = prepare_ligands(mols)
    return clusters, prepared_mols

sdf_file = "sampled_ligands.sdf"
clusters, prepared_ligands = process_ligands(sdf_file)


Number of clusters:

In [10]:
len(clusters)

43

Saving the clusters as clusters.json. 

In [11]:
def save_clusters_to_json(clusters, filename):
    # Convert tuple of tuples to list of lists for JSON serialization
    clusters_list = [list(cluster) for cluster in clusters]
    with open(filename, "w") as f:
        json.dump(clusters_list, f, indent=4)

save_clusters_to_json(clusters, "clusters.json")

# To load run:
# def load_clusters_from_json(filename):
#     with open(filename, "r") as f:
#         clusters_list = json.load(f)
#     # Convert each inner list back to a tuple (optional)
#     clusters = tuple(tuple(cluster) for cluster in clusters_list)
#     return clusters

Choosing a random ligand from each cluster and saving as chosen_ligands.sdf

In [12]:
def save_chosen_ligands_to_sdf(clusters, ligands, output_filename):
    writer = Chem.SDWriter(output_filename)
    for cluster in clusters:
        if cluster:  # Ensure the cluster is not empty
            chosen_index = random.choice(cluster)
            ligand = ligands[chosen_index]
            writer.write(ligand)
    writer.close()

save_chosen_ligands_to_sdf(clusters, prepared_ligands, "chosen_ligands.sdf")

Saving the prepared ligands as prepared_ligands.sdf

In [13]:
def save_prepared_ligands_to_sdf(prepared_ligands, output_filename):
    writer = Chem.SDWriter(output_filename)
    for ligand in prepared_ligands:
        writer.write(ligand)
    writer.close()

save_prepared_ligands_to_sdf(prepared_ligands, "prepared_ligands.sdf")