# Clustering and Downstream Analysis

In [2]:
import glob
import time
import numpy as np
import pandas as pd
import pickle
from contextlib import closing
from multiprocessing import Pool
import multiprocessing
from rdkit.Chem import AllChem
from rdkit import DataStructs
from rdkit import Chem
from functools import partial
import argparse
import os
import chemfp

As we have a large number of molecules to cluster (3 million), we cannot use a traditional Butina clustering with RDKit. Following   https://www.macinchem.org/reviews/clustering/clustering.php we can cluster molecules with Chemfp, which does allow clustering larger libraries. We can use 1.x developer line, which is non-commercial. Important to note is that Chemfp 1.x is **not compatibile with Python 3**, hence we have to create a separate environment that will run the code in **Python 2.7**. All steps to create environment, install combatibile RDKit (versions before 2019) and finally chemfp iis shown below.

In [11]:
# conda create -y -n DD_protocol_py27 python=2.7
# conda activate DD_protocol_py27
# conda install -c rdkit rdkit=2018.09.1
# pip install chemfp

Now, to create a compatibile fingerprints from smiles for the molecules we want to cluster we can do

In [12]:
# sbatch --account=VENDRUSCOLO-SL3-CPU --partition=skylake --nodes=1 --ntasks=1 --cpus-per-task=10 --time=02:00:00 --wrap="rdkit2fps extracted_smiles.smi --fpSize 1024 --morgan --radius 2 --useChirality 1 > extracted_smiles.fps"

And to get clusters

In [13]:
# sbatch --account=VENDRUSCOLO-SL3-CPU --partition=skylake --nodes=1 --ntasks=1 --cpus-per-task=15 --time=10:30:00 --wrap="python ../scripts_3/taylor_butina.py --profile --threshold 0.78 extracted_smiles.fps -o extracted_smiles_clusters.txt"

## Processing clustering results

1. Process clusters and  singletons, isolate isomers ( have "_" in their ID ) and non-isomers

In [2]:
!cat process_clusters_and_singletons.sh

#!/bin/bash

# PROCESS SINGLETONS AND PREPARE THEIR IDS INTO A SEPARATE FILE

sed '5q;d' extracted_smiles_clusters_1024_full.txt > extracted_smiles_clusters_1024_singletons.txt
while IFS=" " read -r -a line; do printf "%s\n" "${line[@]}"; done < extracted_smiles_clusters_1024_singletons.txt > extracted_smiles_clusters_1024_singletons_clean_with_header.txt
tail -n +2 extracted_smiles_clusters_1024_singletons_clean_with_header.txt  > extracted_smiles_clusters_1024_singletons_clean.txt
rm extracted_smiles_clusters_1024_singletons_clean_with_header.txt
rm extracted_smiles_clusters_1024_singletons.txt
mv extracted_smiles_clusters_1024_singletons_clean.txt extracted_smiles_clusters_1024_singletons.txt

# SEPARATE MOLECULES TO ONES THAT ARE ISOMER AND THE ONES THAT ARE NOT
mkdir clustering_results/clusters_and_singletons
grep -v "_" clustering_results/extracted_smiles_clusters_1024.txt > clustering_results/clusters_and_singletons/clusters-no-isomers.txt
grep -v "_" clustering_results/extracte

2. Prepare selected molecules (download/create based on if they are isomers or non-isomers)

In [3]:
!cat prepare_selected_molecules.sh

#!/bin/bash

folder=$1
n_cpus_per_node=$2
name_cpu_partition=$3
account_name=$4

# Create directory where to store ligands. Directory is called pdbqt despite us downloading SDFs as we are going to convert
# them later.
pdbqt_directory="pdbqt"
mkdir -p ${folder}/$pdbqt_directory  || { echo 'Error creating directory' ; exit 1; }


#### DO DOWNLOAD FOR ALL MOLECULES THAT ARE NOT HAVING ISOMERS ####
# For each file of form (*-no-isomers.txt) [* = clusters/singletons] perform 
# the download of sdfs for all ZINC IDs contained in them
echo "Downloading ligands for molecules that do not have geometric isomers"
for f in ${folder}/singletons-no-isomers.txt
do
   tmp="$f"
   filename="${tmp##*/}"
   set_type="${filename%%-*}" # clusters/singletons
   
   mkdir -p ${folder}/${pdbqt_directory}/${set_type}_download || { echo 'Error creating directory' ; exit 1; }
   mkdir -p ${folder}/${set_type}_set_scripts || { echo 'Error creating directory' ; exit 1; }
   
   # Create scripts to download SDFs o

## Useful (bash commands)

1. Get line(s) that contain the given string ("the_string")

In [None]:
# grep -hnr "singletons" extracted_smiles_clusters_1024_full.txt

## Testing fingerprints

In [3]:
m = Chem.MolFromSmiles("Cc1nc(on1)c2ccc(nc2)NCc3ccc(cc3)N4CCCC4")


In [5]:
# fp = AllChem.GetHashedMorganFingerprint(m, 2, nBits=2048)
# array = np.zeros((0,), dtype=np.int8)
# DataStructs.ConvertToNumpyArray(fp, array)
# print(array[array.nonzero()])

In [6]:
fp2 = AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=1024, useChirality=True)
array2 = np.zeros((0, ), dtype=np.int8)
DataStructs.ConvertToNumpyArray(fp2, array2)
print(array2.nonzero())

(array([   4,    8,   12,   23,   33,   36,   75,   80,  102,  128,  136,
        233,  248,  255,  265,  310,  356,  378,  381,  392,  407,  428,
        439,  456,  463,  511,  518,  607,  638,  656,  680,  687,  698,
        726,  730,  801,  831,  836,  849,  896,  897,  926,  935,  967,
        974,  980, 1023]),)


In [8]:
# arena = chemfp.load_fingerprints("clustering/testing_chemfp/test_smiles-isomers_1024.fps")

In [10]:

# bz = Chem.MolFromSmiles('c1ccccc1')
# fp_bz = AllChem.GetMorganFingerprintAsBitVect(bz,radius=2,nBits=1024)
# pyr = Chem.MolFromSmiles('c1ccccc1')
# fp_pyr = AllChem.GetMorganFingerprintAsBitVect(pyr,radius=2,nBits=1024)
# print("Similarity:",DataStructs.TanimotoSimilarity(fp_bz,fp_pyr))

# print("intersection count:",(fp_bz&fp_pyr).GetNumOnBits())
# print("union count:",(fp_bz|fp_pyr).GetNumOnBits())