# SMILES → Tanimoto score computed on all fingerprints used in MolForge
Calcula el Tanimoto coefficient entre dos columnes de **SMILES** fent servir les 15 **fingerprints** utilitzades al paper de MolForge

**Entrada**: CSV amb columnes `SMILES_input` i `SMILES_output_ECFP4`.

**Sortida**: CSV amb columnes `MACCS`, `Avalon`, `RDK4`, `RDK4-L`, `HashAP`, `TT`, `HashTT`, `ECFP0`, `ECFP2`, `ECFP4`, `FCFP2`, `FCFP4`, `AEs`, `ECFP2*`, `ECFP4*`.

## Imports

In [1]:
# Per definir l'arrel del projecte
import os

# Pandas pels dataframes
import pandas as pd
import numpy as np # pels NaN

# Per no mostrar els Warnings de RDKit
from rdkit import RDLogger
from rdkit import DataStructs
RDLogger.DisableLog("rdApp.*")

## Inputs (part a editar)

Arrel del projecte

In [2]:
os.chdir("/export/home/ddiestre/MolForge_Testing")

In [3]:
# Importem la funció smiles_to_fingerprint del nostre sourcecode que utilitza RDKit
from src.smiles_to_fp import smiles_to_fingerprint, get_supported_fingerprints
from src.fingerprints import FpSimilarity

Fingerprints en que transformar els SMILES

In [4]:
fp_type = "ECFP4"

Fitxer de fingerprints preprocessat (path a partir de MolForge_Testing/)

In [5]:
# input_path = "data/MolForge_output/MolForge_MFoutput_1.csv"
input_path = "data/MolForge_output/CoCoGraph_MFoutput_1.csv"

SMILES_in_col_name = "SMILES_input"
SMILES_out_col_name = "SMILES_output_" + fp_type

Fitxer en que guardar l'output (path a partir de MolForge_Testing/)

In [6]:
# output_path = "data/analysis_output/MolForge_MF_Analysis_1.csv"
output_path = "data/analysis_output/CoCoGraph_MF_Analysis_1.csv"

# Columna on guardarem la versió a fingerprint del output de MolForge 
fp_out_col_name = "fingerprints_output_" + fp_type

## 1. Lectura del fitxer

In [7]:
# Lectura del fitxer
df = pd.read_csv(input_path, sep = ',', index_col = 0)
df.head(5)

Unnamed: 0_level_0,SMILES_input,fingerprints_input_ECFP4,SMILES_output_ECFP4
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,CCCCCN(CCCCc1c2cc3sc2nn13)C(=O)c1ccc(OCC)c(Cl)c1,70 80 85 94 162 237 294 347 366 378 412 425 51...,CCCCCC1=C2C=C3C=C(C(=NN3C(=C2C=C4N1N=C5N4N=C6C...
2,O=C(Nc1n[nH]c2ccc(Br)cc12)N1CCCC1,74 91 119 122 218 328 369 378 650 708 728 766 ...,C1CCN(C1)C(=O)NC2=NNC3=C2C=C(C=C3)Br
3,CCN(C(=O)N(C)c1ccccc1OC)C(=Cc1ccc(OC)cc1)c1ccc...,25 54 80 249 252 294 322 376 431 650 694 695 7...,CCN(C1=CC=CC=C1OC)C(=O)N(C)C2=CC=CC=C2/C(=C/C3...
4,CC1(Cl)OC(=O)C1(C)Cl,99 113 314 650 656 667 1041 1057 1060 1135 127...,CC1(C2(C(C(=O)O1)(C(C(=O)O2)(Cl)Cl)Cl)C)C
5,CCc1cc(C=CC(C)=Cc2cc(OC)ccc2C)ccc1C,25 31 80 135 294 322 517 650 694 695 718 781 8...,CCC1=C(C=CC(=C1)/C=C/C(=C/C2=C(C=CC(=C2)OC)C)/C)C


## 2. Columna SMILES_output → Columna de fingerprints

Aquesta part del codi ens serveix únicament per visualitzar els resultats de MolForge en forma de fingerprints i per identificar les al·lucinacions.

In [8]:
fingerprints = []

# Apliquem el conversor SMILES -> fingerprint
fingerprints = df[SMILES_out_col_name].apply(
    lambda s: smiles_to_fingerprint(s, fp_type=fp_type, n_bits=2048, return_bits=True)
)

# Guardem les fingerprints en el mateix format que l'input de MolForge
df[fp_out_col_name] = fingerprints.apply(
    lambda lst: " ".join(str(x) for x in lst) if isinstance(lst, list) else lst
)

# Visusalitzem el nou dataframe
df.head(12)

Unnamed: 0_level_0,SMILES_input,fingerprints_input_ECFP4,SMILES_output_ECFP4,fingerprints_output_ECFP4
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,CCCCCN(CCCCc1c2cc3sc2nn13)C(=O)c1ccc(OCC)c(Cl)c1,70 80 85 94 162 237 294 347 366 378 412 425 51...,CCCCCC1=C2C=C3C=C(C(=NN3C(=C2C=C4N1N=C5N4N=C6C...,38 80 84 85 95 158 191 237 294 378 410 438 518...
2,O=C(Nc1n[nH]c2ccc(Br)cc12)N1CCCC1,74 91 119 122 218 328 369 378 650 708 728 766 ...,C1CCN(C1)C(=O)NC2=NNC3=C2C=C(C=C3)Br,74 91 119 122 218 328 369 378 650 708 728 766 ...
3,CCN(C(=O)N(C)c1ccccc1OC)C(=Cc1ccc(OC)cc1)c1ccc...,25 54 80 249 252 294 322 376 431 650 694 695 7...,CCN(C1=CC=CC=C1OC)C(=O)N(C)C2=CC=CC=C2/C(=C/C3...,25 54 80 181 249 294 322 376 557 650 694 695 7...
4,CC1(Cl)OC(=O)C1(C)Cl,99 113 314 650 656 667 1041 1057 1060 1135 127...,CC1(C2(C(C(=O)O1)(C(C(=O)O2)(Cl)Cl)Cl)C)C,86 113 149 256 314 438 488 650 656 726 780 915...
5,CCc1cc(C=CC(C)=Cc2cc(OC)ccc2C)ccc1C,25 31 80 135 294 322 517 650 694 695 718 781 8...,CCC1=C(C=CC(=C1)/C=C/C(=C/C2=C(C=CC(=C2)OC)C)/C)C,25 31 80 135 294 322 517 650 694 695 718 781 8...
6,CCc1ccc2ccc3[nH]c(=O)n(C)c(=O)[nH]c(=O)c4ccc(c...,2 80 119 145 203 294 310 314 345 437 552 564 6...,CCC1=CC2=C(C=C1)NC3=C2C=CC4=C3C5=C(C=C4)C6=C(C...,80 119 145 203 223 294 310 314 504 540 564 569...
7,CC(C)(C)C1C2CC3(CCCCC3)C3C(N2)C(C)(C)C2CCCCC2N13,2 16 101 114 293 322 380 392 431 519 524 549 5...,CC1([C@@H]2CCCC[C@H]2N3[C@H]1[C@H]4C5(CCCCC5)C...,2 16 85 237 261 293 299 380 519 549 579 663 67...
8,CC1(C)C(=O)NC(=O)N1c1ccc(C(F)(F)F)cc1Cl,114 228 314 366 379 561 650 713 809 875 935 94...,CC1(C(=O)NC(=O)N1C2=C(C=C(C=C2)C(F)(F)F)Cl)C,114 228 314 366 379 561 650 713 809 875 935 94...
9,c1ccc2c(c1)c(OCCN1CCCCC1)c1c3sc(nc32)S1,2 13 80 131 162 243 286 326 333 378 533 561 67...,C1CCN(CC1)CCOC2=C3C(=NC4=C2SC5=C4C=CC=C5C6=C(C...,2 13 31 80 131 137 162 277 333 352 378 461 533...
10,COc1ccc(OCC2c3ccccc3C(=O)N2C(C)=O)cc1,58 80 102 123 314 322 352 371 460 650 666 695 ...,CC(=O)N1C(C2=CC=CC=C2C1=O)COC3=CC=C(C=C3)OC,58 80 102 123 314 322 352 371 460 650 666 695 ...


## 3. Càlcul del Tanimoto coefficient entre cada parell de SMILES per tots els fingerprints

In [9]:
for col in get_supported_fingerprints():
    
    fingerprints_input = []
    # Apliquem el conversor SMILES -> fingerprint
    fingerprints_input = df[SMILES_in_col_name].apply(
        lambda s: smiles_to_fingerprint(s, fp_type=col, n_bits=2048, return_bits=False)
    )

    fingerprints_output = []
    # Apliquem el conversor SMILES -> fingerprint
    fingerprints_output = df[SMILES_out_col_name].apply(
        lambda s: smiles_to_fingerprint(s, fp_type=col, n_bits=2048, return_bits=False)
    )

    tanimotos = [
        DataStructs.TanimotoSimilarity(fp_in, fp_out)
        # guardem NaN en els casos on l'input / output no són correctes
        if (not pd.isna(fp_out) and fp_out != "InvalidSMILE")
        else np.nan
        for fp_in, fp_out in zip(fingerprints_input, fingerprints_output)
    ]

    df[col] = tanimotos

In [10]:
avg_tanimotos = {col: df[col].mean() for col in get_supported_fingerprints()}
Tanimoto_df = pd.DataFrame([avg_tanimotos], index=[fp_type])

Tanimoto_df

Unnamed: 0,MACCS,Avalon,RDK4,RDK4-L,HashAP,TT,HashTT,ECFP0,ECFP2,ECFP4,FCFP2,FCFP4,AEs,ECFP2*,ECFP4*
ECFP4,0.870685,0.756638,0.709819,0.710086,0.673509,0.68915,0.690776,0.798794,0.752878,0.68608,0.752126,0.692485,0.752878,0.818765,0.68306


## 4. Càlcul d'altres paràmetres interessants

### Average tanimoto score

In [11]:
Tanimoto_df["Avg_Tc"] = Tanimoto_df.mean(axis=1)

### Percentatge d'al·lucinacions de MolForge

Cal tenir en compte que les al·lucinacions de MolForge no decrementen negativament la Tanimoto score, ja que aquest paràmetre no es pot calcular, pel que ser conscients del percentatge d'aquests errors és un punt important. A banda, aquesta dada és també interessant per motius d'avaluació de la qualitat de MolForge en general.

In [12]:
Tanimoto_df["Invalid (%)"] = (df[fp_out_col_name] == "InvalidSMILE").mean() * 100

### Percentatge de string-exacts

In [13]:
Tanimoto_df["String-exacts (%)"] = (df["ECFP4"] == 1).mean() * 100

## 5. Guardar el dataframe

In [14]:
Tanimoto_df

Unnamed: 0,MACCS,Avalon,RDK4,RDK4-L,HashAP,TT,HashTT,ECFP0,ECFP2,ECFP4,FCFP2,FCFP4,AEs,ECFP2*,ECFP4*,Avg_Tc,Invalid (%),String-exacts (%)
ECFP4,0.870685,0.756638,0.709819,0.710086,0.673509,0.68915,0.690776,0.798794,0.752878,0.68608,0.752126,0.692485,0.752878,0.818765,0.68306,0.735849,0.0,40.0


In [15]:
Tanimoto_df.to_csv(output_path)