# Fingerprints to SMILES amb MolForge (executar en l'entorn MolForge-env)

1. Llegeix el fitxer de fingerprints preprocessat amb columnes: "id" y "fingerprints".  
2. Executa MolForge sobre els fingerprints y guarda els resultats en una nova columna "SMILES".  
3. Guarda els resultats en `data/MolForge_output/`.

## Imports

In [1]:
# Per definir l'arrel del projecte
import os

# Pandas pels dataframes
import pandas as pd

# Molforge
from MolForge import main as molforge_main

# Per fer la lectura del output de MolForge
import subprocess
import sys
import io
from contextlib import redirect_stdout

## Inputs

Arrel del projecte

In [2]:
os.chdir("/export/home/ddiestre/MolForge_Testing")

Fitxer de fingerprints preprocessat (path a partir de MolForge_Testing/)

In [3]:
input_path = "data/MolForge_input/test_1.csv"

Fitxer en que guardar l'output (path a partir de MolForge_Testing/)

In [4]:
output_path = "data/MolForge_output/test_1_output.csv"

Paràmetres de MolForge

In [5]:
FP_NAME = "ECFP4"
MODEL_TYPE = "smiles"  # ["smiles", "selfies"]
DECODE = "greedy"  # ["greedy", "beam"]
CHECKPOINT_NAME = "ECFP4_smiles_checkpoint.pth"

## 1. Lectura del fitxer

In [6]:
# Lectura del fitxer
df = pd.read_csv(input_path, sep = ',', na_filter=False , index_col = 0)
df.head(5)

Unnamed: 0_level_0,fingerprints
id,Unnamed: 1_level_1
1,1 80 94 114 237 241 255 294 392 411 425 695 74...
2,97 101 314 378 389 442 501 650 728 817 896 909...
3,9 45 78 89 145 203 322 548 586 650 695 718 760...


## 2. Execució de Molforge

In [26]:
def run_for_fp(fp, fp_name, model_type, checkpoint_name, decode="greedy"):
    terminal_call_molforge = "conda run -n MolForge_env python predict.py --input='" + fp + "' --fp='" + fp_name + "' --model_type='" + model_type + "' --checkpoint='" + checkpoint_name + "' --decode='" + decode + "'"
    molforge_output = subprocess.check_output(terminal_call_molforge, shell=True, text=True)

    for line in molforge_output.splitlines():
        line = line.strip()
        if line.startswith("Result:"):
            return line.split("Result:", 1)[1].strip().replace(" ", "")

    return float('nan')

In [27]:
smiles_out = [] # aquesta serà la nostra columna SMILES
len_df = len(df)
for i in range(len_df): # de 0 a len_df-1
    s = run_for_fp(df['fingerprints'][i+1], FP_NAME, MODEL_TYPE, CHECKPOINT_NAME, DECODE)
    smiles_out.append(s)
    print(f"\r[{i+1}/{len_df}]", end="", flush=True) # Seguiment del progrés

[3/3]

## 3. Guardar l'output

In [25]:
# Creem el nou dataframe
df['SMILES'] = smiles_out
df.head(5)

Unnamed: 0_level_0,fingerprints,SMILES
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1 80 94 114 237 241 255 294 392 411 425 695 74...,CCOC1=C(C=C(C=C1)C(C(C)(C)C)N)OCC
2,97 101 314 378 389 442 501 650 728 817 896 909...,C1=CC=C(C=C1)C2C=C(NC(=O)C2C3=NC(=S)NC(=O)C34C...
3,9 45 78 89 145 203 322 548 586 650 695 718 760...,COC1=CC=C(C=C1)C(=O)C2=C(C(=C3N2C4=CC=CC=C4C=C...


In [10]:
df.to_csv(output_path)