<a href="https://colab.research.google.com/github/DelWow/RippenApril2025/blob/main/Main_DeepMol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Important**

Before running any of the code, I recommend having a quick look over the guide I have started working on for this notebook. Make sure your runtime is set to a t4 gpu as Colab automatically sets it to a cpu.

**First Cells**

In [None]:
####This is super important, do not restart runtime while this block is still running, runtime should be restarted ONLY after this block is done running even if an option of restarting runtime is given before the cell ends.
#This cell is still super messy and needs to be cleaned up but this should be enough to start generating molecules and featurization
#Set up process
!pip install numpy
!pip install pandas==2.1.4
!pip install rdkit-pypi
!pip install deepmol
!pip install deepmol scikit-learn
!pip install git+https://github.com/samoturk/mol2vec#egg=mol2vec
!pip install biosynfoni
!pip install -r requirements.txt
!pip install --upgrade --force-reinstall "numpy<2" "pandas<=2.0.3" "scikit-learn<1.6" "rdkit==2023.9.6" "deepmol
!pip install dill
#Restart runtime after this cell is done running

In [None]:
#Run top cell first before this one
#I do not recommend running this cell more than once per runtime as it can lead to errors when it comes to the file pathway of REINVENT4
#Install reinvent
!git clone --depth 1 https://github.com/MolecularAI/REINVENT4.git
%cd REINVENT4

#CPU or GPU install
#!python install.py cpu -d none #This installs for the cpu
!python install.py cu124 -d none #This installs for the gpu



**Second Cells**

In [None]:
#These 3 imports are mostly likely to cause any failures
import numpy as np
import pandas as pd
from pandas.io.formats import format as pandas_format

# Without this if statement RDkit doesnt work for some reason
# RDKit just needs the attribute to exist, the return value is never actually used for your featurization. A proper fix for this error will be implemented soon.
if not hasattr(pandas_format, 'get_adjustment'):
    def get_adjustment(self, col, cell):
        return None
    pandas_format.get_adjustment = get_adjustment

#Continue imports
from rdkit import Chem
from rdkit.Chem import AllChem
import deepmol
from deepmol.loaders import CSVLoader
from deepmol.models import SklearnModel
from deepmol.compound_featurization import MorganFingerprint
from rdkit import DataStructs


# Load dataset (assuming a CSV file with SMILES and labels)
loader = CSVLoader(dataset_path='/content/enumerated_smiles (1).csv', smiles_field="Enumerated_SMILES") # This is for our existing database
dataset = loader.create_dataset()
genLoader = CSVLoader(dataset_path='/content/generated.csv', smiles_field="SMILES") #This is meant for newly generated SMILES
genDataset = genLoader.create_dataset()
print(dataset.get_shape()) #Print shape, used for debugging purposes
print(genDataset.get_shape()) #Print shape, used for debugging purposes
%cd /content

# Apply molecular featurization
featurizer = MorganFingerprint(radius=2, size=128) # Will featurize as ECFP2, radius and size can be changed
featurizer.featurize(dataset, inplace=True)
featurizer.featurize(genDataset, inplace=True)

#Creates a new csv file and transfers featurized smiles into it
features_df = pd.DataFrame(dataset.X)
features_df['SMILES'] = dataset.smiles
features_df.to_csv('featurized_dataset.csv', index=False)
print("featurized_dataset.csv has been generated.")

##Next portion is for computing similaritys (Potentially wrong, need to review), This code can take super longer to run depending the bit length size

#genDataset.X and dataset.X are DeepMol-featurized (binary fingerprints)
def tanimoto_np(fp1, fp2):
    intersection = np.dot(fp1, fp2)
    return intersection / (np.sum(fp1) + np.sum(fp2) - intersection + 1e-10)  #+1e-10 is added to remove division by 0 even if it is not a part of the formula

# Compute pairwise Tanimoto similarity
similarities = []
for gen_fp in genDataset.X:
    row = [tanimoto_np(gen_fp, db_fp) for db_fp in dataset.X]
    similarities.append(row)

#Puts our similaritoes into a a new csv
similarities_df = pd.DataFrame(similarities)
similarities_df.to_csv("tanimoto_similarities.csv", index=False)



**Third Cell**

In [None]:
###Note this code will probably fail the first time it runs, but should run properly the second time and will take about 3-5 minutes to run of the default parameters
#Needed imports
import pandas as pd, textwrap, subprocess, os, sys

SMILES_PATH = "/content/enumerated_smiles (1).csv"    # change if your file lives elsewhere
LIMIT = 10000  # number of entries

smiles = pd.read_csv(SMILES_PATH, header=None, nrows=LIMIT).iloc[:, 0].astype(str) #To do all entires, just remove the nrows=LIMIT slice (or set LIMIT=None) so Pandas reads the whole file:
smiles.to_csv("train.smi", index=False, header=False)
print(f"Prepared train.smi with {len(smiles)} molecules")
%cd /content/REINVENT4

# write a tiny transfer-learning config (GPU) *NOTE refer to user guide to understand how epochs and batch size work
tl_toml = textwrap.dedent("""
    run_type = "transfer_learning"
    device   = "cuda:0"

    [parameters]
    num_epochs  = 10
    batch_size  = 64
    input_model_file       = "priors/reinvent.prior"
    smiles_file            = "train.smi"
    validation_smiles_file = "train.smi"
    output_model_file      = "tiny.model"
""")
open("my_tl.toml", "w").write(tl_toml)


#fine-tune the RNN
os.chdir('/content/REINVENT4')
#A decent amount of errors can be caused by the next line. This is usually because the 'reinvent' file is not being read properly.
#A common fix that I found is just rerunning this cell will somehow solve this problem. Proper fix is needed for this but as of right now I have no idea why this error occurs.
subprocess.run(["reinvent", "-l", "tl.log", "my_tl.toml"], check=True)


# sampling config (GPU) *Note change num_smiles to amount of smiles you want to be generated
sample_toml = textwrap.dedent("""
    run_type = "sampling"
    device   = "cuda:0"

    [parameters]
    model_file   = "tiny.model"
    output_file  = "generated.csv"
    num_smiles   = 500
    unique_molecules = true
    randomize_smiles = true
""")
open("sample.toml", "w").write(sample_toml)

# generate new molecules
subprocess.run(["reinvent", "-l", "sample.log", "sample.toml"], check=True)

print("\n  Done! 500 molecules saved to generated.csv") # This is located in /content/REINVENT4/generated.csv, refer to user guide to understand how the NLL works


**Validity Reports on SMILES**

In [None]:
#This code isnt needed, just meant to check if our SMILES are valid
#This will check if smiles are valid using RDkit and put them into 3 csvs, 1 containing all smiles with their validity, 1 containing only valid smiles, and 1 containing only invalid smiles

df = pd.read_csv('/content/generated.csv')
def isValid(smi):
  return Chem.MolFromSmiles(smi) is not None

df['Valid'] = df['SMILES'].apply(isValid)
df.to_csv("SmilesValidity.csv", index = False)

df[df['Valid']].to_csv("SmilesThatAreValid.csv", index = False)
df[~df['Valid']].to_csv("SmilesThatAreInvalid.csv", index = False)


**Trouble Shooting**

The following is some of the common trouble shooting methods I used find erros in my code. Refer to user guide to see how they are used.

In [None]:
%cd /content/REINVENT4
!python install.py cu124 -d none     # or cu121 if cu124 fails

In [None]:
#Check if you are running off gpu
import torch
assert torch.cuda.is_available(), "still no GPU!"

In [None]:
#To move misplaced files or folders
%mv /content/YOURFILEORFOLDER/content/REINVENT4/


In [None]:
#Checks if you are on gpu and if NVIDIA table is functioning
import torch, os, subprocess, platform
print("torch cuda available:", torch.cuda.is_available())
print("CUDA version       :", torch.version.cuda)
subprocess.run(["nvidia-smi"], check=False)   # prints a table on GPU runtimes


In [None]:
#Prints past 50 logs
!head -n 50 tl.log

In [None]:
!which reinvent
!reinvent --help