# Prepare datasets


This notebook prepares the datasets according to the data available at:<br>
https://github.com/gfm-collab/chemprop-IR?tab=readme-ov-file

In this case, we compute and add the SELFIES associated to the SMILES in the datasets.

For this purpose, we use the `selfies` library:<br>
https://selfies.readthedocs.io/en/latest/index.html

## Libraries


In [4]:
# !pip install -U transformers
# !pip install -U datasets
# !pip install -U huggingface_hub
# !pip install selfies

In [5]:
import os

import ast

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import datasets

import selfies as sf

import matplotlib.pyplot as plt

## Load data

The csv files can be downloaded at the following address : https://github.com/gfm-collab/chemprop-IR?tab=readme-ov-file#data


In [6]:
data_folder = "/storage/smiles2spec_data" # Dataset Folder PATH 

# computed spectra or experimental spectra
suffix = "exp" # comp, exp

In [8]:
train_hf = datasets.load_from_disk(os.path.join(data_folder, f"train_{suffix}.hf"))
val_hf = datasets.load_from_disk(os.path.join(data_folder, f"val_{suffix}.hf"))
test_hf = datasets.load_from_disk(os.path.join(data_folder, f"test_{suffix}.hf"))

## Add SELFIES

In [9]:
def add_selfies(sample):
    
    try:
        sample["selfies"] = sf.encoder(sample["smiles"])
    except:
        sample["selfies"] = "error"

    return sample

In [10]:
train_hf = train_hf.map(add_selfies)
val_hf = val_hf.map(add_selfies)
test_hf = test_hf.map(add_selfies)

  0%|          | 0/6000 [00:00<?, ?ex/s]

In [17]:
# checking number of failed SMILES -> SELFIES conversions

train_hf["selfies"].count("error"), val_hf["selfies"].count("error"), test_hf["selfies"].count("error")

(0, 0, 0)

In [18]:
len(train_hf), len(val_hf), len(test_hf)

(48000, 6000, 6000)

In [19]:
# remove corresponding samples

train_hf = train_hf.filter(lambda sample: sample['selfies'] != "error")
val_hf = val_hf.filter(lambda sample: sample['selfies'] != "error")
test_hf = test_hf.filter(lambda sample: sample['selfies'] != "error")

Filter:   0%|          | 0/48000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6000 [00:00<?, ? examples/s]

In [20]:
len(train_hf), len(val_hf), len(test_hf)

(48000, 6000, 6000)

In [17]:
print(val_hf[123]["smiles"], type(val_hf[123]["smiles"]))
print(val_hf[123]["selfies"], type(val_hf[123]["smiles"]))

COC(=O)c1cncc(N2CCN3CCCCC3C2)n1 <class 'str'>
[C][O][C][=Branch1][C][=O][C][=C][N][=C][C][Branch1][#C][N][C][C][N][C][C][C][C][C][Ring1][=Branch1][C][Ring1][#Branch2][=N][Ring1][S] <class 'str'>


## Save datasets

In [22]:
train_hf.save_to_disk(os.path.join(data_folder, f"train_with_selfies_{suffix}.hf"))
test_hf.save_to_disk(os.path.join(data_folder, f"test_with_selfies_{suffix}.hf"))
val_hf.save_to_disk(os.path.join(data_folder, f"val_with_selfies_{suffix}.hf"))

Saving the dataset (0/2 shards):   0%|          | 0/48000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/6000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/6000 [00:00<?, ? examples/s]