# Introduction to DeepChem DataLoaders

This tutorial introduces the `DataLoader` class that DeepChem uses to load and prepare data for machine learning. DataLoaders automate the process of reading files, converting molecules to numerical features, handling missing values, and managing memory efficiently. Understanding DataLoaders is essential for working with your own datasets in DeepChem.

## Colab

This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1EKssZcHHf59GkvaZlXzuEZ6jwV7laDpf?usp=sharing)

In [12]:
!pip install --pre deepchem



In [13]:
import deepchem as dc
import pandas as pd
import numpy as np

print(f"DeepChem version: {dc.__version__}")

DeepChem version: 2.8.1.dev


## What is a DataLoader?

A DataLoader is DeepChem's tool for preparing raw data files for machine learning. It handles the entire pipeline from file to ML-ready dataset:

```
Raw File (CSV/SDF/JSON) → DataLoader → Dataset → Model Training
```

When working with molecular data, you typically need to load chemical structures from files, convert them to numerical features, extract labels, handle missing or invalid data, and manage memory for large datasets. Writing this code manually is time-consuming and error-prone. DataLoaders automate all of these steps.

In this tutorial, we'll explore how to use DataLoaders effectively with your own data.

## CSVLoader: The Basics

The most commonly used DataLoader is `CSVLoader`, which handles tabular data. Let's start with a simple example.

In [14]:
# Create a simple molecular dataset
data = {
    'compound_id': ['mol1', 'mol2', 'mol3', 'mol4', 'mol5'],
    'smiles': ['CCO', 'CC', 'C', 'CCCC', 'CCC'],
    'solubility': [-0.77, -1.38, -0.33, -1.69, -1.00],
    'toxicity': [0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)
df.to_csv('example_data.csv', index=False)

print("Example dataset:")
print(df)

Example dataset:
  compound_id smiles  solubility  toxicity
0        mol1    CCO       -0.77         0
1        mol2     CC       -1.38         1
2        mol3      C       -0.33         0
3        mol4   CCCC       -1.69         1
4        mol5    CCC       -1.00         0


In [15]:
# Create a featurizer to convert SMILES to numerical features
featurizer = dc.feat.CircularFingerprint(size=1024)

# Create the CSVLoader
loader = dc.data.CSVLoader(
    tasks=['solubility', 'toxicity'],  # Columns to use as prediction targets (can be multiple)
    feature_field='smiles',            # Column containing molecules
    featurizer=featurizer,             # How to convert molecules to features
    id_field='compound_id'             # Column containing identifiers
)

# Load the data
dataset = loader.create_dataset('example_data.csv')

print(f"Dataset created with {len(dataset)} samples")
print(f"Tasks: {dataset.tasks}")
print(f"Feature shape: {dataset.X.shape}")  # (5, 1024) - 5 molecules, 1024 features each
print(f"Label shape: {dataset.y.shape}")     # (5, 2) - 5 molecules, 2 tasks
print(f"Weight shape: {dataset.w.shape}")    # (5, 2) - weights for each label

Dataset created with 5 samples
Tasks: ['solubility' 'toxicity']
Feature shape: (5, 1024)
Label shape: (5, 2)
Weight shape: (5, 2)




The DataLoader automatically performed several operations:

1. **Featurization**: Converted SMILES strings into 1024-dimensional fingerprint vectors using the CircularFingerprint (ECFP) algorithm
2. **Label extraction**: Extracted the values from both task columns (solubility and toxicity) as labels
3. **Weight creation**: Created weight arrays to indicate valid data points
4. **ID preservation**: Kept the compound IDs for reference
5. **Data cleaning**: Invalid SMILES are filtered out, and missing values are set to 0 with weights set to 0 (telling the model to ignore them during training)

The `tasks` parameter accepts a list of column names. You can specify one task for single-task learning or multiple tasks to train a model that predicts several properties simultaneously. Let's examine the data:

In [16]:
print("First molecule's fingerprint (first 20 values):")
print(dataset.X[0][:20])

print("\nLabels for all molecules:")
print(dataset.y)
print("(Each row: [solubility, toxicity])")

print("\nMolecule IDs:")
print(dataset.ids)

First molecule's fingerprint (first 20 values):
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Labels for all molecules:
[[-0.77  0.  ]
 [-1.38  1.  ]
 [-0.33  0.  ]
 [-1.69  1.  ]
 [-1.    0.  ]]
(Each row: [solubility, toxicity])

Molecule IDs:
['mol1' 'mol2' 'mol3' 'mol4' 'mol5']


## Working with Real Datasets

Let's load a real molecular dataset to see how DataLoaders work at scale.

In [17]:
import os
import urllib.request

# Download the Delaney solubility dataset
dataset_file = 'delaney-processed.csv'

if not os.path.exists(dataset_file):
    url = 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv'
    urllib.request.urlretrieve(url, dataset_file)

df_delaney = pd.read_csv(dataset_file)
print(f"Delaney dataset: {len(df_delaney)} molecules")
print(f"\nColumns: {df_delaney.columns.tolist()}")
print(f"\nFirst few rows:")
print(df_delaney.head())

Delaney dataset: 1128 molecules

Columns: ['Compound ID', 'ESOL predicted log solubility in mols per litre', 'Minimum Degree', 'Molecular Weight', 'Number of H-Bond Donors', 'Number of Rings', 'Number of Rotatable Bonds', 'Polar Surface Area', 'measured log solubility in mols per litre', 'smiles']

First few rows:
  Compound ID  ESOL predicted log solubility in mols per litre  \
0   Amigdalin                                           -0.974   
1    Fenfuram                                           -2.885   
2      citral                                           -2.579   
3      Picene                                           -6.618   
4   Thiophene                                           -2.232   

   Minimum Degree  Molecular Weight  Number of H-Bond Donors  Number of Rings  \
0               1           457.432                        7                3   
1               1           201.225                        1                2   
2               1           152.237         

In [18]:
loader_delaney = dc.data.CSVLoader(
    tasks=['measured log solubility in mols per litre'],
    feature_field='smiles',
    featurizer=dc.feat.CircularFingerprint(size=1024)
)

dataset_delaney = loader_delaney.create_dataset(dataset_file)

print(f"Loaded {len(dataset_delaney)} molecules")
print(f"Feature shape: {dataset_delaney.X.shape}")
print(f"Task: {dataset_delaney.tasks}")



Loaded 1128 molecules
Feature shape: (1128, 1024)
Task: ['measured log solubility in mols per litre']




## Memory Management: Sharding

For very large datasets that don't fit in memory, DataLoaders use a technique called "sharding". The data is processed in chunks (shards), and each shard is saved to disk separately. This allows you to work with datasets much larger than your available RAM.

In [19]:
# Control shard size for memory management
dataset_sharded = loader_delaney.create_dataset(
    dataset_file,
    shard_size=50  # Process 50 molecules at a time
)

print(f"Total molecules: {len(dataset_sharded)}")
print(f"Number of shards: {dataset_sharded.get_number_shards()}")
print(f"Shard size: {dataset_sharded.get_shard_size()}")
print(f"Data directory: {dataset_sharded.data_dir}")



Total molecules: 1128
Number of shards: 23
Shard size: 50
Data directory: /tmp/tmpsqh6wwr0




The default shard size is 8192. Use smaller values if you encounter memory issues, or larger values if you have plenty of memory and want faster processing.

## Other DataLoader Types

While CSVLoader is the most common, DeepChem provides specialized loaders for other file formats:

- **SDFLoader**: For 3D molecular structures in SDF format (commonly used in drug discovery)
- **InMemoryLoader**: For data already loaded in Python (lists, arrays, DataFrames)
- **JsonLoader**: For JSON files with nested data structures
- **ImageLoader**: For image data (microscopy, cell images)
- **FASTALoader, FASTQLoader**: For DNA/protein sequences

Let's see how to load data from a different format. Here's an example using SDFLoader with an actual molecular dataset:

In [20]:
# Download a dataset in SDF format
sdf_file = 'example_mols.sdf'

if not os.path.exists(sdf_file):
    url = 'https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/gdb1k.sdf'
    urllib.request.urlretrieve(url, sdf_file)

# Load using SDFLoader
sdf_loader = dc.data.SDFLoader(
    tasks=[],  # No prediction tasks for this example
    featurizer=dc.feat.CircularFingerprint(size=1024),
    sanitize=True  # Clean and validate molecular structures
)

dataset_sdf = sdf_loader.create_dataset(sdf_file)

print(f"Loaded {len(dataset_sdf)} molecules from SDF file")
print(f"Feature shape: {dataset_sdf.X.shape}")

[17:30:05] Explicit valence for atom # 4 N, 4, is greater than permitted
[17:30:05] ERROR: Could not sanitize molecule ending on line 6814
[17:30:05] ERROR: Explicit valence for atom # 4 N, 4, is greater than permitted
[17:30:05] Explicit valence for atom # 2 N, 4, is greater than permitted
[17:30:05] ERROR: Could not sanitize molecule ending on line 6940
[17:30:05] ERROR: Explicit valence for atom # 2 N, 4, is greater than permitted
[17:30:05] Explicit valence for atom # 4 N, 4, is greater than permitted
[17:30:05] ERROR: Could not sanitize molecule ending on line 7543
[17:30:05] ERROR: Explicit valence for atom # 4 N, 4, is greater than permitted
[17:30:05] Explicit valence for atom # 4 N, 4, is greater than permitted
[17:30:05] ERROR: Could not sanitize molecule ending on line 12488
[17:30:05] ERROR: Explicit valence for atom # 4 N, 4, is greater than permitted
[17:30:05] Explicit valence for atom # 3 N, 4, is greater than permitted
[17:30:05] ERROR: Could not sanitize molecule endi

Loaded 991 molecules from SDF file
Feature shape: (991, 1024)




## DataLoaders in MoleculeNet

In previous tutorials, you may have seen code like `tasks, datasets, transformers = dc.molnet.load_delaney()` followed by `train, valid, test = datasets`. MoleculeNet functions use DataLoaders internally. When you call `dc.molnet.load_delaney()`, it's using CSVLoader behind the scenes to load and process the data.

You can use MoleculeNet for convenience with standard datasets, or use DataLoaders directly when you need custom processing or are working with your own data.

In [21]:
# MoleculeNet approach (convenient)
tasks_mol, datasets_mol, transformers = dc.molnet.load_delaney(
    featurizer='ECFP',
    splitter='random'
)
train_mol, valid_mol, test_mol = datasets_mol

print("MoleculeNet approach:")
print(f"Training set: {len(train_mol)} molecules")

# Manual DataLoader approach (more control)
loader = dc.data.CSVLoader(
    tasks=['measured log solubility in mols per litre'],
    feature_field='smiles',
    featurizer=dc.feat.CircularFingerprint(size=1024)
)

full_dataset = loader.create_dataset(dataset_file)
splitter = dc.splits.RandomSplitter()
train_manual, valid_manual, test_manual = splitter.train_valid_test_split(full_dataset)

print("\nManual DataLoader approach:")
print(f"Training set: {len(train_manual)} molecules")



MoleculeNet approach:
Training set: 902 molecules





Manual DataLoader approach:
Training set: 902 molecules


## Summary

DataLoaders automate the process of preparing molecular data for machine learning:

- **CSVLoader** is the most commonly used for tabular data
- Automatic **featurization** converts molecules to numerical representations
- **Missing values** and invalid molecules are handled automatically
- **Sharding** enables processing of large datasets that don't fit in memory
- **Multi-task** learning is supported by specifying multiple task columns
- MoleculeNet uses DataLoaders internally for standard benchmarks

For most applications, you'll use CSVLoader with your own datasets or rely on MoleculeNet for standard benchmarks.