#  Working With Datasets

- `Dataset` class
 - to store and manage data.  
 - provides simple but powerful tools for efficiently working with large amounts of data.  
 - designed to easily interact with NumPy, Pandas, TensorFlow, and PyTorch.

- [DeepChem](https://github.com/deepchem/deepchem/tree/master/examples/tutorials) 

In [1]:
!pip install --pre deepchem

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deepchem
  Downloading deepchem-2.6.1-py3-none-any.whl (608 kB)
[K     |████████████████████████████████| 608 kB 7.7 MB/s 
Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.8 MB)
[K     |████████████████████████████████| 36.8 MB 1.7 MB/s 
Installing collected packages: rdkit-pypi, deepchem
Successfully installed deepchem-2.6.1 rdkit-pypi-2022.3.5


In [2]:
import deepchem as dc
dc.__version__

'2.6.1'

# Anatomy of a Dataset

- load the Delaney dataset of molecular solubilities

In [3]:
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

In [6]:
print(test_dataset)

<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['c1cc2ccc3cccc4ccc(c1)c2c34' 'Cc1cc(=O)[nH]c(=S)[nH]1'
 'Oc1ccc(cc1)C2(OC(=O)c3ccccc23)c4ccc(O)cc4 ' ...
 'c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43' 'Cc1occc1C(=O)Nc2ccccc2'
 'OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)C(O)C3O '], task_names: ['measured log solubility in mols per litre']>


- Dataset is an abstract class.  It has a few subclasses that correspond to different ways of storing data.
- `DiskDataset` is a dataset that has been saved to disk.  The data is stored in a way that can be efficiently accessed, even if the total amount of data is far larger than your computer's memory.
- `NumpyDataset` is an in-memory dataset that holds all the data in NumPy arrays.  It is a useful tool when manipulating small to medium sized datasets that can fit entirely in memory.
- `ImageDataset` stores some or all of the data in image files on disk. It is useful when working with models that have images as their inputs or outputs.

- Every Dataset stores a list of *samples*.  Very roughly speaking, a sample is a single data point.  
 - In this case, each sample is a molecule.  In other datasets a sample might correspond to an experimental assay, a cell line, an image, or many other things.  For every sample the dataset stores the following information.

- The *features*, referred to as `X`.  
- The *labels*, referred to as `y`.  
- The *weights*, referred to as `w`.  This can be used to indicate that some data values are more important than others.  
- An *ID*, which is a unique identifier for the sample.  This can be anything as long as it is unique.  Sometimes it is just an integer index, but in this dataset the ID is a SMILES string describing the molecule.

Notice that `X`, `y`, and `w` all have 113 as the size of their first dimension. 

- `task_names`.  Some datasets contain multiple pieces of information for each sample.  For example, if a sample represents a molecule, the dataset might record the results of several different experiments on that molecule.  
 - This dataset has only a single task: "measured log solubility in mols per litre".  Also notice that `y` and `w` each have shape (113, 1).  The second dimension of these arrays usually matches the number of tasks.


In [7]:
test_dataset.y
# get numpy array

array([[-1.60114461],
       [ 0.20848251],
       [-0.01602738],
       [-2.82191713],
       [-0.52891635],
       [ 1.10168349],
       [-0.88987406],
       [-0.52649706],
       [-0.76358725],
       [-0.64020358],
       [-0.38569452],
       [-0.62568785],
       [-0.39585553],
       [-2.05306753],
       [-0.29666474],
       [-0.73213651],
       [-1.27744393],
       [ 0.0081655 ],
       [ 0.97588054],
       [-0.10796031],
       [ 0.59847167],
       [-0.60149498],
       [-0.34988907],
       [ 0.34686576],
       [ 0.62750312],
       [ 0.14848418],
       [ 0.02268122],
       [-0.85310089],
       [-2.72079091],
       [ 0.42476682],
       [ 0.01300407],
       [-2.4851523 ],
       [-2.15516147],
       [ 1.00975056],
       [ 0.82588471],
       [-0.90390593],
       [-0.91067993],
       [-0.82455329],
       [ 1.26909819],
       [-1.14825397],
       [-2.1343556 ],
       [-1.15744727],
       [-0.1045733 ],
       [ 0.53073162],
       [-1.22567118],
       [-1

- A better approach is to iterate over the dataset.  That lets it load just a little data at a time, process it, then free the memory before loading the next bit.  
 - use the `itersamples()` method

In [None]:
for X, y, w, id in test_dataset.itersamples():
    print(y, id)

[-1.70654087] C1c2ccccc2c3ccc4ccccc4c13
[0.2911162] COc1ccccc1Cl
[-1.42724759] COP(=S)(OC)Oc1cc(Cl)c(Br)cc1Cl
[-0.92546642] ClC(Cl)CC(=O)NC2=C(Cl)C(=O)c1ccccc1C2=O
[-1.95269767] ClC(Cl)C(c1ccc(Cl)cc1)c2ccc(Cl)cc2 
[1.35148394] COC(=O)C=C
[-0.85919344] CN(C)C(=O)Nc2ccc(Oc1ccc(Cl)cc1)cc2
[-0.65090692] N(=Nc1ccccc1)c2ccccc2
[-0.32900957] CC(C)c1ccc(C)cc1
[0.60827977] Oc1c(Cl)cccc1Cl
[1.82959618] OCC2OC(OC1(CO)OC(CO)C(O)C1O)C(O)C(O)C2O 
[1.62130966] OC1C(O)C(O)C(O)C(O)C1O
[1.37515286] Cn2c(=O)n(C)c1ncn(CC(O)CO)c1c2=O
[0.45632528] OCC(NC(=O)C(Cl)Cl)C(O)c1ccc(cc1)N(=O)=O
[1.05325552] CCC(O)(CC)CC
[-1.10535024] CC45CCC2C(CCC3CC1SC1CC23C)C4CCC5O
[-0.20119739] Brc1ccccc1Br
[0.34792162] Oc1c(Cl)cc(Cl)cc1Cl
[-0.98700562] CCCN(CCC)c1c(cc(cc1N(=O)=O)S(N)(=O)=O)N(=O)=O
[-0.816116] C2c1ccccc1N(CCF)C(=O)c3ccccc23 
[0.84023521] CC(C)C(=O)C(C)C
[0.22815687] O=C1NC(=O)NC(=O)C1(C(C)C)CC=C(C)C
[0.06247441] c1c(O)C2C(=O)C3cc(O)ccC3OC2cc1(OC)
[1.04094768] Cn1cnc2n(C)c(=O)n(C)c(=O)c12
[-0.51978109] CC(=O)SC4C

- use `iterbatches()` to iterate over batches of samples.

In [8]:
for X, y, w, ids in test_dataset.iterbatches(batch_size=50):
    print(y.shape)

(50, 1)
(50, 1)
(13, 1)


- `iterbatches(batch_size=100, epochs=10, deterministic=False)` 
 - iterate over the complete dataset ten times, each time with the samples in a different random order.

- Datasets can also expose data using the standard interfaces for TensorFlow and PyTorch.  
 - To get a `tensorflow.data.Dataset`, call `make_tf_dataset()`.  
 - To get a `torch.utils.data.IterableDataset`, call `make_pytorch_dataset()`. 

- `to_dataframe()` (should only use it with small datasets)

In [9]:
test_dataset.to_dataframe()

Unnamed: 0,X,y,w,ids
0,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-1.601145,1.0,c1cc2ccc3cccc4ccc(c1)c2c34
1,<deepchem.feat.mol_graphs.ConvMol object at 0x...,0.208483,1.0,Cc1cc(=O)[nH]c(=S)[nH]1
2,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.016027,1.0,Oc1ccc(cc1)C2(OC(=O)c3ccccc23)c4ccc(O)cc4
3,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-2.821917,1.0,c1ccc2c(c1)cc3ccc4cccc5ccc2c3c45
4,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.528916,1.0,C1=Cc2cccc3cccc1c23
...,...,...,...,...
108,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-1.656304,1.0,ClC4=C(Cl)C5(Cl)C3C1CC(C2OC12)C3C4(Cl)C5(Cl)Cl
109,<deepchem.feat.mol_graphs.ConvMol object at 0x...,0.743629,1.0,c1ccsc1
110,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-2.420799,1.0,c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
111,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.209570,1.0,Cc1occc1C(=O)Nc2ccccc2


# Creating Datasets

- you can create your own datasets.  Creating a `NumpyDataset` is very simple

In [10]:
import numpy as np

X = np.random.random((10, 5))
y = np.random.random((10, 2))
dataset = dc.data.NumpyDataset(X=X, y=y)
print(dataset)

<NumpyDataset X.shape: (10, 5), y.shape: (10, 2), w.shape: (10, 1), ids: [0 1 2 3 4 5 6 7 8 9], task_names: [0 1]>


- we did not specify weights or IDs.  These are optional, as is `y` for that matter.  Only `X` is required.  
- Since we left them out, it automatically built `w` and `ids` arrays for us, setting all weights to 1 and setting the IDs to integer indices.

In [11]:
dataset.to_dataframe()

Unnamed: 0,X1,X2,X3,X4,X5,y1,y2,w,ids
0,0.762896,0.716692,0.236962,0.741224,0.698493,0.651243,0.603753,1.0,0
1,0.230394,0.098218,0.666193,0.697988,0.568199,0.771893,0.313024,1.0,1
2,0.751387,0.351718,0.577777,0.735805,0.22322,0.974115,0.242085,1.0,2
3,0.370109,0.300955,0.443929,0.397283,0.978975,0.62307,0.091307,1.0,3
4,0.36675,0.480301,0.630271,0.304236,0.633707,0.353764,0.038945,1.0,4
5,0.214706,0.072903,0.943132,0.35341,0.824539,0.972111,0.679871,1.0,5
6,0.114787,0.651963,0.995025,0.730898,0.804485,0.344875,0.778179,1.0,6
7,0.285158,0.451345,0.44541,0.407898,0.686841,0.244473,0.291853,1.0,7
8,0.40886,0.564216,0.516479,0.039808,0.115541,0.734804,0.514615,1.0,8
9,0.609884,0.520608,0.684718,0.093286,0.600643,0.234372,0.271084,1.0,9


- creating a DiskDataset
 - If you have the data in NumPy arrays, you can call `DiskDataset.from_numpy()` 

In [12]:
import tempfile

with tempfile.TemporaryDirectory() as data_dir:
    disk_dataset = dc.data.DiskDataset.from_numpy(X=X, y=y, data_dir=data_dir)
    print(disk_dataset)

<DiskDataset X.shape: (10, 5), y.shape: (10, 2), w.shape: (10, 1), ids: [0 1 2 3 4 5 6 7 8 9], task_names: [0 1]>


- What if you have some huge files on disk containing data on hundreds of millions of molecules?  
- Fortunately, DeepChem's `DataLoader` framework can automate most of the work for you. 