# Loading example datasets

In this tutorial, you will find the code required to load a few standard datasets that everyone should try their methods on.

All datasets will be loaded as [PyTorch dataset objects](https://pytorch.org/docs/stable/data.html).

In [1]:
import torch
from torch.utils.data import Dataset

## Real data

First, how to load standard datasets for chemistry and materials science.

### QM9

The QM9 dataset, containing energies and other properties of small organic molecules, is included as an example dataset in Pytorch Geometric.
Beware: Running this code will download the entire data set to the specified directory (unless it has already been downloaded).

In [2]:
from torch_geometric import datasets
qm9 = datasets.QM9("~/qm9")

Taking a look at an entry in the dataset:

In [6]:
gdb1 = next(iter(qm9))
gdb1

Data(edge_attr=[8, 4], edge_index=[2, 8], idx=[1], name="gdb_1", pos=[5, 3], x=[5, 11], y=[1, 19], z=[5])

This entry is, in fact, methane. This information is hidden in the "x" attribute containing atom properties for the five atom:

In [11]:
gdb1["x"]

tensor([[0., 1., 0., 0., 0., 6., 0., 0., 0., 0., 4.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])

See, in the sixth column, "6, 1, 1, 1, 1"? Those are the atomic numbers of carbon and four hydrogens.

The properties to be predicted are contained in the "y" attribute. There are 19 of them, and their meanings are explained [in the documentation](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.QM9).

QM9 is often analyzed using neural networks. However, the data set does contain the positions of the atoms, not just their connectivity. This is the meaning of the "pos" attribute, which for this molecule is a 5x3 tensor (5 atoms, 3 dimensions).

### Materials Project

The Materials Project is an online database of crystal structures and computed properties.

Thankfully, materials can be downloaded one at a time, unlike molecules in QM9. However, you will need to make an account on the [Materials Project website](https://materialsproject.org/), and then click the "API" button on the top menu to obtain an API key.

Then, the following class can be used to create PyTorch datasets containing materials project entries:

In [17]:
from pymatgen.ext.matproj import MPRester

# Enter your API key here
downloader = MPRester("")

class MPData(Dataset):

    def __init__(self, ids, downloader):
        self.entries = \
            [downloader.get_entry_by_material_id(
                material_id,
                inc_structure = "final",
                conventional_unit_cell = True
             ) 
             for material_id in ids]

    def __len__(self):
        return len(self.entries)

    def __getitem__(self, i):
        entry = self.entries[i]
        structure = entry.structure
        return (structure.species, structure.frac_coords, entry.energy)


To create a data set, you need a list of Materials Project IDs, which you can obtain from a Materials Project search, there through their python API or just by [using their search interface in a web browser](https://materialsproject.org/#search/materials/). Below, I create a dataset containing a single entry just so we can see how it is structured:

In [21]:
example_dataset = MPData(["mp-22862"], downloader)
mp22862 = next(iter(example_dataset))
mp22862

([Element Na,
  Element Na,
  Element Na,
  Element Na,
  Element Cl,
  Element Cl,
  Element Cl,
  Element Cl],
 array([[0. , 0. , 0. ],
        [0. , 0.5, 0.5],
        [0.5, 0. , 0.5],
        [0.5, 0.5, 0. ],
        [0.5, 0. , 0. ],
        [0.5, 0.5, 0.5],
        [0. , 0. , 0.5],
        [0. , 0.5, 0. ]]),
 -27.10511488)

This entry is, in fact, table salt.

## Simple exercises

Intro text

### A sine wave

Explanation

In [None]:
Code

### A 4x4 Ising model

Explanation

In [None]:
Code