# Dataset Basics

In this example, we will be working with an Optimization Dataset. However, the concepts will apply to all other datasets

In [None]:
import qcportal as ptl

In [None]:
# Guest access
client = ptl.PortalClient("https://qcademo.molssi.org")

## Retrieving a dataset

We can list the datasets available with `list_datasets`

In [None]:
print(client.list_datasets())

Get the dataset by dataset type & name (see also, `get_dataset_by_id`)

In [None]:
ds = client.get_dataset('optimization', 'Diatomic Molecule Opt')

We can check the status of the calculations on the server with the `status()` function. Note that this will always be computed on the server, and will not use any locally-cached records.

In [None]:
ds.status()

## Specifications and Entries

Each dataset is composed of specifications (how computations are run) and entries (typically molecules that you are working with) and specifications.
In the case of an OptimizationDataset, each specification is an `OptimizationSpecification`, while each entry
contains the initial molecule (in an `OptimizationDatasetEntry`)

Specifications can be viewed with the `specifications` property, which returns a dictionary

In [None]:
ds.specifications

For entries, we can get a list of entry names with `entry_names`

In [None]:
ds.entry_names

To get the full information about an entry, use [get_entry()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.get_entry). This function will fetch from the server as needed.

In [None]:
ds.get_entry('H2')

We can iterate over all the entries with
[iterate_entries()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.iterate_entries).
This function returns a python generator and will automatically fetch entry information as needed

In [None]:
for entry in ds.iterate_entries():
    print(entry.name, entry.initial_molecule.get_hash())

## Getting and iterating over records

Records are indexed by the entry name and the specification name. Similar to entries, a single `OptimizationRecord` can be obtained with [get_record()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.get_record)

In [None]:
rec = ds.get_record("N2", "hf/sto-3g")
print(rec.id)
print(rec.final_molecule)

When you need information about a bunch of records, we can iterate over all of them with
[iterate_records()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.iterate_records).

This function returns a generator which produces a tuple with 3 values (entry name, specification name, and record).
This function will also automatically fetch records information as needed

[iterate_records()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.iterate_records) has some additional arguments which are useful, such as being able to iterate only over records with a particular status. This is useful in this case because some computations have not finished or are errored.

In [None]:
# Get the final bond length
# Molecule.measure() takes a tuple of positions. With only 2 positions, it will compute the distance
for entry_name, spec_name, record in ds.iterate_records(status='complete'):
    print(entry_name, spec_name, record.final_molecule.measure((0, 1)))

## Compiling a pandas dataframe

One common task is to create a pandas dataframe with values that you have computed. For this, you can use
[compile_values()](../api/qcportal.datasets.rst#qcportal.datasets.models.BaseDataset.compile_values).

With that, we can create a pandas dataframe of bond distances for all the entries and all the specifications.

The first argument of this function is a callable which is applied to all (completed) records, and is used to extract the values stored in the dataframe. The function then iterates over all records, applies that function,
and creates the pandas dataframe for you.

In [None]:
df = ds.compile_values(lambda r: r.final_molecule.measure((0, 1)), "bond distance")

In [None]:
print(df)