# Working with datasets 📊

In this notebook, we'll cover how to create and work with datasets.

## 0. Set up

Before anything else, we need to call `LLMObs.enable()` to initialize the LLM Obs library and track a project (required).

In [None]:
import os

from dotenv import load_dotenv
# Load environment variables from the .env file.
load_dotenv(override=True)

from ddtrace.llmobs import LLMObs

LLMObs.enable(api_key=os.getenv("DD_API_KEY"), app_key=os.getenv("DD_APPLICATION_KEY"),  project_name="Onboarding")

## 1. Creating a Dataset
In this example, we'll define a dataset programmatically by creating a list of `records`, each one contains an `input_data` and an optional `expected_output`.

Alternatively, you can create or upload datasets directly within our product. We encourage you to explore this workflow after completing the notebook.

In [None]:
dataset = LLMObs.create_dataset(
    name="capitals-of-the-world",
    description="a list of inputs and outputs describing capitals of the world",
    records=[
        {
            "input_data": {"question": "What is the capital of China?"},
            "expected_output": "Beijing",
            "metadata": {"difficulty": "easy"}
        },
        {
            "input_data": {"question": "Which city serves as the capital of South Africa?"},
            "expected_output": "Pretoria",
            "metadata": {"difficulty": "medium"}
        }
    ]
)

`create_dataset` will automatically push the records to Datadog, you can still manipulate the dataset's records locally. Dataset names *must* be unique when datasets are being created.

You can use the `url` property to see the dataset in Datadog (it may take a few seconds to be accessible).

In [None]:
dataset.url

## 2. Displaying the Dataset as a DataFrame
You can display the dataset as a pandas dataframe for easier visualization.

In [None]:
# Display the dataset as a pandas dataframe
dataset.as_dataframe()

# 3. Modifying a Dataset
You can modify the dataset in the Datadog UI, or locally using methods such as `append()`, `update()`, and `delete()`. Once your modifications are complete, you can call `push()` to update the remote state of the dataset.

You can access dataset entries using index notation:

In [None]:
print('Record at index 0 ->', dataset[0])
print('Records between index 1 and 5 ->', dataset[1:5])

Appending a record to the dataset:

In [None]:
dataset.append({
    "input_data": {"question": "Which city serves as the capital of Canada?"},
    "expected_output": "Ottawa",
    "metadata": {"difficulty": "easy"}
})

dataset.as_dataframe()

Modifying a record in the dataset:

In [None]:
dataset.update(1, {
    "input_data": {"question": "What's the capital city of Chad?"},
    "expected_output": "N'Djamena",
    "metadata": {"difficulty": "hard"}
})

dataset.as_dataframe()

Deleting a record in the dataset:

In [None]:
dataset.delete(0)

dataset.as_dataframe()

Once you are done modifying your dataset, you can call `push()` to update the dataset in Datadog.

In [None]:
dataset.push()

See the updated dataset in Datadog:

In [None]:
dataset.url

# 4. Pulling an Existing Dataset
To pull a dataset from Datadog, use the `pull()` method.

In [None]:
pulled_dataset = LLMObs.pull_dataset(name="capitals-of-the-world")

In [None]:
pulled_dataset.as_dataframe()