<div style="overflow: hidden;">
    <img src="images/DREGS_logo_v2.png" width="300" style="float: left; margin-right: 10px;">
</div>

# Getting started: Part 2 - Simple queries

Here we continue our getting started tutorial, introducing queries.

### What we cover in this tutorial

In this tutorial we will learn how to:

1) Perform a simple query with a single filter
2) Perform a simple query with multiple filters
3) Query for all datasets tagged with a given keyword

### Before we begin

If you haven't done so already, check out the [getting setup](https://lsstdesc.org/dataregistry/tutorial_setup.html) page from the documentation if you want to run this tutorial interactively.

A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version.

In [None]:
# Come up with a random owner name to avoid clashes
from random import randint
import os
OWNER = "tutorial_" + os.environ.get('USER') + '_' + str(randint(0,int(1e6)))

import dataregistry
print(f"Working with dataregistry version: {dataregistry.__version__} as random owner {OWNER}")

> **Note** that running some of the cells below may fail, especially if run multiple times. This will likely be from clashes with the unique constraints within the database (hopefully the error output is informative). In these events either; (1) run the cell above to establish a new database connection with a new random user, or (2) manually change the conflicting database column(s) that are clashing during registration.

## 1) Querying the data registry with a single filter

Now that we've covered the basics of dataset registration, let's explore how to query entries in the database.  

In the previous tutorial, we learned how to connect to the DESC data registry using the `DataRegistry` class. Now, we'll reconnect to the tutorial namespace:

In [None]:
from dataregistry import DataRegistry

# Connect to the tutorial namespace and set the default owner to `OWNER`
datareg = DataRegistry(namespace="tutorial", owner=OWNER)

### Constructing the query 

Queries are built using one or more Boolean logic **filters**, which correspond to SQL `WHERE` clauses in the underlying code.  

For example, to filter for all datasets in the registry with the name `"nersc_tutorial:my_first_desc_dataset"`, you would use the following query:  

In [None]:
# Create a filter that queries on the dataset name
f = datareg.Query.gen_filter('dataset.name', '==', 'nersc_tutorial:my_first_desc_dataset')

In a query filter:  
- The first argument is the column name being searched.  
- The second argument is the logic operator.  
- The third argument is the condition to match.  

Like in SQL, column names can be referenced:  
- Explicitly: Including the table name (e.g., `dataset.name`).  
- Implicitly: Without the table name (e.g., `name`).  

However, implicit column references are only valid if the column name is **unique across all tables**—which `name` is not. Therefore, we strongly recommend always including the table name in filters.  

The following logical operators are supported:  
`==`, `!=`, `<`, `<=`, `>`, `>=`  

#### Wildcard Queries  

A special operator, `~=`, allows for wildcard queries, where `*` serves as the wildcard character. This is useful when:  
- You only know part of a dataset name.  
- You want to find all datasets following a specific naming pattern.

For example

In [None]:
# Create a filter that queries on the dataset name with a wildcard
f = datareg.Query.gen_filter('dataset.name', '~=', 'nersc_tutorial:*')

will return all datasets whose name begins with the pattern `nersc_tutorial:`. The `~=` operator is case insensitive, for case sensitive wildcard searching, one can use the `~==` operator.

### Performing the query

Now, we can use this filter in a query by passing it to the `Query` extension of the `DataRegistry` class, as shown below:  

In [None]:
# Query the database
results = datareg.Query.find_datasets(['dataset.dataset_id', 'dataset.name', 'dataset.relative_path'], [f])

The Query function requires:

- A list of column names to return (e.g., `dataset.dataset_id`, `dataset.name`, and `dataset.relative_path`).
- A list of filter objects to apply to the query (in this case, just `f`).

We can look at the results like so:

In [None]:
print(results)

### Query return formats

Two return formats are supported, selected via the optional `return_format` attribute passed to the `find_datasets` function:

- `return_format="property_dict"` : a dictionary with keys in the format `<table_name>.<column_name>` (default)
- `return_format="dataframe"` : a pandas DataFrame with keys in the format `<table_name>.<column_name>`

To get a list of all columns in the database, along with what table they belong to, you can use the `Query.get_all_columns()` function, i.e.,

In [None]:
print(datareg.Query.get_all_columns())

## 2) Querying the data registry with multiple filters

Queries are not limited to a single filter—we can combine multiple conditions to refine our search.  

For example, suppose we want to retrieve all datasets in the registry that:  
- Have a specific `owner_type`.  
- Were registered after a certain date.  
- Return the results as a Pandas DataFrame.  

To achieve this, we create two filter objects: 

In [None]:
# Create a filter that queries on the owner
f = datareg.Query.gen_filter('dataset.owner_type', '==', 'group')

# Create a 2nd filter that queries on the entry date
f2 = datareg.Query.gen_filter('dataset.creation_date', '>', '01-01-2024')

Then we query the database as before:

In [None]:
# Query the database
results = datareg.Query.find_datasets(['dataset.dataset_id', 'dataset.name', 'dataset.owner',
                                       'dataset.relative_path', 'dataset.creation_date', 'dataset.owner_type'],
                                      [f,f2],
                                      return_format="dataframe")

and print the results

In [None]:
print(results)

More examples for querying can be found in the advanced querying tutorial