![Banner logo](https://raw.githubusercontent.com/CitrineInformatics/community-tools/master/templates/fig/citrine_banner_2.png)

# PyCC Search Client Tutorial

*Authors: Enze Chen, Max Hutchinson, Chris Borg*

In this notebook, we will cover how to use the [Citrination API](http://citrineinformatics.github.io/python-citrination-client/) to search for and return PIF records on Citrination. The query language is quite sophisticated and allows users to apply complex sets of criteria; consequently, this guide will only cover a subset of its capabilities.

## Table of contents
1. [Learning outcomes](#Learning-outcomes)
1. [Background knowledge](#Background-knowledge)
1. [Imports](#Python-package-imports)
1. [Initialization](#Initialize-the-SearchClient)
1. [Query structure](#Query-structure)
1. [Filters](#Filters)
1. [Datasets](#Dataset-search)
1. [PIFs](#PIF-search)
1. [Conclusion](#Conclusion)
1. [Additional resources](#Additional-resources)

## Learning outcomes

[Back to ToC](#Table-of-contents)

By the end of this tutorial, you will know how to:
* Initialize the [`SearchClient`](http://citrineinformatics.github.io/python-citrination-client/modules/search/citrination_client.search.html) and search for datasets and PIF records.
* Nest the various [`Query`](http://citrineinformatics.github.io/python-citrination-client/modules/search/core_query.html) objects to apply a set of search criteria.
* Use [`Filter`](http://citrineinformatics.github.io/python-citrination-client/modules/search/core_query.html#module-citrination_client.search.core.query.filter) objects to match against data fields.

## Background knowledge

[Back to ToC](#Table-of-contents)

In order to get the most out of this tutorial, you should already be familiar with the following:
* The Physical Information File (PIF) schema. 
  * [Documentation](http://citrineinformatics.github.io/pif-documentation/schema_definition/index.html)
  * [Publication](https://www.cambridge.org/core/journals/mrs-bulletin/article/beyond-bulk-single-crystals-a-data-format-for-all-materials-structurepropertyprocessing-relationships/AADBAEDA62B0391D708CF02269989E8B)
  * [Example](../tutorial_sequence/AdvancedPif.ipynb)
* What the search [front-end UI](https://citrination.com/search/simple) looks like.

## Python package imports

[Back to ToC](#Table-of-contents)

In [None]:
# Standard packages
import os

# Third-party packages
from citrination_client import *
from pypif import pif

## Initialize the SearchClient

[Back to ToC](#Table-of-contents)


In [None]:
# Initialize the base CitrinationClient
site = "https://citrination.com" # site you want to access; we'll use the public site
client = CitrinationClient(api_key=os.environ.get('CITRINATION_API_KEY'), 
                           site=site)

# Access the SearchClient from the attribute
search_client = client.search
search_client # reveal the methods

In this notebook, we will focus on the `pif_search()` and `dataset_search()` methods.

## Query structure

[Back to ToC](#Table-of-contents)

Before we discuss the specifics of each method, we'll provide a high-level discussion about the structure of [`Query`](https://github.com/CitrineInformatics/python-citrination-client/tree/64aab061500811fae4767491e5b069bb4a4af068/citrination_client/search/core/query) objects. There are two generic types of queries used by the `SearchClient`:

1. `ReturningQuery` objects that actually return specific objects with data (e.g. PIFs, datasets).
    * These are inputs to the search methods listed above.


2. Other `Query` objects that just match for specific fields (e.g. datasets, formulas).
    * Roughly speaking, there is a `Query` object corresponding to each PIF object ([see here](http://citrineinformatics.github.io/python-citrination-client/modules/search/pif_query_core.html)).

### Example
![Query structure](../fig/query_structure.png "Query structure")

At the top level, we have a `ReturningQuery` object that takes a variety of input parameters such as:
* `size`: Total number of hits to return.
* `query`: One or more [`DataQuery`](http://citrineinformatics.github.io/python-citrination-client/modules/search/core_query.html#module-citrination_client.search.core.query.data_query) objects with the query to run.

The `DataQuery` object then contains more fine-grained fields for selecting specific `dataset`(s) and `system`(s), each with their specific [`DatasetQuery`](http://citrineinformatics.github.io/python-citrination-client/modules/search/dataset_query.html) and [`PifSystemQuery`](http://citrineinformatics.github.io/python-citrination-client/modules/search/pif_query.html) objects. Query objects are in orange and black in the above image.

## Filters

[Back to ToC](#Table-of-contents)

In the above example, you'll notice that each sub-query field ends with a [`Filter`](http://citrineinformatics.github.io/python-citrination-client/modules/search/core_query.html#module-citrination_client.search.core.query.filter) object highlighted in blue. The purpose of these objects is to contain the matching phrase (`equal`), along with any logic (`logic`, `exists`) and range (`min`, `max`) parameters. When constructing your own queries, remember to use a `Filter` when limiting the scope of each field.

Note that the `chemical_formula` field takes a specialized `ChemicalFieldQuery` which has its own [`ChemicalFilter`](http://citrineinformatics.github.io/python-citrination-client/modules/search/pif_chemical_query.html#module-citrination_client.search.pif.query.chemical.chemical_filter) object.

### `extract_as`

`extract_as` is a powerful keyword that facilitates the aggregation of data from multiple sources. It takes a `string` with the alias to save a field under, and is useful when different datasets use slightly different names to describe the same Property. 

It will return the PIF records and relevant field all under the same `extract_as` name. This flattens the data from the hierarchical PIF format to facilitate analysis. [See here](../tutorial_sequence/3_IntroQueries.ipynb) for an example and discussion.

## PIF search

[Back to ToC](#Table-of-contents)

The [`PifSystemReturningQuery`](http://citrineinformatics.github.io/python-citrination-client/modules/search/pif_query.html#module-citrination_client.search.pif.query.pif_system_returning_query) object in the example above is exactly the input for the [`pif_search()`](http://citrineinformatics.github.io/python-citrination-client/modules/search/citrination_client.search.html#citrination_client.search.client.SearchClient.pif_search) method. This method returns a [`PifSearchResult`](http://citrineinformatics.github.io/python-citrination-client/modules/search/pif_result.html#module-citrination_client.search.pif.result.pif_search_result) object with the following attributes:
* `took`: Number of milliseconds the query took.
* `total_num_hits`: The total number of PIF hits.
* `hits`: List of [`PifSearchHit`](http://citrineinformatics.github.io/python-citrination-client/modules/search/pif_result.html#module-citrination_client.search.pif.result.pif_search_hit) objects.

This method is useful when we want to obtain actual PIF data. For example, we can apply it to our example dataset from this tutorial sequence to obtain:

In [None]:
dataset_id = 172242 # change this to be your dataset id
print("The dataset URL is: {}/datasets/{}".format(site, dataset_id))

system_query = PifSystemReturningQuery(
    size=500,   # Returns the total number of matching hits without retrieving any data.
    query=DataQuery(
        dataset=DatasetQuery(
            id=Filter(
                equal=str(dataset_id)))))

search_result = search_client.pif_search(system_query)
print("Found {} PIFs in dataset {}.".format(search_result.total_num_hits, dataset_id))

Each [`PifSearchHit`](http://citrineinformatics.github.io/python-citrination-client/modules/search/pif_result.html#module-citrination_client.search.pif.result.pif_search_hit) object has `id` and `system` attributes to extract the ID and System data of the PIF record.

In [None]:
print("The first PIF record is {}".format(search_result.hits[0].id))
print(pif.dumps(search_result.hits[0].system, indent=4))

### Example: Filter a range of values
Whereas the `.system` attribute above returned the entire PIF, we can use the `.extracted` attribute to return only the fields of interest specified in the query.

In [None]:
system_query = PifSystemReturningQuery(
    size=500,
    query=DataQuery(
        dataset=DatasetQuery(id=Filter(equal=dataset_id)),
        system=PifSystemQuery(
            chemical_formula=ChemicalFieldQuery(
                extract_as='Chemical formula',
                filter=ChemicalFilter(equal='?x?y')),
            properties=PropertyQuery(
                name=FieldQuery(
                    filter=Filter(equal='Band gap')),
                value=FieldQuery(
                    filter=Filter(min=3.0, max=6.0),
                    extract_as='Band gap')))))
                    

search_result = search_client.pif_search(system_query)
print("Found {} PIFs in dataset {}.".format(search_result.total_num_hits, dataset_id))
print([x.extracted for x in search_result.hits][:2])

### Example: Logic
We can search for materials that `SHOULD` be oxides but `MUST NOT` have only 1 oxygen atom.

In [None]:
query_size = 5
query_logical = PifSystemReturningQuery(
    size=query_size,
    query=DataQuery(
        dataset=DatasetQuery(
            id=Filter(equal=str(dataset_id))),
        system=PifSystemQuery(
            chemical_formula=ChemicalFieldQuery(
                extract_as='formula',
                filter=[ChemicalFilter(equal='?xOy', logic="SHOULD"),
                        ChemicalFilter(equal='?xO1', logic="MUST_NOT")]))))

search_result = search_client.pif_search(query_logical)
print("{} total hits, the first {} of which are:".format(search_result.total_num_hits, query_size))
for i in range(query_size):
    print(pif.dumps(search_result.hits[i].extracted))

## Dataset search

[Back to ToC](#Table-of-contents)

In other instances, we might be interested in knowing which datasets contain the information we want. While this can technically be done with a PIF search and then parsing through the dataset fields, there's also a [`DatasetReturningQuery`](http://citrineinformatics.github.io/python-citrination-client/modules/search/dataset_query.html#module-citrination_client.search.dataset.query.dataset_returning_query) that can be directly input into the `dataset_search()` method. The method returns a [`DatasetSearchResult`](http://citrineinformatics.github.io/python-citrination-client/modules/search/dataset_result.html#module-citrination_client.search.dataset.result.dataset_search_result) object with the following attributes:
* `took`: Number of milliseconds the query took.
* `total_num_hits`: The total number of dataset hits.
* `hits`: List of [`DatasetSearchHit`](http://citrineinformatics.github.io/python-citrination-client/modules/search/dataset_result.html#module-citrination_client.search.dataset.result.dataset_search_hit) objects.

An example for how this method works is as follows. We'll search for all datasets that contain a PIF with the chemical formula $\text{As}_{2}\text{S}_{3}$. We will randomize the results returned by passing the `random_results` flag to the `DatasetReturningQuery`.

In [None]:
dataset_query = DatasetReturningQuery(
    size=100,
    random_results=True,
    query=DataQuery(
        system=PifSystemQuery(
            chemical_formula=ChemicalFieldQuery(
                filter=ChemicalFilter(
                    equal='As2S3')))))

search_result = search_client.dataset_search(dataset_query)
print('{} datasets matched this query.'.format(search_result.total_num_hits))

Each [`DatasetSearchHit`](http://citrineinformatics.github.io/python-citrination-client/modules/search/dataset_result.html#module-citrination_client.search.dataset.result.dataset_search_hit) object has many attributes that provide more context.

In [None]:
first = search_result.hits[0]
print('A matching dataset is "{}" with ID {}.\nIt was made by {} at {}.'.format(
    first.name, first.id, first.owner, first.updated_at))

### Example: `get_datasets_by_owner`
We will write and demonstrate a wrapper function takes in a `SearchClient` object and author name (`string`) and returns up to 1000 datasets created by that author. 

This example uses the `dataset_search()` method to obtain a list of datasets. It builds a [`FieldQuery`](http://citrineinformatics.github.io/python-citrination-client/modules/search/pif_query_core.html#module-citrination_client.search.pif.query.core.field_query) object to match against the owner's name.

In [None]:
def get_datasets_by_owner(client, owner_name):
    owner_query = FieldQuery(filter=Filter(equal=owner_name))
    dataset_query = DatasetQuery(owner=owner_query)
    query = DataQuery(dataset=dataset_query)
    datasets = client.dataset_search(DatasetReturningQuery(query=query, size=1000))
    return datasets

owner = 'Enze Chen' # You can change the name here
print('{} has {} datasets.'.format(owner, get_datasets_by_owner(search_client, owner).total_num_hits))

## Conclusion

[Back to ToC](#Table-of-contents)

To recap, this notebook discussed how to search for data on Citrination using the `SearchClient`. The topics covered included:
* How to properly initialize the SearchClient.
* How to construct PIF queries.
* How to construct Data queries.
* How to Filter for values.

## Additional resources

[Back to ToC](#Table-of-contents)

Some other topics that might interest you include:
* Other examples on [learn-citrination](https://github.com/CitrineInformatics/learn-citrination), including [Intro](../tutorial_sequence/3_IntroQueries.ipynb) and [Advanced](../tutorial_sequence/AdvancedQueries.ipynb) queries.
* [DataClient](http://citrineinformatics.github.io/python-citrination-client/tutorial/data_examples.html) - This allows you to create datasets and upload PIF data (only) using the API.
  * There is also a corresponding [tutorial](1_data_client_api_tutorial.ipynb).