# Advanced Queries
*Authors: Max Hutchinson, Carena Church, Enze Chen*

In this notebook, we will continue our discussion of the query language and the `SearchClient` API.

## Background knowledge
To get the most out of this tutorial, you should be familiar with:
* Everything in the [Intro to Queries tutorial](IntroQueries.ipynb).

## Python package imports

In [7]:
# Standard packages
from os import environ

# Third-party packages
from citrination_client import CitrinationClient
from citrination_client import PifSystemReturningQuery, PifSystemQuery, FieldQuery, ValueQuery
from citrination_client import PropertyQuery, DataQuery, DatasetQuery, ChemicalFieldQuery, ChemicalFilter, Filter
from pypif import pif

## Query structure mirrors PIF structure
You can query subsystems, processing steps, properties, conditions of properties etc. by creating a query that matches the object hierarchy to the section you want to query.

### Flattening the PIF structure
`extract_as` creates a flattened dictionary structure mapping user supplied keys to objects in the PIF that match within the query.

`extract_all` is an option for `extract_as` that pulls a list of all objects at the level in the hierarchy that match the query.

Let's search for the "Enthalpy of Formation" property:

In [8]:
# Initialize the SearchClient
client = CitrinationClient(environ['CITRINATION_API_KEY'], 'https://citrination.com')
search_client = client.search

dataset_id = 150675
query_size = 10

system_query = PifSystemReturningQuery(
            size=query_size,
            query=DataQuery(
                dataset=DatasetQuery(
                    id=[Filter(equal=str(dataset_id))]),
                chemical_formula=ChemicalFieldQuery(
                    extract_as='formula',
                    filter=ChemicalFilter(
                        equal='CdTe')),
                system=PifSystemQuery(
                    properties=PropertyQuery(
                        extract_all=True,
                        name=FieldQuery(
                            filter=[Filter(equal="Enthalpy of Formation")]),
                        value=FieldQuery(
                            extract_as="formation_enthalpy",
                            extract_all=True)))))

query_result = search_client.pif_search(system_query)
print("{} hits.\n".format(query_result.total_num_hits))
print("Extracted fields:")
for i in range(2):
    print(pif.dumps(query_result.hits[i].extracted, indent=4))

69640 hits.

Extracted fields:
{
    "formation_enthalpy": [
        "0.0"
    ]
}
{
    "formation_enthalpy": [
        "0.1074050600000005"
    ]
}


## Chemical formula search
Citrine has developed specialized search functionality specifically for chemical formulas. The analyzer parses the search string and recognizes chemical entities such as elements and stoichiometries to find chemically relevant results.

1. You can use `generate_simple_chemical_query()` with a simple search string like "PbSe" or,
2. You can structure a `PifSystemReturningQuery` with more detailed elemental and stoichiometric strings.

Let's search over the Materials Project dataset using ```mp_dataset_id = 150675``` as the dataset_id. We will restrict the query size of the `simple_chemical_query` to 10000, which is the `max_query_size` and will avoid triggering a warning.

In [9]:
mp_dataset_id = 150675
query_size = 10000
simple_query = search_client.generate_simple_chemical_query(chemical_formula="PbSe", include_datasets=[mp_dataset_id],
                                                            size=query_size)
search_result = search_client.pif_search(simple_query)
num_pifs = search_result.total_num_hits
print("{} total hits.\n".format(num_pifs))

for i in range(num_pifs):
    print(pif.dumps(search_result.hits[i].extracted))
    print("\n")

4 total hits.

{"property_units": "g/cm$^3$", "name": "Lead selenide - HP", "property_value": "8.767862185821011", "chemical_formula": "PbSe", "property_name": "Density"}


{"property_units": "g/cm$^3$", "name": "Lead selenide", "property_value": "4.062929739915243", "chemical_formula": "PbSe", "property_name": "Density"}


{"property_units": "g/cm$^3$", "name": "Clausthalite", "property_value": "7.872521935843158", "chemical_formula": "PbSe", "property_name": "Density"}


{"property_units": "g/cm$^3$", "name": "Nickel lead selenide (3/2/2)", "property_value": "9.207453481307569", "chemical_formula": "Ni3(PbSe)2", "property_name": "Density"}




### `formula_filter`
Now let's explore the different filters we can apply to chemical formulas. We will first write a helper method that applies a `ChemicalFilter` to the `chemical_formula` field of a `PifSystemQuery`. Note that the wildcard character `?` can match any element.

In [15]:
def formula_filter(filter_flag):
    chemical_filter = None
    if filter_flag == 'Gallium':
        chemical_filter = ChemicalFilter(equal='Ga')
    elif filter_flag == 'Ternary':
        chemical_filter = ChemicalFilter(equal='?x?y?z')
    elif filter_flag == 'Oxides':
        chemical_filter = ChemicalFilter(equal='?xOy')
    elif filter_flag == 'Single oxides':
        chemical_filter = ChemicalFilter(equal='?1O1')
        
    query_filter = PifSystemReturningQuery(
                    size=query_size,
                    random_results=True,
                    query=DataQuery(
                        dataset=DatasetQuery(
                            id=[Filter(equal=str(mp_dataset_id))]),
                        system=PifSystemQuery(
                            chemical_formula=ChemicalFieldQuery(
                                extract_all=True,
                                extract_as='formula',
                                filter=chemical_filter))))
    return query_filter

### #nofilter—Return all materials in the Materials Project
First we will apply a filter to the dataset ID to query only within the Materials Project Database on Citrination.

In [17]:
query_size = 5
query_filter_none = formula_filter('None')
search_result = search_client.pif_search(query_filter_none)
print("{} total hits, the first 5 of which are:".format(search_result.total_num_hits))
for i in range(5):
    print(pif.dumps(search_result.hits[i].extracted))

69640 total hits, the first 5 of which are:
{"formula": ["Sc29Fe6"]}
{"formula": ["La2ZnO4"]}
{"formula": ["Mg(VS2)2"]}
{"formula": ["V2Co(PO5)2"]}
{"formula": ["CeAl3"]}


### Filter for Gallium-containing compounds
Next we will apply a filter for the `chemical_formula` to select the subset of materials containing the element Gallium.

In [18]:
query_filter_Ga = formula_filter('Gallium')
search_result = search_client.pif_search(query_filter_Ga)
print("{} total hits, the first 5 of which are:".format(search_result.total_num_hits))
for i in range(5):
    print(pif.dumps(search_result.hits[i].extracted))

23 total hits, the first 5 of which are:
{"formula": ["Ga(MoSe2)4"]}
{"formula": ["GaS"]}
{"formula": ["Ga(IO3)3"]}
{"formula": ["Ga"]}
{"formula": ["Ga(SbCl)4"]}


### Filter for ternary compounds
Next, we will apply a different filter to the `chemical_formula` to select the subset of materials with ternary composition $A_xB_yC_z$.

In [19]:
query_filter_ternary = formula_filter('Ternary')
search_result = search_client.pif_search(query_filter_ternary)
print("{} total hits, the first 5 of which are:".format(search_result.total_num_hits))
for i in range(5):
    print(pif.dumps(search_result.hits[i].extracted))

33017 total hits, the first 5 of which are:
{"formula": ["BaSbAu"]}
{"formula": ["PrErTl2"]}
{"formula": ["Cs2ZnCl4"]}
{"formula": ["Ag2SnO3"]}
{"formula": ["La3Ge3Cl2"]}


### Filter for oxides
Next, we will apply a different filter to the `chemical_formula` to select the subset of materials that are oxides.

In [20]:
query_filter_oxides = formula_filter('Oxides')
search_result = search_client.pif_search(query_filter_oxides)
print("{} total hits, the first 5 of which are:".format(search_result.total_num_hits))
for i in range(5):
    print(pif.dumps(search_result.hits[i].extracted))

1577 total hits, the first 5 of which are:
{"formula": ["SeO3"]}
{"formula": ["RuO2"]}
{"formula": ["Ce4DyO9"]}
{"formula": ["WO3"]}
{"formula": ["Fe2O3"]}


### Filter for single-O oxides
Lastly, we will apply a filter to the `chemical_formula` to select the subset of materials that are oxides with only 1 oxygen atom.

In [21]:
query_filter_oxides = formula_filter('Single oxides')
search_result = client.search.pif_search(query)
print("{} total hits, the first 5 of which are:".format(search_result.total_num_hits))
for i in range(5):
    print( pif.dumps(search_result.hits[i].extracted))

121 total hits, the first 5 of which are:
{"formula": ["RbO"]}
{"formula": ["FeO"]}
{"formula": ["CO"]}
{"formula": ["CuO"]}
{"formula": ["NpO"]}


## Logical operations

We can also include the following logical operations on the filters: `SHOULD, MUST, OPTIONAL, MUST_NOT`.

In [22]:
query_size = 3
print('Logical operation 1: Oxides that MUST NOT have only 1 O atom.')
query_logical1 = PifSystemReturningQuery(
            size=query_size,
            random_results=True,
            query=DataQuery(
                dataset=DatasetQuery(
                    id=[Filter(equal=str(mp_dataset_id))]),
                system=PifSystemQuery(
                    chemical_formula=[
                        ChemicalFieldQuery(
                            extract_as='formula',
                            filter=ChemicalFilter(
                                equal='?1O1'),
                            logic="MUST_NOT"),
                        ChemicalFieldQuery(
                        extract_as='formula',
                            filter=ChemicalFilter(
                                equal='?xOy'))]
                )))

search_result = search_client.pif_search(query_logical1)
print("{} total hits, the first {} of which are:".format(search_result.total_num_hits, query_size))
for i in range(query_size):
    print(pif.dumps(search_result.hits[i].extracted))

print("\nLogical operation 2: All compounds that MUST have 'Enthalpy of Formation' and 'Band gap', and SHOULD have 'Crystal System'")
query_logical2 = PifSystemReturningQuery(
            size=query_size,
            random_results=True,
            query=DataQuery(
                dataset=DatasetQuery(
                    id=[Filter(equal=str(mp_dataset_id))]
                ),
                system=PifSystemQuery(
                    chemical_formula=ChemicalFieldQuery(
                        extract_as='formula'
                    ),
                    properties=[
                        PropertyQuery(
                            name=FieldQuery(
                                filter=[Filter(equal="Enthalpy of Formation")]),
                            value=FieldQuery(
                                extract_as="H_f",
                                logic="MUST")
                        ),
                        PropertyQuery(
                            name=FieldQuery(
                                filter=[Filter(equal="Band Gap")]),
                            value=FieldQuery(
                                filter=[Filter(min=1E-5)],
                                extract_as="bandgap",
                                logic="MUST")
                        ),
                         PropertyQuery(
                            name=FieldQuery(
                                filter=[Filter(equal="Crystal System")]),
                            value=FieldQuery(
                                extract_as="crystal system",
                                logic="SHOULD")
                        )]
                )))

search_result = search_client.pif_search(query_logical2)
print("{} total hits, the first {} of which are:".format(search_result.total_num_hits, query_size))
for i in range(query_size):
    print(pif.dumps(search_result.hits[i].extracted, indent=4))

Logical operation 1: Oxides that MUST NOT have only 1 O atom.
1456 total hits, the first 3 of which are:
{"formula": "Zr6O11"}
{"formula": "SiO2"}
{"formula": "VO2"}

Logical operation 2: All compounds that MUST have 'Enthalpy of Formation' and 'Band gap', and SHOULD have 'Crystal System'
69640 total hits, the first 3 of which are:
{
    "crystal system": "hexagonal",
    "formula": "CuSe",
    "H_f": "0.09619559236979214"
}
{
    "crystal system": "monoclinic",
    "bandgap": "2.531",
    "formula": "Gd3Si2Cl5O6",
    "H_f": "-3.2142637892999995"
}
{
    "crystal system": "triclinic",
    "bandgap": "2.9164999999999996",
    "formula": "NaCa8SmTi10(SiO5)10",
    "H_f": "-3.4735250649270832"
}


## Conclusion
This concludes the second tutorial to the query language and the `SearchClient`. We discussed how to use `extract_as` and `extract_all` to return flattened PIFs. Then we discussed how to filter searches by querying specific patterns of chemical formulas. Finally, we gave example of logical operations that can be used to combine different queries. At this point, you should have a good grasp of the search capabilities enabled by the Python Citrination Client.