# OPTIMADE and *pymatgen*

# What is *pymatgen*?

[*pymatgen*](https://pymatgen.org) is a materials science analysis code written in the Python programming language. It helps power the [Materials Project](https://materialsproject.org)'s high-throughput DFT workflows. It supports integration with a wide variety of simulation codes and can perform many analysis tasks such as the generation of phase diagrams or diffraction patterns.

# The motivation behind this tutorial

**This tutorial is aimed either at:**

* People who are already familiar with using *pymatgen* or the Materials Project
    * In particular, anyone already using the Materials Project API through the `MPRester`, and who would like to start using the OPTIMADE API in a similar way

* People who like using Python and think they might appreciate an interface like the one provided by *pymatgen*.
    * *pymatgen* provides a lot of input/output routines (such as conversion to CIF, POSCAR, etc.) and analysis tools (such as determination of symmetry, analysis of possible bonds, etc.) that can be performed directly on structures retrieved from OPTIMADE providers.

**What this tutorial is not:**

* This is not necessarily the way everyone should be accessing OPTIMADE providers!
    * This tool may be useful to you, or it may not be. There are a lot of good tools available in our community. You are encouraged to try out different tools and find the one that's most useful for your own work.

* It is not currently the best way to access OPTIMADE APIs for advanced users.
    * It is still under development.
    * It is unit tested against several OPTIMADE providers but **some do not work yet**.
    * It only currently supports information retrieval from `/v1/structures/` routes.

# Pre-requisites

This tutorial is aimed at people who already have a basic understanding of Python, including how to import modules, the use of basic data structures like dictionaries and lists, and how to intantiate and use objects.

If you do not have this understanding of Python, this tutorial may help you become familiar, but you are highly encouraged to follow a dedicated Python course such as those provided by [Software Carpentry](https://software-carpentry.org).

# Install pymatgen

This tutorial uses the Python programming language. It can be run on any computer with Python installed. For convenience, here we are running in Google's "Colaboratory" notebook environment.

Before we begin, we must install the `pymatgen` package:

In [None]:
!pip install 'pymatgen>=2023.2.22' pybtex

Next, let us **verify the correct version of *pymatgen* is installed**. This is good practice to do before starting out! For this tutorial we need version 2023.2.22 or above. We also need the `pybtex` package installed.

In [None]:
try:
    from importlib_metadata import version
except ImportError:
    from importlib.metadata import version

In [None]:
version("pymatgen")

# Import and learn about the `OptimadeRester`

The `OptimadeRester` is a class that is designed to retrieve data from an OPTIMADE provider and automatically convert the data into *pymatgen* `Structure` objects. These `Structure` objects are designed as a good intermediate format for crystallographic structure analysis, transformation and input/output.

You can read documentation on the `OptimadeRester` here: https://pymatgen.org/pymatgen.ext.optimade.html

In [None]:
from pymatgen.ext.optimade import OptimadeRester

The first step is to inspect the **documentation** for the `OptimadeRester`. We can run:

In [None]:
OptimadeRester?

# Understanding "aliases" as shortcuts for accessing given providers

In [None]:
OptimadeRester.aliases

These aliases are useful since they can provide a quick shorthand for a given database without having to remember a full URL.

This list of aliases is updated periodically. However, new OPTIMADE providers can be made available and will be listed at https://providers.optimade.org. The `OptimadeRester` can query the OPTIMADE providers list to refresh the available aliases.

You can do this as follows, but be aware this might take a few moments:

In [None]:
opt = OptimadeRester()
opt.refresh_aliases()

# Connecting to one or more OPTIMADE providers

Let's begin by connecting to the Materials Project (`mp`) and 2DMatPedia (`twodmatpedia`) databases.
By default pymatgen expects a server to reply within 5 seconds, some servers however require up to several minutes to process a querry.
You can therefore set the timeout to a different value (in seconds) if you get a "Read timed out" error.

In [None]:
opt = OptimadeRester(["mp", "twodmatpedia"], timeout=10)

We can find more information about the OPTIMADE providers we are connected to using the `describe()` method.

In [None]:
print(opt.describe())

# Query for materials: binary nitrides case study

`OptimadeRester` provides an `get_structures` method. **It does not support all features of OPTIMADE filters** but is a good place to get started.

For this case study, we will search for materials containing nitrogen and that have two elements.

In [None]:
results = opt.get_structures(elements=["N"], nelements=2)

We see that the `OptimadeRester` does some of the hard work for us: it automatically retrieves multiple pages of results when many results are available, and also gives us a progress bar.

Let us inspect the `results`:

In [None]:
type(results)  # this method returns a dictionary, so let's examine the keys of this dictionary...

In [None]:
results.keys()  # we see that the results dictionary is keyed by provider/alias

In [None]:
results['mp'].keys()  # and these are then keyed by that database's unique identifier

So let us inspect one structure as an example:

In [None]:
example_structure = results['mp']['mp-804']
print(example_structure)

We can then use *pymatgen* to further manipulate these `Structure` objects, for example to calculate the spacegroup or to convert to a CIF:

In [None]:
example_structure.get_space_group_info()

In [None]:
print(example_structure.to(fmt="cif", symprec=0.01))

# Data analysis

This section I will use some code I prepared earlier to summarize the `results` into a tabular format (`DataFrame`).

In [None]:
import pandas as pd

In [None]:
records = []
for provider, structures in results.items():
    for identifier, structure in structures.items():
        records.append({
            "provider": provider,
            "identifier": identifier,
            "formula": structure.composition.reduced_formula,
            "spacegroup": structure.get_space_group_info()[0],
            "a_lattice_param": structure.lattice.a,
            "volume": structure.volume,
        })
df = pd.DataFrame(records)

In [None]:
df

To pick one specific formula as an example, we can use tools from `pandas` to show the spacegroups present for that formula:

In [None]:
df[df["formula"] == "GaN"].spacegroup

Here, we see that there are a few common high-symmetry spacegroups (such as $P6_3mc$) there are also many low-symmetry structures ($P1$).

I know that in this instance, this is because the $P1$ structures are actually amorphous and not crystalline. This highlights the importance of doing appropraiate **data cleaning** on retrieved data.

### Plotting data

As a quick example, we can also plot information in our table:

In [None]:
import plotly.express as px

In [None]:
px.bar(df, x="spacegroup", facet_row="provider")

**Remember, there is no single "best database" to use. Every database might be constructed for a specific purpose, subject to different biases, with different data qualities and sources.**

The ideal database for one scientist with one application in mind may be different to the ideal database for another scientist with a different application.

**The power of OPTIMADE is that you can query across multiple databases!**

# Advanced usage: querying using the OPTIMADE filter grammar

You can also query using an OPTIMADE filter as defined in the OPTIMADE specification and publication.

**This is recommended** for advanced queries to use the full power of OPTIMADE.

For example, the above query could have equally been performed as:

In [None]:
results = opt.get_structures_with_filter('(elements HAS ALL "N") AND (nelements=2)')

# Advanced usage: retrieving provider-specific property information

The OPTIMADE specification allows for providers to include database-specific information in the returned data, prefixed by namespace.

To access this information with *pymatgen* we have to request "snls" (`StructureNL`) instead of "structures". A `StructureNL` is a `Structure` with additional metadata included, such as the URL it was downloaded from and any of this additional database-specific information.

In [None]:
results_snls = OptimadeRester("odbx").get_snls(nelements=2, additional_response_fields=["_odbx_thermodynamics"])

In [None]:
example_snl = results_snls['odbx']['odbx/2']

In [None]:
example_snl.data['_optimade']['_odbx_thermodynamics']

This extra data provided differs from every database, and sometimes from material to material, so some exploration is required!

# When Things Go Wrong and How to Get Help

Bugs may be present! The `OptimadeRester` is still fairly new.

If it does not work it is likely because of either:

* A bug in the *pymatgen* code. This may be reported directly to Matthew Horton at mkhorton@lbl.gov or an issue can be opened in the *pymatgen* code repository. Matt apologises in advance if this is the case! 

* An issue with a provider. This may be because the provider does not yet fully follow the OPTIMADE specification, because the provider is suffering an outage, or because the filters are not yet optimized with that provider.

    * If this happens, you may try to first increase the `timeout` value to something larger. The default is too low for some providers.

    * Otherwise, you may want to contact the provider directly, or create a post at the OPTIMADE discussion forum: https://matsci.org/optimade

# How to Get Involved

New developers are very welcome to add code to *pymatgen*! If you want to get involved, help fix bugs or add new features, your help would be very much appreciated. *pymatgen* can only exist and be what it is today thanks to the many efforts of its [development team](https://pymatgen.org/team.html).