#  Hands-on-session1 (Querying data from materials databases)**

# **Necessary Libraries**

In [None]:
!pip install matminer[citrine]
!pip install pyyaml
!pip install mp_api
!pip install pandas==2.2.2

# **Reading data from the excel or csv file**

In [None]:
## Loading sample data file of "california_housing_train.csv" from folder "sample_data" as a dataframe using pandas
## To see the manual, type "?pd" or "?pd.read_csv" after importing pandas
## The detailed explanation can be found in "https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html"
import pandas as pd

file_path = "/content/sample_data/california_housing_train.csv"
df = pd.read_csv(file_path)

In [None]:
## Checking the loading dataframe
df.head()

In [None]:
# Summary statistics of the DataFrame
# To see what you can do with this class, just put "." and wait to see the list of available functions and properties
df.describe()

In [None]:
## tuple representing the dimensionality of the DataFrame
df.shape

In [None]:
## Confirm axis labels to see features
df.columns

In [None]:
## Sorting the houses' population over 1000
df_pop = df[df['population'] > 1000]
df_pop.describe()

In [None]:
## Sorting with the condition of population over 1000 & total bedrooms below 500
df_pop_bed = df[(df['population'] > 1000) & (df['total_bedrooms'] < 500)]
df_pop_bed.describe()

In [None]:
## Generate a new column for the data satifying specific conditions of population over 3000
df['High_population'] = df['population'] >= 3000
df.head(10)

In [None]:
## Save to excel file
df.to_csv("new_df.csv")

# **Materials Project**

**Getting Data**

*Revised from Joseph Montoya's notebook*

This notebook demonstrates a few basic examples from matminer's data retrieval features. Matminer supports data retrieval from the following sources.

*   Materials Project (https://materialsproject.org/)

This notebook was last updated 11/15/18 for version 0.4.5 of matminer.

Each resource has a corresponding object in matminer designed for retrieving data and preprocessing it into a pandas dataframe. In addition, matminer can also access and aggregate data from your own mongo database, if you have one.


**Data retrieval**

The materials project data retrieval tool, matminer.data_retrieval.retrieve_MP.MPDataRetrieval is initialized using an api_key that can be found on your personal dashboard page on materialsproject.org if you've created an account. If you've set your api key via pymatgen (e.g. pmg config --add PMG_MAPI_KEY YOUR_API_KEY_HERE), the data retrieval tool may be initialized without an input argument.

**We need our own MP API Key**

In [None]:
## Materials Project API client: https://docs.materialsproject.org/downloading-data/using-the-api/getting-started
## Loading module of MP API client
from mp_api.client import MPRester
import pandas as pd

In [None]:
?MPRester

In [None]:
## Put your own API key
my_api_key = ""
mpr = MPRester(my_api_key)

In [None]:
## See the summary of available fields for crystalline materials
## How about molecular materials?
list_of_available_fields = mpr.materials.summary.available_fields
print(list_of_available_fields)

In [None]:
## Querying Data from MP (https://docs.materialsproject.org/downloading-data/using-the-api/querying-data)
## Load data for target properties
docs = mpr.materials.summary.search(formula="LiCoO2", fields=list_of_available_fields)
docs

In [None]:
## Transforming the data list into dictionary type and then dataframe
results = [doc.dict() for doc in docs]
df_full = pd.DataFrame(results)
df_full.head(10)

In [None]:
## In case you cannot load the data from MP, loading the data from excel file
df_full = pd.read_excel("/content/Hands_on_session1_data.xlsx", sheet_name="example_0")

In [None]:
## Selecting the properties you want to use
df = df_full.loc[:, ["formula_pretty", "material_id", "formation_energy_per_atom", "energy_above_hull", "band_gap"]]
df.head(10)

In [None]:
## https://docs.materialsproject.org/downloading-data/using-the-api/tips-for-large-downloads
## Before requesting data, use the has_props key to find which materials have data for your desired property.
## One source of wasted queries occurs when data is requested for materials that are either nonexistent or do not contain the property of interest.
## You should instead first determine what materials have the data you are looking for.
## For example, below is a query to get all of the material ID values for entries that have dielectric and density of states data:

docs = mpr.materials.summary.search(has_props=["dielectric", "dos"], fields=["material_id"])
docs[0]

In [None]:
## Load data for target properties
mat = mpr.materials.summary.search(material_ids="mp-28967")
mat

In [None]:
## Sort the values in ascending order for a specific property (ex. formation_energy_per_atom)
sdf = df.sort_values(by="formation_energy_per_atom")
sdf.head(10)

In [None]:
## Load the same materials' data with different way
docs = mpr.materials.summary.search(chemsys="Li-Co-O", formula="ABC2", fields=list_of_available_fields)
results = [doc.dict() for doc in docs]
df_full = pd.DataFrame(results)
df = df_full.loc[:, ["formula_pretty", "material_id", "formation_energy_per_atom", "energy_above_hull", "band_gap"]]
sdf = df.sort_values(by="formation_energy_per_atom", axis=0)
sdf.head(10)

Getting a dataframe corresponding to the materials project is essentially equivalent to using the MPRester's query method.(see https://api.materialsproject.org/docs) The inputs are criteria and fields, a list of supported properties which to return. See the MAPI documentation (https://docs.materialsproject.org/downloading-data)

**Example 1: Get various properties of binary oxide materials with "A2O3" formula**


In [None]:
docs = mpr.materials.summary.search(chemsys="*-O", formula="A2B3", fields=list_of_available_fields)
results = [doc.dict() for doc in docs]
df_full = pd.DataFrame(results)
# df_full = pd.read_excel("/content/Hands_on_session1_data.xlsx", sheet_name="example_1")
df = df_full.loc[:, ["formula_pretty", "material_id", "formation_energy_per_atom", "energy_above_hull", "band_gap"]]
sdf = df.sort_values(by="formation_energy_per_atom", axis=0)
sdf.head(10)

#### Example 2: Get materials only containing "Fe", and "O"

In [None]:
docs = mpr.materials.summary.search(chemsys="*-*", elements=["Fe", "O"], fields=list_of_available_fields)
results = [doc.dict() for doc in docs]
df_full = pd.DataFrame(results)
# df_full = pd.read_excel("/content/Hands_on_session1_data.xlsx", sheet_name="example_2")
df = df_full.loc[:, ["formula_pretty", "material_id", "energy_above_hull"]]
df.head()

#### Example 3: Get all bandgaps larger than 6.0 eV

In [None]:
docs = mpr.materials.summary.search(band_gap=(6,None), fields=list_of_available_fields)
results = [doc.dict() for doc in docs]
df_full = pd.DataFrame(results)
# df_full = pd.read_excel("/content/Hands_on_session1_data.xlsx", sheet_name="example_3")
df = df_full.loc[:, ["formula_pretty", "material_id", "band_gap"]]
df.head()

Get binary compounds' bandgaps larger than 6.0 eV

In [None]:
docs = mpr.materials.summary.search(chemsys="*-*", band_gap=(6, None), fields=list_of_available_fields)
results = [doc.dict() for doc in docs]
df_full = pd.DataFrame(results)
# df_full = pd.read_excel("/content/Hands_on_session1_data.xlsx", sheet_name="example_3_2")
df = df_full.loc[:, ["formula_pretty", "material_id", "band_gap"]]
df.head()

Get tertiary lithium oxide compounds' bandgaps larger thatn 6.0eV

In [None]:
docs = mpr.materials.summary.search(num_elements=3, elements=["Li", "O"], band_gap=(6,None), fields=list_of_available_fields)
results = [doc.dict() for doc in docs]
df_full = pd.DataFrame(results)
# df_full = pd.read_excel("/content/Hands_on_session1_data.xlsx", sheet_name="example_3_3")
df = df_full.loc[:, ["formula_pretty", "material_id", "band_gap"]]
df.head()

#### Example 4: Get all bulk modulus from the tertiary lithium oxide which has the "elasticity" property

In [None]:
docs = mpr.materials.summary.search(chemsys="Li-*-O", has_props=["elasticity"], fields=list_of_available_fields)
results = [doc.dict() for doc in docs]
df_full = pd.DataFrame(results)
# df_full = pd.read_excel("/content/Hands_on_session1_data.xlsx", sheet_name="example_4")
df = df_full.loc[:, ["formula_pretty", "material_id", "bulk_modulus"]]
df.head()

In [None]:
## Looking at the dataframe whether there is missing data
df

In [None]:
df.describe()

In [None]:
## Drop the rows where at least one element is missing.
df_clean = df.dropna()
df_clean.describe()

In [None]:
## Dividing data of dictionary types in new columns
## Voigt value: Upper bound of bulk modulus, Reuss value: Lower bound of bulk modulus, VRH: Average of two values
df_expanded = pd.json_normalize(df_clean['bulk_modulus'])
df_expanded.head()

In [None]:
df_drop = df_clean.drop(columns=['bulk_modulus'])
df_drop.head(10)

In [None]:
df_join = df_drop.join(df_expanded)
sdf = df_join.sort_values(by="vrh", axis=0)
sdf.head(10)

In [None]:
## Simple code for same procedure
df_expanded = pd.json_normalize(df_clean['bulk_modulus'])
sdf = df_clean.drop(columns=['bulk_modulus']).join(df_expanded).sort_values(by="vrh", axis=0)
sdf.head(10)

Now let us do a more sophisticated query and ask for more properties such as "bandstructure" and "phase diagram".

Let's look at the band structure of some of these stable compounds that contain Pb and Te which are interesting for thermoelectrics applications:

In [None]:
## Querying the band structures of the stable Pb-Te binary compounds
docs = mpr.materials.summary.search(elements=["Pb", "Te"], energy_above_hull = (0,1e-6), fields=list_of_available_fields)
results = [doc.dict() for doc in docs]
df_full = pd.DataFrame(results)
# df_full = pd.read_excel("/content/Hands_on_session1_data.xlsx", sheet_name="example_advanced")
df = df_full.loc[:, ["formula_pretty", "material_id", "energy_above_hull", "bandstructure", "dos"]]
df.head()

In [None]:
## Loading modules for plotting band structures
from pymatgen.electronic_structure.bandstructure import BandStructureSymmLine
from pymatgen.electronic_structure.plotter import BSPlotter
import matplotlib.pyplot as plt

In [None]:
## Querying the band structures of specific material with its MP ID (mp-20740)
band_structure = mpr.get_bandstructure_by_material_id("mp-20740")

In [None]:
## Using BSPlotter function, plotting its band structure
plotter = BSPlotter(band_structure)
plot = plotter.get_plot()

In [None]:
## More general procedure
## Check the band structure is the instance of BandStructureSymmLine class.
## If it is true, the code for plotting the band structure is proceeded.
band_structure = mpr.get_bandstructure_by_material_id("mp-560090")
if isinstance(band_structure, BandStructureSymmLine):
    plotter = BSPlotter(band_structure)
    plot = plotter.get_plot()
##    plt.show()
else:
    print("The band structure is not of type BandStructureSymmLine.")

Let's look at the phase diagram for Ni-Co-Mn

(This code was written with reference to the Jupyter notebook by Materials Virtual Lab(https://matgenb.materialsvirtuallab.org/))

In [None]:
## Loading functions for generating and plotting phase diagram from phase_diagram module of Pymatgen's analysis
from pymatgen.analysis.phase_diagram import PhaseDiagram, PDPlotter

In [None]:
# Querying all compounds in Ni-Ci-Mn compositional spaces using "get_entries_in_chemsys" function
entries = mpr.get_entries_in_chemsys(elements=["Ni", "Co", "Mn"], additional_criteria={"thermo_types": ["GGA_GGA+U"]})

In [None]:
# Construct phase diagram
pd = PhaseDiagram(entries)

In [None]:
# Plot phase diagram
plotter = PDPlotter(pd, backend="matplotlib")
plotter.show()