# Foundry

Foundry is an easy-to-use API that allows the use to access a bunch of material science datasets. The data can be loaded very efficiently and without much hassle. This notebook will be similar to deepchem_pubchempy and MP_API in that it will be focused on showing how to access and play around with the datasets.

## Setup

### Imports

In [52]:
from foundry import Foundry
import pandas as pd

## Data Loading 

First we need to create an instance of foundry to use with the API. use_globus=False just means that we won't be using the Globus integration that foundry offers. This is optional if you want to do this but for this notebook we will not be using it.

In [53]:
f = Foundry(use_globus=False)

Next, we can load our dataset. This can be done in a few different ways. Firstly, you can use f.list() to print off all the avaiable datasets. The other option is to browse their website https://foundry-ml.org/#/datasets or https://www.materialsdatafacility.org/portal which has a very nice UI for finding them. 

After you have an idea of the dataset that you want you can either copy and paste in the doi from the website or search it within python.

If you do not know the doi of the dataset it can be found by searching the name of the datasets as shown. For this notebook we will use the 'Predicting the thermodynamic stability of perovskite oxides using machine learning models' dataset.

In [54]:
datasets = f.search('Predicting the')
datasets

Unnamed: 0,dataset_name,title,year,DOI
0,perovskite_stability_v1.1,Predicting the thermodynamic stability of pero...,root=2022,10.18126/qe5y-2dnz


This does not always find a specific dataset, it will return a table containing information about all of the datasets that match the query. Most importantly, this table contains both the source_id and the doi number for the dataset we want. We can use these two pieces of information to load the data. Alternatively you can index our the FoundryObject from the datasets variable.

In [55]:
# loading with the source_id
data = f.get_dataset('perovskite_stability_v1.1')

In [56]:
# loading with the doi number 
data = f.get_dataset('10.18126/qe5y-2dnz')

In [57]:
# straight indexing
data = datasets.iloc[0].FoundryDataset

After loading it we have to assign it to variables which will download it. Just as a warning, some of these datasets can be quite large (300mb+) so it's worth checking out the dataset on the website before downloading it. This dataset is only 8.29 MB but it's something to be aware of. 

In [58]:
X_mp, y_mp = data.get_as_dict()['train']

Now that we've loaded our data we can inspect it and see what the data contains. 

In [59]:
X_mp.describe()

Unnamed: 0,Material Composition
count,1929
unique,1929
top,Ba1Sr7V8O24
freq,1


This dataset only contains one input value (formula) but we can featurize it to get more inputs to train on. This is a very simple dataset (one input, one output) but the datasets available can get quite large. 

## Try It Yourself!

- Use the foundry API to grab the 'Charting the complete elastic properties of inorganic crystalline compounds' dataset
- Load the data and inspect it for what it contains
- Featurize the formula column and create a dataframe with those features, nsites, space group, and volume
- Assign the target variable to be the average bulk modulus 
- create train/test splits, standardize the data, and train a random forest model predicting average bulk modulus (K_Voigt)
- score it using mean squared error, mean average error, and R2 