# Create Training Sets for Chemical Interpolation Test
The goal of this test is to determine whether machine learning models are able to infer the interactions between elements that are not included in the training set. Specifically, we will exclude a single quaternary from the OQMD dataset, train a model on the remaining data, and then evalaute the performance of that model on the excluded ternary. In this notebook, we identify which ternaries could be the most interesting to study, and output their data in a format compatible with Magpie.

In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from pymatgen import Composition
from itertools import product
import pandas as pd
import os
import shutil

## Read in the OQMD dataset
We want only the lowest-energy entry at each composition

In [2]:
oqmd_data = pd.read_csv(os.path.join('..', 'oqmd_all.txt'), delim_whitespace=True)
print('Read %d entries'%len(oqmd_data))
oqmd_data.head()

Read 506114 entries


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,comp,energy_pa,volume_pa,magmom_pa,bandgap,delta_e,stability
0,Li1,-1.892,17.8351,,0.0,0.015186,0.0151862666667
1,Mg1,-1.5396,22.9639,,0.0,0.002912,0.0029123775
2,Kr1,0.011256,41.4146,,7.367,0.015315,0.015314775
3,Na1,-1.2991,32.9826,,0.0,0.00378,0.00377956333333
4,Pd1,-5.15853,15.2088,,0.0,0.018186,0.0181856433333


Make all of the energies numeric

In [3]:
for col in oqmd_data.columns:
    if col == 'comp': continue
    oqmd_data[col] = pd.to_numeric(oqmd_data[col], errors='coerce')

Eliminate entries with weird formation enthalpies

In [4]:
oqmd_data.query('delta_e > -20 and delta_e < 5', inplace=True)

Generate the composition object of each entry

In [5]:
oqmd_data['comp_obj'] = oqmd_data['comp'].apply(lambda x: Composition(x))

In [6]:
oqmd_data['pretty_comp'] = oqmd_data['comp_obj'].apply(lambda x: x.reduced_formula)

  % self.symbol)
  % self.symbol)
  % self.symbol)


Get only the lowest-energy entry at each composition

In [7]:
oqmd_data.sort_values('delta_e', ascending=True, inplace=True)
oqmd_data.drop_duplicates('pretty_comp', keep='first', inplace=True)
print('Reduced dataset to %d entries'%len(oqmd_data))

Reduced dataset to 275701 entries


## Identify the systems with large numbers of entries
We want to find a system with a large amount of testing data

In [8]:
oqmd_data['nelems'] = oqmd_data['comp_obj'].apply(lambda x: len(x))

In [9]:
oqmd_data['system'] = oqmd_data['comp_obj'].apply(lambda x: "-".join([y.symbol for y in x]))

Get the top-10 most frequent systems

In [10]:
oqmd_data['system'].value_counts()[:10]

Mn-Na-O    20
O-Ti       18
O-V        18
Li-O-V     17
Fe-Na-O    17
H-O-V      17
Li-Mn-O    16
C-H-N-O    16
Al-Mg      16
Na-O-V     16
Name: system, dtype: int64

*Finding*: Mn-Na-O and Fe-Na-O are the most common ternaries. So, let's choose the Na-Fe-Mn-O quaternary as a hold-out

In [11]:
my_system = ["Na", "Fe", "Mn", "O"]

In [12]:
def get_all_data(elems):
    """Get the data that is in any of the phase diagrams that are subsets of a certain system
    
    Ex: For Na-Fe-O, these are Na-Fe-O, Na-Fe, Na-O, Fe-O, Na-Fe, Na, Fe, O
    
    :param elems: iterable of strs, phase diagram of interest
    :return: subset of OQMD in the constituent systems"""
    
    # Generate the constituent systems
    systems = set()
    for comb in product(*[elems,]*len(elems)):
        sys = "-".join(sorted(set(comb)))
        systems.add(sys)
    
    # Query for the data
    return oqmd_data.query(' or '.join('system=="%s"'%s for s in systems))

In [13]:
test_set = get_all_data(my_system)
print('Gathered a test set with %d entries'%len(test_set))
test_set.sample(5)

Gathered a test set with 96 entries


Unnamed: 0,comp,energy_pa,volume_pa,magmom_pa,bandgap,delta_e,stability,comp_obj,pretty_comp,nelems,system
395430,Fe1Na5O4,-4.664897,14.0727,0.499999,2.06,-1.652827,0.009442,"(Fe, Na, O)",Na5FeO4,3,Fe-Na-O
332987,Mn1Na5O4,-4.771923,13.4157,0.40003,1.555,-1.669564,-0.000754,"(Mn, Na, O)",Na5MnO4,3,Mn-Na-O
312050,Fe1Na4O3,-4.625395,14.6756,0.500132,1.435,-1.573075,-0.008985,"(Fe, Na, O)",Na4FeO3,3,Fe-Na-O
338574,Fe3Mn1,-8.475788,10.5063,0.131962,0.0,0.011862,0.011862,"(Fe, Mn)",MnFe3,2,Fe-Mn
341790,Fe1Na3,-2.403104,25.1112,0.731388,0.0,0.583714,0.583714,"(Fe, Na)",Na3Fe,2,Fe-Na


Remove these entries from the dataset at large

In [14]:
train_set = oqmd_data.loc[oqmd_data.index.difference(test_set.index)]
print('Training set size is %d entries'%len(train_set))

Training set size is 275605 entries


## Save the data in Magpie-friendly format
We will be using Magpie to generate features

In [15]:
def save_magpie(data, path):
    """Save a dataframe in a magpie-friendly format
    
    :param data: pd.DataFrame, data to be saved
    :param path: str, output path"""
    
    data[['comp','delta_e']].to_csv(path, index=False, sep=' ')

In [16]:
save_magpie(test_set, os.path.join('datasets', '%s_test_set.data'%(''.join(my_system))))

In [17]:
save_magpie(train_set, os.path.join('datasets', '%s_train_set.data'%(''.join(my_system))))