# Find Interesting Predictions
Out of the $500M$ predictions, some predictions are more interesting than others.

In [1]:
%matplotlib inline
from pymatgen import Composition, Element
from pymatgen.analysis.hhi.hhi import HHIModel
from pymatgen.util.string import latexify
from sklearn.cluster import AgglomerativeClustering
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import silhouette_score
import itertools
import os
import re
import pandas as pd
import numpy as np

## Load in the Data
Load in the deep learning predictions. 

In [2]:
%%time
def load_DL_predictions(path):
    """Loads in the predictions from Dipendra, and renames the `delta_e` column to match the `oqmd_data`
    
    Also generates a `PDEntry` for each composition, and computes which system this entry is in
    """
    output = pd.read_csv(path, sep=' ')
    output.rename(columns={'delta_e_predicted': 'delta_e'}, inplace=True)
    output['comp_obj'] = output['composition'].apply(lambda x: Composition(x))
    return output
dl_predictions = dict([(x, load_DL_predictions(os.path.join('new-datasets', '%s_stable-0.2.data.gz'%x)))
     for x in ['binary', 'ternary', 'quaternary']
     ])

CPU times: user 9min 3s, sys: 4.17 s, total: 9min 8s
Wall time: 9min 8s


## Define Utility Operations
These will be useful for finding which compounds to evalaute

In [3]:
elem_re = re.compile('[A-Z][a-z]?')
def get_elems(s):
    return ''.join(sorted(set(elem_re.findall(s))))
assert get_elems('AlFeFe2') == 'AlFe'

In [4]:
%%time
for data in dl_predictions.values():
    data['system'] = data['composition'].apply(get_elems)

CPU times: user 30.8 s, sys: 273 ms, total: 31 s
Wall time: 31.1 s


## Get the Single Most-Stable Entry per System
Make the searches faster, yield a single entry per system

In [5]:
%%time
def get_most_stable(data):
    """From a dataset, get only the most-stable entry
    
    :param data: DataFrame, most stable DL predictions
    :return: DataFrame"""
    
    return data.sort_values('stability_predicted', ascending=True).drop_duplicates('system', keep='first')
dl_best = dict((k,get_most_stable(v)) for k,v in dl_predictions.items())

CPU times: user 3.56 s, sys: 381 ms, total: 3.94 s
Wall time: 3.94 s


In [6]:
for k,v in dl_best.items():
    print(k, len(v))
print('total', sum([len(v) for v in dl_best.values()]))

binary 502
ternary 22796
quaternary 551340
total 574638


Get pretty compositions to render

In [7]:
%%time
for data in dl_best.values():
    data['composition'] = data['comp_obj'].apply(lambda x: x.get_integer_formula_and_factor()[0])

  % self.symbol)
  % self.symbol)
  % self.symbol)


CPU times: user 1min 4s, sys: 53.4 ms, total: 1min 4s
Wall time: 1min 4s


## Get Predictions for Different Sets
This part of the notebook details picking different types of compounds 

### Defining Element Lists
Useful when coming up with search spaces later

In [8]:
noble_gases = ['He', 'Ne', 'Ar', 'Kr', 'Xe']
alkali_metals = ['Li', 'Na', 'K'] # , 'Rb', 'Cs'] - Only do the common ones
threed_tms = ['Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn']
actinides = ['Ac', 'Th', 'Pa', 'U', 'Np', 'Pu'] # VASP only has these
lanthanides = set([Element.from_Z(x).symbol for x in range(57, 72)])
chalcogens = ['O', 'S', 'Se', 'Te']
pnictides = ['N', 'P', 'As', 'Sb']
halogens = ['F', 'Cl', 'Br', 'I']
tms = set([Element.from_Z(x).symbol for x in range(1,102) if Element.from_Z(x).is_transition_metal > 0])
metals = tms.union({'Li','Na','K'}).union({'Al', 'Ga', 'In', 'Sn', 'Pb', 'Bi'})
metals_no_highHHI = set([e for e in metals if HHIModel().get_hhi_production(e) is not None
                         and  HHIModel().get_hhi_production(e) < 5000])

Assemble a list of all elements found in our datasets

In [9]:
element_list = set()
dl_predictions['ternary']['composition'].apply(lambda x: element_list.update(elem_re.findall(x)))
print('Number of elements:', len(element_list))

Number of elements: 89


Remove noble gases, lanthanides, and actinides

In [10]:
element_list.difference_update(noble_gases)
element_list.difference_update(actinides)
element_list.difference_update(lanthanides)
print('Number of elements:', len(element_list))

Number of elements: 63


### Scanning different sets

In [11]:
def assemble_list_of_systems(order):
    """Create a DataFrame of all possible systems with a certain number of elements"""
    output = pd.DataFrame()
    output['elements'] = list(itertools.combinations(element_list, order))
    output['system'] = [''.join(sorted(s)) for s in output['elements']]
    return output
binary_systems = assemble_list_of_systems(2)
print('Generated %d binary systems'%len(binary_systems))

Generated 1953 binary systems


In [12]:
ternary_systems = assemble_list_of_systems(3)
print('Generated %d ternary systems'%len(ternary_systems))

Generated 39711 ternary systems


In [13]:
quaternary_systems = assemble_list_of_systems(4)
print('Generated %d quaternary systems'%len(quaternary_systems))

Generated 595665 quaternary systems


Get the ternary systems that contain an common Alkali metal

In [14]:
def run_filter(f, systems, compounds, ntop=2):
    """Find the number of "best compounds in that pass a certain filter
    
    :param f: func, fitler to run on systems
    :param systems: DataFrame, list of systems to be evaluated
    :param compounds: DataFrame, list of compounds to screen
    :param ntop: int, number of top compositions to select"""
    possible_systems = set(systems[systems['elements'].apply(f)]['system'])
    results = compounds[compounds['system'].apply(lambda x: x in possible_systems)]
    print('Found %d matches. Top 2: %s'%(len(results), ' '.join([latexify(x) for x in results.head(2)['composition']])))

#### [Li,K,Na]-Containing Compounds

In [15]:
f = lambda els: any([e in ['Li', 'Na', 'K'] for e in els])

In [16]:
run_filter(f, ternary_systems, dl_best['ternary'])
run_filter(f, quaternary_systems, dl_best['quaternary'])

Found 814 matches. Top 2: KSc$_{2}$Br$_{7}$ KHfBr$_{5}$
Found 21375 matches. Top 2: CsNa$_{2}$CdF$_{4}$ Na$_{2}$CrPbF$_{5}$


#### [Li,K,Na]-Containing Ternaries w/o Halogen

In [17]:
f = lambda els: any([e in ['Li', 'Na', 'K'] for e in els]) and not any([e in halogens for e in els])

In [18]:
run_filter(f, ternary_systems, dl_best['ternary'])
run_filter(f, quaternary_systems, dl_best['quaternary'])

Found 457 matches. Top 2: K$_{2}$W$_{2}$N$_{5}$ LiTi$_{4}$N$_{5}$
Found 10959 matches. Top 2: Ba$_{3}$NaPtO$_{4}$ K$_{2}$P(WN$_{2}$)$_{2}$


#### Chalcohalides

In [19]:
f = lambda els: any([e in chalcogens for e in els]) and any([e in halogens for e in els])

In [20]:
run_filter(f, ternary_systems, dl_best['ternary'])
run_filter(f, quaternary_systems, dl_best['quaternary'])

Found 578 matches. Top 2: Sc$_{2}$SeBr$_{5}$ Sc$_{3}$SBr$_{6}$
Found 18835 matches. Top 2: Sr$_{3}$Cu$_{2}$IO$_{4}$ Zr$_{6}$RhIO$_{2}$


#### Oxides

In [21]:
f = lambda els: any([e is 'O' for e in els])

In [22]:
run_filter(f, ternary_systems, dl_best['ternary'])
run_filter(f, quaternary_systems, dl_best['quaternary'])

Found 817 matches. Top 2: Hf$_{2}$Br$_{6}$O Sc$_{3}$Br$_{6}$O
Found 19113 matches. Top 2: Sr$_{3}$Cu$_{2}$IO$_{4}$ Zr$_{6}$RhIO$_{2}$


#### Metal Oxides

In [23]:
f = lambda els: any([e is 'O' for e in els]) and sum([e in metals for e in els]) == (len(els) - 1)

In [24]:
run_filter(f, ternary_systems, dl_best['ternary'])
run_filter(f, quaternary_systems, dl_best['quaternary'])

Found 242 matches. Top 2: K$_{2}$OsO$_{5}$ AgRuO$_{3}$
Found 3155 matches. Top 2: YAlV$_{2}$O$_{6}$ TiMnSnO$_{5}$


#### $3d$ Metal Oxides

In [25]:
f = lambda els: any([e is 'O' for e in els]) and sum([e in threed_tms for e in els]) == (len(els) - 1)

In [26]:
run_filter(f, ternary_systems, dl_best['ternary'])
run_filter(f, quaternary_systems, dl_best['quaternary'])

Found 8 matches. Top 2: TiCrO$_{3}$ TiMnO$_{3}$
Found 4 matches. Top 2: Ti$_{2}$MnCrO$_{6}$ ScTiCr$_{2}$O$_{6}$


#### Intermetallics

In [27]:
f = lambda els: all([e in metals for e in els])

In [28]:
run_filter(f, ternary_systems, dl_best['ternary'])
run_filter(f, quaternary_systems, dl_best['quaternary'])

Found 152 matches. Top 2: HfAl$_{5}$Ir$_{3}$ YAl$_{4}$Ir$_{3}$
Found 462 matches. Top 2: Sc$_{5}$NiSn$_{3}$Mo ZrAl$_{5}$OsRh


#### Intermetallics (No high HHI)

In [29]:
f = lambda els: all([e in metals_no_highHHI for e in els])

In [30]:
run_filter(f, ternary_systems, dl_best['ternary'])
run_filter(f, quaternary_systems, dl_best['quaternary'])

Found 23 matches. Top 2: TiAl$_{5}$Rh$_{4}$ ZrAl$_{5}$Rh$_{4}$
Found 60 matches. Top 2: TiAl$_{5}$NiRh$_{3}$ Al$_{6}$CrCoRh$_{2}$


#### Ternary Intermetallics w/ at least 1 $3d$ metals

In [31]:
f =lambda els: all([e in metals_no_highHHI for e in els]) and sum([e in threed_tms for e in els]) > 1

In [32]:
run_filter(f, ternary_systems, dl_best['ternary'])
run_filter(f, quaternary_systems, dl_best['quaternary'])

Found 1 matches. Top 2: Ti$_{2}$In$_{3}$Ni$_{4}$
Found 15 matches. Top 2: TiAl$_{5}$NiRh$_{3}$ Al$_{6}$CrCoRh$_{2}$
