# Machine Learning Algorithms for Ionic Conductivity of LLZO-type Garnets

### Authorship and credits

<b> nanoHUB tools by: </b>  <i>Juan Carlos Verduzco</i> and <i>Alejandro Strachan</i>, Materials Engineering, Purdue University <br>
<b> Database curated by: </b> <i>Juan Carlos Verduzco</i>, Materials Engineering, Purdue University <br>


## Overview

In this notebook, we include extra Figures for publication as "An active learning approach for the design of doped LLZO ceramic garnets for battery applications" (Submitted)

<br>

**Outline:**

1. Querying / Processing Data <br>
2. Obtaining features/descriptors from Matminer <br>
3. Regression Models <br>
    3.1 Neural Network <br>
    3.2 Random Forests with Sample Uncertainties - Residual Analysis <br>
    3.3 Random Forests with Sample Uncertainties - Model Predictions  <br>
4. Ta Analysis <br>
5. Active Learning Approach <br>
6. Active Learning - 30 Oldest Points <br>
7. Active Learning - 30 Random Points <br>
8. Garnet Predictor for codoped LLZO <br>

Notes: This notebook uses tools from [Citrination](https://citrination.com/) and requires an account with an API key.

## Libraries

This notebook requires several libraries to be installed. They are separated in blocks depending on their usage.

In [None]:
# PLOTTING (MATPLOTLIB)
%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML
from lolopy.learners import RandomForestRegressor

# PYTHON

import pandas as pd
import numpy as np
import os
import sys
import random

# MACHINE LEARNING
import tensorflow as tf
from tensorflow import keras
from keras import initializers, regularizers
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.callbacks import EarlyStopping
print(keras.__version__)

# CITRINATION / MATMINER

from matminer.data_retrieval.retrieve_Citrine import CitrineDataRetrieval
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf
from sklearn.model_selection import KFold
from pymatgen import Composition
from scipy.stats import norm

# PLOTTING (PLOTLY)
import plotly 
import plotly.graph_objs as go
from plotly.offline import iplot
plotly.offline.init_notebook_mode(connected=True)


# This snipped refers to the adding of the CitrineKey on the main page of the tool. If you are running this notebook by itself, please comment it out and write your citrinekey in the cell below.
file = open(os.path.expanduser('~/.citrinetools.txt'),"r+")
apikey = file.readline()
file.close()

---
## 1. Querying a Database from Citrination

Matminer offers API tools to facilitate querying of databases like the Materials Project and Citrination. An individual **Citrine Key** is required for the query command <i>CitrineDataRetrieval</i>.

Data is stored in a Pandas Dataframe and the list of possible properties to be queried can be consulted by setting the print_properties_options parameter to **True**.

In [None]:
cdr = CitrineDataRetrieval(apikey) # Citrine Key

data = cdr.get_dataframe(criteria={'data_set_id': 184812}, print_properties_options=False) # LLZO Database
property_interest = 'Ionic Conductivity' # Property to be queried

display(data.head(n=10))

This is a utility function that will transform the <i>chemicalFormula</i> column into a Matminer composition object, which will be then used to extract features.

In [None]:
def get_composition(c): # Function to get compositions from chemical formula using pymatgen
    try:
        return Composition(c)
    except:
        return None

We will use the utility function to transform the <i>chemicalFormula</i> column, and we'll typecast relevant columns into numeric types.
<br>
For this specific application, we'll introduce some filters for the dataframe. We are interested in measurements in structures that are cubic and measured at room temperature (Defined as 18°C < T < 30°C)

In [None]:
data['composition'] = data['chemicalFormula'].apply(get_composition) # Transformation of chemicalformula string into Matminer composition
data['Measuring Temperature'] = pd.to_numeric(data['Measuring Temperature'], errors='coerce') # Transformation of Measuring Temp dataframe column from type <str> to a numberic type <int>
data[property_interest] = pd.to_numeric(data[property_interest], errors='coerce') # Transformation of our property of interest dataframe column from type <str> to a numberic type <int>
data["Year Published"] = pd.to_numeric(data["Year Published"], errors='coerce') # Transformation of our property of interest dataframe column from type <str> to a numberic type <int>

data = data[data['Crystallographic Structure'] == 'Cubic'] # Filter all non-cubic structures
data = data[data['Measuring Temperature']<30] # Filter all high temperature measurements (over room temperature)
data = data[data['Measuring Temperature']>18] # Filter all low temperature measurements (over room temperature)

data.reset_index(drop=True, inplace=True) # Reindexing of dataframe rows

Before removing duplicates, we will store all the experimental values for compositions that include Tantalum.

In [None]:
ta_indexes = []
for _ in range(len(data)):
    if ("Zr" in data['composition'][_] and "Ta" in data['composition'][_] and len(data['composition'][_])==5):
        ta_indexes.append(_)
    elif ("Zr" in data['composition'][_] and len(data['composition'][_])==4):
        ta_indexes.append(_) 
        
ta_dataframe = data[data.index.isin(ta_indexes)]
#display(ta_dataframe)        
        
x_values_exp = [_["Ta"] for _ in list(ta_dataframe['composition'])]
y_values_exp = list(ta_dataframe['Ionic Conductivity'])

In order to reduce noise in the neural network and deal with the inconsistencies in the data, we will filter repeated composition values from different measurements and replace the value for ionic conductivity with the median of the values. Similar approaches have been implemented in this [paper](https://iopscience.iop.org/article/10.1088/1361-651X/aaf8ca).

In [None]:
dup_indexes = data[data.duplicated(subset = data.columns.tolist()[0], keep=False)].index.tolist()

dup_dataframe =data[data.duplicated(subset = data.columns.tolist()[0], keep=False)]

duplicates = [[dup_dataframe.iloc[x][0], dup_dataframe.iloc[x][4], dup_indexes[x], dup_dataframe.iloc[x][-2]] for x in range(len(dup_dataframe.index))]
duplicate_compositions = {k: [] for k in set([dup_dataframe.iloc[x][0] for x in range(len(dup_dataframe.index))])}
duplicate_indexes = {k: [] for k in set([dup_dataframe.iloc[x][0] for x in range(len(dup_dataframe.index))])}
duplicate_years = {k: [] for k in set([dup_dataframe.iloc[x][0] for x in range(len(dup_dataframe.index))])}

for _ in duplicates:
    duplicate_compositions[_[0]].append(_[1])
    duplicate_indexes[_[0]].append(_[2]) 
    duplicate_years[_[0]].append(_[3])     

for k in duplicate_compositions:
    
    duplicate_compositions[k] = np.median(duplicate_compositions[k])
    data.at[duplicate_indexes[k][0], 'Ionic Conductivity'] = duplicate_compositions[k]  

    duplicate_years[k] = np.min(duplicate_years[k])
    data.at[duplicate_indexes[k][0], 'Year Published'] = duplicate_years[k]      
    data = data.drop(duplicate_indexes[k][1:], axis = 0)

data = data.reset_index()
data = data.drop(['index'], axis = 1)

After removing duplicates, we will query the dataset for the values that were substituted for the Tantalum compositions, the median of the experimental values.

In [None]:
ta_indexes = []
for _ in range(len(data)):
    if ("Zr" in data['composition'][_] and "Ta" in data['composition'][_] and len(data['composition'][_])==5):
        ta_indexes.append(_)
    elif ("Zr" in data['composition'][_] and len(data['composition'][_])==4):
        ta_indexes.append(_)        
        
ta_dataframe = data[data.index.isin(ta_indexes)]
#display(ta_dataframe)

x_values_dupmed = [_["Ta"] for _ in list(ta_dataframe['composition'])]
y_values_dupmed = list(ta_dataframe['Ionic Conductivity'])

sort_dupmed = list(zip(x_values_dupmed, y_values_dupmed))
sort_dupmed = sorted(sort_dupmed, key = lambda t: t[0])

x_values_dupmed = [item[0] for item in sort_dupmed ] 
y_values_dupmed = [item[1] for item in sort_dupmed ] 

The next cell produces a breakdown of the number of elements in the oxides compositions and a distribution of the elements present in the dataset.

In [None]:
import collections

freq = data["composition"]
list_freq = []

for _ in freq:
    a = [str(x) for x in _]
    list_freq.append(a)
    
list_freq_flat = [item for sublist in list_freq for item in sublist]  
listfreqctr = collections.Counter(list_freq_flat)
print(listfreqctr)
    
lengths = list(map(len,list_freq))
lenctr = collections.Counter(lengths)

print(lenctr)
# print(type(freq[0]))
# print(freq[0])
# print(list(freq[0])[0])

# print(type(list(freq[0])[0]))
# print(str(list(freq[0])[0]))

---

## 2. Matminer Descriptors

In [None]:
f =  MultipleFeaturizer([cf.Stoichiometry(), cf.ElementProperty.from_preset("magpie"), cf.ValenceOrbital(props=['avg']), cf.ElementFraction()]) # Featurizers

X = np.array(f.featurize_many(data['composition'], ignore_errors=True)) # Array to store such features

measuring_temp_array = np.array(data['Measuring Temperature']).reshape(-1,1) # Here we are stacking the Measuring temperature numpy array into the features previously calculated to add it as a descriptor. 
X = np.hstack((X,measuring_temp_array))

y = data[property_interest].values # Separate the value we want to predict to use as labels.
years = data["Year Published"].values

# This code is to drop columns with std = 0. 
x_df = pd.DataFrame(X)
x_df = x_df.loc[:, x_df.std() != 0]
print(x_df.shape) # This shape is (#Entries, #Descriptors per entry)

# This code is to drop columns with std = 0. 
x_df_prior = pd.DataFrame(X)

--- 

## 3. Regression Models

We will start by creating a models for regression with all these entries and descriptors.

### 3.1 Neural Networks


We set the architecture of the sequential feed-forward neural network we'll test. Weights are initialized with a Random Normal distribution and biases are initialized at zero.

We'll use this training function with a validation mean absolute error (mae) stopping function to train the model. A 10% validation set is set to be taken from the training.
<br>
A figure of training mae vs validation mae is shown. Overfitting occurs when the validation mae starts to increase, so we revert the weights to those of the best epoch.

In [None]:
from sklearn.utils import shuffle

# EXTRACTION OF THE DATA FOR THE DESCRIPTORS

all_values = [list(x_df.iloc[x]) for x in range(len(x_df.index))]
all_values = np.array(all_values, dtype = float) 
all_labels = y.copy()

# SPLIT INDICATION FOR TRAIN/TEST SETS
train_percent = 0.90
index_split_at = int (train_percent * len(all_labels))


# EARLY STOPPING CRITERIA
#mae_es= keras.callbacks.EarlyStopping(monitor='mean_squared_error', min_delta=1e-8, patience=200, verbose=1, mode='auto', restore_best_weights=True)
valmae_es= keras.callbacks.EarlyStopping(monitor='val_mean_absolute_error', min_delta=1e-10, patience=1000, verbose=1, mode='auto', restore_best_weights=True)

# EPOCH REAL TIME COUNTER CLASS
class PrintEpNum(keras.callbacks.Callback): # This is a function for the Epoch Counter
    def on_epoch_end(self, epoch, logs):
        sys.stdout.flush()
        sys.stdout.write("Current Epoch: " + str(epoch+1) + " Training Loss: " + "%4f" %logs.get('loss') + '                                       \r') # Updates current Epoch Number

EPOCHS = 10000 # Number of EPOCHS

# NETWORK INITIALIZERS

kernel_init = initializers.RandomNormal(seed=30)
bias_init = initializers.Zeros()
optimizer = tf.train.AdamOptimizer()


# DATA SPLIT AND NORMALIZATION
all_values, all_labels = shuffle(all_values, all_labels, random_state=4)

train_values, test_values = np.split(all_values, [index_split_at])
train_labels, test_labels = np.split(all_labels, [index_split_at])

feature_mean = np.mean(train_values, axis=0)
feature_std = np.std(train_values, axis=0)

train_values = (train_values - feature_mean)/ (feature_std)
test_values = (test_values - feature_mean)/ (feature_std)

# NETWORK ARCHITECTURE

neuralnetwork_model = Sequential()
neuralnetwork_model.add(Dense(60, activation='relu', use_bias = True, input_shape=(train_values.shape[1], ), kernel_initializer=kernel_init, bias_initializer=bias_init))
neuralnetwork_model.add(Dropout(0.2))
neuralnetwork_model.add(Dense(120, activation='relu', use_bias = True, kernel_initializer=kernel_init, bias_initializer=bias_init ))
neuralnetwork_model.add(Dropout(0.2))
neuralnetwork_model.add(Dense(60, activation='relu', use_bias = True, kernel_initializer=kernel_init, bias_initializer=bias_init ))
neuralnetwork_model.add(Dropout(0.2))
neuralnetwork_model.add(Dense(1, activation='relu', use_bias = True, kernel_initializer=kernel_init, bias_initializer=bias_init ))

neuralnetwork_model.compile(loss='mae', optimizer=optimizer, metrics=['mae'])

history = neuralnetwork_model.fit(train_values, train_labels, batch_size=90, validation_split=0.1, shuffle=False, epochs=EPOCHS, verbose = False, callbacks=[PrintEpNum(), valmae_es]) #  

[loss, mae] = neuralnetwork_model.evaluate(test_values, test_labels, verbose=0)
    
print(mae)

Our model can now make predictions for our entry values. In this match plot we are analyzing the real value of the label vs the prediction of the trained model. Values that lay on the match line at x=y are accurately predicted.

In [None]:
test_predictions = neuralnetwork_model.predict(test_values).flatten() # Prediction of the test set

values = np.concatenate((train_values, test_values), axis=0) # This line joins the values together to evaluate all of them
all_predictions = neuralnetwork_model.predict(values).flatten()

fig = plt.figure(figsize = (10,10))

plt.errorbar(all_labels, all_predictions, color='green', marker='o', markersize=12, linestyle='None', label='Training Data')
plt.errorbar(test_labels, test_predictions, color='red', marker='o', markersize=12, linestyle='None',label='Testing Data')
plt.plot([-1, 20], [-1, 20], linestyle='dashed', color='black')
plt.xticks(np.linspace(0,20,11),fontsize=26)
plt.yticks(np.linspace(0,20,11), fontsize=26)
plt.xlim([-1,20])
plt.ylim([-1,20])
plt.grid()
plt.legend(loc=2, fontsize=22)
plt.ylabel('Predicted Conductivity x10$^{-4}$ (S/cm)', fontsize=26)
plt.xlabel('Experimental Conductivity x10$^{-4}$ (S/cm)', fontsize=26) 
plt.title('Artificial Neural Network', fontsize=26)

fig.show()

### 3.2 Random Forests with Sample Uncertainties (FUELS Framework) -- Residual Analysis

In [None]:
# We'll re-process the data

all_values = [list(x_df.iloc[x]) for x in range(len(x_df.index))]
all_values = np.array(all_values, dtype = float) 
all_labels = y.copy()

from lolopy.learners import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

model = RandomForestRegressor(num_trees=500)

# KFOLD CROSS VALIDATION TO CALCULATE RESIDUALS

y_resid = []
y_uncer = []
for train_id, test_id in KFold(10, shuffle=True).split(all_values):
    model.fit(all_values[train_id], all_labels[train_id])
    yf_pred, yf_std = model.predict(all_values[test_id], return_std=True)
    y_resid.extend(yf_pred - all_labels[test_id])
    y_uncer.extend(yf_std)
    
resid = np.divide(y_resid, y_uncer)

x_ev = np.linspace(-5, 5, 50)

# PLOTTING

fig = plt.figure(figsize = (10,10))

plt.hist(resid, density=True, label='Residual \nDensity', bins=10)
plt.plot(x_ev, norm.pdf(x_ev), color='black', linewidth=4, linestyle='-', label='Normal \nDistribution') # NORMAL DISTRIBUTION
plt.xticks(fontsize=26)
plt.yticks(fontsize=26)
plt.xlim([-5,5])
plt.grid()
plt.legend(fontsize=22)
plt.ylabel('Probability Density', fontsize=26)
plt.xlabel('RF Normalized Residual', fontsize=26)

fig.show()

### 3.3 Random Forests with Sample Uncertainties (FUELS Framework) -- MODEL

In [None]:
# We'll re-process the data

from lolopy.learners import RandomForestRegressor
from lolopy.metrics import root_mean_squared_error

history_table = pd.DataFrame(columns=['Composition','Experimental',"Prediction"])

compositions_examined = np.array(data["chemicalFormula"], dtype = "object") # Storing the composition together with 

all_values = [list(x_df.iloc[x]) for x in range(len(x_df.index))]
all_values = np.array(all_values, dtype = float) 
all_labels = y.copy()


# SHUFFLE AND SPLITTING

all_values, all_labels, compositions_examined = shuffle(all_values, all_labels, compositions_examined, random_state=6)

train_values, test_values = np.split(all_values, [index_split_at])
train_labels, test_labels = np.split(all_labels, [index_split_at])
comp_train, comp_test = np.split(compositions_examined, [index_split_at])

# MODEL AND TRAINING

randomforest_model = RandomForestRegressor(num_trees=500)
randomforest_model.fit(train_values, train_labels)

test_pred, test_std = randomforest_model.predict(test_values, return_std=True)
all_pred, all_std = randomforest_model.predict(all_values, return_std=True)

inner_df = pd.DataFrame()
inner_df["Composition"] = comp_test
inner_df["Experimental"] = test_labels
inner_df["Predictions"] = test_pred
inner_df["residual"] = test_pred - test_labels
inner_df["STD"] = test_std
inner_df = inner_df.sort_values(by='Experimental')
display(inner_df)

mean_rf = mean_absolute_error(test_labels, test_pred)
print(mean_rf)

In [None]:
fig = plt.figure(figsize = (10,10))

plt.errorbar(all_labels, all_pred, color='green', marker='o', markersize=12, linestyle='None', label='Training Data')
plt.errorbar(test_labels, test_pred, yerr=test_std, color='red', marker='o', markersize=12, linestyle='None',label='Testing Data')
plt.plot([-1, 20], [-1, 20], linestyle='dashed', color='black')
plt.xticks(np.linspace(0,20,11),fontsize=26)
plt.yticks(np.linspace(0,20,11), fontsize=26)
plt.xlim([-1,20])
plt.ylim([-1,20])
plt.grid()
plt.legend(loc=2, fontsize=22)
plt.ylabel('Predicted Conductivity x10$^{-4}$ (S/cm)', fontsize=26)
plt.xlabel('Experimental Conductivity x10$^{-4}$ (S/cm)', fontsize=26)
plt.title('Random Forest', fontsize=26)

fig.show()

### 4. Ta Analysis

In [None]:
ta_line_comps = []
ta_array = np.linspace(0,1,101) # X-Axis granularity to generate nominal stoichometries
x_values_pred = np.linspace(0,1,101) # X-Axis granulatirty for plotting


# Generating stoichiometric nominal compositions doped with Ta
for stg in ta_array:
    base_equation_string = "Li" + (('%f'%(np.around(7 - stg, decimals=3, out=None))).rstrip('0').rstrip('.')) + "La3" + "Zr" + (('%f'%(np.around(2 - stg, decimals=3, out=None))).rstrip('0').rstrip('.')) + ("Ta" + (('%f'%(np.around(stg, decimals=3, out=None))).rstrip('0').rstrip('.')) if stg > 0 else '') + "O12"
    ta_line_comps.append(base_equation_string)


# Getting descriptors from the chemical formula
    
ta_composition_test_set_dataframe = pd.DataFrame(ta_line_comps, columns=['chemicalFormula'])
ta_composition_test_set_dataframe['composition'] = ta_composition_test_set_dataframe['chemicalFormula'].apply(get_composition) # Transformation of chemicalformula string into Matminer composition
ta_feat_test_set= np.array(f.featurize_many(ta_composition_test_set_dataframe['composition'], ignore_errors=True)) # Array to store such features

ta_temp_array = np.array([25.0 for _ in range(ta_feat_test_set.shape[0])]).reshape(-1,1) # Array of simulated measuring temperatures
ta_feat_test_set = np.hstack((ta_feat_test_set, ta_temp_array))


# We need to drop the same columns that were dropped from the original training set so that this map has the same number of descriptors

# This code is to drop columns with std = 0. 
ta_parsed_features_test_set = pd.DataFrame(ta_feat_test_set)
ta_parsed_features_test_set_2 = ta_parsed_features_test_set.loc[:,  x_df_prior.std() != 0] # Dropping same columns that were dropped on the training data

# Turning these values into an array
ta_values = [list(ta_parsed_features_test_set_2.iloc[x]) for x in range(len(ta_parsed_features_test_set_2.index))]
ta_values = np.array(ta_values, dtype = float) 


# Making predictions
y_values_pred, err_values_pred = randomforest_model.predict(ta_values, return_std=True)


# Plotting
fig = plt.figure(figsize =(10,10))
plt.plot(x_values_exp,y_values_exp, color='blue', marker='o', fillstyle='none', markersize=12, linestyle = "None",  label = "Experiment") # This is from the list of Ta-compositions before dropping duplicates
plt.plot(x_values_dupmed,y_values_dupmed, color='black',  marker='x', markersize=14, linestyle = "None", label = "Median") # This is from the list of Ta-compositions after dropping duplicates

plt.plot(x_values_pred,y_values_pred, color='red', linestyle='solid',label='Prediction') # Predictions
plt.fill_between(np.array(x_values_pred), np.array(y_values_pred)-np.array(err_values_pred), np.array(y_values_pred)+np.array(err_values_pred), facecolor='#EBECF0') # Uncertainty regions

plt.grid()
plt.legend(fontsize=22)
plt.xlabel("Ta Content", fontsize =26)
plt.ylabel("Ionic Conductivity x10$^{-4}$ (S/cm)", fontsize =26)
plt.xticks(fontsize=26)
plt.yticks(fontsize=26)
fig.show()

---
## 5. Active Learning Approach

Active learning is the use of algorithms not for regression, but for the improvement of the input sample space that guides the 'experiments' required to get to such maximum values. Even if it does make predictions for a specific material, its main task is the selection of the most likely candidate to be in a global maxima. This is the approach introduced in the paper by Julia Ling et al.
<br>
<br>
We will select an initial set of 10 entries, and we'll make sure the highest value is not in it.

In [None]:
X = all_values.copy()
y = all_labels.copy()

model = RandomForestRegressor()

entry_number_init = 10

in_train = np.zeros(len(data), dtype=np.bool)
in_train[np.random.choice(len(data), entry_number_init, replace=False)] = True
print('Picked {} training entries'.format(in_train.sum()))
assert not np.isclose(max(y), max(y[in_train]))
print(max(y[in_train]))

We will then train the model with this initial set and make predictions:

In [None]:
model.fit(X[in_train], y[in_train])
y_pred, y_std = model.predict(X[~in_train], return_std=True)

For this approach, we will be querying the next material to sample using four different acquisition functions:
<center>
    
### MEI

<br>
$$ u(x) = max(f(x)_{pred} - f(x)_{max train})$$
<br>

### MLI

<br>
$$ u(x) = \frac{max(f(x)_{pred} - f(x)_{max train})}  {\sigma} $$
<br>

### MU
<br>
$$ u(x) = max(\sigma)$$
<br>

### UCB
<br>
$$ u(x) =  f(x)_{pred} + K * \sigma $$
<br>
</center>

In [None]:
mei_selection = np.argmax(y_pred)
mli_selection = np.argmax(np.divide(y_pred - np.max(y[in_train]), y_std))
mu_selection = np.argmax(y_std)
ucb_selection = np.argmax([sum(x) for x in zip(y_pred, y_std)])

In [None]:
print('Predicted ' + property_interest + ' of material #{} selected based on MEI: {:.6f} +/- {:.6f}'.format(mei_selection, y_pred[mei_selection], y_std[mei_selection]))
print('Predicted ' + property_interest + ' of material #{} selected based on MLI: {:.6f} +/- {:.6f}'.format(mli_selection, y_pred[mli_selection], y_std[mli_selection]))
print('Predicted ' + property_interest + ' of material #{} selected based on MU: {:.6f} +/- {:.6f}'.format(mu_selection, y_pred[mu_selection], y_std[mu_selection]))
print('Predicted ' + property_interest + ' of material #{} selected based on UCB: {:.6f} +/- {:.6f}'.format(ucb_selection, y_pred[ucb_selection], y_std[ucb_selection]))

Here we calculate the approaches. We start with the initial set and we run all experiments to test if we can get to the sample with the highest value. Each of the approaches selects a different next point to query and include in the training set. This approach allow for us to track the entire run of the experiments. The following cell takes about 2-3 minutes to run.

In [None]:
n_steps = 90
all_inds = set(range(len(y)))

random_train = [list(set(np.where(in_train)[0].tolist()))]
mei_train = [list(set(np.where(in_train)[0].tolist()))]
mli_train = [list(set(np.where(in_train)[0].tolist()))]
mu_train = [list(set(np.where(in_train)[0].tolist()))]
ucb_train = [list(set(np.where(in_train)[0].tolist()))]
random_train_inds = []
mei_train_inds = []
mli_train_inds = []
ucb_train_inds = []


for i in range(n_steps):

    # RANDOM
    
    random_train_inds = random_train[-1].copy()
    
    random_search_inds = list(all_inds.difference(random_train_inds))
    
    model.fit(X[random_train_inds], y[random_train_inds])
    random_y_pred = model.predict(X[random_search_inds])
    
    random_train_inds.append(np.random.choice(random_search_inds))
    random_train.append(random_train_inds)
    
    # Maximum Expected Improvement
    
    mei_train_inds = mei_train[-1].copy()    
    mei_search_inds = list(all_inds.difference(mei_train_inds))
    
    # Pick entry with the largest maximum value
    model.fit(X[mei_train_inds], y[mei_train_inds])
    mei_y_pred = model.predict(X[mei_search_inds])

    mei_train_inds.append(mei_search_inds[np.argmax(mei_y_pred)])
    mei_train.append(mei_train_inds)

    # Maximum Likelihood of Improvement
    
    mli_train_inds = mli_train[-1].copy()  # Last iteration
    mli_search_inds = list(all_inds.difference(mli_train_inds))
    
    # Pick entry with the largest maximum value
    model.fit(X[mli_train_inds], y[mli_train_inds])
    mli_y_pred, mli_y_std = model.predict(X[mli_search_inds], return_std=True)
    mli_train_inds.append(mli_search_inds[np.argmax(np.divide(mli_y_pred - np.max(y[mli_train_inds]), mli_y_std))])
    mli_train.append(mli_train_inds)
    
    # Maximum Uncertainty
    
    mu_train_inds = mu_train[-1].copy()  # Last iteration
    mu_search_inds = list(all_inds.difference(mu_train_inds))
    
    # Pick entry with the largest maximum value
    model.fit(X[mu_train_inds], y[mu_train_inds])
    mu_y_pred, mu_y_std = model.predict(X[mu_search_inds], return_std=True)
    mu_train_inds.append(mu_search_inds[np.argmax(mu_y_std)])
    mu_train.append(mu_train_inds)
    
    # Upper Conf Bound
    
    ucb_train_inds = ucb_train[-1].copy()  # Last iteration
    ucb_search_inds = list(all_inds.difference(ucb_train_inds))
    
    # Pick entry with the largest maximum value
    model.fit(X[ucb_train_inds], y[ucb_train_inds])
    ucb_y_pred, ucb_y_std = model.predict(X[ucb_search_inds], return_std=True)
    ucb_train_inds.append(ucb_search_inds[np.argmax([sum(x) for x in zip(ucb_y_pred, ucb_y_std)])])
    ucb_train.append(ucb_train_inds)

In this notebook we present a static version of the plot from the main notebook, for publication purposes.

In [None]:
fig, ax = plt.subplots(2,3,figsize=(18,14))
ax = ax.flatten()
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.5, hspace=0.4)

# FIRST PLOT, Approach lines

random_line, = ax[0].plot([], [], color='green', label='Random')
mei_line, = ax[0].plot([], [], color='blue', label='MEI')
mli_line, = ax[0].plot([], [], color='red', label='MLI')
mu_line, = ax[0].plot([], [], color='purple', label='MU')
ucb_line, = ax[0].plot([], [], color='orange', label='UCB')
max_line, = ax[0].plot(range(n_steps), [max(y) for m in range(n_steps)], '--', color='black', label='Maximum Value')

random_chk, = ax[0].plot([], [], markersize=10, marker='*', linestyle='None', color='green')
mei_chk, = ax[0].plot([], [], markersize=10, marker='*', linestyle='None', color='blue')
mli_chk, = ax[0].plot([], [], markersize=10, marker='*', linestyle='None', color='red')
mu_chk, = ax[0].plot([], [], markersize=10, marker='*', linestyle='None', color='purple')
ucb_chk, = ax[0].plot([], [], markersize=10, marker='*', linestyle='None', color='orange')

# ax0leg = ax[0].legend(loc=4, prop={'size': 12})
# ax0leg.get_frame().set_edgecolor('k')


ax[0].grid()
ax[0].set_xlabel("Number of Experiments", fontsize=24)
ax[0].set_ylabel("Maximum "+ property_interest, fontsize=24)
mli_line.axes.axis([0, n_steps-1, 0, 1.1*max(y)])

mli_line.axes.get_yaxis().set_tick_params(labelsize=20)
mli_line.axes.get_xaxis().set_tick_params(labelsize=20)


mei_line.axes.get_yaxis().set_tick_params(labelsize=20)
mu_line.axes.get_yaxis().set_tick_params(labelsize=20)
random_line.axes.get_yaxis().set_tick_params(labelsize=20)
ucb_line.axes.get_yaxis().set_tick_params(labelsize=20)

# SECOND PLOT, Random Prediction

all_values_samples = ax[5].plot(list(all_inds), y, marker='o', alpha=0.2, color='gray', linestyle='None', markersize=10, label='Values')

random_reallabel = [y[index] for index in random_train_inds]

random_initial_set = ax[5].plot(random_train_inds[:entry_number_init], random_reallabel[:entry_number_init], color='black', marker='o', linestyle= 'None',  markersize=10, label = 'Initial Set')


# ax5leg = ax[5].legend(prop={'size': 12})
# ax5leg.get_frame().set_edgecolor('k')

ax[5].grid()
ax[5].set_title("Random", fontsize=26)
ax[5].set_xlabel("Test Candidates", fontsize=24)
ax[5].set_ylabel(property_interest, fontsize=24)


#THIRD PLOT, MEI Prediction

all_values_samples = ax[1].plot(list(all_inds), y, marker='o', alpha=0.2, color='gray', linestyle='None', markersize=10, label='Values')

mei_reallabel = [y[index] for index in mei_train_inds]

mei_initial_set = ax[1].plot(mei_train_inds[:entry_number_init], mei_reallabel[:entry_number_init], color='black', marker='o',linestyle= 'None',   markersize=10, label = 'Initial Set')

# ax1leg = ax[1].legend(prop={'size': 12})
# ax1leg.get_frame().set_edgecolor('k')

ax[1].grid()
ax[1].set_title("Maximum Expected \n Improvement (MEI)", fontsize=26)
ax[1].set_xlabel("Test Candidates", fontsize=24)
ax[1].set_ylabel(property_interest, fontsize=24)

# 4th PLOT, MLI Prediction

all_values_samples = ax[3].plot(list(all_inds), y, marker='o', alpha=0.2, color='gray', linestyle='None', markersize=10, label='Values')

mli_reallabel = [y[index] for index in mli_train_inds]

mli_initial_set = ax[3].plot(mli_train_inds[:entry_number_init], mli_reallabel[:entry_number_init], color='black', marker='o', linestyle= 'None',  markersize=10, label = 'Initial Set')

# ax3leg = ax[3].legend(prop={'size': 12})
# ax3leg.get_frame().set_edgecolor('k')

ax[3].grid()
ax[3].set_title("Maximum Likelihood \n of Improvement (MLI)", fontsize=26)
ax[3].set_xlabel("Test Candidates", fontsize=24)
ax[3].set_ylabel(property_interest, fontsize=24)


# 5th plot, MU Prediction

all_values_samples = ax[4].plot(list(all_inds), y, marker='o', alpha=0.2, color='gray', linestyle='None', markersize=10, label='Values')

mu_reallabel = [y[index] for index in mu_train_inds]

mu_initial_set = ax[4].plot(mu_train_inds[:entry_number_init], mu_reallabel[:entry_number_init], color='black', marker='o', linestyle= 'None',  markersize=10, label = 'Initial Set')



# ax4leg = ax[4].legend(prop={'size': 12})
# ax4leg.get_frame().set_edgecolor('k')

ax[4].grid()
ax[4].set_title("Maximum \n Uncertainty (MU)", fontsize=26)
ax[4].set_xlabel("Test Candidates", fontsize=24)
ax[4].set_ylabel(property_interest, fontsize=24)

# 6th plot, UCB Prediction

all_values_samples = ax[2].plot(list(all_inds), y, marker='o', alpha=0.2, color='gray', linestyle='None', markersize=10, label='Values')

ucb_reallabel = [y[index] for index in ucb_train_inds]

ucb_initial_set = ax[2].plot(ucb_train_inds[:entry_number_init], ucb_reallabel[:entry_number_init], color='black', marker='o', linestyle= 'None',  markersize=10, label = 'Initial Set')


# ax2leg = ax[2].legend(prop={'size': 12})
# ax2leg.get_frame().set_edgecolor('k')

ax[2].grid()
ax[2].set_title("Upper Confidence \n Bound (UCB)", fontsize=26)
ax[2].set_xlabel("Test Candidates", fontsize=24)
ax[2].set_ylabel(property_interest, fontsize=24)

#################################################

num=90


import matplotlib.pylab as pl

if num > 0:

    random_graph = [max(y[list(t)]) for t in random_train[:num]]
    chk_index = [i for i, j in enumerate(random_graph) if j == max(y)][0]
    random_line.set_data(np.arange(len(random_train))[:chk_index+1], [max(y[list(t)]) for t in random_train[:chk_index+1]])
    random_chk.set_data(chk_index, max(random_graph))
    
    a = list(enumerate(random_train_inds[entry_number_init:entry_number_init+chk_index]))
    a = [_[0] for _ in a]
    
    n = len(a)
    colors = np.array(pl.cm.Greens(np.linspace(0,1,n)))
 
    random_sample_real = ax[5].scatter(random_train_inds[entry_number_init:entry_number_init+chk_index], random_reallabel[entry_number_init:entry_number_init+chk_index], c=colors, marker='o', s=100,linestyle= 'None', label = 'Tests')
    random_sample_real.axes.axis([0, len(y), 0, 1.1*max(y)])
    random_sample_real.axes.get_xaxis().set_ticks([])
    random_sample_real.axes.get_yaxis().set_tick_params(labelsize=20)
    
    
    mei_graph = [max(y[list(t)]) for t in mei_train[:num]]
    chk_index = [i for i, j in enumerate(mei_graph) if j == max(y)][0]
    mei_line.set_data(np.arange(len(mei_train))[:chk_index+1], [max(y[list(t)]) for t in mei_train[:chk_index+1]])
    mei_chk.set_data(chk_index, max(mei_graph))
    
    a = list(enumerate(mei_train_inds[entry_number_init:entry_number_init+chk_index]))
    a = [_[0] for _ in a]
    
    n = len(a)
    colors = np.array(pl.cm.Blues(np.linspace(0,1,n)))    
    
    mei_sample_real = ax[1].scatter(mei_train_inds[entry_number_init:entry_number_init+chk_index], mei_reallabel[entry_number_init:entry_number_init+chk_index], c=colors, marker='o', s=100,linestyle= 'None', label = 'Tests')
    mei_sample_real.axes.axis([0, len(y), 0, 1.1*max(y)])
    mei_sample_real.axes.get_xaxis().set_ticks([])
    mei_sample_real.axes.get_yaxis().set_tick_params(labelsize=20)
    
    

    mli_graph = [max(y[list(t)]) for t in mli_train[:num]]
    chk_index = [i for i, j in enumerate(mli_graph) if j == max(y)][0]
    mli_line.set_data(np.arange(len(mli_train))[:chk_index+1], [max(y[list(t)]) for t in mli_train[:chk_index+1]])
    mli_chk.set_data(chk_index, max(mli_graph))

    
    a = list(enumerate(mli_train_inds[entry_number_init:entry_number_init+chk_index]))
    a = [_[0] for _ in a]
    
    n = len(a)
    colors = np.array(pl.cm.Reds(np.linspace(0,1,n)))    
    
    mli_sample_real = ax[3].scatter(mli_train_inds[entry_number_init:entry_number_init+chk_index], mli_reallabel[entry_number_init:entry_number_init+chk_index], c=colors, marker='o', s=100,linestyle= 'None', label = 'Tests') 
    mli_sample_real.axes.axis([0, len(y), 0, 1.1*max(y)])
    mli_sample_real.axes.get_xaxis().set_ticks([])
    mli_sample_real.axes.get_yaxis().set_tick_params(labelsize=20)    
    

    mu_graph = [max(y[list(t)]) for t in mu_train[:num]]
    chk_index = [i for i, j in enumerate(mu_graph) if j == max(y)][0]
    mu_line.set_data(np.arange(len(mu_train))[:chk_index+1], [max(y[list(t)]) for t in mu_train[:chk_index+1]])
    mu_chk.set_data(chk_index, max(mu_graph))
    
    a = list(enumerate(mu_train_inds[entry_number_init:entry_number_init+chk_index]))
    a = [_[0] for _ in a]
    
    n = len(a)
    colors = np.array(pl.cm.Purples(np.linspace(0,1,n)))     
    
    mu_sample_real = ax[4].scatter(mu_train_inds[entry_number_init:entry_number_init+chk_index], mu_reallabel[entry_number_init:entry_number_init+chk_index], c=colors, marker='o', s=100,linestyle= 'None', label = 'Tests') 
    mu_sample_real.axes.axis([0, len(y), 0, 1.1*max(y)])
    mu_sample_real.axes.get_xaxis().set_ticks([])
    mu_sample_real.axes.get_yaxis().set_tick_params(labelsize=20)    
    
    
    ucb_graph = [max(y[list(t)]) for t in ucb_train[:num]]
    chk_index = [i for i, j in enumerate(ucb_graph) if j == max(y)][0]
    ucb_line.set_data(np.arange(len(ucb_train))[:chk_index+1], [max(y[list(t)]) for t in ucb_train[:chk_index+1]])
    ucb_chk.set_data(chk_index, max(ucb_graph))
    
    
    a = list(enumerate(ucb_train_inds[entry_number_init:entry_number_init+chk_index]))
    a = [_[0] for _ in a]
    
    n = len(a)
    colors = np.array(pl.cm.YlOrBr(np.linspace(0,1,n)))     
    
    ucb_sample_real = ax[2].scatter(ucb_train_inds[entry_number_init:entry_number_init+chk_index], ucb_reallabel[entry_number_init:entry_number_init+chk_index], c=colors, marker='o', s=100,linestyle= 'None', label = 'Tests') 
    ucb_sample_real.axes.axis([0, len(y), 0, 1.1*max(y)])
    ucb_sample_real.axes.get_xaxis().set_ticks([])
    ucb_sample_real.axes.get_yaxis().set_tick_params(labelsize=20)    

fig.show()

---

## 6. 30-Trial Run using the OLDEST data points

In [None]:
X = all_values.copy()
y = all_labels.copy()

entry_number_init = 10
in_train = np.zeros(len(data), dtype=np.bool)
oldest = np.argpartition(years, entry_number_init)
display(data.loc[oldest,['chemicalFormula', 'Year Published']].head(n=10))

In [None]:
random_trials = []
mei_trials = []
mli_trials = []
ucb_trials = []
mu_trials = []

np.random.seed(2) # Random Seed

trial = 0

while trial < 30:
    
    model = RandomForestRegressor(num_trees=500)

    entry_number_init = 10
    
    # -----
    
    in_train = np.zeros(len(data), dtype=np.bool)
    oldest = np.argpartition(years, entry_number_init)
    in_train[oldest[:entry_number_init]] = True
    print('Picked {} training entries'.format(in_train.sum()))

    # -----    
    
#     in_train = np.zeros(len(data), dtype=np.bool)
#     in_train[np.random.choice(len(data), entry_number_init, replace=False)] = True
#     print('Picked {} training entries'.format(in_train.sum()))
    
    if not (np.isclose(max(y), max(y[in_train]))):
        trial += 1
    else: 
        continue

    model.fit(X[in_train], y[in_train])
    y_pred, y_std = model.predict(X[~in_train], return_std=True)

    print(trial)
    n_steps = 90
    all_inds = set(range(len(y)))

    random_train = [list(set(np.where(in_train)[0].tolist()))]
    mei_train = [list(set(np.where(in_train)[0].tolist()))]
    mli_train = [list(set(np.where(in_train)[0].tolist()))]
    mu_train = [list(set(np.where(in_train)[0].tolist()))]
    ucb_train = [list(set(np.where(in_train)[0].tolist()))]
    random_train_inds = []
    mei_train_inds = []
    mli_train_inds = []
    ucb_train_inds = []


    for i in range(n_steps):

        # RANDOM

        random_train_inds = random_train[-1].copy()

        random_search_inds = list(all_inds.difference(random_train_inds))

        model.fit(X[random_train_inds], y[random_train_inds])
        random_y_pred = model.predict(X[random_search_inds])

        random_train_inds.append(np.random.choice(random_search_inds))
        random_train.append(random_train_inds)

        # Maximum Expected Improvement

        mei_train_inds = mei_train[-1].copy()    
        mei_search_inds = list(all_inds.difference(mei_train_inds))

        # Pick entry with the largest maximum value
        model.fit(X[mei_train_inds], y[mei_train_inds])
        mei_y_pred = model.predict(X[mei_search_inds])

        mei_train_inds.append(mei_search_inds[np.argmax(mei_y_pred)])
        mei_train.append(mei_train_inds)

        # Maximum Likelihood of Improvement

        mli_train_inds = mli_train[-1].copy()  # Last iteration
        mli_search_inds = list(all_inds.difference(mli_train_inds))

        # Pick entry with the largest maximum value
        model.fit(X[mli_train_inds], y[mli_train_inds])
        mli_y_pred, mli_y_std = model.predict(X[mli_search_inds], return_std=True)
        mli_train_inds.append(mli_search_inds[np.argmax(np.divide(mli_y_pred - np.max(y[mli_train_inds]), mli_y_std))])
        mli_train.append(mli_train_inds)

        # Maximum Uncertainty

        mu_train_inds = mu_train[-1].copy()  # Last iteration
        mu_search_inds = list(all_inds.difference(mu_train_inds))

        # Pick entry with the largest maximum value
        model.fit(X[mu_train_inds], y[mu_train_inds])
        mu_y_pred, mu_y_std = model.predict(X[mu_search_inds], return_std=True)
        mu_train_inds.append(mu_search_inds[np.argmax(mu_y_std)])
        mu_train.append(mu_train_inds)

        # Upper Conf Bound

        ucb_train_inds = ucb_train[-1].copy()  # Last iteration
        ucb_search_inds = list(all_inds.difference(ucb_train_inds))

        # Pick entry with the largest maximum value
        model.fit(X[ucb_train_inds], y[ucb_train_inds])
        ucb_y_pred, ucb_y_std = model.predict(X[ucb_search_inds], return_std=True)
        ucb_train_inds.append(ucb_search_inds[np.argmax([sum(x) for x in zip(ucb_y_pred, ucb_y_std)])])
        ucb_train.append(ucb_train_inds)

    if np.max(y[random_train_inds]) == np.max(y):
        random_trials.append(np.argmax(y[random_train_inds][10:])+1)
    else:
        random_trials.append(n_steps)

    if np.max(y[mei_train_inds]) == np.max(y):
        mei_trials.append(np.argmax(y[mei_train_inds][10:])+1)
    else:
        mei_trials.append(n_steps)

    if np.max(y[mli_train_inds]) == np.max(y):
        mli_trials.append(np.argmax(y[mli_train_inds][10:])+1)
    else:
        mli_trials.append(n_steps)

    if np.max(y[ucb_train_inds]) == np.max(y):
        ucb_trials.append(np.argmax(y[ucb_train_inds][10:])+1)
    else:
        ucb_trials.append(n_steps)

    if np.max(y[mu_train_inds]) == np.max(y):
        mu_trials.append(np.argmax(y[mu_train_inds][10:])+1)
    else:
        mu_trials.append(n_steps)
        
print("Random", random_trials)
print("MEI", mei_trials)
print("MLI", mli_trials)
print("UCB", ucb_trials)
print("MU", mu_trials)

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt


# Values are listed explicitly here because the previous cell takes a lot of time to run (30 minutes). You can comment these out and work with the variables directly from the cell above.

random_trials = [17, 39, 23, 41, 65, 23, 65, 32, 62, 36, 14, 76, 11, 59, 56, 79, 50, 33, 61, 19, 81, 46, 51, 71, 69, 6, 79, 39, 5, 6]
mei_trials =  [13, 11, 13, 13, 9, 12, 9, 11, 12, 8, 13, 12, 11, 8, 5, 13, 13, 13, 12, 9, 9, 13, 13, 13, 9, 13, 8, 16, 16, 9]
mli_trials =  [3, 12, 2, 17, 12, 12, 3, 13, 3, 12, 17, 13, 3, 14, 12, 12, 3, 15, 16, 13, 3, 11, 14, 17, 3, 15, 4, 13, 3, 12]
ucb_trials =  [14, 14, 4, 13, 3, 10, 13, 12, 4, 10, 4, 11, 4, 14, 11, 15, 10, 14, 11, 15, 15, 10, 4, 11, 4, 14, 10, 14, 14, 11]
mu_trials = [9, 17, 24, 24, 33, 29, 27, 10, 30, 45, 28, 25, 31, 11, 33, 26, 9, 23, 25, 28, 24, 29, 31, 25, 31, 23, 11, 25, 22, 25]

objects = ('MLI', 'UCB', 'MEI', 'MU', 'Random')
y_pos = np.arange(len(objects))

performance = [np.mean(mli_trials),np.mean(ucb_trials),np.mean(mei_trials),np.mean(mu_trials),np.mean(random_trials)]
uncertainty = [np.std(mli_trials),np.std(ucb_trials), np.std(mei_trials), np.std(mu_trials),np.std(random_trials)]
uncertainty_adj = [x/np.sqrt(30) for x in uncertainty]

fig = plt.figure(figsize = (10,10))

plt.bar(y_pos, performance, yerr=uncertainty_adj, color=['red', 'orange', 'blue', 'purple', 'green'], align='center', alpha=0.5)
plt.axhline(y=45 , linestyle= 'dashed' , linewidth=3, color='gray')
plt.xticks(y_pos, objects, fontsize=28)
plt.yticks(fontsize=28)
plt.ylim([0,60])
plt.ylabel('Number of Experiments', fontsize=28)
#plt.title('Information Acquistion Functions', fontsize=18)

plt.show()

---

## 7. 30-Trial Run using the RANDOM starting data points

In [None]:
random_trials = []
mei_trials = []
mli_trials = []
ucb_trials = []
mu_trials = []

np.random.seed(2) # Random Seed

trial = 0

while trial < 30:
    
    model = RandomForestRegressor(num_trees=200)

    entry_number_init = 10
    
    # -----
    
#     in_train = np.zeros(len(data), dtype=np.bool)
#     oldest = np.argpartition(years, entry_number_init)
#     in_train[oldest[:entry_number_init]] = True
#     print('Picked {} training entries'.format(in_train.sum()))

    # -----    
    
    in_train = np.zeros(len(data), dtype=np.bool)
    in_train[np.random.choice(len(data), entry_number_init, replace=False)] = True
    print('Picked {} training entries'.format(in_train.sum()))
    
    if not (np.isclose(max(y), max(y[in_train]))):
        trial += 1
    else: 
        continue

    model.fit(X[in_train], y[in_train])
    y_pred, y_std = model.predict(X[~in_train], return_std=True)

    print(trial)
    n_steps = 90
    all_inds = set(range(len(y)))

    random_train = [list(set(np.where(in_train)[0].tolist()))]
    mei_train = [list(set(np.where(in_train)[0].tolist()))]
    mli_train = [list(set(np.where(in_train)[0].tolist()))]
    mu_train = [list(set(np.where(in_train)[0].tolist()))]
    ucb_train = [list(set(np.where(in_train)[0].tolist()))]
    
    print("Initial Set:", random_train[0])
    
    random_train_inds = []
    mei_train_inds = []
    mli_train_inds = []
    ucb_train_inds = []


    for i in range(n_steps):

        # RANDOM

        random_train_inds = random_train[-1].copy()

        random_search_inds = list(all_inds.difference(random_train_inds))

        model.fit(X[random_train_inds], y[random_train_inds])
        random_y_pred = model.predict(X[random_search_inds])

        random_train_inds.append(np.random.choice(random_search_inds))
        random_train.append(random_train_inds)

        # Maximum Expected Improvement

        mei_train_inds = mei_train[-1].copy()    
        mei_search_inds = list(all_inds.difference(mei_train_inds))

        # Pick entry with the largest maximum value
        model.fit(X[mei_train_inds], y[mei_train_inds])
        mei_y_pred = model.predict(X[mei_search_inds])

        mei_train_inds.append(mei_search_inds[np.argmax(mei_y_pred)])
        mei_train.append(mei_train_inds)

        # Maximum Likelihood of Improvement

        mli_train_inds = mli_train[-1].copy()  # Last iteration
        mli_search_inds = list(all_inds.difference(mli_train_inds))

        # Pick entry with the largest maximum value
        model.fit(X[mli_train_inds], y[mli_train_inds])
        mli_y_pred, mli_y_std = model.predict(X[mli_search_inds], return_std=True)
        mli_train_inds.append(mli_search_inds[np.argmax(np.divide(mli_y_pred - np.max(y[mli_train_inds]), mli_y_std))])
        mli_train.append(mli_train_inds)

        # Maximum Uncertainty

        mu_train_inds = mu_train[-1].copy()  # Last iteration
        mu_search_inds = list(all_inds.difference(mu_train_inds))

        # Pick entry with the largest maximum value
        model.fit(X[mu_train_inds], y[mu_train_inds])
        mu_y_pred, mu_y_std = model.predict(X[mu_search_inds], return_std=True)
        mu_train_inds.append(mu_search_inds[np.argmax(mu_y_std)])
        mu_train.append(mu_train_inds)

        # Upper Conf Bound

        ucb_train_inds = ucb_train[-1].copy()  # Last iteration
        ucb_search_inds = list(all_inds.difference(ucb_train_inds))

        # Pick entry with the largest maximum value
        model.fit(X[ucb_train_inds], y[ucb_train_inds])
        ucb_y_pred, ucb_y_std = model.predict(X[ucb_search_inds], return_std=True)
        ucb_train_inds.append(ucb_search_inds[np.argmax([sum(x) for x in zip(ucb_y_pred, ucb_y_std)])])
        ucb_train.append(ucb_train_inds)

    if np.max(y[random_train_inds]) == np.max(y):
        random_trials.append(np.argmax(y[random_train_inds][10:])+1)
    else:
        random_trials.append(n_steps)

    if np.max(y[mei_train_inds]) == np.max(y):
        mei_trials.append(np.argmax(y[mei_train_inds][10:])+1)
    else:
        mei_trials.append(n_steps)

    if np.max(y[mli_train_inds]) == np.max(y):
        mli_trials.append(np.argmax(y[mli_train_inds][10:])+1)
    else:
        mli_trials.append(n_steps)

    if np.max(y[ucb_train_inds]) == np.max(y):
        ucb_trials.append(np.argmax(y[ucb_train_inds][10:])+1)
    else:
        ucb_trials.append(n_steps)

    if np.max(y[mu_train_inds]) == np.max(y):
        mu_trials.append(np.argmax(y[mu_train_inds][10:])+1)
    else:
        mu_trials.append(n_steps)
        
print("Random", random_trials)
print("MEI", mei_trials)
print("MLI", mli_trials)
print("UCB", ucb_trials)
print("MU", mu_trials)

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Values are listed explicitly here because the previous cell takes a lot of time to run (30 minutes). You can comment these out and work with the variables directly from the cell above.

random_trials =  [11, 12, 26, 36, 57, 60, 3, 65, 81, 83, 55, 90, 54, 68, 36, 87, 52, 74, 12, 42, 62, 44, 29, 68, 52, 8, 74, 15, 38, 15]
mei_trials =  [8, 9, 13, 7, 4, 9, 9, 6, 14, 20, 1, 14, 28, 4, 9, 2, 13, 9, 10, 15, 11, 4, 11, 6, 10, 18, 5, 2, 10, 18]
mli_trials = [12, 6, 22, 13, 3, 14, 23, 1, 15, 30, 2, 7, 14, 15, 19, 1, 5, 4, 11, 11, 11, 4, 16, 8, 12, 16, 11, 1, 4, 9]
ucb_trials =  [1, 8, 21, 9, 3, 13, 23, 1, 12, 28, 2, 4, 16, 2, 13, 1, 10, 12, 10, 11, 9, 12, 14, 4, 11, 17, 14, 2, 6, 7]
mu_trials = [11, 37, 36, 31, 20, 23, 37, 13, 3, 25, 26, 14, 33, 2, 32, 17, 22, 38, 19, 37, 10, 13, 16, 19, 10, 42, 3, 35, 38, 35]


objects = ('MLI', 'UCB', 'MEI', 'MU', 'Random')
y_pos = np.arange(len(objects))

performance = [np.mean(mli_trials),np.mean(ucb_trials),np.mean(mei_trials),np.mean(mu_trials),np.mean(random_trials)]
uncertainty = [np.std(mli_trials),np.std(ucb_trials), np.std(mei_trials), np.std(mu_trials),np.std(random_trials)]
uncertainty_adj = [x/np.sqrt(30) for x in uncertainty]

fig = plt.figure(figsize = (10,10))

plt.bar(y_pos, performance, yerr=uncertainty_adj, color=['red', 'orange', 'blue', 'purple', 'green'], align='center', alpha=0.5)
plt.axhline(y=45 , linestyle= 'dashed' , linewidth=3, color='gray')
plt.xticks(y_pos, objects, fontsize=28)
plt.yticks(fontsize=28)
plt.ylim([0,60])
plt.ylabel('Number of Experiments', fontsize=28)
#plt.title('Information Acquistion Functions', fontsize=18)

plt.show()

## 8. Garnet Predictor for codoped LLZO

In this section we create a garnet predictor based on all of our training data. It is important to note that these predictions are for an untested region of candidates, and because of the limited data and the inability of random forests to extrapolate, should not be understood with those caveats and not taken as all encompassing predictions. Results from this figure might vary because of the inner initialization of the random forest model, but their intention is to show the limited predicted capabilities of the models by themselves, which can help us motivate the use of active learning approaches.

In [None]:
# This function predicts the color-intensity plot taking the model, dopants and relevant substitutions.

def prediction_surface_dualdoped(model, first_dopant, first_dopant_range, second_dopant, second_dopant_range, first_lithium_replace, first_lanthanum_replace, first_zirconium_replace, second_lithium_replace, second_lanthanum_replace, second_zirconium_replace):

    # Grid of compositions
    st_first = np.around(np.linspace(0,first_dopant_range,81), decimals=3, out=None)
    st_second = np.around(np.linspace(0,second_dopant_range,81), decimals=3, out=None)

    first_grid, second_grid = np.meshgrid(st_first, st_second)

    formula_test_set = []
    pairs = []

    # Creation of nominal stoichiometries
    
    for stb in st_first:
        for stg in st_second:
            base_equation_string = "Li" + (('%f'%(np.around(7+first_lithium_replace*stb + second_lithium_replace*stg, decimals=3, out=None))).rstrip('0').rstrip('.')) + "La" + (('%f'%(np.around(3+(first_lanthanum_replace*stb) + (second_lanthanum_replace*stg), decimals=3, out=None))).rstrip('0').rstrip('.')) + "Zr" + (('%f'%(np.around(2+(first_zirconium_replace*stb) + (second_zirconium_replace*stg), decimals=3, out=None))).rstrip('0').rstrip('.')) + (str(first_dopant) + (('%f'%(np.around(stb, decimals=3, out=None))).rstrip('0').rstrip('.')) if stb > 0 else '') + (str(second_dopant) + (('%f'%(np.around(stg, decimals=3, out=None))).rstrip('0').rstrip('.')) if stg > 0 else '') + "O12"
            formula_test_set.append(base_equation_string)
            pairs.append((stb, stg))

    # Descriptors through features
    
    composition_test_set_dataframe = pd.DataFrame(formula_test_set, columns=['chemicalFormula'])

    composition_test_set_dataframe['composition'] = composition_test_set_dataframe['chemicalFormula'].apply(get_composition) # Transformation of chemicalformula string into Matminer composition
    feat_test_set= np.array(f.featurize_many(composition_test_set_dataframe['composition'], ignore_errors=True)) # Array to store such features

    temp_array = np.array([25.0 for _ in range(feat_test_set.shape[0])]).reshape(-1,1)
    feat_test_set = np.hstack((feat_test_set, temp_array))

    # This code is to drop columns with std = 0. 
    parsed_features_test_set = pd.DataFrame(feat_test_set)
    parsed_features_test_set_2 = parsed_features_test_set.loc[:,  x_df_prior.std() != 0] # Dropping same columns that were dropped on the training data

    values = [list(parsed_features_test_set_2.iloc[x]) for x in range(len(parsed_features_test_set_2.index))]
    values = np.array(values, dtype = float) 

    # Normalization if ANNs
    
    if model == neuralnetwork_model:
        values = (values - feature_mean)/ (feature_std)

    # Predictions
    predictions = model.predict(values)#.flatten() # Prediction of the test set # Z
    
    
    # Plotting
    
    Z = np.array([np.array(predictions.reshape(81,81)[:,x]) for x in range(81)])
    
    colors = []
    
    color_scale_colors =[[0.0, "rgb(165,0,38)"],
                [0.1111111111111111, "rgb(215,48,39)"],
                [0.2222222222222222, "rgb(244,109,67)"],
                [0.3333333333333333, "rgb(253,174,97)"],
                [0.4444444444444444, "rgb(254,224,144)"],
                [0.5555555555555556, "rgb(224,243,248)"],
                [0.6666666666666666, "rgb(171,217,233)"],
                [0.7777777777777778, "rgb(116,173,209)"],
                [0.8888888888888888, "rgb(69,117,180)"],
                [1.0, "rgb(49,54,149)"]]
    
    
    
    known_points = []
    cross_test_points = []
    
    test_forset = inner_df["Composition"].apply(get_composition)
    
    for _ in composition_test_set_dataframe['composition']: # Marking diamonds for the values in the entire dataset
        if _ in list(data['composition'].values):
            dic_ = dict(_.as_dict()) 
            if first_dopant in dic_.keys():
                x = dic_[first_dopant]
            else:
                x = 0.0
            if second_dopant in dic_.keys():
                y = dic_[second_dopant]
            else:
                y = 0.0              
            z = float(data[data['composition'] == _]["Ionic Conductivity"])
            known_points.append([x,y,z])
        if _ in list(test_forset): # Marking crosses for the values in the test dataset
            dic_ = dict(_.as_dict()) 
            if first_dopant in dic_.keys():
                x = dic_[first_dopant]
            else:
                x = 0.0
            if second_dopant in dic_.keys():
                y = dic_[second_dopant]
            else:
                y = 0.0              
            cross_test_points.append([x,y])
            
    
    known_points = np.array(known_points)   
    known_scatter = go.Scatter(x=known_points[:,0], y=known_points[:,1], customdata=known_points[:,2], mode='markers',
    marker=dict(
        symbol = "diamond",
        size=20,
        color=known_points[:,2],
        cmin = 0,
        cmax = 18,
        opacity=1,  colorbar=dict(thickness=20, title=dict(text='Lithium ion conductivity 10<sup> - 4</sup> S/cm',font = dict(family='Times New Roman',size=32)), titleside = 'right'), line=dict(width=2, color ='white')
    ), hovertemplate='D1:%{x:.2f} <br>D2:%{y:.3f} <br>IC:%{customdata:.3f}')
    

    layout = go.Layout(
        width = 800,
        height = 600,
        font = dict(family='Times New Roman',size=32),
        xaxis= dict(title= first_dopant + ' content',zeroline= False, gridwidth= 2),
        yaxis= dict(title= second_dopant + ' content',zeroline= False, gridwidth= 2),
        showlegend=False
    )
    
    trace_heatmap = go.Heatmap(x=st_first, y = st_second, z=Z, showscale=False, connectgaps=True, zsmooth='best', zauto = False, zmin = 0, zmax = 18,  colorbar=dict(tickfont=dict(size=20)))
                              
    fig = go.Figure(data=[trace_heatmap,known_scatter], layout=layout)
    

    
    if cross_test_points != []:
        cross_test_points = np.array(cross_test_points)   
        cross_test_scatter = go.Scatter(x=cross_test_points[:,0], y=cross_test_points[:,1], mode='markers',
        marker=dict(
            symbol = "x",
            size=14,
            color="white"))
        
        fig.add_trace(cross_test_scatter)
    
    fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))    
    iplot(fig)
    
    return first_dopant, second_dopant, st_first, st_second, pairs, colors, predictions

In [None]:
first_dopant, second_dopant, first_element_range, second_element_range, pairs, colors, other_predictions = prediction_surface_dualdoped(randomforest_model, "Ta", 1 , "Nb", 1, -1, 0, -1, -1, 0, -1)

In [None]:
first_dopant, second_dopant, first_element_range, second_element_range, pairs, colors, other_predictions = prediction_surface_dualdoped(randomforest_model, "Bi", 1 , "Ga", 0.5, -1, 0, -1, -3, 0, 0)

In [None]:
first_dopant, second_dopant, first_element_range, second_element_range, pairs, colors, other_predictions = prediction_surface_dualdoped(randomforest_model, "Sc", 0.5, "Ga", 0.5 , 1, 0, -1,-3, 0, 0)