# Jupyter Notebook Demo for *Inorganic materials synthesis planning with literature-trained neural networks*

This notebook provides a tutorial for the word embeddings and machine learning models in this repository.

## First: download additional resources

### FastText
- [Trained model](https://figshare.com/s/70455cfcd0084a504745)

### ELMo
- [Vocabulary file](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt)
- [Weights and config](https://figshare.com/s/ec677e7db3cf2b7db4bf)
- [bilm-tf Library](https://github.com/allenai/bilm-tf)

For the `bilm-tf` library, you'll need to run through the setup instructions. In practice, we've found that ELMo performs much better than FastText in a variety of tasks (e.g., named entity recognition), but it has much higher computational cost and requires GPUs.

For the FastText model files and the ELMo weights/options, place them in the same directory as this notebook.

## 1. Importing libraries

Import the following libraries, and `pip install` or `conda install` anything that you don't have. Some of these libraries are used in the `models` folder and not directly in this notebook, but it's easier to check these dependencies first.

In [12]:
import gensim
import keras
import tensorflow
import numpy

## 2. FastText Embeddings for Materials Science

FastText embeddings allow for out-of-vocabulary inference by using sub-word information. Our FastText model is trained to use lowercase letters only.

In [2]:
fasttext = gensim.models.keyedvectors.KeyedVectors.load("../synthesis-generation/bin/fasttext_embeddings-MINIFIED.model")

# Need to set this when loading from saved file
fasttext.bucket = 2000000

In [3]:
# Examples that should have reasonably high similarity

print(fasttext.similarity("batio3", "bifeo3"))
print(fasttext.similarity("rinse", "wash"))
print(fasttext.similarity("grind", "mill"))
print(fasttext.similarity("nanotube", "nanosphere"))

0.7407776617325483
0.8053306072499058
0.7119846415681294
0.7841045911308067


In [4]:
# Out-of-vocabulary inference
# A new vector is inferred for an unseen material and
# reasonable similarity estimates are produced

print("licoo2" in fasttext.vocab)
print("lini(1–x)coxo2" in fasttext.vocab)

print(fasttext["lini(1–x)coxo2"].shape)

print(fasttext.similarity("licoo2", "lini(1–x)coxo2"))
print(fasttext.similarity("mno2", "lini(1–x)coxo2"))

True
False
(100,)
0.6120179766407703
0.46709642006346813


## 3. ELMo Embeddings for Materials Science

ELMo embeddings are context-sensitive at prediction time, in addition to being fully character-based. This means that out-of-vector inference is possible at prediction time, and both upper and lower case letters are supported.

Support for ELMo embeddings is provided through our custom `TokenClassifier` object (which can also perform NER, once trained).

In [5]:
from models import token_classifier
token_classifier = token_classifier.TokenClassifier(
    vocab="../synthesis-generation/bin/vocab.txt", 
    options="../synthesis-generation/bin/elmo_options.json", 
    weights="../synthesis-generation/bin/elmo_weights.hdf5"
)

Instructions for updating:
Use the `axis` argument instead
USING SKIP CONNECTIONS


In [6]:
example_sentences = [
    ["The", "silica", "nanoparticles", "were", "heated", "."],
    ["The", "precursor", "was", "sputtered", "onto", "the", "silica", "substrate", "."]
]

feature_matrix = token_classifier.featurize_elmo_list(example_sentences)

In [11]:
print(example_sentences[0][1])
print(example_sentences[1][6])

silica
silica


In [7]:
# Since ELMo is context-sensitive, it can produce different embeddings for the same word ("silica")

silica_embedding_1 = feature_matrix[0][1]
silica_embedding_2 = feature_matrix[1][6]

print (numpy.linalg.norm(silica_embedding_1 - silica_embedding_2))

7.7241073


## 5. Loading and building variational autoencoder models

The code samples below show how to load and inspect the Keras models included in this repository.

In [8]:
from models import action_generator, material_generator

action_generator = action_generator.ActionGenerator()
material_generator = material_generator.MaterialGenerator()

In [9]:
action_generator.build_nn_model()
action_generator.vae.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
ops_in (InputLayer)             (None, 20, 50)       0                                            
__________________________________________________________________________________________________
conv_enc_1 (Conv1D)             (None, 16, 128)      32128       ops_in[0][0]                     
__________________________________________________________________________________________________
conv_enc_2 (Conv1D)             (None, 12, 128)      82048       conv_enc_1[0][0]                 
__________________________________________________________________________________________________
conv_enc_3 (Conv1D)             (None, 8, 128)       82048       conv_enc_2[0][0]                 
__________________________________________________________________________________________________
flatten_1 

In [10]:
material_generator.build_nn_model()
material_generator.vae.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
material_in (InputLayer)        (None, 10, 50)       0                                            
__________________________________________________________________________________________________
conv_enc_1 (Conv1D)             (None, 8, 64)        9664        material_in[0][0]                
__________________________________________________________________________________________________
conv_enc_2 (Conv1D)             (None, 6, 64)        12352       conv_enc_1[0][0]                 
__________________________________________________________________________________________________
conv_enc_3 (Conv1D)             (None, 4, 64)        12352       conv_enc_2[0][0]                 
__________________________________________________________________________________________________
flatten_2 