<a href="https://colab.research.google.com/github/JamesChung821/python/blob/master/%E3%80%8CMLSE_Tutorial_ipynb%E3%80%8D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## MLUMD Colombia Tutorial- Information Extraction from Scientific Text
Today, we will be using named entity recognition models and text mining to extract information on the synthesis of materials. 

This notebook is based on Kim, Edward, et al. "Inorganic materials synthesis planning with literature-trained neural networks." Journal of Chemical Information and Modeling 60.3 (2020): 1194-1201. 
doi.org/10.1021/acs.jcim.9b00995

More information and resources are available at synthesisproject.org.
Feel free to reach out to me at zjensen@mit.edu with questions or collaboration ideas. 

Please make a copy of this notebook and run the code from the copy.


# Extraction Workflow

** Focus of tutorial
1. Obtain literature corpus relevant to your area of study.
    - Search engines such as Scopus, Engineering Village, Crossref, Web of Science, etc.
    - Text and data mining agreements with publishers.
    - We have used this pipeline on corpora ranging from hundreds to hundreds of thousands of articles.
2. Parse and clean articles
    - We only use HTML and XML formats (PDFs are very difficult)
    - Published repository parses for many of the popular publishers (https://github.com/CederGroupHub/LimeSoup)
3. Classify paragraph type to determine relevant sections
    - Introduction, synthesis, characterization, results, conclusion, etc.
    - Use a hybrid rule-based/classifier approach
    - Model is available at https://github.com/olivettigroup/materials-synthesis-generative-models.git
4. **Named Entity Recognition (NER) and text mining to extract interesting entities
    - We care about synthesis information (targets, precursors, operations, etc)
    - Models are available at https://github.com/olivettigroup/materials-synthesis-generative-models.git
5. **Associating Entities 
    - For example- precursors with their target material
    - Many techniques, some include proximity-based and dependency parsing
6. **Data Mining and Machine Learning
    - Visualize trends in the synthesis data
    - Use machine learning models on the extracted data

# Imports:
Load all the necessary libraries

We need to download and install the bilm library from Github. To have the code work on CoLab, we need to first run the three cells below, then restart the runtime instance so it is recognized as an installed package. https://stackoverflow.com/questions/57838013/modulenotfounderror-after-successful-pip-install-in-google-colaboratory

In [None]:
!git clone https://github.com/allenai/bilm-tf.git

Cloning into 'bilm-tf'...
remote: Enumerating objects: 292, done.[K
remote: Total 292 (delta 0), reused 0 (delta 0), pack-reused 292[K
Receiving objects: 100% (292/292), 588.40 KiB | 22.63 MiB/s, done.
Resolving deltas: 100% (137/137), done.


In [None]:
%cd bilm-tf/
!python setup.py install

/content/bilm-tf
running install
running bdist_egg
running egg_info
creating bilm.egg-info
writing bilm.egg-info/PKG-INFO
writing dependency_links to bilm.egg-info/dependency_links.txt
writing requirements to bilm.egg-info/requires.txt
writing top-level names to bilm.egg-info/top_level.txt
writing manifest file 'bilm.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'bilm.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/bilm
copying bilm/data.py -> build/lib/bilm
copying bilm/model.py -> build/lib/bilm
copying bilm/__init__.py -> build/lib/bilm
copying bilm/training.py -> build/lib/bilm
copying bilm/elmo.py -> build/lib/bilm
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/bilm
copying build/lib/bilm/data.py -> build/bdist.linux-x86_64/egg/bilm
copying build/lib/bilm/model.py -> build/

In [None]:
%pip install 'h5py==2.10.0' --force-reinstall

Collecting h5py==2.10.0
  Downloading h5py-2.10.0-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB)
[?25l[K     |▏                               | 10 kB 31.4 MB/s eta 0:00:01[K     |▎                               | 20 kB 32.2 MB/s eta 0:00:01[K     |▍                               | 30 kB 35.6 MB/s eta 0:00:01[K     |▌                               | 40 kB 38.1 MB/s eta 0:00:01[K     |▋                               | 51 kB 40.8 MB/s eta 0:00:01[K     |▊                               | 61 kB 43.6 MB/s eta 0:00:01[K     |▉                               | 71 kB 28.3 MB/s eta 0:00:01[K     |█                               | 81 kB 30.0 MB/s eta 0:00:01[K     |█                               | 92 kB 31.9 MB/s eta 0:00:01[K     |█▏                              | 102 kB 31.0 MB/s eta 0:00:01[K     |█▎                              | 112 kB 31.0 MB/s eta 0:00:01[K     |█▍                              | 122 kB 31.0 MB/s eta 0:00:01[K     |█▌                              | 133

------------------------------------------------------------------

***We need to restart the runtime. This is so the bilm library can be loaded easily in the environment. 
Select Runtime --> Restart runtime. Then go to next cell. 

------------------------------------------------------------------

In [None]:
%tensorflow_version 1.x
import tensorflow as tf
import numpy as np
import json
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import pickle
import logging
logging.getLogger('tensorflow').disabled = True #OPTIONAL - to disable outputs from Tensorflow

TensorFlow 1.x selected.


# Get the data
We need to get the pretrained Elmo embedding model as well as the public Github NER repository

NER Repository

In [None]:
!git clone https://github.com/olivettigroup/materials-synthesis-generative-models.git

Cloning into 'materials-synthesis-generative-models'...
remote: Enumerating objects: 678, done.[K
remote: Counting objects: 100% (177/177), done.[K
remote: Compressing objects: 100% (125/125), done.[K
remote: Total 678 (delta 125), reused 95 (delta 52), pack-reused 501[K
Receiving objects: 100% (678/678), 1.12 MiB | 15.13 MiB/s, done.
Resolving deltas: 100% (141/141), done.


In [None]:
%cd materials-synthesis-generative-models/

/content/bilm-tf/materials-synthesis-generative-models


ELMO Embeddings

In [None]:
# Get the vocab file for Elmo, we use the default 
!wget https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt

--2021-07-29 16:59:03--  https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.224.128
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.224.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7430437 (7.1M) [text/plain]
Saving to: ‘vocab-2016-09-10.txt’


2021-07-29 16:59:04 (9.52 MB/s) - ‘vocab-2016-09-10.txt’ saved [7430437/7430437]



In [None]:
# Download the pretrained-Elmo weights and config file from https://figshare.com/s/ec677e7db3cf2b7db4bf
!wget https://ndownloader.figshare.com/files/13773791?private_link=ec677e7db3cf2b7db4bf

--2021-07-29 16:59:04--  https://ndownloader.figshare.com/files/13773791?private_link=ec677e7db3cf2b7db4bf
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 52.16.102.173, 54.217.124.219, 2a05:d018:1f4:d000:b283:27aa:b939:8ed4, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|52.16.102.173|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/13773791/elmo_finetuned_matsci.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=8791f5b142ba7b293ce5993c638216cfd76d10469418f29ca1f3dc5e11b425c6&X-Amz-Date=20210729T165905Z&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20210729/eu-west-1/s3/aws4_request [following]
--2021-07-29 16:59:05--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/13773791/elmo_finetuned_matsci.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=8791f5b142ba7b293ce5993c638

In [None]:
# Unzip the weights and config file
!tar -xvf '13773791?private_link=ec677e7db3cf2b7db4bf'

elmo_options.json
elmo_weights.hdf5


NER Model

In [None]:
!wget https://ndownloader.figshare.com/files/25506038

--2021-07-29 16:59:20--  https://ndownloader.figshare.com/files/25506038
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 54.217.124.219, 52.16.102.173, 2a05:d018:1f4:d000:b283:27aa:b939:8ed4, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|54.217.124.219|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/25506038/token_classifier_elmo.model?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=550edef9e315603cc4f3e79f01f8fba81a75ba633a520a74bcb67b113f1bf754&X-Amz-Date=20210729T165921Z&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20210729/eu-west-1/s3/aws4_request [following]
--2021-07-29 16:59:21--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/25506038/token_classifier_elmo.model?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=550edef9e315603cc4f3e79f01f8fba81a75ba633a520a74bcb67b113f1bf7

ELMO Featurized Annotations

In [None]:
!wget https://ndownloader.figshare.com/files/25636313
!wget https://ndownloader.figshare.com/files/25636334
!wget https://ndownloader.figshare.com/files/25636421

--2021-07-29 16:59:40--  https://ndownloader.figshare.com/files/25636313
Resolving ndownloader.figshare.com (ndownloader.figshare.com)... 54.217.124.219, 52.16.102.173, 2a05:d018:1f4:d003:1c8b:1823:acce:812, ...
Connecting to ndownloader.figshare.com (ndownloader.figshare.com)|54.217.124.219|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/25636313/dev_elmo.p?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=75bc4cf38ab2cccf395b929aa4164cdb73a500b306eb5ed7b70faf398bfbf7a4&X-Amz-Date=20210729T165941Z&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20210729/eu-west-1/s3/aws4_request [following]
--2021-07-29 16:59:41--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/25636313/dev_elmo.p?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=75bc4cf38ab2cccf395b929aa4164cdb73a500b306eb5ed7b70faf398bfbf7a4&X-Amz-Date=20210729T165941Z&X-Am

In [None]:
!mv 25506038 ner.model
!mv 25636313 dev_elmo.p
!mv 25636334 test_elmo.p
!mv 25636421 train_elmo.p

# Load the NER model and make predictions



Import code from Git repository. This repository will do all the heavy ML lifting for us interacting with Keras and Tensorflow behind the scenes.  

In [None]:
from models import token_classifier

Make a NER classifier instance using the ELMO files we downloaded. We are using CPU and pre-featurized data for the sake of the tutorial.

In [None]:
token_classifier = token_classifier.TokenClassifier(
    vocab="vocab-2016-09-10.txt", 
    options="elmo_options.json", 
    weights="elmo_weights.hdf5",
    use_cpu=True, load_data=False
)

Load the pre-trained NER model we downloaded earlier. 

In [None]:
token_classifier.load('ner.model')

Visualize the Model

In [None]:
token_classifier.model.summary()

Load our pre-featurized ELMO data. For this tutorial, we are using the test set from training the model as our data set. 

In [None]:
token_classifier.X_test = pickle.load(open('test_elmo.p', 'rb'))
print('Featurized Data Shape (should be (302,100,1024)):', token_classifier.X_test.shape)

Load the text data from the repository

In [None]:
with open('data/ner_annotations_split.json', 'r') as f:
  data = json.load(f)
print('Data Type:', type(data))
print('Data Keys:', data.keys())
print('Number of Papers:', data['total_annotation_files'])
print('Data Type for each paper:', type(data['data'][0]))
print('Keys for each paper:', data['data'][0].keys())

Get all the test sentences from the data

In [None]:
test_sentences = []
for paper in data['data']:
  if paper['split'] == 'test':
    for tokens in paper['tokens'][1:]:
      test_sentences.append(tokens)
print('Number of Sentences in data set:', len(test_sentences))
print('Example Sentence', test_sentences[0])

Example sentences. If your data is text than we need to convert to ELMO using "featurize_elmo_list". We won't be doing that for time reasons beyond this example. 

In [None]:
example_sentences = [
    "The BaCO3 and TiO2 were mixed to make BaTiO3 .".split(),
    "The SiO2 was heated at 700 degC .".split()
]
feature_matrix = token_classifier.featurize_elmo_list(example_sentences)
print("ELMO Features shape (should be (2,100,1024)):", feature_matrix.shape)

In [None]:
raw_predictions = token_classifier.model.predict(feature_matrix)
print('Example Sentences and Predictions:')
for sentence, predictions in zip(example_sentences, raw_predictions):
  print('---')
  for word, prediction in zip(sentence, predictions):
    print('  ', word, token_classifier.token_classes[np.argmax(prediction)])

Get the raw predictions (class probabilities) using Keras

In [None]:
raw_test_predictions = token_classifier.model.predict(token_classifier.X_test)
print('Raw Prediction Shape (should be (302, 100, 4)):', raw_test_predictions.shape)

Format the predictions taking the most likely class as the prediction

In [None]:
test_predictions = []
for predictions in raw_test_predictions:
  curr_predictions = []
  for prediction in predictions:
    curr_predictions.append(token_classifier.token_classes[np.argmax(prediction)])
  test_predictions.append(curr_predictions)
print('Number of Predictions (sentences) in data set:', len(test_predictions))
print('Example Predictions:', test_predictions[0])

# Text and Data Mining From Predictions

## Tally up targets, precursors, and operations
For the first data mining example, we are going to look at the most common targets, precursors, and operations in the data set. This gives us a sense of what type of information our set contains, whether we have the correct data set for our task, and what types of noise we expect in the data. 

In [None]:
all_targets, all_precursors, all_operations = [],[],[]
for sentence, labels in zip(test_sentences, test_predictions):
  prev_label = ''
  for sent, label in zip(sentence, labels):
    if label == 'target':
      if prev_label == 'target':
        all_targets[-1] = all_targets[-1]+' '+sent  # We need to combine multiword targets to help with tokenization noise effects
      else:
        all_targets.append(sent)
    elif label == 'precursor':
      if prev_label == 'precursor':
        all_precursors[-1] = all_precursors[-1]+' '+sent
      else:
        all_precursors.append(sent)
    elif label == 'operation':
      if prev_label == 'operation':
        all_operations[-1] = all_operations[-1]+' '+sent
      else:
        all_operations.append(sent)
    prev_label = label
print('Total Number of Targets:',  len(all_targets))
print('Total Number of Precursors:', len(all_precursors))
print('Total Number of Operations:', len(all_operations))

We will make a simple visualization showing the ten most common targets, precursors, and operations across the entire data set. Since this is a small, randomly selected data set, we do not see overly common targets or precursors. It is also hard to see any connections between targets and precursors since the materials domains vary greatly.  

In [None]:
fig, ax = plt.subplots(1,3, figsize=(18,6))
plt.subplots_adjust(wspace=.55)
ax[0].barh(np.arange(len(all_targets[:10])), pd.Series(all_targets).value_counts().values.tolist()[:10])
ax[0].invert_yaxis()
ax[0].set_yticks(np.arange(len(all_targets[:10])))
ax[0].set_yticklabels(pd.Series(all_targets).value_counts().index.tolist()[:10])
ax[0].set_title('All Targets')
ax[1].barh(np.arange(len(all_precursors[:10])), pd.Series(all_precursors).value_counts().values.tolist()[:10])
ax[1].invert_yaxis()
ax[1].set_yticks(np.arange(len(all_precursors[:10])))
ax[1].set_yticklabels(pd.Series(all_precursors).value_counts().index.tolist()[:10])
ax[1].set_title('All Precursors')
ax[2].barh(np.arange(len(all_operations[:10])), pd.Series(all_operations).value_counts().values.tolist()[:10])
ax[2].invert_yaxis()
ax[2].set_yticks(np.arange(len(all_operations[:10])))
ax[2].set_yticklabels(pd.Series(all_operations).value_counts().index.tolist()[:10])
ax[2].set_title('All Operations')

Next, we will associate precursors and operations with specific targets. This is trickier than it may appear and requires us to make assumptions about the associations between targets, precursors, and operations within a paper. To keep things simple for the tutorial, we assume any precursor or operation within the same paper as our specified target is being used in the synthesis of that target. For the tutorial, we are only going to look at a couple of the more common targets we found above: Carbon Nanotubes (CNT), Bi2Te3, and Tetraphenylporphyrin (TPP). 

In [None]:
targets_dict = {}
curr_index = 0
targets = ['CNT', 'Bi2Te3', 'TPP']
for t in targets:
  targets_dict[t] = {}
  targets_dict[t]['operations'], targets_dict[t]['precursors'], targets_dict[t]['names'] = [],[],[]
for paper in data['data']:
  if paper['split'] == 'test':
    curr_targets, curr_precursors, curr_operations = [],[],[]
    for tokens in paper['tokens'][1:]:
      prev_label = ''
      for token, label in zip(tokens, test_predictions[curr_index]):
        if label == 'target':
          if prev_label == 'target':
            curr_targets[-1] = curr_targets[-1]+' '+token
          else:
            curr_targets.append(token)
        elif label == 'precursor':
          if prev_label == 'precursor':
            curr_precursors[-1] = curr_precursors[-1]+' '+token
          else:
            curr_precursors.append(token)
        elif label == 'operation':
          if prev_label == 'operation':
            curr_operations[-1] = curr_operations[-1]+' '+token
          else:
            curr_operations.append(token)
      curr_index+=1
    for t in targets:
      for c in curr_targets:
        if t in c:
          targets_dict[t]['names'].append(c)
          targets_dict[t]['precursors'].extend(curr_precursors)
          targets_dict[t]['operations'].extend(curr_operations)
for t in targets:
  targets_dict[t]['names'] = list(np.unique(targets_dict[t]['names']))
  targets_dict[t]['precursors'] = list(np.unique(targets_dict[t]['precursors']))
  targets_dict[t]['operations'] = list(np.unique(targets_dict[t]['operations']))

Now, we can examine the precursors and operations that are used for each target. The "names" field keeps track of all the different target variants that were found. 

In [None]:
for t in targets_dict:
  print(t)
  print('   Names:', [t for t in targets_dict[t]['names'] if any(c.isalpha() for c in t)])
  print('   Precursors:', [t for t in targets_dict[t]['precursors'] if any(c.isalpha() for c in t)])
  print('   Operations:', [t for t in targets_dict[t]['operations'] if any(c.isalpha() for c in t)])

## Temperature Text Mining
For the second data mining activity, we will extract synthesis temperatures and build in levels of detail around the temperatures. 

First we will just extract all the temperatures from the data. This can be done relatively easily by taking the words before the token "degC". This should give us a relatively accurate and precise set of temperatures.

In [None]:
all_temperatures = []
for paper in data['data']:
  if paper['split'] == 'test':
    for tokens in paper['tokens'][1:]:
      prev_token = ''
      for token in tokens:
        if token == 'degC':
          try:
            all_temperatures.append(float(prev_token))
          except:
            pass
        prev_token = token
print('Number of Temperatures Found:', len(all_temperatures))
print('Minimum Temperature:', np.min(all_temperatures))
print('Maximum Temperature:', np.max(all_temperatures))
print('Mean Temperature:', round(np.mean(all_temperatures),1))
print('Median Temperature:', np.median(all_temperatures))

We visualize the temperature distributions using a violin plot which is a nice visualization choice for adding additional levels of detail. 

In [None]:
temp_data = pd.DataFrame({'Temperature':all_temperatures})
sns.violinplot(data=temp_data, y='Temperature')

Next, we want to add levels of nuance to the temperature data. We will split the temperatures based on which operations, or synthesis step, it occurs in. To determine that, we will take whichever operation is closest to the temperature within the sentence.  

In [None]:
temperatures, operations = [],[]
label_index_count = 0
for paper in data['data']:
  if paper['split'] == 'test':
    for tokens in paper['tokens'][1:]:
      prev_token = ''
      for token in tokens:
        if token == 'degC':
          try:
            curr_temp = float(prev_token)
            temp_index = tokens.index(prev_token)
            curr_operations, curr_operation_indexes = [],[]
            for i, (token, label) in enumerate(zip(tokens, test_predictions[label_index_count])):
              if label == 'operation':
                curr_operations.append(token)
                curr_operation_indexes.append(i)
            if len(curr_operation_indexes) > 0:
              closest_index = np.argmin([abs(temp_index-c) for c in curr_operation_indexes])
              temperatures.append(curr_temp)
              operations.append(curr_operations[closest_index])
          except:
            pass
        prev_token = token
      label_index_count+=1
print('Number of Temperatures and Operations:', len(temperatures), len(operations))

To make the visualization cleaner, we only take the top five most common operations.

In [None]:
most_common_ops = pd.Series(operations).value_counts().index.tolist()[:5]
common_temps, common_ops = [],[]
for t, o in zip(temperatures, operations):
  if o in most_common_ops:
    common_temps.append(t)
    common_ops.append(o)
print('Number of Temperatures from top five operations:', len(common_temps), len(common_ops))

We plot the temperature data split by operations. We see calcination has a much higher spread in temperatures than the rest of the operations, while operations like drying and stirring have a much more consistent temperature across the data set. 

In [None]:
temp_op_data = pd.DataFrame({'Temperature':common_temps, 'Operation':common_ops})
sns.violinplot(data=temp_op_data, y='Temperature', x='Operation')

Finally, we add a target dimension to the temperature extraction. We follow the same assumption as before that all operations and temperatures occuring in the same paper as the target are being done to make the target. 

In [None]:
temperatures, operations, targets = [],[],[]
label_index_count = 0
curr_index = 0
for paper in data['data']:
  if paper['split'] == 'test':
    curr_targets = []
    for tokens in paper['tokens'][1:]:
      prev_label = ''
      for token, label in zip(tokens, test_predictions[curr_index]):
        if label == 'target':
          if prev_label == 'target':
            curr_targets[-1] = curr_targets[-1]+' '+token
          else:
            curr_targets.append(token)
      curr_index+=1
    if len(curr_targets) == 0:
      label_index_count = label_index_count + len(paper['tokens'][1:])
      continue
    for tokens in paper['tokens'][1:]:
      prev_token = ''
      for token in tokens:
        if token == 'degC':
          try:
            curr_temp = float(prev_token)
            temp_index = tokens.index(prev_token)
            curr_operations, curr_operation_indexes = [],[]
            for i, (token, label) in enumerate(zip(tokens, test_predictions[label_index_count])):
              if label == 'operation':
                curr_operations.append(token)
                curr_operation_indexes.append(i)
            if len(curr_operation_indexes) > 0:
              closest_index = np.argmin([abs(temp_index-c) for c in curr_operation_indexes])
              for c in curr_targets:
                if len(c) > 1:
                  temperatures.append(curr_temp)
                  operations.append(curr_operations[closest_index])
                  targets.append(c)
          except:
            pass
        prev_token = token
      label_index_count+=1
print('Number of temperatures, operations, and targets:', len(temperatures), len(operations), len(targets))

We clean the data first by filtering down to only two targets, nickel oxides and carbon nanotubes. We then combine similar operations to give us more data to work with. 

In [None]:
small_temperatures, small_operations, small_targets = [],[],[]
for t, o, targ in zip(temperatures, operations, targets):
  if 'NiO' in targ:
    small_temperatures.append(t)
    small_operations.append(o)
    small_targets.append('NiO')
  elif 'CNT' in targ:
    small_temperatures.append(t)
    small_operations.append(o)
    small_targets.append('CNT')
cleaned_small_temperatures, cleaned_small_operations, cleaned_small_targets = [],[],[]
for t, o, targ in zip(small_temperatures, small_operations, small_targets):
  if o == 'dried' or o == 'drying':
    cleaned_small_temperatures.append(t)
    cleaned_small_operations.append('dry')
    cleaned_small_targets.append(targ)
  elif o == 'calcined' or o == 'calcination':
    cleaned_small_temperatures.append(t)
    cleaned_small_operations.append('calcine')
    cleaned_small_targets.append(targ)
  elif o == 'heated' or o == 'held' or o == 'set' or o == 'crystallization':
    cleaned_small_temperatures.append(t)
    cleaned_small_operations.append('heat')
    cleaned_small_targets.append(targ)
  elif o == 'stirred':
    cleaned_small_temperatures.append(t)
    cleaned_small_operations.append('stir')
    cleaned_small_targets.append(targ)
print('Final cleaned numbers of temperatures, operations, and targets:', len(cleaned_small_temperatures), len(cleaned_small_operations), len(cleaned_small_targets))

The final visualization shows the temperature data split by operation and targets. We see that nickel oxides have a wide range of calcination temperatures whereas carbon nanotubes seem to have only a single calcination temperature. 

In [None]:
temp_op_data = pd.DataFrame({'Temperature':cleaned_small_temperatures, 'Operation':cleaned_small_operations, 'Target':cleaned_small_targets})
sns.violinplot(data=temp_op_data, y='Temperature', x='Operation', hue='Target', split=True)

# End of tutorial

# Additional Demonstrations

## Setup Environment

We need to download and install the bilm library from Github. To have the code work on CoLab, we need to first run the two cells below, then restart the runtime instance so it is recognized as an installed package. 
https://stackoverflow.com/questions/57838013/modulenotfounderror-after-successful-pip-install-in-google-colaboratory


In [None]:
!git clone https://github.com/allenai/bilm-tf.git

In [None]:
%cd bilm-tf/
!python setup.py install

******Restart the runtime*******

In [None]:
%tensorflow_version 1.x
import tensorflow as tf
import numpy as np
import json
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import pickle
import logging
logging.getLogger('tensorflow').disabled = True #OPTIONAL - to disable outputs from Tensorflow

In [None]:
!git clone https://github.com/olivettigroup/materials-synthesis-generative-models.git

In [None]:
%cd materials-synthesis-generative-models/

In [None]:
!wget https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt

In [None]:
# Download the pretrained-Elmo weights and config file from https://figshare.com/s/ec677e7db3cf2b7db4bf
!wget https://ndownloader.figshare.com/files/13773791?private_link=ec677e7db3cf2b7db4bf

In [None]:
# Unzip the weights and config file
!tar -xvf '13773791?private_link=ec677e7db3cf2b7db4bf'

In [None]:
from models import token_classifier

In [None]:
token_classifier = token_classifier.TokenClassifier(
    vocab="vocab-2016-09-10.txt", 
    options="elmo_options.json", 
    weights="elmo_weights.hdf5",
    use_cpu=True, load_data=False
)


## Featurize Data and Train NER Model

Load in the data from the repository 

In [None]:
with open('data/ner_annotations_split.json', 'r') as f:
  data = json.load(f)
print(data.keys())

In [None]:
train_sentences, dev_sentences, test_sentences = [],[],[]
train_labels, dev_labels, test_labels = [],[],[]
for paper in data['data']:
  if paper['split'] == 'train':
    train_sentences.extend(paper['tokens'][1:]) # first "sentence" is the title which we don't want right now
    train_labels.extend(paper['labels'][1:])
  elif paper['split'] == 'dev':
    dev_sentences.extend(paper['tokens'][1:])
    dev_labels.extend(paper['labels'][1:])
  else:
    test_sentences.extend(paper['tokens'][1:])
    test_labels.extend(paper['labels'][1:])
print(len(train_sentences), len(dev_sentences), len(test_sentences))
print(len(train_labels), len(dev_labels), len(test_labels))

Featurize the sentences into arrays of ELMO embeddings

In [None]:
train_elmo_features = token_classifier.featurize_elmo_list(train_sentences)
print('Train Input shape:', train_elmo_features.shape)
dev_elmo_features = token_classifier.featurize_elmo_list(dev_sentences)
print('Dev Input shape:', dev_elmo_features.shape)
test_elmo_features = token_classifier.featurize_elmo_list(test_sentences)
print('Test Input shape:', test_elmo_features.shape)

One-hot encode the annotation labels

In [None]:
y_train, y_dev, y_test = [],[],[]
for labels in train_labels:
  train_onehot_labels = np.zeros(shape=(token_classifier._seq_maxlen, len(token_classifier.token_classes)))
  for j, label in enumerate(labels[:token_classifier._seq_maxlen]):
    if label not in ['precursor', 'target', 'operation']:
      label = 'null'
    train_onehot_label = [0.0]*len(token_classifier.token_classes)
    train_onehot_label[token_classifier.inv_token_classes[label]] = 1.0
    train_onehot_labels[j] = train_onehot_label
  y_train.append(train_onehot_labels)
for labels in dev_labels:
  dev_onehot_labels = np.zeros(shape=(token_classifier._seq_maxlen, len(token_classifier.token_classes)))
  for j, label in enumerate(labels[:token_classifier._seq_maxlen]):
    if label not in ['precursor', 'target', 'operation']:
        label = 'null'
    dev_onehot_label = [0.0]*len(token_classifier.token_classes)
    dev_onehot_label[token_classifier.inv_token_classes[label]] = 1.0
    dev_onehot_labels[j] = dev_onehot_label
  y_dev.append(dev_onehot_labels)
for labels in test_labels:
  test_onehot_labels = np.zeros(shape=(token_classifier._seq_maxlen, len(token_classifier.token_classes))) 
  for j, label in enumerate(labels[:token_classifier._seq_maxlen]):
    if label not in ['precursor', 'target', 'operation']:
        label = 'null'
    test_onehot_label = [0.0]*len(token_classifier.token_classes)
    test_onehot_label[token_classifier.inv_token_classes[label]] = 1.0
    test_onehot_labels[j] = test_onehot_label
  y_test.append(test_onehot_labels)
y_test = np.array(y_test)
y_dev = np.array(y_dev)
y_train = np.array(y_train)
print('Train Output Shape:', y_train.shape)
print('Dev Output Shape:', y_dev.shape)
print('Test Output Shape:', y_test.shape)

Set variables

In [None]:
token_classifier.X_train = train_elmo_features
token_classifier.X_dev = dev_elmo_features
token_classifier.X_test = test_elmo_features
token_classifier.Y_train = y_train
token_classifier.Y_dev = y_dev
token_classifier.Y_test = y_test

Build the model in Keras

In [None]:
token_classifier.build_nn_model()

Train the model using early stopping to prevent overfitting

In [None]:
token_classifier.train(stop_early=True)

Save model

In [None]:
token_classifier.save("bin/model_name.model")