## Snowball Relationship Extraction

The new Relex package is a toolkit for performing probabilistic chemical relationship extraction based on semi-supervised online learning. The aim is to train parse expressions probabilistically, removing the need for creating parsers with trial and error.

This overview is based on how to use the code, for a detailed explanation of the algorithm please see the associated paper: https://www.nature.com/articles/sdata2018111


In general, chemical relationships can consist of any number of entities, that is, the elements of a relationship that are linked together to uniquely define it. Here we will focus on a simple Curie Temperature relationship that consists of the following entities:
- A compound
- A specifier
- A value
- A unit

Thus this forms a quaternary relationship. Note the algorithm is general and so any number of entities can be specified. You can even make some entities more important than others.

First define a new model, as usual:

In [20]:
from chemdataextractor.relex import Snowball
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType, Compound, TemperatureModel
import re
from chemdataextractor.parse import R, I, W, Optional, merge, join, OneOrMore, Any, ZeroOrMore, Start
from chemdataextractor.doc import Sentence

In [21]:
class CurieTemperature(TemperatureModel):
    specifier_expression =((I('Curie') + I('temperature')) | I('Tc')).add_action(join)
    specifier = StringType(parse_expression=specifier_expression, required=True, contextual=True, updatable=True)
    compound = ModelType(Compound, required=True, contextual=True, updatable=False)

We are now ready to start Snowballing

# Training

Create a ```Snowball``` object to use on our relationship and point to a path for training.

Here will we use the default parameters:
- Tc = 0.95, the minimum Confidence required for a new relationship to be accepted
- Tsim=0.95, The minimum similarity between phrases for them to be clustered together
- learning_rate = 0.5, How quickly the system updates the confidences based on new information
- prefix_length=1, number of tokens in phrase prefix
- suffix_length = 1, number of tokens in phrase suffix
- prefix_weight = 0.1, the weight of the prefix in determining similarity
- middles_weight = 0.8, the weight of the middles in determining similarity
- suffix_weight  = 0.1, weight of suffix in determining similarity
- max_candidate_combinations = 400, maximum number of candidate,name combinations to retrieve (memory constrained)

Note increasing TC and Tsim yields more extraction patterns but stricter rules on new relations.

Increasing the learning rate influences how much we trust new information compared to our training.

Increasing the prefix/suffix length increases the likelihood of getting overlapping relationships.


The training process in online. This means that the user can train the system on as many papers as they like, and it will continue to update the knowledge base. At each paper, the sentences are scanned for any matches to the parse phrase, and if the sentence matches, candidate relationships are formed. There can be many candidate relationships in a single sentence, so the output provides the user will all available candidates. 

The user can specify to accept a relationship by typing in the number (or numbers) of the candidates they wish to accept. I.e. If you want candidate 0 only, type '0' then press enter. If you want 0 and 3 type '0,3' and press enter. If you dont want any, then press any other key. e.g. 'n' or 'no'. 


In [3]:
snowball = Snowball(CurieTemperature)

In [4]:
snowball.save_file_name = 'curie_temperatures'

In [5]:
snowball.train(corpus='../tests/data/relex/curie_training_set/')

1/6:   c2jm33712f.html



Cobalt is a ferromagnetic transition metal exhibiting a high Curie temperature of 1388 K (ferromagnetic–paramagnetic transition) and a high saturation magnetization (1422 emu cm−3) at room temperature.2 The technological applications of Co nanoparticles in the field of ultrahigh-density data recording and data storage are well documented in the literature.1 Recently, Co has also been used in MRI agents,3 a field which has primarily been dominated by iron oxides (Fe3O4) because of their stability and biocompatibility albeit the oxides show much less saturation magnetization (84 emu cm−3)4 in comparison to cobalt.


Candidate 0 <(Cobalt,compound_names,0,1), (Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14)>

Candidate 1 <(transition metal,compound_names,4,6), (Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14)>

Candidate 2 <(Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14), 

...: n



The EuS nanocrystals have low coercive fields HC (at 2 K, HC = 90 Oe) and low remanence Mr, this rapidly approaches zero as temperature increases, being zero above TC (Fig. 7).


Candidate 0 <(EuS,compound_names,1,2), (2,raw_value,10,11), (K,raw_units,11,12), (TC,specifier,34,35)>

...: n



The Arrot plots shown in Fig. 6a, right, and obtained from the M(H) curves, confirm that the Curie temperature is around 15 K.


Candidate 0 <(Curie temperature,specifier,20,22), (15,raw_value,24,25), (K,raw_units,25,26)>

...: n



In what concerns the other range (>40 K), we observe a maximum at 60 K (Fig. 6b), probably related to a EuO-like environment located at the surface of the nanocrystals, since EuO has a Tc at 69 K.8 The AC susceptibility curves show a high temperature region Tir > 11 K for which the in-phase susceptibility χ′ is not frequency dependent and the out-of-phase susceptibility χ″ is close to zero.


Candidate 0 <(K,raw_units,9,10), (60,raw_value,17,18), (EuO,compound

This training process automatically clusters the sentences you accept and updates the knowlede base. You can check what has been learned by searching in the relex/data folder. 

You can always stop training and start again, or come back to the same training process if you wish, simply load in an existing snowball system using: ```Snowball.load()```

# Finding New Relationships
We can now use the Snowball object just like any other parser. Simply add the trained Snowball parser to your models.

In [22]:
CurieTemperature.parsers += [snowball]

Now we can parse a completely new sentence and get the associated records.

In [25]:
import pprint
s = Sentence('BiFeO3 is a ferromagnetic transition metal exhibiting a Curie temperature of 1103 K (')
s.add_models([CurieTemperature])
pprint.pprint(s.records.serialize())

[{'Compound': {'names': ['BiFeO3']}},
 {'Compound': {'names': ['transition metal']}},
 {'CurieTemperature': {'compound': {'Compound': {'names': ['BiFeO3']}},
                       'raw_units': 'K',
                       'raw_value': '1103',
                       'specifier': 'Curie temperature',
                       'units': 'Kelvin^(1.0)',
                       'value': [1103.0]}},
 {'CurieTemperature': {'compound': {'Compound': {'names': ['BiFeO3']}},
                       'confidence': 0.9802186932726803,
                       'raw_units': 'K',
                       'raw_value': '1103',
                       'specifier': 'Curie temperature',
                       'units': 'Kelvin^(1.0)',
                       'value': [1103.0]}}]


As shown, the Snowball parser correctly picks out the relation and adds a confidence score to the output. The record without the confidence score is the one picked up by the AutoSentenceParser()