# Using QuantityParser

By writing parsers subclassing QuantityParser and defining models subclassing QuantityModel, we can extract the values and units of found properties not as strings, as was the case when using BaseParser and BaseModel, but rather as (a list of) floats and a class representing the units. This allows for easy conversion and comparison of extracted properties, no matter how the original document wrote them. In this example, we will rewrite the parser from extracting_a_custom_property using these classes.

In [1]:
from chemdataextractor import Document
from chemdataextractor.physicalmodels import Compound
from chemdataextractor.doc import Paragraph, Heading

## Example Document

Let's create a simple example document with a single heading followed by a single paragraph:

In [2]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

What does this look like:

In [4]:
d

## Default Parsers

By default, ChemDataExtractor doesn't extract boiling points:

In [5]:
d.records.serialize()

[{'labels': ['3a'], 'names': ['2,4,6-trinitrotoluene'], 'roles': ['product']}]

## Defining a New QuantityModel

The first task is to define the schema of a new property, and add it to the `Compound` model. Note that because `TemperatureModel`, a subclass of `QuantityModel`, is already defined, and all subclasses of `QuantityModel` have a value and a unit, we don't need to define those. An example of how to define new types of quantities can be seen in `chemdataextractor.units.temperature.py`.

In [11]:
from chemdataextractor.units.temperatures import TemperatureModel, Temperature, Kelvin
from chemdataextractor.model import ListType, ModelType

class BoilingPoint(TemperatureModel):
    pass

Compound.boiling_points = ListType(ModelType(BoilingPoint))

## Writing a New Parser

Next, define parsing rules that define how to interpret text and convert it into the model. Once we define the dimensions in a `QuantityParser` subclass, it should be able to autmatically parse the units found. If you encounter units that are missing, they can always be added, an example of which can be seen in `chemdataextractor.units.temperature.py`.

In [8]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling') + I(u'point')).hide()
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')

In [9]:
from chemdataextractor.parse.quantity import QuantityParser
from chemdataextractor.utils import first

class BpParser(QuantityParser):
    root = bp
    dimensions = Temperature()

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=self.extract_value(first(result.xpath('./value/text()'))),
                    units=self.extract_units(first(result.xpath('./units/text()')))
                )
            ]
        )
        yield compound



In [10]:
Paragraph.parsers = [BpParser()]

## Running the New Parser

The parser can then be run on the document. Because `BoilingPoint` is a subclass of `QuantityModel`, we can then, for example, conver the result to Kelvins.

In [13]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

for record in d.records:
    print(record.names[0])
    print('Boiling point was found to be:', record.boiling_points[0])
    kelvin_value = record.boiling_points[0].convert_to(Kelvin())
    print('Boiling point in Kelvins is:', kelvin_value)

2,4,6-trinitrotoluene
Boiling point was found to be: Quantity with Temperature, Celsius^(1.0) and a value of 240.0
Boiling point in Kelvins is: 513.15
