### Getting started
#### Week 24/01/2022

In [1]:
from batterydataextractor.doc.document import Document, Paragraph

In [2]:
doc = Document('UV-vis spectrum of 5,10,15,20-Tetra(4-carboxyphenyl)porphyrin in Tetrahydrofuran (THF).',
              'The mechanism of lithium intercalation in the so-called ‘soft’ anodes, i.e. graphite or graphitable carbons, is well known: it develops through well-identified, reversible stages, corresponding to progressive intercalation within discrete graphene layers, to reach the formation of LiC6 with a maximum theoretical capacity of 372 ± 2.4 mAh g−1.')

In [3]:
doc.elements

[Paragraph(id=None, references=[], text='UV-vis spectrum of 5,10,15,20-Tetra(4-carboxyphenyl)porphyrin in Tetrahydrofuran (THF).'),
 Paragraph(id=None, references=[], text='The mechanism of lithium intercalation in the so-called ‘soft’ anodes, i.e. graphite or graphitable carbons, is well known: it develops through well-identified, reversible stages, corresponding to progressive intercalation within discrete graphene layers, to reach the formation of LiC6 with a maximum theoretical capacity of 372 ± 2.4 mAh g−1.')]

In [4]:
para = doc.elements[0]
para

In [5]:
para.sentences

[Sentence('UV-vis spectrum of 5,10,15,20-Tetra(4-carboxyphenyl)porphyrin in Tetrahydrofuran (THF).', 0, 87)]

In [6]:
para.tokens

[[Token('UV', 0, 2),
  Token('-', 2, 3),
  Token('vis', 3, 6),
  Token('spectrum', 7, 15),
  Token('of', 16, 18),
  Token('5,10,15,20-Tetra(4-carboxyphenyl)porphyrin', 19, 61),
  Token('in', 62, 64),
  Token('Tetrahydrofuran', 65, 80),
  Token('(', 81, 82),
  Token('THF', 82, 85),
  Token(')', 85, 86),
  Token('.', 86, 87)]]

In [8]:
doc.cems

AttributeError: 'NoneType' object has no attribute 'start'

In [9]:
p = Paragraph(u'Dye-sensitized solar cells (DSSCs) with ZnTPP = Zinc tetraphenylporphyrin.')
p.abbreviation_definitions

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[(['DSSCs'], ['Dye', '-', 'sensitized', 'solar', 'cells'], None),
 (['ZnTPP'], ['Zinc', 'tetraphenylporphyrin'], 'CM')]

In [10]:
s = p.sentences[0]

In [11]:
s.tokens[0]

Token('Dye', 0, 3)

In [12]:
s.tokens[0].lex.normalized

'Dye'

In [13]:
s.tokens[0].lex.is_hyphenated

False

### Parser

In [4]:
from batterydataextractor.nlp import BertCemTagger, CemTagger
from transformers import pipeline
dt = BertCemTagger()
ct = CemTagger()

In [5]:
question_answerer = pipeline("question-answering", "batterydata/test1")

In [6]:
def new_parser(name, context):
    result = question_answerer(question="What is the {}?".format(name), context=context)
    if result['score']>0:
        answer = result['answer']
    print("{}:".format(name.capitalize()), answer, " Confidence score:", result['score'])
    return

In [7]:
def new_property_only_parser(name, context):
    result = question_answerer(question="What is the value of {}?".format(name), context=context)
    if result['score']>0:
        answer = result['answer']
    print("Property value:", answer, " Confidence score:", result['score'])
    return

In [8]:
def new_property_parser(name, context):
    result = question_answerer(question="What is the value of {}?".format(name), context=context)
    if result['score']>0:
        answer = result['answer']
    ner = [(i, 'NN') for i in context.split(" ")]
    materials = ct.tag(ner)
    for i in materials:
        if i[-1] == 'B-CM':
            material = i[0][0]
    print("Material:", material, " Property value:", answer, " Confidence score:", result['score'])
    return

#### Demo

In [9]:
property_name = "mp"
context = '4-Acetylamino-3-chloro-6-(4-cyano-2,6-difluoro-3-methoxyphenyl)pyridine-2-carboxylic acid: mp 146-147 °C.'
new_property_parser(property_name, context)

Material: 4-Acetylamino-3-chloro-6-(4-cyano-2,6-difluoro-3-methoxyphenyl)pyridine-2-carboxylic  Property value: 146-147 °C  Confidence score: 0.4786563217639923


In [10]:
property_name = "glass transition temperature"
context = 'The poly(azide) shows a glass transition temperature of 282.6 °C.'
new_property_parser(property_name, context)

#CDE result: just 282.6, no material

Material: poly(azide)  Property value: 282.6  Confidence score: 0.757524847984314


In [11]:
context = 'The four-armed compd. (ANTH-OXA6t-OC12) with the dodecyloxy surface group is a high glass transition temp. (Tg:  211°) material and exhibits good soly.'
new_property_parser(property_name, context)

Material: dodecyloxy  Property value: Tg:  211°  Confidence score: 0.6190184354782104


In [12]:
property_name = 'uvvis'
context = 'λabs/nm 320, 380, 475, 529;'
new_property_only_parser(property_name, context)

Property value: λabs/nm 320, 380, 475, 529  Confidence score: 1.7138758039436652e-06


In [13]:
property_name = 'capacity'
context = 'The mechanism of lithium intercalation in the so-called ‘soft’ anodes, i.e. graphite or graphitable carbons, is well known: it develops through well-identified, reversible stages, corresponding to progressive intercalation within discrete graphene layers, to reach the formation of LiC6 with a maximum theoretical capacity of 372 ± 2.4 mAh g−1.'
new_property_parser(property_name, context)

Material: LiC6  Property value: 372 ± 2.4 mAh g−1  Confidence score: 0.34231090545654297


TODO: just definition, material, etc.

In [14]:
property_name = 'apparatus'
context1 = 'The photoluminescence quantum yield (PLQY) was measured using a HORIBA Jobin Yvon FluoroMax-4 spectrofluorimeter'
context2 = '1H NMR spectra were recorded on a Varian MR-400 MHz instrument.'
new_parser(property_name, context1)
new_parser(property_name, context2)

Apparatus: FluoroMax-4 spectrofluorimeter  Confidence score: 0.05023842304944992
Apparatus: Varian MR-400 MHz instrument  Confidence score: 0.5510933995246887


### Springer reader, Parser, Model in BDE
#### Week 09/02/2022


In [17]:
from batterydataextractor.doc import Document
spr = Document.from_file(r"tests/testpapers/spr_test1.xml")
spr.elements
# records = spr.records
# print(spr.records.serialize())

[{'title': 'Nano and Battery Anode: A Review', 'authors': ['Majdi', 'Latipov', 'Borisov', 'Yuryevna', 'Kadhim', 'Suksatan', 'Khlewee', 'Kianfar'], 'publisher': 'Springer US', 'journal': 'Nanoscale Research Letters', 'date': '20211211', 'volume': '16', 'issue': '1', 'doi': '10.1186/s11671-021-03631-x', 'abstract': 'Improving the anode properties, including increasing its capacity, is one of the basic necessities to improve battery performance. In this paper, high-capacity anodes with alloy performance are introduced, then the problem of fragmentation of these anodes and its effect during the cyclic life is stated. Then, the effect of reducing the size to the nanoscale in solving the problem of fragmentation and improving the properties is discussed, and finally the various forms of nanomaterials are examined. In this paper, electrode reduction in the anode, which is a nanoscale phenomenon, is described. The negative effects of this phenomenon on alloy anodes are expressed and how to eli

In [16]:
from batterydataextractor.doc import Document
doc = Document("The cathode of Li-ion battery is LiFePO4. The voltage is 3.3V for NaCl. However, the capacity of LiFePO4 is 3 mAh/g.")
doc.add_models_by_names(["capacity", "voltage"])
record = doc.records
print(record)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[<PropertyData>, <PropertyData>]


In [18]:
for r in record:
    print(r.serialize())

{'PropertyData': {'value': [3.3], 'units': 'V', 'specifier': 'voltage', 'material': 'NaCl'}}
{'PropertyData': {'value': [3.0], 'units': 'mAh / g', 'specifier': 'capacity', 'material': 'LiFePO4'}}


In [19]:
print(doc.cems)

[Span('LiFePO4', 33, 40), Span('NaCl', 66, 70), Span('LiFePO4', 97, 104)]


In [20]:
property_name = "mp"
context = '4-Acetylamino-3-chloro-6-(4-cyano-2,6-difluoro-3-methoxyphenyl)pyridine-2-carboxylic acid: mp 146-147 °C.'
doc2 = Document(context)
doc2.add_models_by_names([property_name])
print(doc2.records.serialize())

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'PropertyData': {'value': [146.0, 147.0], 'units': '° C', 'specifier': 'mp', 'material': 'pyridine-2-carboxylic acid'}}]


In [21]:
property_name = "glass transition temperature"
context = 'The poly(azide) shows a glass transition temperature of 282.6 °C.'
doc3 = Document(context)
doc3.add_models_by_names([property_name])
print(doc3.records.serialize())
#CDE result: just 282.6, no material

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'PropertyData': {'value': [282.6], 'units': '° C', 'specifier': 'glass transition temperature', 'material': 'poly(azide)'}}]


In [24]:
property_name = 'capacity'
context = 'The mechanism of lithium intercalation in the so-called ‘soft’ anodes, i.e. graphite or graphitable carbons, is well known: it develops through well-identified, reversible stages, corresponding to progressive intercalation within discrete graphene layers, to reach the formation of LiC6 with a maximum theoretical capacity of 372 mAh g−1.'
doc4 = Document(context)
doc4.add_models_by_names([property_name])
print(doc4.records.serialize())

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'PropertyData': {'value': [372.0, 1.0], 'specifier': 'capacity', 'material': 'LiC6'}}]


In [28]:
s = 'The lithium iron phosphate battery (LiFePO4 battery) or LFP battery (lithium ferrophosphate), is a type ' \
            'of lithium-ion battery using lithium iron phosphate (LiFePO4) as the cathode material, and a graphitic ' \
            'carbon electrode with a metallic backing as the anode.'
p = Paragraph(s)
p.add_general_models(["anode", "cathode"])
record = p.records
for r in record:
    print(r.serialize())

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'GeneralInfo': {'answer': 'graphitic carbon electrode with a metallic backing', 'specifier': 'anode'}}
{'GeneralInfo': {'answer': 'lithium iron phosphate', 'specifier': 'cathode'}}


In [27]:
s = '1H NMR spectra were recorded on a Varian MR-400 MHz instrument.'
p = Paragraph(s)
p.add_general_models(["apparatus"])
record = p.records
for r in record:
    print(r.serialize())

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'GeneralInfo': {'answer': 'Varian MR - 400 MHz instrument', 'specifier': 'apparatus'}}


TODO:

- ~~Define relations between compound and property.~~
- More complicated case (e.g. A long paragraph including multiple properties or materials?)
- Incorporate/Embed "Unit Model" into it - (Maybe compare the performance? - Should increase the precision but reduce recall)
- ~~General Parser (Non-property parser): Anode/Cathode/Electrolyte; Apparatus~~
- ~~Definition?~~
- Save the database option
- Bootstrap option?
- Compound (2 words fix)