# Extracting information from a paragraph
---
So now we have our paragraphs, let's see what kind of information we can get out! The first thing we need to do is import some libraries, including the `synparagraph` one I wrote for this specific purpose.

In [1]:
import os
import sys
import matplotlib.pyplot as plt

try:
    from synoracle.synparagraph import SynParagraph
except ModuleNotFoundError:
    module_path = os.path.abspath(os.path.join('..'))
    if module_path not in sys.path:
        sys.path.append(module_path)
    from synoracle.synparagraph import SynParagraph

In [2]:
import pandas as pd
import numpy as np
import pint
ureg = pint.UnitRegistry()
Q_ = ureg.Quantity

from glob import glob
from tqdm.notebook import tqdm, trange
def li_iterate(li):
    l = iter(li)
    for _ in trange(len(li)):
        yield next(l)

## Finding the papers
---
Next, we need to find out papers, which we do using the helpful library `glob` for wildcard searching. We search for the `.txt` files corresponding to extracted paragraphs.

In [3]:
files = [x.rsplit('\\',1)[1].rsplit('.', 1)[0] for x in glob('./majed/*.txt')]
files_iter = iter(files)
print(files)

['acsaccounts5b00165.0', 'anie200504114.100', 'B206698J.13', 'B306504A.8', 'B810295C.12', 'B915273C.21', 'cm049398e.85', 'cm8012733.120', 'cm8012733.132', 'cm8012733.89', 'cm8012733.91', 'cm801411y.107', 'cm801411y.70', 'cm801411y.71', 'cm801411y.72', 'es000990o.103', 'es000990o.67', 'es000990o.68', 'es000990o.70', 'es000990o.71', 'es000990o.97', 'es000990o.99', 'ie0705047.65', 'ie0705047.69', 'ja0559911.70', 'ja974025i.129', 'ja974025i.299', 'ja974025i.78', 'ja974025i.79', 'jctbv9612.142', 'jctbv9612.158', 'jjcis201309023.54', 'jp014280w.105', 'jp014280w.77', 'jp021964a.72', 'jp021964a.93', 'jp044538t.96', 'jp044538t.97', 'jp044538t.98', 'la035834k', 'la902239m.63', 'la902239m.64', 'la902239m.88', 'nature02529.108', 'S0009261400013853.30', 'S0045653505011185', 'S0169131711000895.239', 'S0169131711000895.39', 'S0169131711000895.40', 'S0169131711000895.43', 'S0169131711000895.59', 'S0304389408003610.104', 'S0304389408003610.312', 'S0304389408003610.81', 'S0304389408003610.83', 'S0304389

## Picking a random paper and processing the information
---
We then instantiate a `SynthesisParagraph` object, which does our data extraction for us. This loads in the paper, but doesn't go through the data extraction just yet.

In [277]:
working = next(files_iter)
print(working)
test_syn = SynParagraph(working, source_directory='./majed/', chemtagger_dir = '../')

StopIteration: 

## Looking at the text classification
---
Now our object is successfully instantiated, we can read the text (`raw_text`) and scan how `ChemDataExtractor` and `ChemicalTagger`interpreted the information. `cde_text` underlines identified chemicals, and `xml_text` colour codes action phrases too.

In [274]:
test_syn.load_xml()
print(test_syn.xml_para_annotate(test_syn.working_xml))

[4;96mSodium[0m [4;96mhydroxide[0m [1;96m0.88[0m [1;96mg[0m [1;96m([0m [1;96m0.22[0m [1;96mmol[0m [1;96m)[0m [96mwas[0m [96mdissolved[0m [96min[0m [1;96m25[0m [1;96mml[0m [96mof[0m [96mdistilled[0m [4;96mwater[0m [m.[0m [4;91mSodium[0m [4;91maluminate[0m [1;91m1.05[0m [1;91mg[0m [1;91m([0m [1;91m0.0128[0m [1;91mmol[0m [1;91m)[0m [91mand[0m [1;91m196[0m [1;91mml[0m [1;91m([0m [1;91m4.841[0m [1;91mmol[0m [1;91m)[0m [91mof[0m [4;91mmethanol[0m [91mas[0m [91ma[0m [91msolvent[0m [91mwere[0m [91madded[0m [mand[0m [94mthe[0m [94mmixture[0m [94mwas[0m [94mstirred[0m [94mfor[0m [94m30[0m [94mmin[0m [m.[0m [4;91mTetraethoxysilane[0m [1;91m41.9[0m [1;91mml[0m [1;91m([0m [1;91m0.1882[0m [1;91mmol[0m [1;91m)[0m [91mwas[0m [91madded[0m [91mdrop-wise[0m [mand[0m [94mstirred[0m [94mfor[0m [94man[0m [94mhour[0m [m.[0m [91mFinally[0m [1;91m15[0m [1;91mg[0m [91mof[0m [9

## Extracting a sequence
---
Finally, the sequence dataframe shows what was added when, letting us recreate the sequence of events described in the paragraph. 

In [275]:
test_syn.raw_synthesis

Unnamed: 0,name,text,new_chemicals,temp,time,prepphrase,apparatus,step number
0,Dissolve,Sodium hydroxide 0.88 g ( 0.22 mol ) was disso...,"[{'name': 'Sodium hydroxide', 'mass': '0.88 g'...",[],[],[in 25 ml of distilled water],[],0
1,Add,Sodium aluminate 1.05 g ( 0.0128 mol ) and 196...,"[{'name': 'Sodium aluminate', 'mass': '1.05 g'...",[],[],[],[],1
2,Stir,the mixture was stirred for 30 min,[],[],[for 30 min],[],[],2
3,Add,Tetraethoxysilane 41.9 ml ( 0.1882 mol ) was a...,"[{'name': 'Tetraethoxysilane', 'mass': nan, 'o...",[],[],[],[],3
4,Stir,stirred for an hour,[],[],[for an hour],[],[],4
5,Add,Finally 15 g of the seeding gel was added,"[{'name': 'unknown', 'mass': '15 g', 'other_am...",[],[],[],[],5
6,Stir,the mixture was stirred for 1 h,[],[],[for 1 h],[],[],6
7,,The pH of the mixture was 10.2-11.0 .,[],[],[],[of the mixture],[],7
8,Stir,stirred at 230-250 °C for 4-10 h,[],[at 230-250 °C],[for 4-10 h],[],[],8
9,,The total mixture was put into 600 ml Parr aut...,[],[],[],[into 600 ml Parr autoclave],[],9


In [276]:
test_syn.raw_synthesis.to_json(f'./majed/{working}.json')