# Extracting information from a paragraph
---
So now we have our paragraphs, let's see what kind of information we can get out! The first thing we need to do is import some libraries, including the `synparagraph` one I wrote for this specific purpose.

In [1]:
import os
import sys
import matplotlib.pyplot as plt

try:
    from synoracle.synparagraph import SynParagraph
except ModuleNotFoundError:
    module_path = os.path.abspath(os.path.join('..'))
    if module_path not in sys.path:
        sys.path.append(module_path)
    from synoracle.synparagraph import SynParagraph

In [2]:
import pandas as pd
import numpy as np
import pint
ureg = pint.UnitRegistry()
Q_ = ureg.Quantity

from glob import glob
from tqdm.notebook import tqdm, trange
def li_iterate(li):
    l = iter(li)
    for _ in trange(len(li)):
        yield next(l)

## Picking a random paper and processing the information
---
We then instantiate a `SynthesisParagraph` object, which does our data extraction for us. This loads in the paper, but doesn't go through the data extraction just yet.

In [4]:

test_syn = SynParagraph('S1385894723007039.90', source_directory='./', chemtagger_dir = '../')

./


## Looking at the text classification
---
Now our object is successfully instantiated, we can read the text (`raw_text`) and scan how `ChemDataExtractor` and `ChemicalTagger`interpreted the information. `cde_text` underlines identified chemicals, and `xml_text` colour codes action phrases too.

In [5]:
test_syn.load_xml()
print(test_syn.xml_para_annotate(test_syn.working_xml))

[1;96m0.730[0m [1;96mg[0m [96mof[0m [4;96mZn(NO3)2·6H2O[0m [96mwas[0m [96mdissolved[0m [96min[0m [1;96m40[0m [1;96mmL[0m [4;96mmethanol[0m [mand[0m [95msonicated[0m [95mfor[0m [95m10[0m [95mmin[0m [95mto[0m [95mform[0m [95msolution[0m [95mA[0m [m.[0m [96mSimilarly[0m [96m,[0m [1;96m3.285[0m [1;96mg[0m [96mof[0m [4;96m2-methylimidazole[0m [96mwas[0m [96mdissolved[0m [96min[0m [1;96m40[0m [1;96mmL[0m [4;96mmethanol[0m [mand[0m [95msonicated[0m [95mfor[0m [95m10[0m [95mmin[0m [95mto[0m [95mform[0m [95msolution[0m [95mB[0m [m.[0m [91mThe[0m [91mtwo[0m [91msolutions[0m [91mwere[0m [91mthen[0m [91mmixed[0m [mand[0m [94mstirred[0m [94mvigorously[0m [94mfor[0m [94m3[0m [94mh[0m [94mat[0m [94m25[0m [94m±[0m [94m2[0m [94m°C[0m [m.[0m [95mSubsequently[0m [95m,[0m [95mthe[0m [95mturbid[0m [95mmixture[0m [95mwas[0m [95mseparated[0m [95mby[0m [95mcentrifugation[0m

## Extracting a sequence
---
Finally, the sequence dataframe shows what was added when, letting us recreate the sequence of events described in the paragraph. 

In [6]:
test_syn.raw_synthesis

Unnamed: 0,name,text,new_chemicals,temp,time,prepphrase,apparatus,step number
0,Dissolve,0.730 g of Zn(NO3)2·6H2O was dissolved in 40 m...,"[{'name': 'Zn(NO3)2·6H2O', 'mass': '0.730 g', ...",[],[],[in 40 mL methanol],[],0
1,Wait,sonicated for 10 min,[],[],[for 10 min],[],[],1
2,Yield,to form solution A,"[{'name': 'A', 'mass': nan, 'other_amount': na...",[],[],[],[],2
3,Dissolve,"Similarly , 3.285 g of 2-methylimidazole was d...","[{'name': '2-methylimidazole', 'mass': '3.285 ...",[],[],[in 40 mL methanol],[],3
4,Wait,sonicated for 10 min,[],[],[for 10 min],[],[],4
5,Yield,to form solution B,"[{'name': 'B', 'mass': nan, 'other_amount': na...",[],[],[],[],5
6,Add,The two solutions were then mixed,[],[],[],[],[],6
7,Stir,stirred vigorously for 3 h at 25 ± 2 °C,[],[],[for 3 h],[at 25 ± 2 °C],[],7
8,Partition,"Subsequently , the turbid mixture was separate...",[],[],[],[by centrifugation ( 10000 rpm )],[],8
9,Yield,yielding white,[],[],[],[],[],9


In [7]:
test_syn.raw_synthesis.to_json('./S1385894723007039.90.json')