# Extracting information from a paragraph
---
So now we have our paragraphs, let's see what kind of information we can get out! The first thing we need to do is import some libraries, including the `synparagraph` one I wrote for this specific purpose.

In [1]:
import os
import sys
import matplotlib.pyplot as plt

try:
    from synoracle.synparagraph import SynParagraph
except ModuleNotFoundError:
    module_path = os.path.abspath(os.path.join('..'))
    if module_path not in sys.path:
        sys.path.append(module_path)
    from synoracle.synparagraph import SynParagraph

In [2]:
import pandas as pd
import numpy as np
import pint
ureg = pint.UnitRegistry()
Q_ = ureg.Quantity

from glob import glob
from tqdm.notebook import tqdm, trange
def li_iterate(li):
    l = iter(li)
    for _ in trange(len(li)):
        yield next(l)

## Picking a paper and processing the information
---
We then instantiate a `SynthesisParagraph` object, which does our data extraction for us. This loads in the paper, but doesn't go through the data extraction just yet.

In [2]:

test_syn = SynParagraph('S2590123022000482.92', source_directory='./', chemtagger_dir = '../')

./


## Looking at the text classification
---
Now our object is successfully instantiated, we can read the text (`raw_text`) and scan how `ChemDataExtractor` and `ChemicalTagger`interpreted the information. `cde_text` underlines identified chemicals, and `xml_text` colour codes action phrases too.

In [3]:
test_syn.load_xml()
print(test_syn.xml_para_annotate(test_syn.working_xml))

[mAll[0m [mchemicals[0m [mwere[0m [mpurchased[0m [mfrom[0m [mcommercial[0m [msources[0m [m([0m [mAldrich[0m [mand[0m [4mVWR[0m [m)[0m [mand[0m [mused[0m [95mas[0m [95mreceived[0m [95mwithout[0m [95mfurther[0m [95mpurification[0m [m.[0m [mThe[0m [msolvents[0m [mused[0m [mfor[0m [mthe[0m [msynthesis[0m [mwere[0m [mof[0m [manalytical[0m [mreagent[0m [mgrade[0m [m.[0m [95mThe[0m [95msynthesis[0m [95mof[0m [4;95mZIF-8[0m [mwas[0m [massisted[0m [mby[0m [mmicrowave[0m [mirradiation[0m [mand[0m [multrasound[0m [m.[0m [mThree[0m [mmixtures[0m [mof[0m [msolvent[0m [mwere[0m [mused[0m [m([0m [mdeionized[0m [4mwater[0m [m/[0m [4mdimethylformamide[0m [m)[0m [m([0m [4mW/D[0m [m)[0m [m,[0m [m([0m [mdeionized[0m [4mwater[0m [m/[0m [4mmethanol[0m [m)[0m [m([0m [4mW/M[0m [m)[0m [mand[0m [m([0m [mdeionized[0m [mwater/deionized[0m [4mwater[0m [m)[0m [m([0m

## Extracting a sequence
---
Finally, the sequence dataframe shows what was added when, letting us recreate the sequence of events described in the paragraph. 

In [10]:
test_syn.raw_synthesis = test_syn.raw_synthesis.drop('text', axis=1)

In [12]:
test_syn.raw_synthesis.to_json('./S2590123022000482.92.json', indent=2)