# Building data tables from a corpus of papers

Once a synthesis protocol has been parsed into a basic sequence and stored as a `.json` file, we need to convert ll the parsed information into useable formats for further analysis. 
In this notebook, we will go through all of the data parsing for a single paper to demonstrate the data structures available.

In [1]:
import os
import sys
import matplotlib.pyplot as plt

try:
    from synoracle.interpret import sequence, ingredients
except ModuleNotFoundError:
    module_path = os.path.abspath(os.path.join('..'))
    if module_path not in sys.path:
        sys.path.append(module_path)
    from synoracle.interpret import sequence, ingredients

In [2]:
from glob import glob
from tqdm.notebook import tqdm, trange
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import json
def li_iterate(li):
    l = iter(li)
    for _ in trange(len(li)):
        yield next(l)

## Importing the sequence data frame into pandas
First, we instantiate a `Sequence` object which contains the synthesis steps as a pandas dataframe. It starts as a `raw_synthesis`

In [3]:
cq = sequence.Sequence.from_json('./RSC_mcm/c5nj03147h/sequence.12.json')
cq.raw_synthesis

Unnamed: 0,name,text,new_chemicals,temp,time,prepphrase,apparatus,step number
0,Synthesize,The ZIF-8 nanoparticles were synthesized,[],[],[],[],[],0
1,Dissolve,"Zn(NO3)2·6H2O ( 3 g , 10 mmol ) in 200 mL of m...","[{'name': 'Zn(NO3)2·6H2O', 'mass': '3 g', 'oth...",[],[],[],[],1
2,Add,was rapidly poured into an equal volume methan...,"[{'name': 'methanol', 'mass': None, 'other_amo...",[],[],[into an equal volume methanol solution of Hmi...,[],2
3,Stir,with vigorous stirring at room temperature,[],[at room temperature],[],[],[],3
4,,following the method reported by Cravillon et ...,[],[],[],"[by Cravillon et, al.29]",[],4
5,Stir,After stirring for 1 h,[],[],[for 1 h],[],[],5
6,Recover,the resulting ZIF-8 nanoparticles were collect...,[],[],[for 8 min],"[by centrifugation, at 6010 g]",[],6
7,Wash,washing three times with methanol,"[{'name': 'methanol', 'mass': None, 'other_amo...",[],[],[with methanol],[],7
8,,The nanoparticles were lyophilized before use .,[],[],[],[before use],[],8
9,,The as-synthesized ZIF-8 had a crystal size of...,[],[],[],[of ∼50 nm],[],9


### Pulling out chemical information
The first thing we'll do is extract information about the chemicals involved in the synthesis procedure. 
We'll start by building a `chemical_list` of the mentioned materials with quanities sorted by type (mass, volume, concentration, moles (other_amount)).

In [4]:
cq.extract_chemicals()
cq.chemical_list.chemical_list

Unnamed: 0,name,mass,other_amount,volume,percent,concentration,aliases,Units used
0,Zn(NO3)2·6H2O,3 g,10 mmol,,,,[Zn(NO3)2·6H2O],other_amount
1,methanol,,,200 mL,,,[methanol],volume
2,methanol,,,,,,[methanol],
3,methanol,,,,,,[methanol],
4,Zn(NO3)2·6H2O,,,,,,[Zn(NO3)2·6H2O],


From this, we'll build a `BillOfMaterials`, with chemicals grouped by identity and the total amount of moles provided. This lets us perform a lot of later analysis into reaction stoichiometry and the like.

In [5]:
ingreds_bom = cq.chemical_list.produce_bill_of_mats()

In [6]:
ingreds_bom.bill_of_materials

Unnamed: 0_level_0,name,moles
pubchem id,Unnamed: 1_level_1,Unnamed: 2_level_1
887,[methanol],5.140863
15865313,[Zn(NO3)2·6H2O],0.01


## Extracting time and temperature information from the sequence
We then perform a similar set of processing for reaction conditions like times and temperatures, grouping into minutes and degrees kelvin respectively.
We can then analyse the total synthesis time, and set of temperatures used for later comparison between different protocols.

In [7]:
cq.extract_conditions()
cq.conditions.time_temp

Unnamed: 0,step number,time,temp,T (K),Time (min)
0,0,[],[],,
1,1,[],[],,
2,2,[],[],,
3,3,[],[at room temperature],[298.15],
4,4,[],[],,
5,5,[for 1 h],[],,[60.0]
6,6,[for 8 min],[],,[8.0]
7,7,[],[],,
8,8,[],[],,
9,9,[],[],,


In [8]:
sum(cq.conditions.time_temp['Time (min)'][cq.conditions.time_temp['Time (min)'].notna()].sum())

68.0

In [9]:
set(cq.conditions.time_temp['T (K)'][cq.conditions.time_temp['T (K)'].notna()].sum())

{298.15}

### Analysing the sequence of actions itself
Finally we can investigate the sequence of steps themselves to analyse how complex the synthesis is, and break down the ingredients and conditions by reaction step. 
We condense the synthesis procedure into "blocks", each with their own chemicals and conditions. 
This gives us the added opportuniy to perform like-for-like analysis on subsets of a reaction.

In [10]:
cq.condense_to_supertypes()
cq.condensed_sequence

(1, 3)
(5, 5)
(6, 7)
(10, 10)


Unnamed: 0_level_0,new_chemicals,temp,time,Condensed steps
Step supertype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
add,"[{'name': 'Zn(NO3)2·6H2O', 'mass': '3 g', 'oth...",[at room temperature],[],3
add,[],[],[for 1 h],1
remove,"[{'name': 'methanol', 'mass': None, 'other_amo...",[],[for 8 min],2
remove,[],[],[],1


In [12]:

cq.extract_chemicals(
    partial_sequence =pd.DataFrame(cq.condensed_sequence.reset_index().loc[0]).T
    )
ingredients_sub_selection = cq.chemical_list.produce_bill_of_mats()

ingredients_sub_selection.bill_of_materials

Unnamed: 0_level_0,name,moles
pubchem id,Unnamed: 1_level_1,Unnamed: 2_level_1
887,[methanol],5.140863
15865313,[Zn(NO3)2·6H2O],0.01


In [13]:
cq.extract_conditions(
    partial_sequence =pd.DataFrame(cq.condensed_sequence.reset_index().loc[2]).T
)
cq.conditions.time_temp

Unnamed: 0,time,temp,T (K),Time (min)
2,[for 8 min],[],,[8.0]
