# Extracting structured informaiton from synthesis sequences


Once a synthesis protocol has been parsed into a basic sequence and stored as a `.json` file, we need to convert ll the parsed information into useable formats for further analysis. 
In this notebook, we will go through all of the data parsing for a single paper to demonstrate the data structures available.

In [22]:
import os
import sys
import matplotlib.pyplot as plt
import math

try:
    from synoracle.sequence import Sequence
except ModuleNotFoundError:
    module_path = os.path.abspath(os.path.join('..'))
    if module_path not in sys.path:
        sys.path.append(module_path)
    from synoracle.sequence import Sequence

In [2]:
from glob import glob
from tqdm.notebook import tqdm, trange
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import json
def li_iterate(li):
    l = iter(li)
    for _ in trange(len(li)):
        yield next(l)

## Importing the sequence data

First, we import the raw synthesis sequence from `json` format into a `Sequence` object. 
Inside the `Sequence` object the raw synthesis information is stored as a pandas `DataFrame` under the attribute `raw_synthesis`. 
Using this, we can check the information which has been gathered, as well as reference the original text to manually check the fidelity of the previous steps, if required.

In terms of data which will be used for later processing, the `new_chemicals`, `temp`, and `time` columns contain chemical information, temperatures, and times respectively. 
Each of these will be processed to generate structured information from the synthesis, accoridngly sythesis steps containing no information about any of these three synthesis aspects will be discarded to create a `clean_synthesis` (not shown here).


In [3]:
cq = Sequence.from_json('./S2590123022000482.92.json')
cq.raw_synthesis

Unnamed: 0,name,new_chemicals,temp,time,prepphrase,apparatus,step number
0,Purify,[],[],[],[without further purification],[],0
1,,"[{'name': 'VWR', 'mass': None, 'other_amount':...",[],[],[from commercial sources ( Aldrich and VWR )],[],1
2,,[],[],[],"[for the synthesis, of analytical reagent grade]",[],2
3,Synthesize,"[{'name': 'ZIF-8', 'mass': None, 'other_amount...",[],[],[of ZIF-8],[],3
4,,[],[],[],[by microwave irradiation and ultrasound],[],4
5,,"[{'name': 'water / dimethylformamide', 'mass':...",[],[],[of solvent],[],5
6,Add,"[{'name': 'ZnO', 'mass': '0,2 g', 'other_amoun...",[],[],[],[],6
7,Partition,"[{'name': 'water', 'mass': None, 'other_amount...",[],[],[into 15 mL of deionized water ( W ) and 15 mL...,[],7
8,Add,"[{'name': 'zinc oxide', 'mass': None, 'other_a...",[],[],[to Hmim solution],[],8
9,Stir,[],[],[for 10 min],[in ultrasonic bath],[ultrasonic bath],9


### Processing chemical information

Once a `clean_synthesis` has been generated, the first structured informaiton to extract are details of which chemicals are present and in what quantity. 
As chemicals can be added multiple times during synthesis or mentioned using different names in different studies, and their quantity can be reported in a number of units, the following steps need ot be carried out:
1. Identify chemical names
2. Determine which units have been used to measure each one

These steps are carried out by the `Sequence.extract_chemicals()` method, which produces a `ChemicalList` object under the attribute `chemical_list`, containing information about each mentioned chemical with quanities sorted by type (mass, volume, concentration, moles (other_amount)).

In [4]:
cq.extract_chemicals()
cq.chemical_list.chemical_list

Unnamed: 0,name,mass,other_amount,volume,percent,concentration,aliases,Units used
0,VWR,,,,,,[VWR],
1,ZIF-8,,,,,,[ZIF-8],
2,water / dimethylformamide,,,,,,[water / dimethylformamide],
3,W/D,,,,,,[W/D],
4,water / methanol,,,,,,[water / methanol],
5,W/M,,,,,,[W/M],
6,water,,,,,,[water],
7,W / W,,,,,,[W / W],
8,ZnO,"0,2 g",,,,,[ZnO],mass
9,2-methylimidazole,0.8 g,,,,,[2-methylimidazole],mass


Once a `ChemicalList` has been generated, this can be further processed into an itemised bill of materials for a synthesis, containing unique identifiers for each chemical and the total quantity used throughout the synthesis.
The steps to convert a `ChemicalList` into a `BillOfMaterials` are contained wihtin the `ChemicalList.produce_bill_of_mats` class method, whih performs the following actions:
1. Groups all chemicals together with the same name 
2. Searches the online PubChem database for the chemical's name, taking the database's chemical ID number as a unique identifier
3. Uses the PubChem entry to extract key informaiton about the compound like molecuar weight
4. Estimates the compound's density using the ChEDl database and COSTALD method, if required
5. Calculates the total number of moles present of the compound
6. Groups multiple instances of the same chemical together, to show the total bill of materials present


In [5]:
ingreds_bom = cq.chemical_list.produce_bill_of_mats()

 VWR
 W / W
 W/D
 W/M
 ZIF-8-WM-(US
 water / dimethylformamide
 water / methanol


In [6]:
ingreds_bom.bill_of_materials

Unnamed: 0_level_0,name,moles
pubchem_id,Unnamed: 1_level_1,Unnamed: 2_level_1
887,[methanol],0.0
962,[water],1.652506
6228,[DMF],0.0
12749,[2-methylimidazole],0.009744
14806,"[ZnO, zinc oxide]",0.0
15245636,[ZIF-8],0.0


By standardising the format of the bill of materials, both in terms of chemical identiity and quantity units, we are able to seamlessly compare between different synthesis protocols. 
In this way, statistics on how common a certain chemical are or what quantity of a certain chemical is used can be easily calculated.

## Extracting time and temperature information from the sequence
We then perform a similar set of processing for reaction conditions like times and temperatures, grouping into minutes and degrees kelvin respectively.
We can then analyse the total synthesis time, and set of temperatures used for later comparison between different protocols.

In [7]:
cq.extract_conditions()
cq.conditions.time_temp

Time extraction failed for: "min", original was "min"
IndexError list index out of range
Time extraction failed for: "min", original was "min"
IndexError list index out of range
Time extraction failed for: "min", original was "min"
IndexError list index out of range
Time extraction failed for: "min", original was "min"
IndexError list index out of range


Unnamed: 0,step number,time,temp,T (K),Time (min)
0,0,[],[],,
1,1,[],[],,
2,2,[],[],,
3,3,[],[],,
4,4,[],[],,
5,5,[],[],,
6,6,[],[],,
7,7,[],[],,
8,8,[],[],,
9,9,[for 10 min],[],,[10.0]


In [28]:
times = cq.conditions.time_temp['Time (min)'][cq.conditions.time_temp['Time (min)'].notna()].sum()
sum([x for x in times if not math.isnan(x)])

50.0

In [9]:
set(cq.conditions.time_temp['T (K)'][cq.conditions.time_temp['T (K)'].notna()].sum())

{338.15}

### Analysing the sequence of actions itself
Finally we can investigate the sequence of steps themselves to analyse how complex the synthesis is, and break down the ingredients and conditions by reaction step. 
We condense the synthesis procedure into "blocks", each with their own chemicals and conditions. 
This gives us the added opportunity to perform like-for-like analysis on subsets of a reaction.

In [29]:
cq.condense_to_supertypes()
cq.condensed_sequence

Unnamed: 0_level_0,name,new_chemicals,temp,time,Condensed steps
Step supertype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
remove,Purify,[],[],[],1
add,Add,"[{'name': 'ZnO', 'mass': '0,2 g', 'other_amoun...",[],[],1
remove,Partition,"[{'name': 'water', 'mass': None, 'other_amount...",[],[],1
add,AddStir,"[{'name': 'zinc oxide', 'mass': None, 'other_a...",[],[for 10 min],2
remove,Yield,[],[],[],1
remove,PartitionWash,"[{'name': 'water', 'mass': None, 'other_amount...",[],[],2


In [31]:

cq.extract_chemicals(
    partial_sequence =pd.DataFrame(cq.condensed_sequence.reset_index().loc[1]).T
    )
ingredients_sub_selection = cq.chemical_list.produce_bill_of_mats(identifier_cache_location='./id_cache.json', property_cache_location='./prop_cache.json')

ingredients_sub_selection.bill_of_materials

Unnamed: 0_level_0,name,moles
pubchem_id,Unnamed: 1_level_1,Unnamed: 2_level_1
12749,[2-methylimidazole],0.009744
14806,[ZnO],0.0
