# Literature Extraction Demo

In this notebook, we will show you how the extraction works.

Along with this notebook, we also show you a DemoDatabase.

## read the item from the database
In the SnyMOF database, we store general information of the structure inside an HTML file.
Since HTML files can be viewed by Web Browser (i.e., Google Chrome, Mozilla Firefox and Internet Explorer)
the general information can be accessed by almost every computer users.


Here we choose 2 strucutres as an example:
NUCPIM_clean
NUCPEI_clean

We choose them since they come from a same paper, and that paper is publicly accessible:

>Liu, Bo, et al. "Selective CO 2 adsorption in a microporous metal–organic framework with suitable pore sizes and open metal sites." Inorganic Chemistry Frontiers 2.6 (2015): 550-557.

You can find the general information of these two structures inside:
```
DemoDatabase/data_base/[name of the structure]
```

We first some information from the database

In [1]:
import cd_tools
from cd_tools.osvalkyrie import project_path
import os
lgrd = cd_tools.lgrd

data_dict = {}
for item in ['NUCPEI_clean','NUCPIM_clean']:
    reader = cd_tools.from_html(os.path.join(project_path(),r"DemoDatabase/data_base"),item)

    print(f'{item}:')
    print(f'\tmetal record inside CoRE MOF Database: {reader.core_item("core_All_Metals")}')
    print(f'\tDeposition Number inside CSD: {reader.csd_item("csd_dn")}')
    print()

    data_dict[item] = {'metal': reader.core_item("core_Open_Metal_Sites"),
                       'dn':    reader.csd_item("csd_dn")}


NUCPEI_clean:
	metal record inside CoRE MOF Database: Zn
	Deposition Number inside CSD: 1033728

NUCPIM_clean:
	metal record inside CoRE MOF Database: Zn
	Deposition Number inside CSD: 1033729



##  Literature Processing
You can find the full content of the paper (DOI: 10.1039/C5QI00025D) inside
```
Code\DemoDatabase\Pulication\P3535\full_page.html
```
It is downloaded from the publisher's website.

### paragraph classification
First we should read the content and choose the paragraphs that describe the synthesis.

> if something wrong happended, maybe check whether you already run:
> ```
> cde data download
> ```
> in advance

In [2]:
import chemdataextractor as cder
import cd_lib.chose_para as para

content_root = os.path.join(project_path(),r"DemoDatabase\Pulication\P3535")

# reading the content as a object
item_html_cde = cder.Document.from_file(os.path.join(content_root,'full_page.html'))
item = para.cde_paras(item_html_cde)

# search every paragraphs insides the contents to find whether it contains some synthesis paragraph or not?
if item.sny_sele():

    sny_para_list = item.sny_para_str()
    pot_sny_para_no = 0

    print(f'We find {len(sny_para_list)} synthesis paragraph(s), which is (are):')
    while pot_sny_para_no < len(sny_para_list):
        print(f'{pot_sny_para_no+1})')
        print(f'{sny_para_list[pot_sny_para_no]})')
        with open(os.path.join(content_root, 'pot_sny_para' + str(pot_sny_para_no) + '.txt'), 'w',
                  encoding='utf-8') as f:
            f.write(sny_para_list[pot_sny_para_no])
        pot_sny_para_no += 1
        print()


We find 2 synthesis paragraph(s), which is (are):
1)

A mixture of Zn(NO3)2·6H2O (0.1 mmol, 0.030 g), H5L (0.04 mmol, 0.018 g), DMF (1.5 mL) and water (0.5 mL) was placed in a screw-capped vial, then the vial was capped and placed in an oven at 105 °C for 72 h. The resulting block crystals were washed with DMF three times to give 1·DMF. The yield was ∼24.0 mg (72.6% based on H5L). Anal. Calcd for C65H74N7O27.5Zn4: C, 47.17; H, 4.51; N, 5.92. Found: C, 47.02; H, 4.78; N, 5.63. IR (cm−1): 3424m, 2965w, 2932w, 2807w, 2492w, 2026w, 1660s, 1628s, 1579s, 1450w, 1435w, 1390s, 1255w, 1163w, 1104m, 1061w, 1020w, 920w, 892w, 853w, 781s, 724s, 665w, 579w, 477w.
)

2)

A mixture of Zn(NO3)2·6H2O (0.1 mmol, 0.030 g), H5L (0.04 mmol, 0.018 g), DMA (1.5 mL) and water (1.0 mL) was placed in a screw-capped vial, then the vial was capped and placed in an oven at 105 °C for 72 h. The resulting block crystals were washed with DMA three times to give 1·DMA. The yield was ∼20.6 mg (60.4% based on H5L). Anal

Here you may find these synthesis paragraphs very similar to eah other, but they actually descripe the synthesis of 2 structures.

### Build relations between the structure and synthesis paragraph
To achieve this, we must know how many strucutures regesitered under this paper (DOI: 10.1039/C5QI00025D) in CSD database. you can get it from CSD python API, which I will not show you here. (Considering many institutes do not purchase this function)

Also, you can go to CCDC and search it (https://www.ccdc.cam.ac.uk/structures/Search?Doi=10.1039%2FC5QI00025D&DatabaseToSearch=Published)

In [3]:
print('Inside CSD, the paper (DOI: 10.1039/C5QI00025D) only contain 2 structures.')

# to check if number of extracted synthesis paragraphs matches the number of the structures

if len(data_dict) == 2:
    # Sort the structures in descending order of its Deposition Number
    sorted(data_dict.items(), key=lambda item:int(item[1]['dn']))

Inside CSD, the paper (DOI: 10.1039/C5QI00025D) only contain 2 structures.


Since the extracted paragraphs are arranged in the order of appearance in the text, so the relationship of each structure has been established:
>data_dict[0] => pot_sny_para0.txt
>
>data_dict[1] => pot_sny_para1.txt

### Use ChemicalTagger to prase the result and extract syntesis information inside
> We use python to call terminal to use ChemicalTagger
> If you have something wrong with the java, you also can excute these to use ChemicalTagger:
```
java -jar "[root]\Code\_CommonRedist\chemicalTagger-1.6-SNAPSHOT-jar-with-dependencies-file.jar" "[root]\Code\DemoDatabase\Pulication\P3535\mod_pot_sny_para0.txt" "[root]\Code\DemoDatabase\Pulication\P3535\chemtg0.xml"
java -jar "[root]\Code\_CommonRedist\chemicalTagger-1.6-SNAPSHOT-jar-with-dependencies-file.jar" "[root]\Code\DemoDatabase\Pulication\P3535\mod_pot_sny_para1.txt" "[root]\Code\DemoDatabase\Pulication\P3535\chemtg1.xml"
```

> You can find all the result inside:
```
Code\DemoDatabase\Pulication\P3535\full_page.html
```

In [4]:
import cd_lib.chetg as ctg
chemtg_location = os.path.join(project_path(),r'_CommonRedist\chemicalTagger-1.6-SNAPSHOT-jar-with-dependencies-file.jar')

for ind in range(len(data_dict)):
    text_loc = os.path.join(content_root, 'pot_sny_para' + str(ind) + '.txt')
    xml_type = ctg.chemtgp(text_loc, chemtg_location= chemtg_location, opt_put='chemtg' + str(ind))
    data_dict[list(data_dict.keys())[ind]]['xml'] = xml_type

Then we read the result and extract synthesis information insides

In [5]:
from cd_lib import onlystr
from cd_lib.pcplib import metal_table, nonmetal_table

for i in range(len(data_dict)):

    item_name = list(data_dict.keys())[i]

    # read the result
    item_file = ctg.ctg_xml_par(xml_type)

    # the process to select condition
    csv_cont = [item_name, '']

    def csv_con_bp(chemical_name):
        bp_man = ['calc', 'TGA', 'Teflon', 'plate']
        breakpoint_list = metal_table + nonmetal_table + bp_man
        for i in chemical_name.strip().split('-'):
            if i in breakpoint_list:
                return True
        else:
            return False

    try:
        tt_ele_cla = ctg.tt_classifer(item_file.ope_list())
        tt_re = ctg.ht_ident(tt_ele_cla)
    except ctg.NoTemTimeError as e:
        lgrd.warning('{}_{}:{}'.format(item_name, repr(e), str(e)))
        csv_cont[1] = csv_cont[1] + 'T'
        tt_re = {'temp': '', 'temp_u': '', 'time': '', 'time_u': ''}
    except (TypeError, ValueError) as e:
        lgrd.warn('{}_{}:{}'.format(item_name, repr(e), str(e)))
        csv_cont[1] = csv_cont[1] + 'T'
        tt_re = {'temp': '', 'temp_u': '', 'time': '', 'time_u': ''}

    for i in ['temp', 'temp_u', 'time', 'time_u']:
        csv_cont.append(onlystr(tt_re[i]))

    try:
        item_yield = ctg.yield_out(item_file.yield_list())
    except Exception as err:
        lgrd.warn('{}_{}:{}'.format(item_name, repr(err), str(err)))
        item_yield = None
    finally:
        if item_yield is None:
            csv_cont.append('')
        else:
            csv_cont.append(item_yield)
    # 到这, 已经前面的坑全部占了

    item_metal = data_dict[item_name]['metal'].split(',')

    item_metal = set([x.strip() for x in item_metal])

    metal_cont = []
    chem_cont = []


    try:
        item_chemical_list_row = ctg.cc_i(item_file.molecular_list())
        item_chemical_list = ctg.mixture_iden_no_qua(item_chemical_list_row)
        item_chemical_out = ctg.cc_out_no_qua(item_chemical_list)
        item_chemical_out = ctg.cc_table_no_qua(item_chemical_out)
    except (ctg.ChemicalListError, ctg.NullChemicalNameError, ValueError, AssertionError, TypeError) as e:
        csv_cont[1] = csv_cont[1] + 'C'
        lgrd.warning('{}_{}:{}'.format(item_name, repr(e), str(e)))
        item_chemical_out = []

    if len(item_chemical_out) > 2:

        for i in item_chemical_out:
            if len(i) == 3 and i['name'] != "":
                table = ['name', 'cid', 'chem_role']
                chemicals_info = [i[x] for x in table]
            else:
                csv_cont[1] = csv_cont[1] + 'B'
                chemicals_info = []

            if len(chemicals_info) == 3:
                if csv_con_bp(chemicals_info[0]):
                    break
                else:
                    if chemicals_info[-1].strip() != "" and (chemicals_info[-1].strip() in item_metal) and (len(metal_cont) < 15):
                        metal_cont = metal_cont + chemicals_info
                    else:
                        chem_cont = chem_cont + chemicals_info
            else:
                csv_cont[1] = csv_cont[1] + 'C'
    else:
        csv_cont[1] = csv_cont[1] + 'S'
    lgrd.info(f'{item_name}:{metal_cont}\t{chem_cont}')

    if len(metal_cont) == 0:
        csv_cont[1] = csv_cont[1] + 'M'

    while len(metal_cont) < 15:
        metal_cont = metal_cont + [""]

    if len(chem_cont) > 30:
        chem_cont = chem_cont[:30]
        csv_cont[1] = csv_cont[1] + 'L'
    csv_cont = csv_cont + metal_cont + chem_cont
    csv_cont = [onlystr(x) for x in csv_cont]
    data_dict[item_name]['metal_extracted'] = metal_cont
    data_dict[item_name]['condition_extracted'] = csv_cont
    data_dict[item_name]['chemical_extracted'] = chem_cont

[2021-08-02 13:42:04,819] - [C&C] - [<ipython-input-5-1204d41b7484> file line:90] - INFO: NUCPEI_clean:['Zn(NO3)2·6H2O', None, 'Zn', 'C64H85N6O31.5Zn4', None, 'Zn']	['H5L', None, '', 'Dimethylacetamide', 31374, 'Sol', 'water', 962, 'Sol', '1·DMA', None, '', 'H5L', None, '']
[2021-08-02 13:42:10,058] - [C&C] - [<ipython-input-5-1204d41b7484> file line:90] - INFO: NUCPIM_clean:['Zn(NO3)2·6H2O', None, 'Zn', 'C64H85N6O31.5Zn4', None, 'Zn']	['H5L', None, '', 'Dimethylacetamide', 31374, 'Sol', 'water', 962, 'Sol', '1·DMA', None, '', 'H5L', None, '']


The last 2 step:
1. Check with the metal information in CoRE MOF Data  to confirm the information
2. Generate a table

In [6]:
import pandas as pd
pd_dict = []
for i in range(len(data_dict)):

    item_name = list(data_dict.keys())[i]

    # if not all the Open Metal Sites extracted in the literature, we skip it.
    extracted_metal_list = set(data_dict[item_name]['metal_extracted'][2::3])
    for metal in data_dict[item_name]['metal']:
        if metal not in extracted_metal_list:
            continue
    temp_dic= {}
    temp_dic['filename'] = item_name
    temp_dic['temperature'] =data_dict[item_name]['condition_extracted'][2]
    temp_dic['temperature_unit'] =data_dict[item_name]['condition_extracted'][3]
    temp_dic['time']=data_dict[item_name]['condition_extracted'][4]
    temp_dic['unit']=data_dict[item_name]['condition_extracted'][5]
    solves = []
    additves = []
    ind = 1
    while ind < len(data_dict[item_name]['chemical_extracted']):
        if data_dict[item_name]['chemical_extracted'][ind+1] == "Sol":
            solves.append(data_dict[item_name]['chemical_extracted'][ind])
        elif data_dict[item_name]['chemical_extracted'][ind+1] == "Addi":
            solves.append(data_dict[item_name]['chemical_extracted'][ind])
        ind += 3
    while len(solves)<5:
        solves.append('')
    while len(additves)<5:
        additves.append('')

    for p in range(5):
        temp_dic[f'solvent{p+1}'] = solves[p]

    for p in range(5):
        temp_dic[f'additive{p+1}'] = additves[p]

    pd_dict.append(temp_dic)
df = pd.DataFrame(pd_dict)

In the end, we get the synthesis information about the structure:

In [7]:
df



Unnamed: 0,filename,temperature,temperature_unit,time,unit,solvent1,solvent2,solvent3,solvent4,solvent5,additive1,additive2,additive3,additive4,additive5
0,NUCPEI_clean,105.0,°C,72,h,31374,962,,,,,,,,
1,NUCPIM_clean,105.0,°C,72,h,31374,962,,,,,,,,
