# MOF ChemUnity Matching 

The purpose of this notebook is to use our developed tools to match CSD Ref Codes to MOF Names/Co-References found in their synthesis papers. 

### Preparation of CSD Data

First, the CSD Data must be prepared to be injected into the prompt. For each DOI we wish to process, we must gather the relevant info for each associated CSD code.

Over 20 000 DOIs have been selected for text mining. We chose MOFs that:
- Are found in CSD 
- Are also found in either QMOF or CoRE Databases

This way, every MOF in our database has relevant computational properties already calculated (found in QMOF or CoRE). The properties can be easily added to our database at the end. 

In [8]:
# Imports
import pandas as pd
from src.MOF_ChemUnity.utils.DataPrep import Data_Prep

In [2]:
# Define path to all CSD info extracted from CSD Python API
csd_info_path = 'data/Benchmark_set_2/Ground Truth/CoRE_QMOF_expanded_w_synonyms.csv'

In [3]:
# Define path to folder containing all papers to be text mined from
doi_folder_path = '/home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 1/PDF'

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = 'data/Benchmark_set_2/Ground Truth/CoRE_QMOF_expanded_w_synonyms.csv'

In [4]:
# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

In [5]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(doi_folder_path, csd_info_path,feature_list)

In [6]:
publication_data = Prepare_Data.gather_info()


Missing DOIs: {'10.1107/S0108270198006660', '10.1107/S1600536811000419', '10.1107/S0108270108026760', '10.1107/S0108270104026502', '10.1107/S0108270185004498', '10.1107/S1600536806040360', '10.1107/S1600536806042899', '10.1107/S1600536809005212', '10.1107/S1600536809007879', '10.1107/S1600536806010841', '10.1107/S1600536803006871', '10.1107/S0567740871005958', '10.1107/S1600536804010402', '10.1107/S1600536806018733', '10.1107/S1600536811001814', '10.1107/S0108270101004231', '10.1107/S1600536805017150', '10.1107/S1600536810041590', '10.1107/S1600536810010536', '10.1107/S1600536809051381', '10.1107/S0108270103026568', '10.1107/S1600536807035726', '10.1107/S0108270191009484', '10.1107/S0108270191004341', '10.1107/S0108270100007435', '10.1107/S0108270112025577', '10.1107/S1600536810001182', '10.1107/S1600536810045903', '10.1107/S0108270104011011', '10.1107/S0108270188014271', '10.1107/S1600536803021445', '10.1107/S1600536811010099', '10.1107/S1600536811014759', '10.1107/S0108270101001615'

In [10]:
publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1107/S0108270187012125,10.1107_S0108270187012125.pdf,pdf,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",GENGEL,catena-((μ5-Dihydrogen glutarato)-(μ3-dihydrog...,P21/a,K,C40H60K4O32,[],9.392,12.782,11.147
1,10.1107/S0108270107034646,10.1107_S0108270107034646.pdf,pdf,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",DIKPUJ,"catena-(bis(μ2-6-Methylpicolinato-N,O,O)-cadmi...",P21/c,Cd,C56Cd4H48N8O16,[],6.8284,10.6888,18.5505
2,10.1107/S1600536809011593,10.1107_S1600536809011593.pdf,pdf,/home/tom-pruyn/Documents/TDM Papers/Processin...,Journal(Acta Crystallographica Section E: Stru...,WOQXIJ03,catena-[diaqua-(μ3-succinato)-cadmium(ii)],P21/c,Cd,C16Cd4H32O24,[],7.713,12.231,8.056
3,10.1107/S0108270105039259,10.1107_S0108270105039259.pdf,pdf,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",MAZVAL,"catena-((μ2-Fluoro)-(μ2-squarato-O,O')-diaqua-...",P21/n,V,C8F2H8O12V2,[],3.779,11.207,7.841
4,10.1107/S0108270106025625,10.1107_S0108270106025625.pdf,pdf,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",PELTEG,catena-((μ4-(N-(phosphonatomethyl)ammonio)acet...,Cc,Cd,C6Cd2H20N2O14P2,['catena-((μ4-N-(phosphonatomethyl)glycinato)-...,9.827,4.9326,16.795
