This is the repository for PatCID: an open-access dataset of chemical structures in patent documents.
Create a virtual environment.
conda create -n patcid python=3.11
conda activate patcid
Install dependencies.
pip install -e .
The PatCID dataset is available on Zenodo.
wget https://zenodo.org/records/10572870/files/patcid.zip?download=1 -O patcid.zip
unzip patcid.zip -d ./data/patcid/
(Download size: 5.7 GB
, files format: .jsonl
)
Run the notebook ./examples/molecule_query.ipynb
to use PatCID to retrieve documents referencing a molecule of interest.
Run the notebook ./examples/patent_query.ipynb
to use PatCID to retrieve molecules displayed in a given patent document.
user_interface.mp4
To request access to the above user interface, please contact the IBM's Deep Search team at deepsearch-core@zurich.ibm.com.
The benchmarks datasets D2C-UNI and D2C-RND are available on Zenodo.
The code repositories used to build and evaluate PatCID are available:
For segmenting chemical-structure images from documents, we use DECIMER Segmentation from K. Rajan, H. O. Brinkhaus, M. Sorokina, A. Zielesny and C. Steinbeck.
The model weights are available on Hugging Face:
- The classification model
- The recognition model.
The training datasets are available on Zenodo and Hugging Face:
To test our processing pipeline outside its main application domain, we process a scientific publication published on ChemRxiv. ./data/extra/scientific_paper_example/
contains the pages of the document (page_*.png
) annotated with the segmentation and classification predictions. For pages containing molecules, the predicted molecules are provided in page_*_molecules.txt
.
If you find this repository useful, please consider citing:
@Article{Morin2024,
author={Morin, Lucas
and Weber, Val{\'e}ry
and Meijer, Gerhard Ingmar
and Yu, Fisher
and Staar, Peter W. J.},
title={PatCID: an open-access dataset of chemical structures in patent documents},
journal={Nature Communications},
year={2024},
month={Aug},
day={02},
volume={15},
number={1},
pages={6532},
issn={2041-1723},
doi={10.1038/s41467-024-50779-y},
url={https://doi.org/10.1038/s41467-024-50779-y}
}