This repository contains the official implementation of DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science. This model is trained on 4,408 distantly supervised tables published in materials science research papers to extract compositions reported in the tables. These tables yielded a total of 38,799 tuples in the training set.
The tuples are of the form
-
$id$ is the id of the material composition reported in the tables -
$c_k^{id}$ is the k-th chemical element in the material composition -
$p_k^{id}$ is the percentage of k-th chemical element in the material composition -
$u_k^{id}$ is the unit of the of$p_k^{id}$ (either mole % or weight %)
The following figure represents the architecture of the model proposed in our work.
- The code directory contains the file for training models reported in this paper.
- The data directory contains the dataset for training models reported in this paper.
- The notebooks directory contains Jupyter notebook to visualise the dataset.
- The respective directories and sub-directories contain task-specific README files.
If you find this repository useful, please cite our work as follows:
@inproceedings{gupta-etal-2023-discomat,
title = "{D}i{SC}o{M}a{T}: Distantly Supervised Composition Extraction from Tables in Materials Science Articles",
author = "Gupta, Tanishq and
Zaki, Mohd and
Khatsuriya, Devanshi and
Hira, Kausik and
Krishnan, N M Anoop and
-, Mausam",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.753",
pages = "13465--13483",
}