MatBERT-CRF for CO2 Capture Named Entity Recognition

This repository contains the code for training and inference of a MatBERT-CRF model for Named Entity Recognition (NER) on CO2 capture scientific literature.

Overview

We developed an NLP pipeline to automatically extract key performance indicators from CO2 capture research papers. The model uses MatBERT (Materials Science BERT) with a CRF layer to ensure valid IOBES tag sequences.

Entity Types (8 classes)

Entity	Description	Example
CONCENTRATION	CO2 feed concentration	"15 vol%", "400 ppm"
TEMPERATURE	Operating/regeneration temperature	"120°C", "40-60°C"
PRESSURE	System pressure	"1 bar", "0.1 MPa"
MATERIAL	Sorbents, absorbents, membranes	"MEA", "zeolite 13X", "PVDF"
REMOVAL_EFFICIENCY	CO2 capture rate/capacity	"90%", "3.5 mol/kg"
COST	Capture or avoidance cost	"$50/tCO2"
ENERGY_SRD	Thermal energy (Specific Reboiler Duty)	"3.5 GJ/tCO2"
ENERGY_SEC	Electrical energy (Specific Energy Consumption)	"250 kWh/tCO2"

Repository Structure

├── models/
│   ├── base_ner_model.py    # Base NER model class with training/evaluation
│   ├── bert_model.py        # BERT-CRF model implementation
│   └── crf.py               # CRF layer with IOBES constraints
├── utils/
│   ├── data.py              # Data loading and preprocessing
│   └── metrics.py           # Evaluation metrics (P/R/F1)
├── run_train.py             # Training script
├── run_inference_ensemble.py # Ensemble inference script
├── data_split.json          # DOI-based train/val/test split
└── README.md

Requirements

torch>=1.9.0
transformers>=4.5.0
numpy
pandas
tqdm
seqeval
chemdataextractor

Installation

git clone https://github.com/yourusername/co2-capture-ner.git
cd co2-capture-ner
pip install -r requirements.txt

MatBERT Model

Download MatBERT weights and place them in model_matbert/matbert-base-uncased/:

MatBERT on Hugging Face

Usage

Training

python run_train.py

Configuration (modify in run_train.py):

device = "cuda:0"           # GPU device
n_epochs = 30               # Maximum epochs
batch_size = 16             # Batch size
learning_rate = 2e-4        # Learning rate
datafile = "./dataset/your_data.csv"  # Training data path

Output:

Model checkpoints: matbert_IOBES_{date}_Best/seed{k}/0/best.pt
Training logs: loss.txt, summary.txt

Inference (Ensemble)

The ensemble inference script averages logits from 5 models (trained with different seeds) and applies CRF decoding for consistent IOBES predictions.

python run_inference_ensemble.py

Configuration (modify paths in the script):

TRAINED_MODEL_DIR = 'path/to/trained/models'
INPUT_DATA_DIR = 'path/to/input/csv/files'
OUTPUT_DIR = 'path/to/output'

Data Format

Training Data (CSV)

Column	Description
file_name	DOI or document identifier
x_train	List of tokens (as string)
y_train	List of IOBES labels (as string)

Example:

file_name,x_train,y_train
10.1016_j.cej.2021.130362,"['The', 'CO2', 'concentration', 'was', '15', 'vol%']","['O', 'O', 'O', 'O', 'B-CONCENTRATION', 'E-CONCENTRATION']"

Inference Data (CSV)

Pre-tokenized text with one token per line:

The
CO2
concentration
was
15
vol%

Model Architecture

Input Text
    ↓
MatBERT Encoder (768-dim)
    ↓
Dropout (0.1)
    ↓
Linear Classifier (768 → 33 tags)
    ↓
CRF Layer (IOBES constraints)
    ↓
Output Tags

IOBES Tagging Scheme

B-: Beginning of entity
I-: Inside of entity
O: Outside (not an entity)
E-: End of entity
S-: Single-token entity

The CRF layer enforces valid transitions (e.g., B-MATERIAL can only be followed by I-MATERIAL or E-MATERIAL).

Training Strategy

Data Split: DOI-based stratified split (197 train / 24 val / 25 test)
Seeds: 5 random seeds for statistical reliability
Early Stopping: Based on validation F1 score (patience=100)
Optimizer: AdamW with linear warmup scheduler

Ensemble Inference

The ensemble approach:

Load 5 models trained with different random seeds
For each input, collect logits from all models
Average the logits element-wise
Apply CRF Viterbi decoding to averaged logits
Output consistent IOBES tag sequences

This method improves robustness and maintains valid tag sequences through CRF constraints.

Citation

If you use this code, please cite:

@article{your_paper,
  title={Automated Information Extraction from CO2 Capture Literature using MatBERT-CRF},
  author={Your Name},
  journal={Journal Name},
  year={2025}
}

License

This project is licensed under the MIT License.

Acknowledgments

MatBERT - Materials Science BERT
Hugging Face Transformers
pytorch-crf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MatBERT-CRF for CO2 Capture Named Entity Recognition

Overview

Entity Types (8 classes)

Repository Structure

Requirements

Installation

MatBERT Model

Usage

Training

Inference (Ensemble)

Data Format

Training Data (CSV)

Inference Data (CSV)

Model Architecture

IOBES Tagging Scheme

Training Strategy

Ensemble Inference

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
models		models
utils		utils
README.md		README.md
data_split.json		data_split.json
run_ensemble_inference.py		run_ensemble_inference.py
run_train.py		run_train.py

Folders and files

Latest commit

History

Repository files navigation

MatBERT-CRF for CO2 Capture Named Entity Recognition

Overview

Entity Types (8 classes)

Repository Structure

Requirements

Installation

MatBERT Model

Usage

Training

Inference (Ensemble)

Data Format

Training Data (CSV)

Inference Data (CSV)

Model Architecture

IOBES Tagging Scheme

Training Strategy

Ensemble Inference

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages