This repository contains the code for training and inference of a MatBERT-CRF model for Named Entity Recognition (NER) on CO2 capture scientific literature.
We developed an NLP pipeline to automatically extract key performance indicators from CO2 capture research papers. The model uses MatBERT (Materials Science BERT) with a CRF layer to ensure valid IOBES tag sequences.
| Entity | Description | Example |
|---|---|---|
| CONCENTRATION | CO2 feed concentration | "15 vol%", "400 ppm" |
| TEMPERATURE | Operating/regeneration temperature | "120°C", "40-60°C" |
| PRESSURE | System pressure | "1 bar", "0.1 MPa" |
| MATERIAL | Sorbents, absorbents, membranes | "MEA", "zeolite 13X", "PVDF" |
| REMOVAL_EFFICIENCY | CO2 capture rate/capacity | "90%", "3.5 mol/kg" |
| COST | Capture or avoidance cost | "$50/tCO2" |
| ENERGY_SRD | Thermal energy (Specific Reboiler Duty) | "3.5 GJ/tCO2" |
| ENERGY_SEC | Electrical energy (Specific Energy Consumption) | "250 kWh/tCO2" |
├── models/
│ ├── base_ner_model.py # Base NER model class with training/evaluation
│ ├── bert_model.py # BERT-CRF model implementation
│ └── crf.py # CRF layer with IOBES constraints
├── utils/
│ ├── data.py # Data loading and preprocessing
│ └── metrics.py # Evaluation metrics (P/R/F1)
├── run_train.py # Training script
├── run_inference_ensemble.py # Ensemble inference script
├── data_split.json # DOI-based train/val/test split
└── README.md
torch>=1.9.0
transformers>=4.5.0
numpy
pandas
tqdm
seqeval
chemdataextractor
git clone https://github.com/yourusername/co2-capture-ner.git
cd co2-capture-ner
pip install -r requirements.txtDownload MatBERT weights and place them in model_matbert/matbert-base-uncased/:
python run_train.pyConfiguration (modify in run_train.py):
device = "cuda:0" # GPU device
n_epochs = 30 # Maximum epochs
batch_size = 16 # Batch size
learning_rate = 2e-4 # Learning rate
datafile = "./dataset/your_data.csv" # Training data pathOutput:
- Model checkpoints:
matbert_IOBES_{date}_Best/seed{k}/0/best.pt - Training logs:
loss.txt,summary.txt
The ensemble inference script averages logits from 5 models (trained with different seeds) and applies CRF decoding for consistent IOBES predictions.
python run_inference_ensemble.pyConfiguration (modify paths in the script):
TRAINED_MODEL_DIR = 'path/to/trained/models'
INPUT_DATA_DIR = 'path/to/input/csv/files'
OUTPUT_DIR = 'path/to/output'| Column | Description |
|---|---|
| file_name | DOI or document identifier |
| x_train | List of tokens (as string) |
| y_train | List of IOBES labels (as string) |
Example:
file_name,x_train,y_train
10.1016_j.cej.2021.130362,"['The', 'CO2', 'concentration', 'was', '15', 'vol%']","['O', 'O', 'O', 'O', 'B-CONCENTRATION', 'E-CONCENTRATION']"Pre-tokenized text with one token per line:
The
CO2
concentration
was
15
vol%
Input Text
↓
MatBERT Encoder (768-dim)
↓
Dropout (0.1)
↓
Linear Classifier (768 → 33 tags)
↓
CRF Layer (IOBES constraints)
↓
Output Tags
- B-: Beginning of entity
- I-: Inside of entity
- O: Outside (not an entity)
- E-: End of entity
- S-: Single-token entity
The CRF layer enforces valid transitions (e.g., B-MATERIAL can only be followed by I-MATERIAL or E-MATERIAL).
- Data Split: DOI-based stratified split (197 train / 24 val / 25 test)
- Seeds: 5 random seeds for statistical reliability
- Early Stopping: Based on validation F1 score (patience=100)
- Optimizer: AdamW with linear warmup scheduler
The ensemble approach:
- Load 5 models trained with different random seeds
- For each input, collect logits from all models
- Average the logits element-wise
- Apply CRF Viterbi decoding to averaged logits
- Output consistent IOBES tag sequences
This method improves robustness and maintains valid tag sequences through CRF constraints.
If you use this code, please cite:
@article{your_paper,
title={Automated Information Extraction from CO2 Capture Literature using MatBERT-CRF},
author={Your Name},
journal={Journal Name},
year={2025}
}This project is licensed under the MIT License.
- MatBERT - Materials Science BERT
- Hugging Face Transformers
- pytorch-crf