Skip to content

SSoYuuN/MatBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MatBERT-CRF for CO2 Capture Named Entity Recognition

This repository contains the code for training and inference of a MatBERT-CRF model for Named Entity Recognition (NER) on CO2 capture scientific literature.

Overview

We developed an NLP pipeline to automatically extract key performance indicators from CO2 capture research papers. The model uses MatBERT (Materials Science BERT) with a CRF layer to ensure valid IOBES tag sequences.

Entity Types (8 classes)

Entity Description Example
CONCENTRATION CO2 feed concentration "15 vol%", "400 ppm"
TEMPERATURE Operating/regeneration temperature "120°C", "40-60°C"
PRESSURE System pressure "1 bar", "0.1 MPa"
MATERIAL Sorbents, absorbents, membranes "MEA", "zeolite 13X", "PVDF"
REMOVAL_EFFICIENCY CO2 capture rate/capacity "90%", "3.5 mol/kg"
COST Capture or avoidance cost "$50/tCO2"
ENERGY_SRD Thermal energy (Specific Reboiler Duty) "3.5 GJ/tCO2"
ENERGY_SEC Electrical energy (Specific Energy Consumption) "250 kWh/tCO2"

Repository Structure

├── models/
│   ├── base_ner_model.py    # Base NER model class with training/evaluation
│   ├── bert_model.py        # BERT-CRF model implementation
│   └── crf.py               # CRF layer with IOBES constraints
├── utils/
│   ├── data.py              # Data loading and preprocessing
│   └── metrics.py           # Evaluation metrics (P/R/F1)
├── run_train.py             # Training script
├── run_inference_ensemble.py # Ensemble inference script
├── data_split.json          # DOI-based train/val/test split
└── README.md

Requirements

torch>=1.9.0
transformers>=4.5.0
numpy
pandas
tqdm
seqeval
chemdataextractor

Installation

git clone https://github.com/yourusername/co2-capture-ner.git
cd co2-capture-ner
pip install -r requirements.txt

MatBERT Model

Download MatBERT weights and place them in model_matbert/matbert-base-uncased/:

Usage

Training

python run_train.py

Configuration (modify in run_train.py):

device = "cuda:0"           # GPU device
n_epochs = 30               # Maximum epochs
batch_size = 16             # Batch size
learning_rate = 2e-4        # Learning rate
datafile = "./dataset/your_data.csv"  # Training data path

Output:

  • Model checkpoints: matbert_IOBES_{date}_Best/seed{k}/0/best.pt
  • Training logs: loss.txt, summary.txt

Inference (Ensemble)

The ensemble inference script averages logits from 5 models (trained with different seeds) and applies CRF decoding for consistent IOBES predictions.

python run_inference_ensemble.py

Configuration (modify paths in the script):

TRAINED_MODEL_DIR = 'path/to/trained/models'
INPUT_DATA_DIR = 'path/to/input/csv/files'
OUTPUT_DIR = 'path/to/output'

Data Format

Training Data (CSV)

Column Description
file_name DOI or document identifier
x_train List of tokens (as string)
y_train List of IOBES labels (as string)

Example:

file_name,x_train,y_train
10.1016_j.cej.2021.130362,"['The', 'CO2', 'concentration', 'was', '15', 'vol%']","['O', 'O', 'O', 'O', 'B-CONCENTRATION', 'E-CONCENTRATION']"

Inference Data (CSV)

Pre-tokenized text with one token per line:

The
CO2
concentration
was
15
vol%

Model Architecture

Input Text
    ↓
MatBERT Encoder (768-dim)
    ↓
Dropout (0.1)
    ↓
Linear Classifier (768 → 33 tags)
    ↓
CRF Layer (IOBES constraints)
    ↓
Output Tags

IOBES Tagging Scheme

  • B-: Beginning of entity
  • I-: Inside of entity
  • O: Outside (not an entity)
  • E-: End of entity
  • S-: Single-token entity

The CRF layer enforces valid transitions (e.g., B-MATERIAL can only be followed by I-MATERIAL or E-MATERIAL).

Training Strategy

  • Data Split: DOI-based stratified split (197 train / 24 val / 25 test)
  • Seeds: 5 random seeds for statistical reliability
  • Early Stopping: Based on validation F1 score (patience=100)
  • Optimizer: AdamW with linear warmup scheduler

Ensemble Inference

The ensemble approach:

  1. Load 5 models trained with different random seeds
  2. For each input, collect logits from all models
  3. Average the logits element-wise
  4. Apply CRF Viterbi decoding to averaged logits
  5. Output consistent IOBES tag sequences

This method improves robustness and maintains valid tag sequences through CRF constraints.

Citation

If you use this code, please cite:

@article{your_paper,
  title={Automated Information Extraction from CO2 Capture Literature using MatBERT-CRF},
  author={Your Name},
  journal={Journal Name},
  year={2025}
}

License

This project is licensed under the MIT License.

Acknowledgments

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages