BERT for uncertainty detection

Bachelor's thesis available via this link.

Project description

Dataset

The models here are trained to detect uncertain words and sentences in text and the class of uncertainty:

Dynamic. Indicates necessity, dispositions, external circumstances, wishes, intents, plans, and desires. Example: I have to go.
Doxastic. Expresses the speaker’s beliefs. Example: He believes that the Earth is flat.
Investigation. Propositions, for which the truth value cannot be stated until further analysis is done. Example: We examined the role of NF-kappa B in protein activation.
Conditional. Used for conditionals. Example: If it rains, we’ll stay in.
Epistemic. Uncertainty, for which it is known that the proposition is neither true nor false. Example: It may be raining.

The dataset used here is the re-annotated CoNLL-2010 shared task dataset for uncertainty detection (Szeged Uncertainty Corpus). It's available for download here. The dataset consists of two main parts:

Wikipedia (WikiWeasel).
Biological (BioScope).

Goal

The goal of this project was to compare the performance of two different BERT training procedures:

Train a domain-specific model: SciBERT and BioBERT on Biological part of the dataset.
Train the general-domain BERT on the Wikipedia part and perform transfer-learning.

20 models with different seeds are trained and F1-score is compared with statistical tests.

Results

I showed that it's not possible to conclude on this dataset which approach yields better results. But most importantly, SciBERT almost always outperforms BioBERT and doesn't benefit as much from additional Wikipedia data. If you decide to train a domain-specific language model, train it from a random initialization with a domain-specific dictionary rather that start with BERT initialization.

Model

SciBERT model for uncertainty detection on biological texts is available on my Google Drive.

Demo

The demo allows to experiment with the model and annotate an arbitrary text for uncertainty.

Instructions to run the demo:

Clone the repository:

git clone https://github.com/PeterZhizhin/BERTUncertaintyDetection

Create a virtualenv and install dependencies:

python -m venv .env
source .env/bin/activate
pip install -U spacy
python -m spacy download en_core_web_sm
pip install aiohttp jinja2 aiohttp-jinja2 transformers torch torchvision

Go to the demo folder:

cd demo

Download the model extract the archive remember the path to the model folder.
Run the server:

python demo_server.py --model_path [PATH TO FOLDER WITH THE MODEL] --labels_path ../labels.txt

Train the models yourself

All the model training was done on a Slurm cluster of National Research University Higher School of Economics. So, all the scripts for training require a Slurm cluster with GPUs by default. If you wish to train models without a Slurm cluster, you may change the training scripts.

Clone the repo:

git clone https://github.com/PeterZhizhin/BERTUncertaintyDetection

Install dependencies:

python -m venv .env
source .env/bin/activate
pip install -U spacy
python -m spacy download en_core_web_sm
pip install aiohttp jinja2 aiohttp-jinja2 transformers torch torchvision lxml

Download the dataset, extract it and place to the uncertainty_dataset folder.
Make all shell scripts executable:

chmod +x *.sh
chmod +x huggingface_models/*.sh

Create the datasets for training and evaluation:

./create_biomedical_ner_dataset_train_test.sh
./create_biomedical_classification_dataset_train_test.sh
./create_wiki_classification_dataset_train_test.sh
./create_wiki_ner_dataset_train_test.sh

Train all models:

cd huggingface_models
sbatch --wait ./train_all_models_on_wiki_and_bio_slurm.sh
sbatch --wait ./transfer_all_models_on_wiki_and_bio_slurm.sh
sbatch --wait ./train_all_classification_models_on_wiki_and_bio_slurm.sh
sbatch --wait ./transfer_all_classification_models_on_wiki_and_bio_slurm.sh

All models are now available in ner_experiments and classification_experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
data		data
demo		demo
huggingface_models		huggingface_models
pictures		pictures
utils		utils
README.md		README.md
aggregate_results.py		aggregate_results.py
convert_xml.py		convert_xml.py
create_biomedical_classification_dataset_train_test.sh		create_biomedical_classification_dataset_train_test.sh
create_biomedical_ner_dataset_train_test.sh		create_biomedical_ner_dataset_train_test.sh
create_factbank_dataset_train_test.sh		create_factbank_dataset_train_test.sh
create_wiki_classification_dataset_train_test.sh		create_wiki_classification_dataset_train_test.sh
create_wiki_ner_dataset_train_test.sh		create_wiki_ner_dataset_train_test.sh
labels.txt		labels.txt
make_eval_also_become_test.sh		make_eval_also_become_test.sh
train_test_split.sh		train_test_split.sh

PeterZhizhin/BERTUncertaintyDetection

Folders and files

Latest commit

History

Repository files navigation

BERT for uncertainty detection

Project description

Dataset

Goal

Results

Model

Demo

Train the models yourself

About

Resources

Stars

Watchers

Forks

Languages