Skip to content

SlovakBERT based Named Entity Recognition Model implementation for the Slovak language

License

Notifications You must be signed in to change notification settings

Raychani1/Text_Parsing_Methods_Using_NLP

Repository files navigation

Text Parsing Methods using NLP

About The Project

The main goal of this project is the development of a Deep Learning model for Named Entity Recognition (NER) in Slovak. The Gerulata/SlovakBERT based model is fine-tuned on webscraped Slovak news articles. The finished model supports the following IOB tagged entity categories: Person, Organisation, Location, Date, Time, Money and Percentage.


Related Work

Thesis

HuggingFaceModel


Built With

Python 3.10 NumPy Pandas Seaborn Plotly Datasets Transformers Ray WandB Scikit PyTorch


Best Model Training Parameters

Parameter Value
per_device_train_batch_size 4
per_device_eval_batch_size 4
learning_rate 5e-05
adam_beta1 0.9
adam_beta1 0.999
adam_epsilon 1e-08
num_train_epochs 15
lr_scheduler_type linear
seed 42

Best Model Training History

Best model results are reached in the 8th training epoch.

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.6721 1.0 70 0.2214 0.6972 0.7308 0.7136 0.9324
0.1849 2.0 140 0.1697 0.8056 0.8365 0.8208 0.952
0.0968 3.0 210 0.1213 0.882 0.8622 0.872 0.9728
0.0468 4.0 280 0.1107 0.8372 0.907 0.8708 0.9684
0.0415 5.0 350 0.1644 0.8059 0.8782 0.8405 0.9615
0.0233 6.0 420 0.1255 0.8576 0.8878 0.8724 0.9716
0.0198 7.0 490 0.1383 0.8545 0.8846 0.8693 0.9703
0.0133 8.0 560 0.1241 0.884 0.9038 0.8938 0.9735

Best Model Results

Dataset distribution for final evaluation:

NER Tag Number of Tokens
0 6568
B-Person 96
I-Person 83
B-Organizaton 583
I-Organizaton 585
B-Location 59
I-Location 15
B-Date 113
I-Date 87
Time 5
B-Money 44
I-Money 74
B-Percentage 57
I-Percentage 54

Confusion Matrix of the final evaluation: image


Evaluation metrics of the final evaluation:

Precision Macro-Precision Recall Macro-Recall F1 Macro-F1 Accuracy
0.9897 0.9715 0.9897 0.9433 0.9895 0.9547 0.9897

Model Prediction Output Example

prediction_output


Getting Started

To get a local copy up and running follow these simple steps.


Prerequisites

  • Python 3.10.x - It is either installed on your Linux distribution or on other Operating Systems you can get it from the Official Website, Microsoft Store or through Windows Subsystem for Linux (WSL) using this article.

Setup and Usage

  1. Clone the repo and navigate to the Project folder

    git clone https://github.com/Raychani1/Text_Parsing_Methods_Using_NLP
  2. Create a new Python Virtual Environment

    python -m venv venv
  3. Activate the Virtual Environment

    On Linux:

    source ./venv/bin/activate

    On Windows:

    .\venv\Scripts\Activate.ps1
  4. Install Project dependencies

    pip install -r requirements.txt
  5. Update Weights & Biases configuration (Optional)

    WAND_ENV_VARIABLES = {
        'WANDB_API_KEY': 'YOUR-WANDB-API-KEY',
        'WANDB_PROJECT': 'YOUR-WANDB-PROJECT',
        'WANDB_LOG_MODEL': 'true',
        'WANDB_WATCH': 'false'
    }
  6. Run main script (with prepared use-cases)

    python main.py

License

Distributed under the MIT License. See LICENSE for more information.


Acknowledgments

Gerulata / SlovakBERT (Hugging Face Model)

Crabz / SlovakBERT-NER (Hugging Face Model)

Rohan Paul / YT_Fine_tuning_BERT_NER_v1 (Tutorial)

About

SlovakBERT based Named Entity Recognition Model implementation for the Slovak language

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages