Text Parsing Methods using NLP

About The Project

The main goal of this project is the development of a Deep Learning model for Named Entity Recognition (NER) in Slovak. The Gerulata/SlovakBERT based model is fine-tuned on webscraped Slovak news articles. The finished model supports the following IOB tagged entity categories: Person, Organisation, Location, Date, Time, Money and Percentage.

Related Work

Built With

Best Model Training Parameters

Parameter	Value
per_device_train_batch_size	4
per_device_eval_batch_size	4
learning_rate	5e-05
adam_beta1	0.9
adam_beta1	0.999
adam_epsilon	1e-08
num_train_epochs	15
lr_scheduler_type	linear
seed	42

Best Model Training History

Best model results are reached in the 8th training epoch.

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.6721	1.0	70	0.2214	0.6972	0.7308	0.7136	0.9324
0.1849	2.0	140	0.1697	0.8056	0.8365	0.8208	0.952
0.0968	3.0	210	0.1213	0.882	0.8622	0.872	0.9728
0.0468	4.0	280	0.1107	0.8372	0.907	0.8708	0.9684
0.0415	5.0	350	0.1644	0.8059	0.8782	0.8405	0.9615
0.0233	6.0	420	0.1255	0.8576	0.8878	0.8724	0.9716
0.0198	7.0	490	0.1383	0.8545	0.8846	0.8693	0.9703
0.0133	8.0	560	0.1241	0.884	0.9038	0.8938	0.9735

Best Model Results

Dataset distribution for final evaluation:

NER Tag	Number of Tokens
0	6568
B-Person	96
I-Person	83
B-Organizaton	583
I-Organizaton	585
B-Location	59
I-Location	15
B-Date	113
I-Date	87
Time	5
B-Money	44
I-Money	74
B-Percentage	57
I-Percentage	54

Confusion Matrix of the final evaluation:

Evaluation metrics of the final evaluation:

Precision	Macro-Precision	Recall	Macro-Recall	F1	Macro-F1	Accuracy
0.9897	0.9715	0.9897	0.9433	0.9895	0.9547	0.9897

Model Prediction Output Example

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Python 3.10.x - It is either installed on your Linux distribution or on other Operating Systems you can get it from the Official Website, Microsoft Store or through Windows Subsystem for Linux (WSL) using this article.

Setup and Usage

Clone the repo and navigate to the Project folder

git clone https://github.com/Raychani1/Text_Parsing_Methods_Using_NLP

Create a new Python Virtual Environment
```
python -m venv venv
```

Activate the Virtual Environment

On Linux:

source ./venv/bin/activate

On Windows:

.\venv\Scripts\Activate.ps1

Install Project dependencies
```
pip install -r requirements.txt
```

Update Weights & Biases configuration (Optional)

WAND_ENV_VARIABLES = {
    'WANDB_API_KEY': 'YOUR-WANDB-API-KEY',
    'WANDB_PROJECT': 'YOUR-WANDB-PROJECT',
    'WANDB_LOG_MODEL': 'true',
    'WANDB_WATCH': 'false'
}

Run main script (with prepared use-cases)
```
python main.py
```

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

Gerulata / SlovakBERT (Hugging Face Model)

Crabz / SlovakBERT-NER (Hugging Face Model)

Rohan Paul / YT_Fine_tuning_BERT_NER_v1 (Tutorial)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data @ d2d437e		data @ d2d437e
output @ 36661e7		output @ 36661e7
text_parsing_methods_using_nlp		text_parsing_methods_using_nlp
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data @ d2d437e

data @ d2d437e

output @ 36661e7

output @ 36661e7