GitHub - ReAlex1902/Hawk: German documents analysis

Hawk

The decision to analyze a great number of documents with information about German top managers

Table of Contents

About The Project
- Built With
Analysis
Getting Started
- Prerequisites
- Installation
Usage
Metrics
License
Contact

About The Project

Hawk project is supposed to help working with the text documents in finding necessary information (tags).

Here are main reasons to use this decision:

Hawk analyzes hundreds of documents for few minutes and highlights important information
It is reliable according to the metrics on the validation dataset
The neural network will save great amount of money on the routine analysis of documents

Of course, after some time Hawk will need the additioinal training. As it is applicable to every NN project, it needs new data to be kind of state-of-the-art decision

Built With

Hawk was created using the following technologies:

Analysis

You can find all steps in data analysis and modeling at analysis/Hawk_modeling.ipynb

Data preprocessing steps:

Large German Pipeline (de_core_news_lg) is used for tagging texts.
All special symbols (including umlauts) are preprocessed.
DataFrame is transformed to two lists with get_sents_and_tags function (sentences and tags). The first list contains texts divided to sentences, second list consists of tags for each word in the sentence.
tag2idx and idx2tag dictionaries are created once and used after training. They connect each tag to number, which will be predicted in the future. idx2tag.json is used in src/predict.py to transform predicted number to tag.
With Bert Tokenizer text is tokenized with labels to each piece of the word using BILUO method.
Texts with labels are padded to a maximum length of 512 tokens.
Train and Validation Tensor Datasets and DataLoaders are created for training the model.

Modeling steps:

I used BertForTokenClassification bert-base-german-cased pretrained model.
train_model() function trains the model and return the history of loss function values. Here I used AdamW optimizer. eval_model() runs the model on the validation dataset and returns accuracy score, f1 score and classification report.
predict() prints the given sentence and highlights important information. The function in src/predict.py is modified for everyday usage.

Getting Started

In order to make Hawk work well you need to install all necessary prerequisites

Prerequisites

Prerequisites are described in src/requirements.txt. All you need is to follow the installation step.

Installation

Clone Hawk github repository:

git clone https://github.com/ReAlex1902/Hawk.git

Go to Hawk repository and download all necessary libraries with the next command:

cd Hawk
python -m pip install -r src\requirements.txt

Download HAWK model (434 MB):

gdown --id 1_IWXvjsV3uU0D93loeVUuK_miA24dt8b --output src\HAWK_3.0.pth

Usage

Run src/main.py script.
Write down the text of the document. example_document.txt is provided.
Enjoy the result!

Metrics

As it was mentioned above, metrics are high on the validation set. Here is the report:

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Alex Malkhasov

E-mail: ReAlex1902@gmail.com
Telegram
LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
analysis		analysis
photos		photos
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
example_document.txt		example_document.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hawk

About The Project

Built With

Analysis

Data preprocessing steps:

Modeling steps:

Getting Started

Prerequisites

Installation

Usage

Metrics

License

Contact

About

Releases

Packages

Languages

License

ReAlex1902/Hawk

Folders and files

Latest commit

History

Repository files navigation

Hawk

About The Project

Built With

Analysis

Data preprocessing steps:

Modeling steps:

Getting Started

Prerequisites

Installation

Usage

Metrics

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages