Skip to content

ReAlex1902/Hawk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

Hawk

The decision to analyze a great number of documents with information about German top managers

Table of Contents
  1. About The Project
  2. Analysis
  3. Getting Started
  4. Usage
  5. Metrics
  6. License
  7. Contact

About The Project

Hawk project is supposed to help working with the text documents in finding necessary information (tags).

Here are main reasons to use this decision:

  • Hawk analyzes hundreds of documents for few minutes and highlights important information
  • It is reliable according to the metrics on the validation dataset
  • The neural network will save great amount of money on the routine analysis of documents

Of course, after some time Hawk will need the additioinal training. As it is applicable to every NN project, it needs new data to be kind of state-of-the-art decision

Built With

Hawk was created using the following technologies:

Analysis

You can find all steps in data analysis and modeling at analysis/Hawk_modeling.ipynb

Data preprocessing steps:

  1. Large German Pipeline (de_core_news_lg) is used for tagging texts.
  2. All special symbols (including umlauts) are preprocessed.
  3. DataFrame is transformed to two lists with get_sents_and_tags function (sentences and tags). The first list contains texts divided to sentences, second list consists of tags for each word in the sentence.
  4. tag2idx and idx2tag dictionaries are created once and used after training. They connect each tag to number, which will be predicted in the future. idx2tag.json is used in src/predict.py to transform predicted number to tag.
  5. With Bert Tokenizer text is tokenized with labels to each piece of the word using BILUO method.
  6. Texts with labels are padded to a maximum length of 512 tokens.
  7. Train and Validation Tensor Datasets and DataLoaders are created for training the model.

Modeling steps:

  1. I used BertForTokenClassification bert-base-german-cased pretrained model.
  2. train_model() function trains the model and return the history of loss function values. Here I used AdamW optimizer. eval_model() runs the model on the validation dataset and returns accuracy score, f1 score and classification report.
  3. predict() prints the given sentence and highlights important information. The function in src/predict.py is modified for everyday usage.

Getting Started

In order to make Hawk work well you need to install all necessary prerequisites

Prerequisites

Prerequisites are described in src/requirements.txt. All you need is to follow the installation step.

Installation

  1. Clone Hawk github repository:
git clone https://github.com/ReAlex1902/Hawk.git
  1. Go to Hawk repository and download all necessary libraries with the next command:
cd Hawk
python -m pip install -r src\requirements.txt
  1. Download HAWK model (434 MB):
gdown --id 1_IWXvjsV3uU0D93loeVUuK_miA24dt8b --output src\HAWK_3.0.pth

Usage

  1. Run src/main.py script.
  2. Write down the text of the document. example_document.txt is provided.
  3. Enjoy the result!

Metrics

As it was mentioned above, metrics are high on the validation set. Here is the report:

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Alex Malkhasov

E-mail: ReAlex1902@gmail.com
Telegram
LinkedIn