The decision to analyze a great number of documents with information about German top managers
Table of Contents
Hawk project is supposed to help working with the text documents in finding necessary information (tags).
Here are main reasons to use this decision:
- Hawk analyzes hundreds of documents for few minutes and highlights important information
- It is reliable according to the metrics on the validation dataset
- The neural network will save great amount of money on the routine analysis of documents
Of course, after some time Hawk will need the additioinal training. As it is applicable to every NN project, it needs new data to be kind of state-of-the-art decision
Hawk was created using the following technologies:
You can find all steps in data analysis and modeling at analysis/Hawk_modeling.ipynb
- Large German Pipeline (de_core_news_lg) is used for tagging texts.
- All special symbols (including umlauts) are preprocessed.
- DataFrame is transformed to two lists with get_sents_and_tags function (sentences and tags). The first list contains texts divided to sentences, second list consists of tags for each word in the sentence.
- tag2idx and idx2tag dictionaries are created once and used after training. They connect each tag to number, which will be predicted in the future. idx2tag.json is used in src/predict.py to transform predicted number to tag.
- With Bert Tokenizer text is tokenized with labels to each piece of the word using BILUO method.
- Texts with labels are padded to a maximum length of 512 tokens.
- Train and Validation Tensor Datasets and DataLoaders are created for training the model.
- I used BertForTokenClassification bert-base-german-cased pretrained model.
- train_model() function trains the model and return the history of loss function values. Here I used AdamW optimizer. eval_model() runs the model on the validation dataset and returns accuracy score, f1 score and classification report.
- predict() prints the given sentence and highlights important information. The function in src/predict.py is modified for everyday usage.
In order to make Hawk work well you need to install all necessary prerequisites
Prerequisites are described in src/requirements.txt. All you need is to follow the installation step.
- Clone Hawk github repository:
git clone https://github.com/ReAlex1902/Hawk.git
- Go to Hawk repository and download all necessary libraries with the next command:
cd Hawk
python -m pip install -r src\requirements.txt
- Download HAWK model (434 MB):
gdown --id 1_IWXvjsV3uU0D93loeVUuK_miA24dt8b --output src\HAWK_3.0.pth
- Run src/main.py script.
- Write down the text of the document. example_document.txt is provided.
- Enjoy the result!
As it was mentioned above, metrics are high on the validation set. Here is the report:
Distributed under the MIT License. See LICENSE
for more information.
Alex Malkhasov
E-mail: ReAlex1902@gmail.com
Telegram
LinkedIn