Skip to content
This repository has been archived by the owner on Feb 2, 2022. It is now read-only.

Applied BERT based model to extract relations from 29 annual reports of listed companies and news; Used spaCy library and BERT model for name-entity recognition and relations extraction, and generated a network graph that summarises the key relation

Notifications You must be signed in to change notification settings

Christy-Lo/Financial-Fraud-Detection-Using-Text-Mining

Repository files navigation

Financial-Fraud-Detection-Using-Text-Mining

Groupmate: Vickie Chang, Chris Yeung

Applied BERT based model to extract relations from 29 annual reports of listed companies and news; Used spaCy library and BERT model for name-entity recognition and relations extraction, and generated a network graph that summarises the key relation

Methodology

Methodology

29 sets of annual report and news from Reuter are inputted to the trained SpaCy pipeline to identify entities. The entity comes along with a label to classify the entity's nature. The paragraph is then split into sentences. Sentences containing less than two entities are removed as they contain no valid relations. For sentences containing three or more entities, combinations of two are generated from the multiple entities in a sentence by using itertools from the combinations package and hence each sentence contains exactly two entities. Then the relations between the entities in each sentence are manually labelled. The data are then splited into train and test set for training the BERT model from plkmo/BERT-Relation-Extraction. We have then apply our model on data related to Tencent as a case study target.

Training Data

We have labelled a total of 2973 sentences with the following labels

Labels Number Percentage
Colleague 475 15.98%
Relative 61 2.05%
Employee-Company 407 13.69%
Educated-Institute 83 2.79%
Founder-Company 51 1.72%
Shareholder-Company 143 4.81%
Within-Same-Company-Group 78 2.62%
Cooperate-Partner 76 2.56%
Subsidary-ParentCompany 98 3.30%
Same-Entity 97 3.26%
Other 1404 47.23%

Example of data

Input:

Subsidary-ParentCompany(e1,e2)
Sentence:  As [E1]Advance Data Services Limited[/E1] is wholly-owned by [E2]Ma[/E2] Huateng, Mr Ma has an interest in these shares as disclosed under the section of “Directors’ Interests in Securities”.

Output:

Predicted:  Subsidary-ParentCompany(e1,e2)

Model Preformance and Visualization

We have choosen to train the model with 11 epoch based on the training accuracy, losses and f1 score

Parameter at Epoch 11 Value
Train accuracy 0.8767857
Losses 0.3696946
Test F1 score 0.2857143
Losses F1 score
Training Accuracy

Confusion Matrix

Network Graph

TencentGraph

Acknowledgement

We would like to thanks Dr. K. P. Chow's research team for sharing their research data. We do not own any of the data. We have also refer to different tutorial throughout the project and we do not own the code. The links are stated when their code are adopted.

About

Applied BERT based model to extract relations from 29 annual reports of listed companies and news; Used spaCy library and BERT model for name-entity recognition and relations extraction, and generated a network graph that summarises the key relation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published