Financial-Fraud-Detection-Using-Text-Mining

Groupmate: Vickie Chang, Chris Yeung

Applied BERT based model to extract relations from 29 annual reports of listed companies and news; Used spaCy library and BERT model for name-entity recognition and relations extraction, and generated a network graph that summarises the key relation

Methodology

29 sets of annual report and news from Reuter are inputted to the trained SpaCy pipeline to identify entities. The entity comes along with a label to classify the entity's nature. The paragraph is then split into sentences. Sentences containing less than two entities are removed as they contain no valid relations. For sentences containing three or more entities, combinations of two are generated from the multiple entities in a sentence by using itertools from the combinations package and hence each sentence contains exactly two entities. Then the relations between the entities in each sentence are manually labelled. The data are then splited into train and test set for training the BERT model from plkmo/BERT-Relation-Extraction. We have then apply our model on data related to Tencent as a case study target.

Data_preprocessing.ipynb contains code of data collection and data preprocessing.
Tencent_RE BERT model.ipynb contains code to implement the github repo of plkmo/BERT-Relation-Extraction
The confusion matrix and network graph are generated in Confusion_Matrix.ipynb and RelationGenerator.ipynb

Training Data

We have labelled a total of 2973 sentences with the following labels

Labels	Number	Percentage
Colleague	475	15.98%
Relative	61	2.05%
Employee-Company	407	13.69%
Educated-Institute	83	2.79%
Founder-Company	51	1.72%
Shareholder-Company	143	4.81%
Within-Same-Company-Group	78	2.62%
Cooperate-Partner	76	2.56%
Subsidary-ParentCompany	98	3.30%
Same-Entity	97	3.26%
Other	1404	47.23%

Example of data

Input:

Subsidary-ParentCompany(e1,e2)
Sentence:  As [E1]Advance Data Services Limited[/E1] is wholly-owned by [E2]Ma[/E2] Huateng, Mr Ma has an interest in these shares as disclosed under the section of “Directors’ Interests in Securities”.

Output:

Predicted:  Subsidary-ParentCompany(e1,e2)

Model Preformance and Visualization

We have choosen to train the model with 11 epoch based on the training accuracy, losses and f1 score

Parameter at Epoch 11	Value
Train accuracy	0.8767857
Losses	0.3696946
Test F1 score	0.2857143

Losses	F1 score

Training Accuracy

Confusion Matrix

Network Graph

Acknowledgement

We would like to thanks Dr. K. P. Chow's research team for sharing their research data. We do not own any of the data. We have also refer to different tutorial throughout the project and we do not own the code. The links are stated when their code are adopted.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Confusion_Matrix.ipynb		Confusion_Matrix.ipynb
README.md		README.md
RelationGenerator.ipynb		RelationGenerator.ipynb
Tencent_RE BERT model.ipynb		Tencent_RE BERT model.ipynb
data_preprocessing.ipynb		data_preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion_Matrix.ipynb

Confusion_Matrix.ipynb

README.md

README.md

RelationGenerator.ipynb

RelationGenerator.ipynb

Tencent_RE BERT model.ipynb

Tencent_RE BERT model.ipynb

data_preprocessing.ipynb

data_preprocessing.ipynb

Repository files navigation

Financial-Fraud-Detection-Using-Text-Mining

Methodology

Training Data

Example of data

Model Preformance and Visualization

Confusion Matrix

Network Graph

Acknowledgement

About

Releases

Packages

Languages

Christy-Lo/Financial-Fraud-Detection-Using-Text-Mining

Folders and files

Latest commit

History

Repository files navigation

Financial-Fraud-Detection-Using-Text-Mining

Methodology

Training Data

Example of data

Model Preformance and Visualization

Confusion Matrix

Network Graph

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Languages