Construction of a cancer-related Knowledge Graph

HUSTAIA 2023 Spring undergraduate graduation project

Build a cancer-relatd Knowledge Graph from open source papers.

The final result contain 949,007 concept nodes and 4,766,994 triples. And the concept node will have a link to 1,023,966 paper nodes.

The building process is divided into 4 step:

data preparation: download data from xrxiv and pubmed and filter cancer-related paper using keywords and title
ner: train ner models and use them on prepared data
re: use OpenNRE package to get the relation between entities
term clustering: cluster similar entity into concepts
postprocess: generate KG triples and do viz

Updates

20230416: finish viz🎉🎊🎉
20230413: finish postprocess
20230407: finish OpenNRE re
20230324: finish term clustering using CODER++
20230317: finish UIE test
20230310: finish biobert multiNERHead finetune and used it on my own corpus.
20230224: finish data preparation.

Details

The KG is totally build automatically from pubmed and xrxiv papers.

At the paper collection stage, I collect ~1M cancer related papers from pubmed/chemrxiv/medrxiv/biorxiv/arxiv.

I use a distant supervised method to train the NER models. NER dataset is the same as biobert's ner dataset. I train two NER models: multiNerHead version bioBERT and UIE. The multiNerHead version bioBERT is much better than UIE. So the multiNerHead version bioBERT model is used as the final NER model.

Then I use OpenNRE package to extract the relation between each entity. This step cost a lot of resource. I split the data into a lot of splits and use 48 V100 to infer. It cost ~11h to finish all the process.(This might be the most costly step in the whole project)

Then I use CODER++ model pulled from huggingface to do term clustering. This can group entities like cancers and cancer into a concept cluster which can make the KG more precise. The following is the word cloud after term clustering.

The following steps are just some post-process and viz using streamlit which I think there is no more to say about.

Usage

create env using conda

conda env create -f env.yaml
conda activate cancerGraph
python -m spacy download en_core_web_sm

use gdown to download graphData into data/graphData folder

gdown --folder https://drive.google.com/drive/folders/1NPG-61qI8IoUQAqdHGAUJUrdpvM1IR30?usp=sharing --output data/graphData

pull neo4j docker and build the database

see more detail in viz/import.md
use streamlit to viz the data
```
cd viz
streamlit run databaseKGviz.py
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Construction of a cancer-related Knowledge Graph

Updates

Details

Usage

About

Uh oh!

Releases

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.asset		README.asset
cluster		cluster
data		data
download		download
ner		ner
postprocess		postprocess
re		re
viz		viz
.gitignore		.gitignore
README.md		README.md
env.yaml		env.yaml

PoloWitty/cancerGraph

Folders and files

Latest commit

History

Repository files navigation

Construction of a cancer-related Knowledge Graph

Updates

Details

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Languages