Biomedical clustering evaluation

Evaluation of clustering algorithms on biomedical abstracts using internal metrics (Silhouette, Calinski-Harabasz, Davies-Bouldin).

Different combinations of the following methods have been experimented.

Step	Methods
Text embedding	Bag-of-words Word2vec GloVe fastText BioWordVec MiniLM BioBERT PubMedBERT
Dimensionality reduction	PCA LSA NMF UMAP
Clustering	K-means Bisecting k-means Agglomerative DBSCAN OPTICS HDBSCAN

Dataset creation

The evaluation dataset is created using PubMed abstracts. Each entry of the dataset contains articles related to a disease.

pmids.json contains the PMIDs of the articles of an already defined dataset.
To fetch the actual abstracts, run:

python dataset_fetch.py --pmids=pmids.json --output=<output-path>

To create a custom dataset with different diseases, create a file with the wanted diseases (one per line) and run:

python dataset_pmids.py --diseases-list=<path-to-list> --output=<output-path>

The output contains the PMIDs of the dataset, which can be passed to dataset_fetch.py.

Evaluation

To start the evaluation, run:

python eval.py --dataset=<path-to-dataset> --embedding=<embedding>

Available values of <embedding> are bow, word2vec, glove, fasttext, biowordvec, minilm, biobert, pubmedbert.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
dataset_fetch.py		dataset_fetch.py
dataset_pmids.py		dataset_pmids.py
eval.py		eval.py
pmids.json		pmids.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

dataset_fetch.py

dataset_fetch.py

dataset_pmids.py

dataset_pmids.py

eval.py

eval.py

pmids.json

pmids.json

requirements.txt

requirements.txt

Repository files navigation

Biomedical clustering evaluation

Dataset creation

Evaluation

About

Releases

Packages

Languages

License

NotXia/biomed-clustering-eval

Folders and files

Latest commit

History

Repository files navigation

Biomedical clustering evaluation

Dataset creation

Evaluation

About

Resources

License

Stars

Watchers

Forks

Languages