Skip to content

NotXia/biomed-clustering-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Biomedical clustering evaluation

Work done for my Bachelor's thesis.

Evaluation of clustering algorithms on biomedical abstracts using internal metrics (Silhouette, Calinski-Harabasz, Davies-Bouldin).

Different combinations of the following methods have been experimented.

Step Methods
Text embedding Bag-of-words
Word2vec
GloVe
fastText
BioWordVec
MiniLM
BioBERT
PubMedBERT
Dimensionality reduction PCA
LSA
NMF
UMAP
Clustering K-means
Bisecting k-means
Agglomerative
DBSCAN
OPTICS
HDBSCAN

Dataset creation

The evaluation dataset is created using PubMed abstracts. Each entry of the dataset contains articles related to a disease.

pmids.json contains the PMIDs of the articles of an already defined dataset.
To fetch the actual abstracts, run:

python dataset_fetch.py --pmids=pmids.json --output=<output-path>

To create a custom dataset with different diseases, create a file with the wanted diseases (one per line) and run:

python dataset_pmids.py --diseases-list=<path-to-list> --output=<output-path>

The output contains the PMIDs of the dataset, which can be passed to dataset_fetch.py.

Evaluation

To start the evaluation, run:

python eval.py --dataset=<path-to-dataset> --embedding=<embedding>

Available values of <embedding> are bow, word2vec, glove, fasttext, biowordvec, minilm, biobert, pubmedbert.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages