Skip to content

This repo contains to use varius semantic similarity based to get high quality embeddings for your specific task

Notifications You must be signed in to change notification settings

10deepaktripathi/Embedding-Analysis-Using-Transformers

Repository files navigation

Embedding Quality Evaluation for Health-related Data

This repository contains the analysis and evaluation of embedding quality for health-related data using various embedding models. The goal was to assess the effectiveness of different models in capturing the semantic meaning and context of health-related keywords and clinical notes.

Directory Structure

  • Embedding Analysis1.ipynb: Notebook contains comparitive study across multiple pretrained and finetuned models.
  • model_tuning.ipynb: Notebook containing model architectures and parameters for fine-tuning clinical BERT model.
  • preprocess_data.ipynb: Notebook containing code for preprocessing ClinicNotes and Medical Keyword files.
  • Detailed Report on Embedding Analysis: A detailed PDF report about this case study
  • charts: Contains all the charts that were drawn on this project.
  • data: Contains all the data files models are getting analyized or finetuned on.

Feel free to explore the notebooks and report for more information.

Evaluation Approaches

Three main approaches were used to assess the embedding quality:

  1. Cosine Similarity between Medical Keyword Pairs: The cosine similarity was calculated for pairs of medical keywords to measure their similarity. A higher cosine similarity score indicates greater similarity between the keywords.

  2. Scatter Plot of Clinic Notes Embeddings: The clinic notes were transformed into embeddings using different models, and the embeddings were visualized using scatter plots. The goal was to observe if similar categories of clinic notes formed distinct clusters in the plot.

  3. Perform K-Means clustering on clinic notes embeddings and then check which type of category's (gastroenterology, cardiovascular, neurology) data produces the similar type of embedding. In other ways we want to check which model is getting confused among different types of category’s data points.

Embedding Models

Several embedding models were evaluated for their performance on health-related data:

  1. Word2Vec: Word2Vec embeddings, obtained using pre-trained models like en_core_web_sm and en_core_web_lg from spaCy, were the initial baseline. However, they did not perform well in capturing health-related context.

  2. ELMO: ELMO embeddings, which capture contextual information effectively, were used to generate embeddings for clinic notes. While ELMO performed better than Word2Vec, it was not specifically trained on health-related data.

  3. BERT: BERT embeddings, including BERT base uncased, BioBERT, ClinicalBERT, and BlueBERT, were evaluated.
    BERT Base uncased model excel at capturing context, but is not specifically trained on health-related data. BioBERT, ClinicalBERT, and BlueBERT are trained on health related data

  4. Sentence Transformers: Using sentence transformers with ClinicalBERT and BioBERT

  5. USE: Using universal sentence encoder(USE) for sentence embedding.

K-Means clustering result. Note that K-Mens is not able to differentiate among different categories data points.

Chart

K-Means clustering result after finetunning model with 3 different approaches.

  1. Tune model to differentiate between right keyword pair and wrong keyword pair
  2. Tune model on clinic notes to predict the notes category
  3. Tune model first on clinic notes to predict right note's category and then further tune the same model to differentiate between right keyword pair and wrong keyword pair.

Chart

Scatter plots of clinic notes embeddings against clinic notes categories using BERT models. Chart Chart Chart Chart Chart Chart


ClinicalBERT has captured the cluster better than any other model


Below is the cosine similarity result plot from each model.

Chart

  1. Fine-tuned ClinicalBERT: ClinicalBERT was further fine-tuned using clinic notes data to improve its performance on health-related embeddings. This model was trained on a 3-class classification task to predict the category given clinic notes.

Chart Chart Chart

Evaluation Results

The evaluation results revealed the following insights:

  • Scatter plots: The scatter plots of clinic notes embeddings showed varying degrees of grouping. ClinicalBERT fine-tuned on clinic notes data exhibited the clearest groupings, indicating better capturing of health-related context.

  • Keyword pair similarity: The models were also evaluated on the cosine similarity of medical keyword pairs. ClinicalBERT outperformed other models, achieving a similarity score of 0.8174 with finetuned and 0.8134 with pretrained.

Conclusion

Based on the evaluation, all ClinicalBERT based models, particularly when fine-tuned on clinic notes data and keyword pair both, showed the best performance in capturing health-related context. However, there is room for further research and improvement, including hyperparameter tuning and the incorporation of more health-specific training data.

For a detailed analysis and code implementation, please refer to the Jupyter Notebook and the respective model directories in this repository. All models performence on keyword pair cosine similarity task

Lastly the result of all models on cosine similarity task is as below -

Chart

About

This repo contains to use varius semantic similarity based to get high quality embeddings for your specific task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published