# Sentence Embeddings

This notebook contains the code to generate sentence embeddings using the pre-trained models from the [sentence-transformers](https://www.sbert.net/index.html) library.

In [129]:
import pandas as pd
from pathlib import Path

from sentence_transformers import SentenceTransformer

In [130]:
PATH_DATA_BASE = Path.cwd().parent / "data"
PATH_SENTENCES = Path.cwd().parent / "models/sentences"
PATH_EMBEDDINGS = Path.cwd().parent / "models/embeddings"

In [131]:
# Setting pandas option to display the full content of DataFrame columns without truncation
pd.set_option('display.max_colwidth', None)

In [132]:
dataset = pd.read_csv(PATH_DATA_BASE / 'filtered_data.csv')
dataset.head()

Unnamed: 0,titles,abstracts,terms,urls
0,"A Survey on Deep Learning for Polyp Segmentation: Techniques, Challenges and Future Trends","Early detection and assessment of polyps play a crucial role in the\nprevention and treatment of colorectal cancer (CRC). Polyp segmentation\nprovides an effective solution to assist clinicians in accurately locating and\nsegmenting polyp regions. In the past, people often relied on manually\nextracted lower-level features such as color, texture, and shape, which often\nhad issues capturing global context and lacked robustness to complex scenarios.\nWith the advent of deep learning, more and more outstanding medical image\nsegmentation algorithms based on deep learning networks have emerged, making\nsignificant progress in this field. This paper provides a comprehensive review\nof polyp segmentation algorithms. We first review some traditional algorithms\nbased on manually extracted features and deep segmentation algorithms, then\ndetail benchmark datasets related to the topic. Specifically, we carry out a\ncomprehensive evaluation of recent deep learning models and results based on\npolyp sizes, considering the pain points of research topics and differences in\nnetwork structures. Finally, we discuss the challenges of polyp segmentation\nand future trends in this field. The models, benchmark datasets, and source\ncode links we collected are all published at\nhttps://github.com/taozh2017/Awesome-Polyp-Segmentation.",['cs.CV'],http://arxiv.org/abs/2311.18373v3
1,A Multi-Scale Feature Extraction and Fusion Deep Learning Method for Classification of Wheat Diseases,"Wheat is an important source of dietary fiber and protein that is negatively\nimpacted by a number of risks to its growth. The difficulty of identifying and\nclassifying wheat diseases is discussed with an emphasis on wheat loose smut,\nleaf rust, and crown and root rot. Addressing conditions like crown and root\nrot, this study introduces an innovative approach that integrates multi-scale\nfeature extraction with advanced image segmentation techniques to enhance\nclassification accuracy. The proposed method uses neural network models\nXception, Inception V3, and ResNet 50 to train on a large wheat disease\nclassification dataset 2020 in conjunction with an ensemble of machine vision\nclassifiers, including voting and stacking. The study shows that the suggested\nmethodology has a superior accuracy of 99.75% in the classification of wheat\ndiseases when compared to current state-of-the-art approaches. A deep learning\nensemble model Xception showed the highest accuracy.","['cs.CV', 'cs.LG']",http://arxiv.org/abs/2501.09938v1
2,"Image Segmentation with transformers: An Overview, Challenges and Future","Image segmentation, a key task in computer vision, has traditionally relied\non convolutional neural networks (CNNs), yet these models struggle with\ncapturing complex spatial dependencies, objects with varying scales, need for\nmanually crafted architecture components and contextual information. This paper\nexplores the shortcomings of CNN-based models and the shift towards transformer\narchitectures -to overcome those limitations. This work reviews\nstate-of-the-art transformer-based segmentation models, addressing\nsegmentation-specific challenges and their solutions. The paper discusses\ncurrent challenges in transformer-based segmentation and outlines promising\nfuture trends, such as lightweight architectures and enhanced data efficiency.\nThis survey serves as a guide for understanding the impact of transformers in\nadvancing segmentation capabilities and overcoming the limitations of\ntraditional models.",['cs.CV'],http://arxiv.org/abs/2501.09372v1
3,Shape-Based Single Object Classification Using Ensemble Method Classifiers,"Nowadays, more and more images are available. Annotation and retrieval of the\nimages pose classification problems, where each class is defined as the group\nof database images labelled with a common semantic label. Various systems have\nbeen proposed for content-based retrieval, as well as for image classification\nand indexing. In this paper, a hierarchical classification framework has been\nproposed for bridging the semantic gap effectively and achieving multi-category\nimage classification. A well known pre-processing and post-processing method\nwas used and applied to three problems; image segmentation, object\nidentification and image classification. The method was applied to classify\nsingle object images from Amazon and Google datasets. The classification was\ntested for four different classifiers; BayesNetwork (BN), Random Forest (RF),\nBagging and Vote. The estimated classification accuracies ranged from 20% to\n99% (using 10-fold cross validation). The Bagging classifier presents the best\nperformance, followed by the Random Forest classifier.","['cs.CV', 'cs.AI', 'cs.CL']",http://arxiv.org/abs/2501.09311v1
4,Few-Shot Adaptation of Training-Free Foundation Model for 3D Medical Image Segmentation,"Vision foundation models have achieved remarkable progress across various\nimage analysis tasks. In the image segmentation task, foundation models like\nthe Segment Anything Model (SAM) enable generalizable zero-shot segmentation\nthrough user-provided prompts. However, SAM primarily trained on natural\nimages, lacks the domain-specific expertise of medical imaging. This limitation\nposes challenges when applying SAM to medical image segmentation, including the\nneed for extensive fine-tuning on specialized medical datasets and a dependency\non manual prompts, which are both labor-intensive and require intervention from\nmedical experts.\n This work introduces the Few-shot Adaptation of Training-frEe SAM (FATE-SAM),\na novel method designed to adapt the advanced Segment Anything Model 2 (SAM2)\nfor 3D medical image segmentation. FATE-SAM reassembles pre-trained modules of\nSAM2 to enable few-shot adaptation, leveraging a small number of support\nexamples to capture anatomical knowledge and perform prompt-free segmentation,\nwithout requiring model fine-tuning. To handle the volumetric nature of medical\nimages, we incorporate a Volumetric Consistency mechanism that enhances spatial\ncoherence across 3D slices. We evaluate FATE-SAM on multiple medical imaging\ndatasets and compare it with supervised learning methods, zero-shot SAM\napproaches, and fine-tuned medical SAM methods. Results show that FATE-SAM\ndelivers robust and accurate segmentation while eliminating the need for large\nannotated datasets and expert intervention. FATE-SAM provides a practical,\nefficient solution for medical image segmentation, making it more accessible\nfor clinical applications.",['cs.CV'],http://arxiv.org/abs/2501.09138v1


## sentence-transformers models

### What is a sentence-transformers model?

It maps sentences & paragraphs to a N dimensional dense vector space and can be used for tasks like clustering or semantic search.

### all-MiniLM-L6-v2

MiniLM is a smaller variant of the BERT model which has been designed to provide high-quality language understanding capabilities while being significantly smaller and more efficient. The "`all-MiniLM-L6-v2`" model refers to a specific configuration of the MiniLM model.

Here are some reasons why I have chosen this model for my project:

- Efficiency: MiniLM models are smaller and faster than full-size BERT models, which can be a major advantage if you're working on a project with limited computational resources or if you need to process large amounts of data quickly.
- Performance: Despite their smaller size, MiniLM models often perform at a comparable level to full-size BERT models on a variety of NLP tasks. This means that you can often use a MiniLM model without sacrificing much in the way of performance. In fact, the `Performance Sentence Embeddings` metric which is the average performance on encoding sentences over 14 diverse tasks from different domains is `68.06` for the `all-MiniLM-L6-v2` model, which is very good to start with.
- Ease of Use: If you're using a library like Hugging Face's Transformers, it can be relatively straightforward to load a pre-trained MiniLM model and fine-tune it for your specific task.
- Lower Memory Requirements: Given its smaller size, MiniLM requires less memory for training and inference. This could be a crucial factor if you're working with limited hardware resources.

In [133]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our feature we like to encode
titles = dataset['titles']
urls = dataset['urls']
# Features are encoded by calling model.encode()
embeddings_titles = model.encode(titles)
embeddings_url = model.encode(urls)

In [134]:
# Print the embeddings
c = 0
for title, embedding in zip(titles, embeddings_titles):
    print("Title:", title)
    print("Embedding dimension:", len(embedding))
    print("Title length:", len(title))
    print("")

    if c >=5:
        break
    c +=1 
    
for url, embedding in zip(urls, embeddings_url):
    print("URL:", url)
    print("Embedding dimension:", len(embedding))
    print("URL length:", len(url))
    print("")

    if c >=5:
        break
    c +=1 

Title: A Survey on Deep Learning for Polyp Segmentation: Techniques, Challenges and Future Trends
Embedding dimension: 384
Title length: 90

Title: A Multi-Scale Feature Extraction and Fusion Deep Learning Method for Classification of Wheat Diseases
Embedding dimension: 384
Title length: 101

Title: Image Segmentation with transformers: An Overview, Challenges and Future
Embedding dimension: 384
Title length: 72

Title: Shape-Based Single Object Classification Using Ensemble Method Classifiers
Embedding dimension: 384
Title length: 74

Title: Few-Shot Adaptation of Training-Free Foundation Model for 3D Medical Image Segmentation
Embedding dimension: 384
Title length: 87

Title: Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation
Embedding dimension: 384
Title length: 77

URL: http://arxiv.org/abs/2311.18373v3
Embedding dimension: 384
URL length: 33



In [135]:
import pickle
from pathlib import Path

# Function to save a pickle file
def save_pickle(data, file_name, folder_path):
    # Ensure the folder exists
    folder_path.mkdir(parents=True, exist_ok=True)
    
    # Define the full file path
    file_path = folder_path / file_name
    
    # Save the pickle file
    try:
        with open(file_path, "wb") as f:
            pickle.dump(data, f)
        print(f"File '{file_name}' saved successfully at: {file_path}")
    except Exception as e:
        print(f"Error saving file '{file_name}': {e}")

# Save the embeddings
data_files = {
    "Titles.pkl": titles,
    "URLS.pkl": urls,
    "Embedding_Titles.pkl": embeddings_titles,
    "Embedding_URLS.pkl": embeddings_url
}

# Define the folder outside the root directory
ROOT = Path.cwd()  # Root folder of the project
MODEL = ROOT.parent / "models"  # Folder outside the root directory

# Save all files
for file_name, data in data_files.items():
    save_pickle(data, file_name, MODEL)


File 'Titles.pkl' saved successfully at: /home/umair/UNI/ML Project/models/Titles.pkl
File 'URLS.pkl' saved successfully at: /home/umair/UNI/ML Project/models/URLS.pkl
File 'Embedding_Titles.pkl' saved successfully at: /home/umair/UNI/ML Project/models/Embedding_Titles.pkl
File 'Embedding_URLS.pkl' saved successfully at: /home/umair/UNI/ML Project/models/Embedding_URLS.pkl


## Testing the embedding model

In [136]:
paper_you_like = input("Enter your topic of interest here 👇 \n")
paper_you_like

'HI'

In [137]:
from sentence_transformers import util
cosine_scores_titles = util.cos_sim(embeddings_titles, model.encode(paper_you_like))
cosine_scores_urls = util.cos_sim(embeddings_url, model.encode(paper_you_like))


In [138]:
import torch
top_similar_papers_titles = torch.topk(cosine_scores_titles,dim=0, k=5,sorted=True)
top_similar_papers_urls = torch.topk(cosine_scores_urls,dim=0, k=5,sorted=True)

In [139]:
for i in top_similar_papers_titles.indices:
    for i in top_similar_papers_urls.indices:
        print('Title:', titles[i.item()])
        print('URL:', urls[i.item()])
        print("")

Title: On the Convergence of the ELBO to Entropy Sums
URL: http://arxiv.org/abs/2209.03077v6

Title: Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting
URL: http://arxiv.org/abs/2405.12705v1

Title: ICANet: A Method of Short Video Emotion Recognition Driven by Multimodal Data
URL: http://arxiv.org/abs/2208.11346v2

Title: Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks
URL: http://arxiv.org/abs/2501.10080v1

Title: Diffusion Models in Vision: A Survey
URL: http://arxiv.org/abs/2209.04747v6

Title: On the Convergence of the ELBO to Entropy Sums
URL: http://arxiv.org/abs/2209.03077v6

Title: Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting
URL: http://arxiv.org/abs/2405.12705v1

Title: ICANet: A Method of Short Video Emotion Recognition Driven by Multimodal Data
URL: http://arxiv.org/abs/2208.11346v2

Title: Few-shot Structure-Informed Machinery Part 