# Sentence Embeddings

This notebook contains the code to generate sentence embeddings using the pre-trained models from the sentence-transformers library.

In [6]:
import pandas as pd
from pathlib import Path 

from sentence_transformers import SentenceTransformer

In [7]:
PATH_DATA_BASE = Path.cwd().parent / 'data'
PATH_SENTENCES = Path.cwd().parent / 'models/sentences'
PATH_EMBEDDINGS = Path.cwd().parent / 'models/embeddings'

In [8]:
# Setting pandas option to display the full content of DataFrame columns without truncation
pd.set_option('display.max_colwidth', None)

In [9]:
dataset = pd.read_csv(PATH_DATA_BASE / 'filtered_data.csv')
dataset.head()

Unnamed: 0,titles,abstracts,terms,urls
0,HisynSeg: Weakly-Supervised Histopathological Image Segmentation via Image-Mixing Synthesis and Consistency Regularization,"Tissue semantic segmentation is one of the key tasks in computational\npathology. To avoid the expensive and laborious acquisition of pixel-level\nannotations, a wide range of studies attempt to adopt the class activation map\n(CAM), a weakly-supervised learning scheme, to achieve pixel-level tissue\nsegmentation. However, CAM-based methods are prone to suffer from\nunder-activation and over-activation issues, leading to poor segmentation\nperformance. To address this problem, we propose a novel weakly-supervised\nsemantic segmentation framework for histopathological images based on\nimage-mixing synthesis and consistency regularization, dubbed HisynSeg.\nSpecifically, synthesized histopathological images with pixel-level masks are\ngenerated for fully-supervised model training, where two synthesis strategies\nare proposed based on Mosaic transformation and B\'ezier mask generation.\nBesides, an image filtering module is developed to guarantee the authenticity\nof the synthesized images. In order to further avoid the model overfitting to\nthe occasional synthesis artifacts, we additionally propose a novel\nself-supervised consistency regularization, which enables the real images\nwithout segmentation masks to supervise the training of the segmentation model.\nBy integrating the proposed techniques, the HisynSeg framework successfully\ntransforms the weakly-supervised semantic segmentation problem into a\nfully-supervised one, greatly improving the segmentation accuracy. Experimental\nresults on three datasets prove that the proposed method achieves a\nstate-of-the-art performance. Code is available at\nhttps://github.com/Vison307/HisynSeg.","['cs.CV', 'cs.AI']",http://arxiv.org/abs/2412.20924v1
1,Dual-Space Augmented Intrinsic-LoRA for Wind Turbine Segmentation,"Accurate segmentation of wind turbine blade (WTB) images is critical for\neffective assessments, as it directly influences the performance of automated\ndamage detection systems. Despite advancements in large universal vision\nmodels, these models often underperform in domain-specific tasks like WTB\nsegmentation. To address this, we extend Intrinsic LoRA for image segmentation,\nand propose a novel dual-space augmentation strategy that integrates both\nimage-level and latent-space augmentations. The image-space augmentation is\nachieved through linear interpolation between image pairs, while the\nlatent-space augmentation is accomplished by introducing a noise-based latent\nprobabilistic model. Our approach significantly boosts segmentation accuracy,\nsurpassing current state-of-the-art methods in WTB image segmentation.","['cs.CV', 'cs.AI', 'cs.LG']",http://arxiv.org/abs/2412.20838v1
2,Solar Filaments Detection using Active Contours Without Edges,"In this article, an active contours without edges (ACWE)-based algorithm has\nbeen proposed for the detection of solar filaments in H-alpha full-disk solar\nimages. The overall algorithm consists of three main steps of image processing.\nThese are image pre-processing, image segmentation, and image post-processing.\nHere in the work, contours are initialized on the solar image and allowed to\ndeform based on the energy function. As soon as the contour reaches the\nboundary of the desired object, the energy function gets reduced, and the\ncontour stops evolving. The proposed algorithm has been applied to few\nbenchmark datasets and has been compared with the classical technique of object\ndetection. The results analysis indicates that the proposed algorithm\noutperforms the results obtained using the existing classical algorithm of\nobject detection.","['cs.CV', 'astro-ph.IM', 'astro-ph.SR', 'cs.AI', 'cs.LG']",http://arxiv.org/abs/2412.20749v1
3,TAVP: Task-Adaptive Visual Prompt for Cross-domain Few-shot Segmentation,"While large visual models (LVM) demonstrated significant potential in image\nunderstanding, due to the application of large-scale pre-training, the Segment\nAnything Model (SAM) has also achieved great success in the field of image\nsegmentation, supporting flexible interactive cues and strong learning\ncapabilities. However, SAM's performance often falls short in cross-domain and\nfew-shot applications. Previous work has performed poorly in transferring prior\nknowledge from base models to new applications. To tackle this issue, we\npropose a task-adaptive auto-visual prompt framework, a new paradigm for\nCross-dominan Few-shot segmentation (CD-FSS). First, a Multi-level Feature\nFusion (MFF) was used for integrated feature extraction as prior knowledge.\nBesides, we incorporate a Class Domain Task-Adaptive Auto-Prompt (CDTAP) module\nto enable class-domain agnostic feature extraction and generate high-quality,\nlearnable visual prompts. This significant advancement uses a unique generative\napproach to prompts alongside a comprehensive model structure and specialized\nprototype computation. While ensuring that the prior knowledge of SAM is not\ndiscarded, the new branch disentangles category and domain information through\nprototypes, guiding it in adapting the CD-FSS. Comprehensive experiments across\nfour cross-domain datasets demonstrate that our model outperforms the\nstate-of-the-art CD-FSS approach, achieving an average accuracy improvement of\n1.3\% in the 1-shot setting and 11.76\% in the 5-shot setting.",['cs.CV'],http://arxiv.org/abs/2409.05393v2
4,Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation,"Although recent years have witnessed significant advancements in medical\nimage segmentation, the pervasive issue of domain shift among medical images\nfrom diverse centres hinders the effective deployment of pre-trained models.\nMany Test-time Adaptation (TTA) methods have been proposed to address this\nissue by fine-tuning pre-trained models with test data during inference. These\nmethods, however, often suffer from less-satisfactory optimization due to\nsuboptimal optimization direction (dictated by the gradient) and fixed\nstep-size (predicated on the learning rate). In this paper, we propose the\nGradient alignment-based Test-time adaptation (GraTa) method to improve both\nthe gradient direction and learning rate in the optimization procedure. Unlike\nconventional TTA methods, which primarily optimize the pseudo gradient derived\nfrom a self-supervised objective, our method incorporates an auxiliary gradient\nwith the pseudo one to facilitate gradient alignment. Such gradient alignment\nenables the model to excavate the similarities between different gradients and\ncorrect the gradient direction to approximate the empirical gradient related to\nthe current segmentation task. Additionally, we design a dynamic learning rate\nbased on the cosine similarity between the pseudo and auxiliary gradients,\nthereby empowering the adaptive fine-tuning of pre-trained models on diverse\ntest data. Extensive experiments establish the effectiveness of the proposed\ngradient alignment and dynamic learning rate and substantiate the superiority\nof our GraTa method over other state-of-the-art TTA methods on a benchmark\nmedical image segmentation task. The code and weights of pre-trained source\nmodels are available at https://github.com/Chen-Ziyang/GraTa.",['cs.CV'],http://arxiv.org/abs/2408.07343v4


### sentence-transformers models

#### What is a sentence-transformers model?

It maps sentences & paragraphs to a N dimensional dense vector space and can be used for tasks like clustering or semantic search.

#### all-MiniLM-L6-v2

MiniLM is a smaller variatn of the BERT model which has been designed to provide high-quality language understanding capabilities while being significantly smaller and more efficient. The "`all-MiniLM-L6-v2`" model refers to a specific configuration of teh MiniLM model.

Here are some reasons why this model is a good choice for our use case:

* Efficiency: MiniLM models are smaller and faster than full-size BERT models, which can be a major advantage if you're working on a project with limited computational resources or if you need to process large amounts of data quickly.

* Performance: Despite their smaller size, MiniLM models often perform at a comparable level to full-size BERT models on a variety of NLP tasks. This means that you can often use a MiniLM model without sacrificing much in the way of performance. In fact, the `Performance Sentence Embeddings` metric which is the average performance on encoding sentences over 14 diverse tasks from different domains is `68.06` for the `all-MiniLM-L6-v2` model, which is very good to start with.

* Ease of Use: If you're using a library like Hugging Face's Transformers, it can be relatively straightforward to load a pre-trained MiniLM model and fine-tune it for your specific task.

* Lower Memory Requirements: Given its smaller size, MiniLM requires less memory for training and inference. This could be a crucial factor if you're working with limited hardware resources.

In [10]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our feature we like to encode
sentences = dataset['titles']

# Features are encoded by calling model.encode()
embeddings = model.encode(sentences)

In [11]:
# Print the embeddings
c = 0
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding dimension:", len(embedding))
    print("Title length:", len(sentence))
    print("")
    
    if c >= 5:
        break
    c += 1


Sentence: HisynSeg: Weakly-Supervised Histopathological Image Segmentation via Image-Mixing Synthesis and Consistency Regularization
Embedding dimension: 384
Title length: 122

Sentence: Dual-Space Augmented Intrinsic-LoRA for Wind Turbine Segmentation
Embedding dimension: 384
Title length: 65

Sentence: Solar Filaments Detection using Active Contours Without Edges
Embedding dimension: 384
Title length: 61

Sentence: TAVP: Task-Adaptive Visual Prompt for Cross-domain Few-shot Segmentation
Embedding dimension: 384
Title length: 72

Sentence: Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation
Embedding dimension: 384
Title length: 79

Sentence: A Large-scale Interpretable Multi-modality Benchmark for Facial Image Forgery Localization
Embedding dimension: 384
Title length: 90



In [12]:
import pickle

# Saving sentences and corresponding embeddings
with open(PATH_EMBEDDINGS / 'embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)
    
with open(PATH_SENTENCES / 'sentences.pkl', 'wb') as f:
    pickle.dump(sentences, f)

## Testing the embedding model

In [13]:
paper_you_like = input("Enter your topic of interest here 👇 \n")
paper_you_like

'Self Driving'

In [14]:
from sentence_transformers import util
cosine_scores = util.cos_sim(embeddings, model.encode(paper_you_like))

In [15]:
import torch 
top_similar_papers = torch.topk(cosine_scores, dim=0, k=5, sorted=True)
top_similar_papers

torch.return_types.topk(
values=tensor([[0.5887],
        [0.5686],
        [0.5554],
        [0.5535],
        [0.5486]]),
indices=tensor([[13715],
        [ 5041],
        [32691],
        [36365],
        [13906]]))

In [16]:
for i in top_similar_papers.indices:
    print(sentences[i.item()])

Unsupervised Domain Adaptation for Self-Driving from Past Traversal Features
Self-Supervised Representation Learning from Temporal Ordering of Automated Driving Sequences
Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting
Learning On-Road Visual Control for Self-Driving Vehicles with Auxiliary Tasks
UniWorld: Autonomous Driving Pre-training via World Models
