<a href="https://colab.research.google.com/github/Marian843/1stJupyterNotebook/blob/main/sbert_semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
'''from google.colab import files
uploaded = files.upload()'''

##**Import the Needed Libraries**

In [None]:
import pandas as pd
import re
import nltk
import spacy
import unicodedata
import numpy as np
from google.colab import files
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


##**Load the Dataset**


---


- This will load the dataset uploaded in Google Drive.
- The dataset is from Kaggle.
- The dataset is composed of title and its abstract from the paper of **Arxiv**.

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/arxiv_data.csv')

df

Unnamed: 0,titles,summaries,terms
0,Survey on Semantic Stereo Matching / Semantic ...,Stereo matching is one of the widely used tech...,"['cs.CV', 'cs.LG']"
1,FUTURE-AI: Guiding Principles and Consensus Re...,The recent advancements in artificial intellig...,"['cs.CV', 'cs.AI', 'cs.LG']"
2,Enforcing Mutual Consistency of Hard Regions f...,"In this paper, we proposed a novel mutual cons...","['cs.CV', 'cs.AI']"
3,Parameter Decoupling Strategy for Semi-supervi...,Consistency training has proven to be an advan...,['cs.CV']
4,Background-Foreground Segmentation for Interio...,"To ensure safety in automated driving, the cor...","['cs.CV', 'cs.LG']"
...,...,...,...
51769,Hierarchically-coupled hidden Markov models fo...,We address the problem of analyzing sets of no...,"['stat.ML', 'physics.bio-ph', 'q-bio.QM']"
51770,Blinking Molecule Tracking,We discuss a method for tracking individual mo...,"['cs.CV', 'cs.DM']"
51771,Towards a Mathematical Foundation of Immunolog...,We attempt to set a mathematical foundation of...,"['stat.ML', 'cs.LG', 'q-bio.GN']"
51772,A Semi-Automatic Graph-Based Approach for Dete...,Diffusion Tensor Imaging (DTI) allows estimati...,['cs.CV']


In [None]:
!pip install nltk contractions



##**Cleaning the Data**


---


- Used a basic NLP cleaning pipeline while keeping in mind the SBERT model
  - Sentence Tokenization
  - Expand Contractions (*for example:* from don't → do not)
  - Unicode Normalization
  - Lowercase
  - Link Removal (https://, www., .com)
  - Non-word Removal
  - Latin Abbreviations Removal (i.e., e.g., etc.)
  - Extra whitespaces removal



In [None]:
def clean_and_tokenize_text(text):
    if not text or pd.isna(text):
        return []

    # Normalize unicode
    text = unicodedata.normalize("NFKC", text)

    # Expand contractions (e.g., "don't" → "do not")
    text = contractions.fix(text)

    # Lowercase
    text = text.lower()

    # Remove URLs and emails
    text = re.sub(r"http\S+|www\.\S+|\S+@\S+", "", text)

    # Remove non-word characters but keeping the punctuation
    text = re.sub(r"[^\w\s.,!?]", "", text)

    text = re.sub(r'\b(e\.g\.|i\.e\.|etc\.)', '', text, flags=re.IGNORECASE)

    # Sentence tokenization
    sentences = sent_tokenize(text)

    # Remove extra whitespace and empty strings
    cleaned_sentences = [re.sub(r"\s+", " ", s).strip() for s in sentences if s.strip()]

    return cleaned_sentences

df['cleaned_sentences'] = df.apply(
    lambda row: clean_and_tokenize_text(f"{row['titles']}. {row['summaries']}"),
    axis=1
)

df.to_csv("cleaned_sentences_output.csv", index=False)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


##**Load the SBERT Model and Generate Sentence Embeddings**


---

It uses Sentence Transformers (also known as SBERT) library to convert the sentences into vector embeddings.

The pre-trained SBERT model that is used is *all-MiniLM-L6-v2* which is much faster and efficient compared to other models.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

df['joined_sentences'] = df['cleaned_sentences'].apply(lambda sents: " ".join(sents))

embeddings = model.encode(df['joined_sentences'].tolist(), show_progress_bar=True)

np.save('sbert_embeddings.npy', embeddings)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1618 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


##**Load the Saved Sentence Embeddings**

In [None]:
embeddings = np.load('sbert_embeddings.npy')

In [None]:
query = "Skin tone detection using CNN"
query_embedding = model.encode(query, convert_to_numpy=True)

hits = util.semantic_search(query_embedding, embeddings, top_k=5)[0]

for hit in hits:
    idx = hit['corpus_id']
    score = hit['score']
    print(f"Score: {score:.4f}")
    print("Title:", df.iloc[idx]['titles'])
    print("Summary:", df.iloc[idx]['summaries'])
    print("\n")

Score: 0.6396
Title: Estimating Skin Tone and Effects on Classification Performance in Dermatology Datasets
Summary: Recent advances in computer vision and deep learning have led to
breakthroughs in the development of automated skin image analysis. In
particular, skin cancer classification models have achieved performance higher
than trained expert dermatologists. However, no attempt has been made to
evaluate the consistency in performance of machine learning models across
populations with varying skin tones. In this paper, we present an approach to
estimate skin tone in benchmark skin disease datasets, and investigate whether
model performance is dependent on this measure. Specifically, we use individual
typology angle (ITA) to approximate skin tone in dermatology datasets. We look
at the distribution of ITA values to better understand skin color
representation in two benchmark datasets: 1) the ISIC 2018 Challenge dataset, a
collection of dermoscopic images of skin lesions for the det