# Fine-Tuning SBERT for Clarifying Ambiguous Questions

Fine-Tuning Sentence-BERT on Ambiguous Queries and Clarifying Questions for Conversational Search

In this notebook, we fine-tune a pre-trained Sentence-BERT (SBERT) model to handle ambiguous user queries and their corresponding clarifying questions. This task aims to improve the model's ability to identify and generate appropriate clarifications, which is essential for enhancing conversational search systems.

The dataset used in this experiment consists of pairs of user queries and their corresponding clarifying questions. We train the model to map user queries to clarifications, allowing it to understand ambiguous queries and respond with suitable follow-up questions.

The main steps include:
1. Preprocessing the data from a JSON format.
2. Training the model on the query-clarification pairs using the `MultipleNegativesRankingLoss` function.
3. Evaluating the model's performance using similarity scoring.


# 1.Install Required Libraries


In [None]:
!pip install pandas sentence-transformers datasets requests

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda

In [23]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 2. Load and Explore the JSON Data



In [24]:
import pandas as pd
import json
import requests

# Download JSON from GitHub
url =  "https://raw.githubusercontent.com/Ansh0903/JSON/master/qulac_for_wiki.json"
response = requests.get(url)
data = json.loads(response.text)

# View the structure
print("Data Type:", type(data))
print("Total Number of Keys:", len(data))
print("First Few Keys:", list(data.keys())[:5])


Data Type: <class 'dict'>
Total Number of Keys: 198
First Few Keys: ['obama family tree', 'cheap internet', 'ritz carlton lake las vegas', 'fickle creek farm', 'madam cj walker']


# 3. Inspect Sample Data

In [25]:
# Pick one sample to understand its structure
sample_key = list(data.keys())[0]
print(f"Sample Key: {sample_key}")
print("Associated Data:", data[sample_key])


Sample Key: obama family tree
Associated Data: ['find the time magazine photo essay \\"barack obama\\\'s family tree\\".', "where did barack obama\\'s parents and grandparents come from?", "find biographical information on barack obama\\'s mother."]


# 4. Extract Queries and Clarifying Questions


In [12]:
queries = []
clarifications = []

for key, item in data.items():
    if len(item) > 0:
        queries.append(item[0])  # First element is the user query
    else:
        queries.append(None)

    if len(item) > 1:
        clarifications.append(item[1])  # Second element is the clarification
    else:
        clarifications.append(None)

print(f"Extracted {len(queries)} Queries and {len(clarifications)} Clarifications")


Extracted 198 Queries and 198 Clarifications


# 5.Create and Save Dataset for Training


In [13]:
# Create DataFrame
df = pd.DataFrame({
    'User Query': queries,
    'Clarifying Question': clarifications
})

# Remove incomplete data
df = df.dropna()

# Save to CSV
df.to_csv('sbert_training_data.csv', index=False)
print("Dataset saved as 'sbert_training_data.csv'.")

# Show sample
df.head()


Dataset saved as 'sbert_training_data.csv'.


Unnamed: 0,User Query,Clarifying Question
0,"find the time magazine photo essay \""barack ob...",where did barack obama\'s parents and grandpar...
1,what are some low-cost broadband internet prov...,do any internet providers still sell dial-up?
2,find information about the ritz carlton resort...,find a site where i can determine room price a...
3,find general information about fickle creek fa...,"where is fickle creek farm, and how can i go t..."
4,find historical information about madam c. j. ...,find information about the business that c. j....


In [14]:
# Save to google Drive
df.to_csv('/content/drive/MyDrive/SBERT/sbert_training_data.csv', index=False)
print("Dataset saved to Google Drive in: /content/drive/MyDrive/SBERT/sbert_training_data.csv")


Dataset saved to Google Drive in: /content/drive/MyDrive/SBERT/sbert_training_data.csv


# 6.Prepare Data for SBERT Training

In [15]:
from sentence_transformers import InputExample

train_examples = [
    InputExample(texts=[row['User Query'], row['Clarifying Question']])
    for index, row in df.iterrows()
]
print(f"Prepared {len(train_examples)} training examples.")


Prepared 196 training examples.


# 7. Create DataLoader

In [16]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)


# 8. Load Pre-trained SBERT Model


In [17]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# 9.Define Loss Function for Fine-Tuning

In [18]:
from sentence_transformers import losses

train_loss = losses.MultipleNegativesRankingLoss(model)


# 10.Fine-Tune the Model

In [19]:
pip install datasets

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

In [20]:
pip install --upgrade "transformers[torch]" accelerate

Collecting accelerate
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuff

In [22]:
import os
os.environ["WANDB_DISABLED"] = "true"

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=4,            # adjust this number
    warmup_steps=100,    # Small warmup for stable learning
    show_progress_bar=True
)
model.save('/content/drive/MyDrive/SBERT')



NameError: name 'Dataset' is not defined

In [15]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('/content/drive/MyDrive/SBERT')


query = "obama family tree"
clarification = "What aspect of the Obama family tree are you asking about?"

# Encode both
query_embedding = model.encode(query, convert_to_tensor=True)
clarification_embedding = model.encode(clarification, convert_to_tensor=True)

# Calculate cosine similarity
similarity = util.pytorch_cos_sim(query_embedding, clarification_embedding)
print("Similarity Score:", similarity.item())

Similarity Score: 0.89948570728302


In [16]:
test_query = "Tell me about the Obama family history"
clarification = "What aspect of the Obama family tree are you asking about?"

test_embedding = model.encode(test_query, convert_to_tensor=True)
clarification_embedding = model.encode(clarification, convert_to_tensor=True)

similarity = util.pytorch_cos_sim(test_embedding, clarification_embedding)
print("Similarity Score:", similarity.item())

Similarity Score: 0.7635096311569214


In [55]:
query = "how much is the fees?"
bm25_candidates = [
    "What major you are looking into?",
    "Are you intrested in sports science?",
    "what are the grade requirements?",
    "Are you looking for a degree in biology?",
    "what are fees per year?"
]



In [56]:
from sentence_transformers.util import cos_sim
import torch

def rerank_with_sbert(query, candidates, model, top_k=1):
    query_embed = model.encode(query, convert_to_tensor=True)
    candidate_embeds = model.encode(candidates, convert_to_tensor=True)
    scores = cos_sim(query_embed, candidate_embeds)[0]

    # THis is optional print ranked scores
    for i, (q, score) in enumerate(zip(candidates, scores)):
        print(f"{i+1}. {q} — Score: {score.item():.4f}")

    top_results = torch.topk(scores, k=top_k)

    if top_k == 1:
        return candidates[top_results.indices[0].item()]
    else:
        return [candidates[i] for i in top_results.indices]


In [57]:
best_question = rerank_with_sbert(query, bm25_candidates, model)
print("Top Clarifying Question:", best_question)


1. What major you are looking into? — Score: 0.1600
2. Are you intrested in sports science? — Score: 0.0635
3. what are the grade requirements? — Score: 0.1561
4. Are you looking for a degree in biology? — Score: 0.1410
5. what are fees per year? — Score: 0.7160
Top Clarifying Question: what are fees per year?


This notebook demonstrated the fine-tuning of a Sentence-BERT model on a dataset consisting of ambiguous user queries and clarifying questions. By utilizing the `MultipleNegativesRankingLoss` function, we trained the model to understand semantic similarities between the queries and clarifications.

After training, we evaluated the model's ability to generate clarifications for new ambiguous queries, achieving a high similarity score. The fine-tuned model shows promise in improving conversational search systems, where handling user query ambiguity is crucial.

