# Fine-Tuning SBERT for Clarifying Ambiguous Questions

Fine-Tuning Sentence-BERT on Ambiguous Queries and Clarifying Questions for Conversational Search

In this notebook, we fine-tune a pre-trained Sentence-BERT (SBERT) model to handle ambiguous user queries and their corresponding clarifying questions. This task aims to improve the model's ability to identify and generate appropriate clarifications, which is essential for enhancing conversational search systems.

The dataset used in this experiment consists of pairs of user queries and their corresponding clarifying questions. We train the model to map user queries to clarifications, allowing it to understand ambiguous queries and respond with suitable follow-up questions.

The main steps include:
1. Preprocessing the data from a JSON format.
2. Training the model on the query-clarification pairs using the `MultipleNegativesRankingLoss` function.
3. Evaluating the model's performance using similarity scoring.


# 1.Install Required Libraries


In [1]:
!pip install pandas sentence-transformers datasets requests



# 2. Load and Explore the JSON Data



In [2]:
import pandas as pd
import json
import requests

# Download JSON from GitHub
url = "https://raw.githubusercontent.com/Sayalinale/Conversational-AI/refs/heads/main/qulac_for_wiki.json"
response = requests.get(url)
data = json.loads(response.text)

# View the structure
print("Data Type:", type(data))
print("Total Number of Keys:", len(data))
print("First Few Keys:", list(data.keys())[:5])


Data Type: <class 'dict'>
Total Number of Keys: 198
First Few Keys: ['obama family tree', 'cheap internet', 'ritz carlton lake las vegas', 'fickle creek farm', 'madam cj walker']


# 3. Inspect Sample Data

In [3]:
# Pick one sample to understand its structure
sample_key = list(data.keys())[0]
print(f"Sample Key: {sample_key}")
print("Associated Data:", data[sample_key])


Sample Key: obama family tree
Associated Data: ['find the time magazine photo essay \\"barack obama\\\'s family tree\\".', "where did barack obama\\'s parents and grandparents come from?", "find biographical information on barack obama\\'s mother."]


# 4. Extract Queries and Clarifying Questions


In [4]:
queries = []
clarifications = []

for key, item in data.items():
    if len(item) > 0:
        queries.append(item[0])  # First element is the user query
    else:
        queries.append(None)

    if len(item) > 1:
        clarifications.append(item[1])  # Second element is the clarification
    else:
        clarifications.append(None)

print(f"Extracted {len(queries)} Queries and {len(clarifications)} Clarifications")


Extracted 198 Queries and 198 Clarifications


# 5.Create and Save Dataset for Training


In [5]:
# Create DataFrame
df = pd.DataFrame({
    'User Query': queries,
    'Clarifying Question': clarifications
})

# Remove incomplete data
df = df.dropna()

# Save to CSV
df.to_csv('sbert_training_data.csv', index=False)
print("Dataset saved as 'sbert_training_data.csv'.")

# Show sample
df.head()


Dataset saved as 'sbert_training_data.csv'.


Unnamed: 0,User Query,Clarifying Question
0,"find the time magazine photo essay \""barack ob...",where did barack obama\'s parents and grandpar...
1,what are some low-cost broadband internet prov...,do any internet providers still sell dial-up?
2,find information about the ritz carlton resort...,find a site where i can determine room price a...
3,find general information about fickle creek fa...,"where is fickle creek farm, and how can i go t..."
4,find historical information about madam c. j. ...,find information about the business that c. j....


In [6]:
# Save to specific local directory (Windows path)
df.to_csv(r'C:\Users\sayali\Downloads\sbert_training_data.csv', index=False)
print("Dataset also saved to C:\\Users\\sayali\\Downloads\\sbert_training_data.csv.")

Dataset also saved to C:\Users\sayali\Downloads\sbert_training_data.csv.


# 6.Prepare Data for SBERT Training

In [7]:
from sentence_transformers import InputExample

train_examples = [
    InputExample(texts=[row['User Query'], row['Clarifying Question']])
    for index, row in df.iterrows()
]
print(f"Prepared {len(train_examples)} training examples.")


Prepared 196 training examples.


# 7. Create DataLoader

In [7]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)


# 8. Load Pre-trained SBERT Model


In [8]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')


# 9.Define Loss Function for Fine-Tuning

In [9]:
from sentence_transformers import losses

train_loss = losses.MultipleNegativesRankingLoss(model)


# 10.Fine-Tune the Model

In [14]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [15]:
pip install --upgrade "transformers[torch]" accelerate

Note: you may need to restart the kernel to use updated packages.


In [17]:
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=4,            # adjust this number
    warmup_steps=100,    # Small warmup for stable learning
    show_progress_bar=True
)

Step,Training Loss


In [1]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('fine_tuned_sbert')

query = "obama family tree"
clarification = "What aspect of the Obama family tree are you asking about?"

# Encode both
query_embedding = model.encode(query, convert_to_tensor=True)
clarification_embedding = model.encode(clarification, convert_to_tensor=True)

# Calculate cosine similarity
similarity = util.pytorch_cos_sim(query_embedding, clarification_embedding)
print("Similarity Score:", similarity.item())

Similarity Score: 0.8960805535316467


In [2]:
test_query = "Tell me about the Obama family history"
clarification = "What aspect of the Obama family tree are you asking about?"

test_embedding = model.encode(test_query, convert_to_tensor=True)
clarification_embedding = model.encode(clarification, convert_to_tensor=True)

similarity = util.pytorch_cos_sim(test_embedding, clarification_embedding)
print("Similarity Score:", similarity.item())

Similarity Score: 0.7534855604171753


This notebook demonstrated the fine-tuning of a Sentence-BERT model on a dataset consisting of ambiguous user queries and clarifying questions. By utilizing the `MultipleNegativesRankingLoss` function, we trained the model to understand semantic similarities between the queries and clarifications.

After training, we evaluated the model's ability to generate clarifications for new ambiguous queries, achieving a high similarity score. The fine-tuned model shows promise in improving conversational search systems, where handling user query ambiguity is crucial.

