# UniAssist – Retrieval System Notebook

## Purpose
This notebook implements the core retrieval mechanism of UniAssist.
It enables the system to find the most relevant answer from the dataset
based on semantic similarity between user questions and stored questions.

This notebook does not generate new facts.
It strictly retrieves verified answers from the dataset,
ensuring correctness and reliability.

---

## Why Retrieval Is Used
Pure generative models can hallucinate or guess answers.
A retrieval-based approach ensures that:
- All answers come from known data
- The system behaves predictably
- Small datasets are used effectively

---

## What This Notebook Does
✔ Converts questions into semantic representations  
✔ Compares user queries with dataset questions  
✔ Retrieves the best matching answer  
✔ Forms the foundation for a commercial-grade system  

---

## What This Notebook Does NOT Do
✘ Train generative models  
✘ Paraphrase answers  
✘ Deploy any application  

Those steps are handled later.


# Stage 1

In [None]:
import os
os.listdir()


['.config', 'UniAssist_training_data.csv', 'sample_data']

In [None]:
import pandas as pd
import numpy as np


In [None]:
DATASET_PATH = "UniAssist_training_data.csv"
qa_frame = pd.read_csv(DATASET_PATH)


In [None]:
qa_frame.head()


Unnamed: 0,category_id,category_name,intent_id,question,answer
0,1,Attendance and Academic Compliance,C1_Q1,What is the minimum attendance requirement?,Students are required to maintain a minimum of...
1,1,Attendance and Academic Compliance,C1_Q1,What percentage of attendance is required to c...,Students are required to maintain a minimum of...
2,1,Attendance and Academic Compliance,C1_Q1,Is there a minimum attendance criteria for stu...,Students are required to maintain a minimum of...
3,1,Attendance and Academic Compliance,C1_Q1,How much attendance is compulsory in a semester?,Students are required to maintain a minimum of...
4,1,Attendance and Academic Compliance,C1_Q1,What is the required attendance percentage for...,Students are required to maintain a minimum of...


In [None]:
qa_frame.columns.tolist()


['category_id', 'category_name', 'intent_id', 'question', 'answer']

In [None]:
all_questions = qa_frame["question"].astype(str).tolist()
all_answers = qa_frame["answer"].astype(str).tolist()


In [None]:
len(all_questions), len(all_answers)


(1075, 1075)

In [None]:
for i in range(3):
    print(f"Q{i+1}:", all_questions[i])
    print(f"A{i+1}:", all_answers[i])
    print("-" * 50)


Q1: What is the minimum attendance requirement?
A1: Students are required to maintain a minimum of 75% overall attendance and at least 60% attendance in each subject for all programs.
--------------------------------------------------
Q2: What percentage of attendance is required to continue a course?
A2: Students are required to maintain a minimum of 75% overall attendance and at least 60% attendance in each subject for all programs.
--------------------------------------------------
Q3: Is there a minimum attendance criteria for students?
A3: Students are required to maintain a minimum of 75% overall attendance and at least 60% attendance in each subject for all programs.
--------------------------------------------------


In [None]:
retrieval_questions = all_questions.copy()
retrieval_answers = all_answers.copy()


## Stage 1 Summary — Text Preparation

In this stage:
- Questions and answers were extracted from the dataset
- Data types were standardized to strings
- Alignment between questions and answers was verified
- Clean working copies were created for retrieval

This ensures that the retrieval system operates on
consistent and reliable text data.


# Stage 2

In [None]:
!pip install -q sentence-transformers


In [None]:
from sentence_transformers import SentenceTransformer




In [None]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
question_embeddings = embedding_model.encode(
    retrieval_questions,
    show_progress_bar=True
)


Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [None]:
type(question_embeddings), question_embeddings.shape


(numpy.ndarray, (1075, 384))

In [None]:
question_embeddings[0][:10]


array([ 0.08746446,  0.03471295, -0.00994687, -0.00728209, -0.0562158 ,
        0.05762235, -0.06739643, -0.0234088 , -0.0346221 ,  0.00365131],
      dtype=float32)

## Stage 2 Summary — Semantic Embeddings

In this stage:
- A pretrained sentence embedding model was loaded
- All questions were converted into semantic vector representations
- These embeddings capture meaning rather than exact wording

These vectors will be used in the next stage
to perform similarity-based retrieval of answers.


# Stage 3

In [None]:
from sklearn.metrics.pairwise import cosine_similarity


In [None]:
def retrieve_best_answer(user_query, top_k=1):
    """
    Retrieves the most relevant answer from the dataset
    based on semantic similarity between the user query
    and stored questions.
    """
    # Convert user query to embedding
    query_vector = embedding_model.encode([user_query])

    # Compute cosine similarity with all stored question embeddings
    similarity_scores = cosine_similarity(query_vector, question_embeddings)[0]

    # Get index of best matching question
    best_match_index = similarity_scores.argmax()

    # Fetch corresponding answer and score
    best_answer = retrieval_answers[best_match_index]
    best_score = similarity_scores[best_match_index]

    return best_answer, best_score


In [None]:
test_question = retrieval_questions[0]
retrieved_answer, similarity = retrieve_best_answer(test_question)

print("Test Question:")
print(test_question)
print("\nRetrieved Answer:")
print(retrieved_answer)
print("\nSimilarity Score:", similarity)


Test Question:
What is the minimum attendance requirement?

Retrieved Answer:
Students are required to maintain a minimum of 75% overall attendance and at least 60% attendance in each subject for all programs.

Similarity Score: 1.0000001


In [None]:
test_question = "Can you tell me about the attendance requirements?"
retrieved_answer, similarity = retrieve_best_answer(test_question)

print("User Question:")
print(test_question)
print("\nRetrieved Answer:")
print(retrieved_answer)
print("\nSimilarity Score:", similarity)


User Question:
Can you tell me about the attendance requirements?

Retrieved Answer:
Students are required to maintain a minimum of 75% overall attendance and at least 60% attendance in each subject for all programs.

Similarity Score: 0.87689346


## Stage 3 Summary — Similarity-Based Retrieval

In this stage:
- Cosine similarity was used to compare semantic embeddings
- A retrieval function was implemented to find the best match
- Answers are retrieved deterministically from the dataset
- The system now answers questions based on meaning, not keywords

This stage forms the core logic of the UniAssist system.
