# NLP-Based Chatbot Training for University Q&A

This notebook demonstrates how to build a simple NLP-based chatbot using the provided JSON data of university questions and answers. We'll use sentence embeddings and similarity search to retrieve the most relevant response for a user's query.

## Overview
- Load and preprocess the Q&A data from `Innovators_united_queries.json`
- Compute sentence embeddings using a pre-trained model
- Implement a retrieval system based on cosine similarity
- Test the chatbot with sample queries
- Evaluate the system's performance

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


In [2]:
# Install required libraries (uncomment if needed)
!pip install sentence-transformers scikit-learn pandas numpy flask



In [3]:
import json
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')






## Step 1: Load and Explore the Data

In [4]:
def new_func(f):
    data = json.load(f)
    return data

# Open file with utf-8 encoding
with open('new queries.json', 'r', encoding='utf-8') as f:
    data = new_func(f)

# Convert to DataFrame
df = pd.DataFrame(data)
print(df.head())


                                           Questions  \
0  From where I can get information about the int...   
1    How can I register for an event in the college?   
2  From where I can get a duplicate ID card if I ...   
3  I lost something in college. Whom should I con...   
4  I am unable to find my bus, whom should I cont...   

                                           RESPONSES  
0  Our college has a Training and Placement Cell ...  
1  For participation, registration links or QR co...  
2  If you lose your ID card, contact your HOD. Yo...  
3  If you have lost something, inform your class ...  
4  Contact transport office.The staff will tell y...  


## Step 2: Preprocess the Data

We'll clean the text by converting to lowercase and removing extra whitespace.

In [5]:
def preprocess_text(text):
    return text.lower().strip()

# Apply preprocessing
df['Questions'] = df['Questions'].apply(preprocess_text)
df['RESPONSES'] = df['RESPONSES'].apply(preprocess_text)

print("Data after preprocessing:")
print(df.head())

Data after preprocessing:
                                           Questions  \
0  from where i can get information about the int...   
1    how can i register for an event in the college?   
2  from where i can get a duplicate id card if i ...   
3  i lost something in college. whom should i con...   
4  i am unable to find my bus, whom should i cont...   

                                           RESPONSES  
0  our college has a training and placement cell ...  
1  for participation, registration links or qr co...  
2  if you lose your id card, contact your hod. yo...  
3  if you have lost something, inform your class ...  
4  contact transport office.the staff will tell y...  


## Step 3: Compute Sentence Embeddings

We'll use a pre-trained SentenceTransformer model to convert questions into vector embeddings.

In [6]:
# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Compute embeddings for all questions
question_embeddings = model.encode(df['Questions'].tolist())
print(f"Embeddings shape: {question_embeddings.shape}")

Embeddings shape: (1383, 384)


## Step 4: Implement the Chatbot Response Function

This function takes a user query, computes its embedding, and finds the most similar question using cosine similarity.

In [7]:
# After the "## Step 6: Evaluate the Model" markdown cell, add this new cell:

## Step 6.1: Add Default Answer Functionality

# Define a comprehensive default answer that guides users
default_answer = "I don't have specific information about that query. As InVIE, I can help with university-related questions about exams, bus timings, library, college timing, and other campus facilities. Could you try rephrasing your question or ask about one of these topics?"

# Modify the get_chatbot_response function to use the default answer
def get_chatbot_response(query, top_k=1):
    # Preprocess the query
    query = preprocess_text(query)
    
    # Compute embedding for the query
    query_embedding = model.encode([query])
    
    # Calculate cosine similarities
    similarities = cosine_similarity(query_embedding, question_embeddings)[0]
    
    # Get the top-k most similar questions
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    # Get the maximum similarity score
    max_sim = similarities[top_indices[0]]
    
    # Check if max similarity is below threshold
    threshold = 0.45  # Adjust this threshold based on testing
    if max_sim < threshold:
        return [(default_answer, max_sim)]
    
    # Return the corresponding responses
    responses = df.iloc[top_indices]['RESPONSES'].tolist()
    similarities_scores = similarities[top_indices]
    
    return list(zip(responses, similarities_scores))

## Step 5: Test the Chatbot

Let's test the chatbot with some sample queries.

In [8]:
# Test queries
test_queries = [
    "examination fee",
    "what is my bus timming",
    "library kha h?",
    "m ky kru",
    "how to apply for scholarships",
]

for query in test_queries:
    print(f"Query: {query}")
    responses = get_chatbot_response(query)
    print(f"Response: {responses[0][0]}")
    print(f"Similarity: {responses[0][1]:.4f}")
    print("-" * 50)

Query: examination fee
Response: you can visit the official website of university https://www.invertisuniversity.ac.in/admission/fee-structure for fee structure of any course.
Similarity: 0.5490
--------------------------------------------------
Query: what is my bus timming
Response: it differs from one stop to another.you need to contact transport incharge for the timings.
Similarity: 0.5244
--------------------------------------------------
Query: library kha h?
Response: there are 2 libraries in the campus. one is in academic block-1 and the other one is in academic block-3.
Similarity: 0.5523
--------------------------------------------------
Query: m ky kru
Response: I don't have specific information about that query. As InVIE, I can help with university-related questions about exams, bus timings, library, college timing, and other campus facilities. Could you try rephrasing your question or ask about one of these topics?
Similarity: 0.3671
---------------------------------------

## Step 6: Evaluate the Model

We'll perform a simple evaluation by checking how well the model retrieves the correct response for known questions.

In [9]:
# Simple evaluation: Check if the top response matches the expected one
def evaluate_model(test_data):
    correct = 0
    total = len(test_data)
    
    for question, expected_response in test_data:
        response, _ = get_chatbot_response(question)[0]
        if response == expected_response:
            correct += 1
    
    accuracy = correct / total
    return accuracy

# Use a subset of the data for evaluation (first 20 pairs)
test_data = list(zip(df['Questions'][:20], df['RESPONSES'][:20]))
accuracy = evaluate_model(test_data)
print(f"Model Accuracy on test set: {accuracy:.2%}")

Model Accuracy on test set: 100.00%


## Step 7: Save the Model and Data

To deploy the chatbot, we can save the embeddings and data for later use.

In [10]:
# Save embeddings and data
np.save('question_embeddings.npy', question_embeddings)
df.to_csv('processed_qa_data.csv', index=False)

print("Model and data saved successfully!")

Model and data saved successfully!


## Conclusion

This notebook demonstrates a basic NLP-based chatbot using sentence embeddings and similarity search. The system can be improved by:
- Using more advanced models (e.g., BERT-based)
- Implementing intent classification
- Adding more sophisticated preprocessing
- Integrating with a web interface for real-time interaction

To use this chatbot in production, load the saved embeddings and data, then use the `get_chatbot_response` function.