# CTSE Lecture Notes Chatbot (Enhanced)
## SE4010 AI/ML Assignment
This Jupyter Notebook implements an optimized Retrieval-Augmented Generation (RAG) chatbot for answering questions based on CTSE lecture notes. Enhancements include persistent vector storage, semantic chunking, advanced embeddings, error handling, and an interactive interface.

## 1. Import Libraries

In [1]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain_ollama import OllamaLLM
from langchain.chains import RetrievalQA
from tqdm import tqdm
import os
import logging

# Set up logging for debugging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

## 2. Load and Process Lecture Notes

In [2]:
# Check if lecture_notes folder exists
notes_dir = './lecture_notes/'
if not os.path.exists(notes_dir):
    logging.error(f"Directory {notes_dir} not found. Please create it and add PDF files.")
    raise FileNotFoundError(f"Directory {notes_dir} not found.")

# Load PDFs with progress bar
logging.info("Loading PDF files...")
loader = PyPDFDirectoryLoader(notes_dir)
try:
    documents = loader.load()
    if not documents:
        logging.error("No documents loaded. Ensure PDFs are text-based and not empty.")
        raise ValueError("No documents loaded.")
except Exception as e:
    logging.error(f"Error loading PDFs: {e}")
    raise

# Semantic chunking for better context
logging.info("Splitting documents into chunks...")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Smaller chunks for efficiency
    chunk_overlap=100,  # Moderate overlap for context
    separators=["\n\n", "\n", ".", " ", ""]  # Semantic separators
)
chunks = text_splitter.split_documents(documents)
logging.info(f"Created {len(chunks)} chunks.")

2025-05-09 03:12:43,488 - INFO - Loading PDF files...
2025-05-09 03:12:45,027 - INFO - Splitting documents into chunks...
2025-05-09 03:12:45,045 - INFO - Created 124 chunks.


## 3. Set Up Embeddings and Vector Store

In [3]:
# Check for existing FAISS index
faiss_index_path = './faiss_index'
if os.path.exists(faiss_index_path):
    logging.info("Loading existing FAISS index...")
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    vector_store = FAISS.load_local(faiss_index_path, embeddings, allow_dangerous_deserialization=True)
else:
    logging.info("Creating new FAISS index...")
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
    vector_store = FAISS.from_documents(chunks, embeddings)
    vector_store.save_local(faiss_index_path)
    logging.info(f"Saved FAISS index to {faiss_index_path}")

2025-05-09 03:13:07,965 - INFO - Creating new FAISS index...
  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
2025-05-09 03:13:40,236 - INFO - Use pytorch device_name: cpu
2025-05-09 03:13:40,237 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2025-05-09 03:14:56,854 - INFO - Loading faiss with AVX512 support.
2025-05-09 03:14:56,870 - INFO - Could not load library with AVX512 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx512'")
2025-05-09 03:14:56,871 - INFO - Loading faiss with AVX2 support.
2025-05-09 03:14:57,321 - INFO - Successfully loaded faiss with AVX2 support.
2025-05-09 03:14:57,345 - INFO - Failed to load GPU Faiss: name 'GpuIndexIVFFlat' is not defined. Will not load constructor refs for GPU indexes. This is only an error if you're trying to use GPU Faiss.
2025-05-09 03:14:57,369 - INFO - Saved FAISS index to ./faiss_index


## 4. Set Up Ollama LLM

In [4]:
# Initialize Ollama LLM (Mistral)
logging.info("Initializing Ollama LLM...")
try:
    llm = OllamaLLM(model="mistral", temperature=0.3)  # Lower temperature for precise answers
except Exception as e:
    logging.error(f"Error initializing Ollama: {e}")
    raise

2025-05-09 03:15:16,189 - INFO - Initializing Ollama LLM...


## 5. Build RAG Pipeline with Custom Prompt

In [9]:
from langchain.prompts import PromptTemplate

# Custom prompt for better answer quality
prompt_template = """
You are a knowledgeable assistant for Current Trends in Software Engineering (CTSE). Answer the following question based solely on the provided lecture notes. Provide a clear, concise, and accurate response. If the information is not available, say so.

Context: {context}
Question: {question}

Answer:
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])

# Create RAG chain
logging.info("Building RAG pipeline...")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 3}),  # Retrieve top 3 chunks
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

2025-05-09 03:16:05,995 - INFO - Building RAG pipeline...


## 6. Interactive Chatbot Interface

In [10]:
# Function to ask questions
def ask_question(question):
    try:
        result = qa_chain({"query": question})
        answer = result["result"].strip()
        source = result["source_documents"][0].page_content[:200] + "..."
        logging.info(f"Question: {question}")
        logging.info(f"Answer: {answer}")
        return answer, source
    except Exception as e:
        logging.error(f"Error processing question: {e}")
        return "Error: Unable to process question.", ""

# Interactive loop (run this cell to interact)
print("CTSE Chatbot: Ask a question about the lecture notes (type 'exit' to quit)")
while True:
    question = input("Question: ")
    if question.lower() == 'exit':
        print("Exiting chatbot.")
        break
    answer, source = ask_question(question)
    print(f"Answer: {answer}")
    print(f"Source: {source}\n")

CTSE Chatbot: Ask a question about the lecture notes (type 'exit' to quit)
Exiting chatbot.


## 7. Test with Example Questions

In [11]:
# Test predefined questions
test_questions = [
    "What is the main topic of Lecture 1?",
    "Explain the concept from the lecture 2 notes.",
    "Summarize the key points of Lecture 3."
]

for question in tqdm(test_questions, desc="Testing questions"):
    answer, source = ask_question(question)
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f"Source: {source}\n")

  result = qa_chain({"query": question})
2025-05-09 03:19:42,295 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
2025-05-09 03:21:28,946 - INFO - Question: What is the main topic of Lecture 1?
2025-05-09 03:21:29,026 - INFO - Answer: The main topic of Lecture 1, based on the provided lecture notes, cannot be definitively determined as the information provided does not specify the subject or course name. However, the numbers "4 V’s" and "11" or "23" could potentially refer to concepts within the lecture, but without further context, it is impossible to accurately interpret their meaning.
Testing questions:  33%|███▎      | 1/3 [02:15<04:31, 135.85s/it]

Question: What is the main topic of Lecture 1?
Answer: The main topic of Lecture 1, based on the provided lecture notes, cannot be definitively determined as the information provided does not specify the subject or course name. However, the numbers "4 V’s" and "11" or "23" could potentially refer to concepts within the lecture, but without further context, it is impossible to accurately interpret their meaning.
Source: SLIIT  -Faculty of Computing
Subject Name
4 V’s
11...



2025-05-09 03:21:36,589 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
2025-05-09 03:24:28,382 - INFO - Question: Explain the concept from the lecture 2 notes.
2025-05-09 03:24:28,470 - INFO - Answer: The concept discussed in the lecture 2 notes pertains to MapReduce phases, which are fundamental steps in the MapReduce programming model used for processing large datasets in a distributed manner.

The Map phase takes input data and applies a mapping function to it, producing output values in the form of <key, value> pairs. In the given example, each word from the input splits is considered as a key, and its frequency of occurrence is the corresponding value.

Following the Map phase, the Shuffle phase consolidates the relevant records from the Map phase output. The same words are grouped together with their respective frequencies.

Finally, the Reduce phase aggregates the output values from the Shuffle phase. This phase combines the values for each key

Question: Explain the concept from the lecture 2 notes.
Answer: The concept discussed in the lecture 2 notes pertains to MapReduce phases, which are fundamental steps in the MapReduce programming model used for processing large datasets in a distributed manner.

The Map phase takes input data and applies a mapping function to it, producing output values in the form of <key, value> pairs. In the given example, each word from the input splits is considered as a key, and its frequency of occurrence is the corresponding value.

Following the Map phase, the Shuffle phase consolidates the relevant records from the Map phase output. The same words are grouped together with their respective frequencies.

Finally, the Reduce phase aggregates the output values from the Shuffle phase. This phase combines the values for each key (word in this case) and returns a single output value that represents the summary or total count of that particular word across the entire dataset. In short, the Reduce ph

2025-05-09 03:24:37,343 - INFO - HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
2025-05-09 03:31:14,573 - INFO - Question: Summarize the key points of Lecture 3.
2025-05-09 03:31:14,712 - INFO - Answer: Lecture 3, Introduction to Big Data at SLIIT - Faculty of Computing, focuses on the Four V's of Big Data: Volume, Velocity, Variety, and Veracity.

1. **Volume**: This refers to the large amount of data generated every second by various sources such as social media, sensors, and transactions. The challenge lies in storing and managing this vast amount of data efficiently.

2. **Velocity**: This aspect emphasizes the speed at which new data is being produced. It can be real-time data streams or historical data that needs to be processed quickly for immediate insights or analysis.

3. **Variety**: Big Data comes in various formats, including structured (e.g., databases), semi-structured (e.g., XML, JSON), and unstructured data (e.g., text, images, videos). Handli

Question: Summarize the key points of Lecture 3.
Answer: Lecture 3, Introduction to Big Data at SLIIT - Faculty of Computing, focuses on the Four V's of Big Data: Volume, Velocity, Variety, and Veracity.

1. **Volume**: This refers to the large amount of data generated every second by various sources such as social media, sensors, and transactions. The challenge lies in storing and managing this vast amount of data efficiently.

2. **Velocity**: This aspect emphasizes the speed at which new data is being produced. It can be real-time data streams or historical data that needs to be processed quickly for immediate insights or analysis.

3. **Variety**: Big Data comes in various formats, including structured (e.g., databases), semi-structured (e.g., XML, JSON), and unstructured data (e.g., text, images, videos). Handling this diversity is crucial for effective data processing.

4. **Veracity**: This V refers to the quality or truthfulness of the data. Big Data may contain errors, inconsi




#Rusith

In [13]:
import ipywidgets as widgets
widgets.IntSlider()

IntSlider(value=0)