# Small Language Model (SLM) for Book-Based Question Answering

## 1. Introduction
This project implements a Small Language Model (SLM) designed to answer questions based on the content of a given book. It uses a combination of **text retrieval** and **sequence-to-sequence learning** to extract relevant information and generate meaningful responses.

## 2. Approach
My approach follows a **Retrieval-Augmented Generation (RAG)** framework:
1. **Text Preprocessing** - The book text is split into manageable chunks.
2. **Embedding & Retrieval** - We encode these chunks using a sentence transformer and store them in a FAISS index for efficient retrieval.
3. **Answer Generation** - Using a T5 model, we generate answers based on the retrieved relevant text.

## 3. Model Architecture
- **Sentence Transformer** (all-MiniLM-L6-v2): Used for encoding book text into dense vector representations for retrieval.
- **FAISS**: A fast similarity search library to retrieve relevant text.
- **T5-Small**: A transformer-based sequence-to-sequence model that generates answers based on retrieved context.

## 4. Preprocessing Techniques
1. **Text Tokenization & Chunking**
   - The book text is split into smaller sections using the `nltk.sent_tokenize` method.
   - Each chunk is stored as a potential context for answering queries.
2. **Embedding Generation**
   - Each text chunk is passed through `all-MiniLM-L6-v2`, and the embeddings are stored in a FAISS index for retrieval.


## 5. Code Explanation

### **1. Import necessary libraries**

In [2]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoModel
import faiss
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize

In [3]:
# Ensure nltk dependencies are available
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rajee\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### **2. Loading Pre-trained Models**

In [4]:
# Load Pre-trained Tokenizer and Model (DistilBERT for embeddings, T5 for generation)
retrieval_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", trust_remote_code=True)
generation_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small", trust_remote_code=True)
generation_tokenizer = AutoTokenizer.from_pretrained("t5-small", trust_remote_code=True)

- **Sentence Transformer**: Encodes book text.
- **T5-Small**: Generates answers.

### **3. Chunking the Book Text**

In [5]:
# Function to split book text into chunks
def chunk_text(text, chunk_size=200):
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_length = 0
    
    for sentence in sentences:
        current_length += len(sentence.split())
        current_chunk.append(sentence)
        if current_length >= chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = 0
    
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    
    return chunks

- Splits long text into manageable chunks (200 words per chunk).

### **4. Encoding Text into Embeddings**

In [6]:
# Function to encode text into embeddings
def encode_text(text_list, model, tokenizer):
    inputs = tokenizer(text_list, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        if hasattr(outputs, 'last_hidden_state'):
            embeddings = outputs.last_hidden_state[:, 0, :]
        else:
            embeddings = outputs.pooler_output
    return embeddings.numpy()

- Converts book chunks into numerical embeddings.

### **5. Building FAISS Index**

In [7]:
# Function to build FAISS index
def build_faiss_index(chunks, model, tokenizer):
    embeddings = encode_text(chunks, model, tokenizer)
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return index, embeddings

- Stores embeddings for efficient retrieval.

### **6. Retrieving Relevant Text for a Query**

In [8]:
# Function to retrieve relevant text
def retrieve_text(query, chunks, index, model, tokenizer, top_k=3):
    query_embedding = encode_text([query], model, tokenizer)
    distances, indices = index.search(query_embedding, top_k)
    return [chunks[i] for i in indices[0]]

- Retrieves the most relevant book chunks based on the user's question.

### **7. Answer Generation using T5**

In [9]:
# Function to generate answers using T5 model
def generate_answer(context, question, model, tokenizer):
    input_text = f"question: {question} context: {context}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True)
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

- Uses the T5 model to generate a natural language response.

### **8. Main function**

In [10]:
# Main function to load book text, process it, and answer questions
def answer_question(book_text, question):
    chunks = chunk_text(book_text)
    index, _ = build_faiss_index(chunks, retrieval_model, tokenizer)
    relevant_text = retrieve_text(question, chunks, index, retrieval_model, tokenizer)
    combined_context = " ".join(relevant_text)
    return generate_answer(combined_context, question, generation_model, generation_tokenizer)


### **9. Interactive User Input**

In [None]:
# Interactive user input
if __name__ == "__main__":
    book_text = input("Enter the book text: ")
    while True:
        question = input("Ask a question (or type 'exit' to quit): ")
        if question.lower() == 'exit':
            break
        answer = answer_question(book_text, question)
        print("Answer:", answer)

Enter the book text:  n a distant kingdom surrounded by towering mountains and vast rivers, there existed an ancient city known for its wisdom and prosperity. The city was home to scholars, inventors, and artists who dedicated their lives to knowledge and creativity. One day, a young boy named Elias discovered an old manuscript hidden deep within the grand library. The manuscript spoke of a legendary artifact—an enchanted compass said to guide its bearer to the lost city of Eldoria, where the secrets of the universe were kept. Driven by curiosity and a thirst for adventure, Elias set out on a journey filled with challenges, encountering mystical creatures, deciphering ancient riddles, and braving perilous landscapes. Along the way, he met companions who shared his quest, each possessing unique skills that contributed to the adventure. As he drew closer to Eldoria, he realized that the journey itself was a test of wisdom and courage. The true treasure was not the secrets hidden in Eldor

Answer: ancient manuscript


Ask a question (or type 'exit' to quit):  What challenges did he face?


Answer: encountering mystical creatures, deciphering ancient riddles, and braving peril


Ask a question (or type 'exit' to quit):  Who did Elias meet on his journey?


Answer: companions


Ask a question (or type 'exit' to quit):  What was the real treasure of the journey?


Answer: knowledge and strength


## 10. Observations & Learnings
### **Key Takeaways:**
- **FAISS** enables fast retrieval of relevant text chunks.
- **T5-Small** is efficient for text generation but may require fine-tuning for better domain adaptation.
- **Chunking Strategy** affects retrieval quality; larger chunks may include more relevant context.
- **Interactive Design** in Jupyter enhances usability.

### **Future Improvements:**
- Fine-tune `T5-Small` on a custom dataset for improved accuracy.
- Experiment with larger models like `T5-Base` for better answer quality.
- Implement **metadata filtering** to improve retrieval precision.
