# Small Language Model (SLM) for Question Answering
Created by Sparsh Patel, email: sparshpatel0912@gmail.com

## Overview
This project implements a Small Language Model (SLM) for answering questions based on the contents of a given book. The model extracts text from an EPUB file, preprocesses it, generates embeddings for context retrieval, and then answers user queries by retrieving the most relevant chunk of text.
The book used here is Percy Jackson and The Last Olympian

## Approach
The implementation follows these key steps:
1. **Extract text from an EPUB file**
2. **Chunk the extracted text into manageable sizes**
3. **Generate embeddings for each chunk using a sentence-transformer model**
4. **Store embeddings in a FAISS index for efficient retrieval**
5. **Retrieve the most relevant chunk for a given question**
6. **Use a question-answering model to generate an answer based on the retrieved chunk**

## Dependencies
- `ebooklib`
- `BeautifulSoup`
- `nltk`
- `sentence-transformers`
- `faiss-cpu`
- `torch`
- `transformers`

## Model Architecture

The implementation involves multiple NLP models working together:

Sentence-Transformers (all-mpnet-base-v2): Used for embedding text chunks and questions into vector space.

FAISS Index: Efficient similarity search engine for finding the most relevant text chunk.

Hugging Face QA Pipeline: Generates an answer using the retrieved chunk as context.

---

## Pre Processing Techniques

### 1. Extract Text from EPUB
```python
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup

def extract_text_from_epub(epub_path):
    book = epub.read_epub(epub_path)
    text = ""
    
    for item in book.get_items():
        if item.get_type() == ebooklib.ITEM_DOCUMENT:
            soup = BeautifulSoup(item.content, 'html.parser')
            text += soup.get_text() + "\n\n"
    
    return text

# Example Usage
epub_path = "path/to/book.epub"  # Replace with actual file path
book_text = extract_text_from_epub(epub_path)
print(book_text[:1000])  # Print a sample
```

### 2. Split Text into Chunks
```python
import nltk
from nltk.tokenize import sent_tokenize

# Download necessary NLTK data
nltk.download('punkt')

def split_into_chunks(text, max_chunk_size=1024):
    sentences = sent_tokenize(text)  # Split into sentences
    chunks = []
    chunk = ""

    for sentence in sentences:
        if len(chunk) + len(sentence) <= max_chunk_size:
            chunk += sentence + " "
        else:
            chunks.append(chunk.strip())
            chunk = sentence + " "

    if chunk:
        chunks.append(chunk.strip())
    
    return chunks

# Apply chunking
chunks = split_into_chunks(book_text)
chunks = [chunk for chunk in chunks if chunk.strip()]  # Remove empty chunks
print(f"Total chunks: {len(chunks)}")
print(chunks[:5])  # Print first 5 chunks
```

### 3. Generate Embeddings
```python
from sentence_transformers import SentenceTransformer
import numpy as np

# Load the embedding model
embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

# Convert text chunks into embeddings
chunk_embeddings = embedding_model.encode(chunks, convert_to_tensor=True)
chunk_embeddings = chunk_embeddings.cpu().numpy()  # Convert to NumPy

print("Chunks successfully embedded!")
```

### 4. Create FAISS Index for Efficient Retrieval
```python
import faiss

# Create FAISS index
index = faiss.IndexFlatL2(chunk_embeddings.shape[1])
index.add(chunk_embeddings)
```

### 5. Retrieve Most Relevant Chunk
```python
from torch.nn.functional import cosine_similarity
import torch

def retrieve_best_chunk(question):
    question_embedding = embedding_model.encode([question], convert_to_numpy=True)
    _, indices = index.search(question_embedding, 1)
    return chunks[indices[0][0]]
```

### 6. Answer Questions
```python
from transformers import pipeline

# Load a question-answering model
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

def answer_question(question):
    best_chunk = retrieve_best_chunk(question)
    result = qa_pipeline(question=question, context=best_chunk)
    return result["answer"]

# Example
question = "Who is Percy's mentor?"
print(answer_question(question))
```

## Evaluation Methodology
- **Accuracy**: The accuracy of the model seems to be around 60%. As this an extractive based model and not a generative one the accuracy of the answers lie in the context of the chunk they are extracted from rather than the absolute answer
- **Efficiency**: The retrieval and inference times are quick enough and generate answers usually within 2 seconds.
- **Chunking Strategy**: We can test different chunk sizes for best accuracy.Currently the chunk size is 1024 but can be changed accordingly to fine tune the model. As we are extracting from a book, it is necessary that we have atleast a few paragraphs together for context.

## Conclusion
This approach successfully extracts text from books, chunks them for efficient retrieval, and answers user questions using an embedding-based retrieval system combined with a QA model. The system can be further optimized by experimenting with different embedding models and fine-tuning the question-answering pipeline. I used the deepset/roberta-base-squad2 Q/A model, but if you have the resources you can use Generative Model like Mistral-7B, Llama-2, or GPT-based models which can improve the answer quality

---


