# RAG Implementation with ePub Documents

This notebook demonstrates the implementation of a Retrieval-Augmented Generation (RAG) system for ePub documents using FAISS and Langchain.

## Features
- ePub document loading and processing
- Text chunking and embedding generation
- Vector database storage with FAISS
- Similarity-based document retrieval
- AI-powered response generation

## Environment Setup

Installing required dependencies and importing necessary libraries for ePub processing, text embeddings, and vector storage.

In [None]:
pip install pypandoc

In [1]:
# Import necessary libraries for ePub processing and document handling
from langchain_community.document_loaders import UnstructuredEPubLoader
import pypandoc
 
# Install necessary libraries for document chunking, embeddings, and storage
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores.faiss import FAISS
 
# Import required libraries for Langchain OpenAI integration and text display
from langchain.chat_models import ChatOpenAI
from IPython.display import display, Markdown
 


In [None]:
# Download Pandoc if it is not already installed
# pypandoc.download_pandoc()

## Natural Language Processing Setup

Downloading NLTK resources for text tokenization and processing:
- `punkt`: Sentence tokenizer
- `punkt_tab`: Additional tokenization support  
- `averaged_perceptron_tagger_eng`: Part-of-speech tagger

In [2]:
import nltk

In [None]:
# Import NLTK and download required resources for text processing
# nltk.download('punkt')  # Tokenizer for splitting sentences
# nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_eng')  # POS tagger for sentence parsing

## ePub Document Loading

Loading ePub documents using UnstructuredEPubLoader for content extraction and processing. Sample document download included for testing.

In [3]:
# Download a sample ePub file for testing (Alice's Adventures in Wonderland from Project Gutenberg)
import urllib.request
import os

# URL for a free ePub book from Project Gutenberg
epub_url = "https://www.gutenberg.org/ebooks/11.epub.noimages"
epub_filename = "alice_wonderland.epub"

# Check if file already exists
if not os.path.exists(epub_filename):
    try:
        print(f"Downloading sample ePub file: {epub_filename}")
        urllib.request.urlretrieve(epub_url, epub_filename)
        print(f"Successfully downloaded {epub_filename}")
    except Exception as e:
        print(f"Error downloading file: {e}")
        print("Please manually download an ePub file or use a local file.")
else:
    print(f"ePub file {epub_filename} already exists.")

ePub file alice_wonderland.epub already exists.


In [6]:
# Check for available ePub files in the current directory
import os
import glob

# Look for ePub files
epub_files = glob.glob("*.epub")
print(f"Available ePub files: {epub_files}")

# If no ePub files found, provide instructions
if not epub_files:
    print("\nNo ePub files found in the current directory.")
    print("Please either:")
    print("1. Download or place an ePub file in this directory")
    print("2. Update the file path in the loader to point to your ePub file")
    print("3. Use a sample ePub file from the internet (like Project Gutenberg)")
else:
    print(f"\nFound {len(epub_files)} ePub file(s). You can use: {epub_files[0]}")

Available ePub files: ['alice_wonderland.epub']

Found 1 ePub file(s). You can use: alice_wonderland.epub


## Document Content Extraction

Extracting text content from ePub format and converting to structured document objects for further processing.

In [7]:
from langchain_community.document_loaders import UnstructuredEPubLoader
import os

# Upload ePub file to workspace or use sample download
epub_file = "alice_wonderland.epub"
if not os.path.exists(epub_file):
    epub_file = "three little pigs.epub"  # fallback option
    if not os.path.exists(epub_file):
        print("ERROR: No ePub file found. Upload an ePub file to the workspace.")
        print("Or run the sample download cells above.")
    else:
        print(f"Using ePub file: {epub_file}")
else:
    print(f"Using ePub file: {epub_file}")

# Load and extract ePub content
if os.path.exists(epub_file):
    loader = UnstructuredEPubLoader(epub_file)
    docs = loader.load()
    
    print(f"Document loaded with {len(docs)} sections.")
    
    if docs:
        print(f"Content preview (first 200 characters):")
        print(docs[0].page_content[:200] + "..." if len(docs[0].page_content) > 200 else docs[0].page_content)

Using ePub file: alice_wonderland.epub
Document loaded with 1 sections.
Content preview (first 200 characters):




The Project Gutenberg eBook of Alice's Adventures in Wonderland

This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no r...


## Document Structure Analysis

Analyzing the loaded document structure and content organization to verify successful extraction.

In [13]:
# Display the first section of the loaded document to verify data
print(str(docs[0])[:200]) # Print the first chunk/section of the document

page_content='



The Project Gutenberg eBook of Alice's Adventures in Wonderland

This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and wi


# Text Processing and Vector Storage

Converting documents into embeddings and storing in a vector database for efficient similarity search and retrieval.

## Text Chunking

Splitting documents into manageable chunks using RecursiveCharacterTextSplitter with 300-character segments for optimal processing.

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter with a chunk size of 300 characters
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300)
 
# Split the loaded documents into chunks
chunks = text_splitter.split_documents(docs)
 
# Print the number of chunks to verify successful splitting
print(f"Number of chunks: {len(chunks)}")

Number of chunks: 1053


## Embedding Generation

Creating vector embeddings using HuggingFace Sentence Transformers model (all-MiniLM-L6-v2) to capture semantic meaning of text chunks.

In [15]:
from langchain.embeddings import HuggingFaceEmbeddings

# Load the Hugging Face embedding model for sentence embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
 
# Print confirmation of embedding model loading
print("Embedding model loaded successfully.")

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


Embedding model loaded successfully.


In [None]:
pip install faiss-cpu

## FAISS Vector Database

Storing document embeddings in FAISS (Facebook AI Similarity Search) for efficient similarity search and clustering operations.

In [18]:
from langchain.vectorstores.faiss import FAISS

# Embed the chunks as vectors and store them in a FAISS vector database
db_faiss = FAISS.from_documents(chunks, embedding_model)
 
# Print confirmation after FAISS database creation
print("Document chunks embedded and stored in FAISS vector database.")

Document chunks embedded and stored in FAISS vector database.


# Document Retrieval System

Implementing similarity-based document retrieval functionality for the RAG system.

## Retrieval Function Implementation

Creating a document retrieval function that performs similarity search on the FAISS database to find relevant content based on user queries.

In [19]:
# Function to retrieve relevant documents based on a user query
def retrieve_docs(query, k):
    # Perform similarity search on the FAISS database using the query
    docs_faiss = db_faiss.similarity_search(query, k=k)
    
    # Return the most relevant document chunks
    return docs_faiss

## Retrieval Testing

Testing the retrieval system with sample queries to verify functionality and relevance of returned results.

In [20]:
# Test the retrieval function with a specific query
context = retrieve_docs("who is the antagonist of the story?", 5)
 
# Print the first retrieved chunk to verify correct retrieval
print(context[0])

page_content='CHAPTER I. Down the Rabbit-Hole' metadata={'source': 'alice_wonderland.epub'}


# AI Response Generation

Integrating Large Language Model for generating contextual responses based on retrieved document content.

## Azure OpenAI Configuration

Setting up Azure OpenAI client for GPT model integration with educational assistant persona configuration.

In [24]:
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from dotenv import load_dotenv

# Load env vars
load_dotenv()

llm = AzureChatOpenAI(
    azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    temperature=0
)

## Response Generation Pipeline

Complete RAG pipeline implementation combining document retrieval with AI-powered response generation for educational content delivery.

In [25]:
# Define the user query
query = "Who is the antagonist of the story?"

context = "Once upon a time, there was a big bad wolf who wanted to eat the three little pigs. The wolf was very sneaky and tried to trick the pigs in many ways. He huffed and puffed and blew down their houses, but the pigs were clever and built strong houses to keep safe from the wolf."

# Define the system prompt for the assistant's role
system_message = f"""
    You are a Kindergarten teacher helping the user learn through a story.
    Include grammar explanations and clarifications for difficult vocabulary if applicable.
    Correct the user's input when necessary.
    Answer the query: {query} with the context: {context} provided.
"""

messages = [("system", system_message), ("human", query)]

from IPython.display import display, Markdown

# Generate and display the response from the assistant
response = llm.invoke(messages)  # Call the API with the messages
display(Markdown(response.content))  # Display the response in markdown format

The antagonist of the story is the big bad wolf. An antagonist is a character who opposes the main character or creates conflict in the story. In this case, the wolf wants to eat the three little pigs, which makes him the one causing trouble for them. 

If you have any more questions about the story or need help with vocabulary, feel free to ask!