## Necessary Imports

In [4]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from dotenv import load_dotenv
import os
from groq import Groq
from langchain_groq import ChatGroq
import ipywidgets as widgets
from IPython.display import display, HTML
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import chain
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

#### This is the simple RAG framework and for our purposes we will break this down into a three step process: indexing, retrieval and generation.

## Indexing

### Loading Documents
The project uses the following books as the primary data sources for retrieval and generation:
1. "Verity" by Colleen Hoover
    - Genre: Psychological Thriller

    - Description: A gripping novel about a struggling writer, Lowen Ashleigh, who is hired to complete the remaining books in a successful series by the injured author, Verity Crawford. As Lowen works on the manuscripts, she uncovers dark secrets about Verity's life.

    - Use Case: The book's complex narrative and character dynamics make it an excellent source for testing retrieval and generation capabilities.

2. "The Girl on the Train" by Paula Hawkins
    - Genre: Mystery, Thriller

    - Description: A suspenseful story about Rachel, a woman who becomes entangled in a missing persons investigation that she observes during her daily train commute. The novel explores themes of memory, truth, and deception.

    - Use Case: The intricate plot and unreliable narration provide rich content for testing advanced RAG techniques.

In [5]:
# Read pdfs data using PyMuPDF reader which is useful for adding detail metadata
verity_book = PyMuPDFLoader("./Dataset/Verity-By-Colleen-Hoover.pdf")
tgott_book = PyMuPDFLoader("./Dataset/The-Girl-on-the-Train.pdf")

In [6]:
# Load the pdfs
verity_book_pages = verity_book.load()
tgott_book_pages = tgott_book.load()

In [7]:
def is_page_empty(page, min_text_length=0):
    """
    Check if a page is empty or contains very little text.
    
    Args:
        page: The page object loaded by PyMuPDFLoader.
        min_text_length: Minimum number of characters to consider a page non-empty.
    
    Returns:
        bool: True if the page is empty, False otherwise.
    """
    return len(page.page_content.strip()) <= min_text_length

In [8]:
# Filter out empty pages from Verity book
verity_book_pages_filtered = [page for page in verity_book_pages if not is_page_empty(page)]

# Filter out empty pages from The Girl on the Train book
tgott_book_pages_filtered = [page for page in tgott_book_pages if not is_page_empty(page)]

In [9]:
documents = verity_book_pages_filtered + tgott_book_pages_filtered

In [10]:
print(documents[0])

page_content='Copyright © 2018 by Colleen Hoover
All rights reserved. No part of this publication may be reproduced, distributed,
or transmitted in any form or by any means, including photocopying, recording,
or other electronic or mechanical methods, without the prior written permission
of the publisher, except in the case of brief quotations embodied in critical
reviews and certain other noncommercial uses permitted by copyright law.
This book is a work of fiction. All names, characters, locations, and incidents are
products of the authors’ imaginations. Any resemblance to actual persons, things,
living or dead, locales, or events is entirely coincidental.
VERITY
Editing by Murphy Rae
Cover Design by Murphy Rae
Interior Formatting by Elaine York, Allusion Graphics, LLC' metadata={'producer': 'calibre 3.33.1 [https://calibre-ebook.com]', 'creator': 'calibre 3.33.1 [https://calibre-ebook.com]', 'creationdate': '2018-12-13T14:15:10+00:00', 'source': './Dataset/Verity-By-Colleen-Hoover.p

### Text Splitting using Recursive Character Text Splitter 

In [11]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)

In [12]:
texts = text_splitter.split_documents(documents)

### Creating Embedding and Storing Text to Vector DB

In [13]:
# Initialize embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")




In [14]:
# Create vector store
vectorstore = FAISS.from_documents(texts, embeddings)

## Retrieval

In [15]:
db_retriever=vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

In [16]:
db_retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001A9656B9960>, search_kwargs={'k': 5})

## Generation using Vector DB and Memory 

In [21]:
# Load the API key from .env file
load_dotenv()
chatgroq_api_key = os.getenv("GROQ_API_KEY")
# chatgroq_api_key

In [22]:
# Initialize the ChatGroq model
chatgroq_model = ChatGroq(temperature=0,
                      model_name="deepseek-r1-distill-llama-70b",
                      api_key=chatgroq_api_key)

In [23]:
# Define a function to extract the content of a message
def get_msg_content(msg):
    return msg.content

# Define the SYSTEM prompt for contextualizing the chat history to come up with a standalone question
contextualize_system_prompt = (
"""Given a chat history and the latest user question \
which might reference context in the chat history, formulate a standalone question which can be understood \
without the chat history. Do NOT answer the question, just reformulate it if needed and otherwise return it as is."""
)

# Define the prompt for contextualizing the chat history to come up with a standalone question
contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", contextualize_system_prompt),
    ("placeholder", "{chat_history}"),
    ("human", "{input}"),
])

# Define the chain for contextualizing the chat history to come up with a standalone question
contextualize_chain = (
    contextualize_prompt
    | chatgroq_model
    | get_msg_content
)


In [24]:
# Define the question-answering SYSTEM prompt to generate the final answer
qa_system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context mentioned within delimeter ### to answer "
    "the question. If you don't know the answer, say that you "
    "Sorry, I am don't know."
    "\n\n"
    "###"
    "{context}"
    "###"
)

# Define the question-answering prompt to generate the final answer
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        ("placeholder", "{chat_history}"),
        ("human", "{input}"),
    ]
)

# Define the chain to generate the final answer
qa_chain = (
    qa_prompt
    | chatgroq_model
    | get_msg_content
) 

In [25]:
# Define the overall chain the uses both the retrieved documents and the chat history to answer the question
@chain
def history_aware_qa(input):
    # Rephrase the question if needed
    if input.get('chat_history'):
        question = contextualize_chain.invoke(input)
    else:
        question = input['input']

    # print(input)
    # Get context from the retriever
    context = db_retriever.invoke(question)
    # print(context)
    
    # Get the final answer
    return qa_chain.invoke({
        **input,
        "context": context
    })

In [None]:
from collections import defaultdict

# Create a dictionary to store chat histories per session
session_chat_histories = defaultdict(InMemoryChatMessageHistory)

# Modify your RunnableWithMessageHistory to use session-specific histories

qa_with_history = RunnableWithMessageHistory(
    history_aware_qa,
    lambda session_id: session_chat_histories[session_id],  # Now session-aware
    input_messages_key="input",
    history_messages_key="chat_history",
)

In [None]:
# Define the chatbot response function
def chatbot_response(user_input):
    # Finally, let's invoke the chain
    result = qa_with_history.invoke(
        {"input": user_input},
        config={"configurable": {"session_id": "123"}},
    )
    return f"{result}"

# Create the chatbot UI
# Text input for user messages
user_input = widgets.Text(
    placeholder="Type your message here...",
    description="You:",
    layout=widgets.Layout(width="80%")
)

# Button to submit messages
submit_button = widgets.Button(
    description="Send",
    button_style="success"
)

# Output area for the conversation
output = widgets.Output(
    layout=widgets.Layout(),
    style={"description_width": "initial"}
)

# Function to handle button click
def on_submit_button_click(b):
    with output:
        user_message = user_input.value
        if user_message.strip():  # Check if the input is not empty
            # Display the user's message
            display(HTML(f"<strong>You:</strong> {user_message}"))
            
            # Get the chatbot's response
            bot_response = chatbot_response(user_message)

            # Extract the content within the <think> tag
            think_content = bot_response.split('<think>')[1].split('</think>')[0].strip()

            # Extract the bot's response after the <think> tag
            answer_content = bot_response.split('</think>')[1].strip()

            # Format the output
            formatted_output = f"""
            <strong>AskAI Thinking:</strong> <think>{think_content}</think>
            <br>
            <strong>AskAI Answer:</strong> {answer_content}
            """
            
            # Display the bot's response
            display(HTML(f"{formatted_output}"))
            display(HTML("<br>"))
            # Clear the input box
            user_input.value = ""
        else:
            display(HTML("<em>Please enter a message.</em>"))

# Attach the function to the button's click event
submit_button.on_click(on_submit_button_click)

# Arrange the widgets vertically
chatbot_ui = widgets.VBox([user_input, submit_button, output])

# Display the chatbot UI
display(chatbot_ui)

VBox(children=(Text(value='', description='You:', layout=Layout(width='80%'), placeholder='Type your message h…

In [None]:
# Get messages for session 123
history_for_123 = session_chat_histories["123"].messages

# Print formatted messages
for msg in history_for_123:
    print(f"{msg.type}: {msg.content}")
    print("\n")

human: What is the main them of Verity?
ai: <think>
Okay, so I need to figure out the main theme of the book "Verity" by Colleen Hoover based on the provided context. Let me go through each document one by one to see what clues I can find.

Starting with the first document (id='f95a3898-4529-4707-8c52-0adfbf1cc83a', page 61), the narrator is talking about skimming through the first and second books, making notes on characters. They mention needing to understand what motivates the characters, what moves them, and what sets them off. This suggests that the story is character-driven and possibly delves into the psychological aspects of the characters.

The second document (id='0b2ac7fc-9250-4201-b2fc-14474f029b0e', page 94) discusses the narrator's obsession with Verity and how she's putting a lot of herself into her series. The narrator justifies their snooping as research, indicating that Verity's personal life and writing are deeply connected. This might hint at themes of identity or t

In [33]:
# Get messages for session 123
history_for_456 = session_chat_histories["456"].messages

if not history_for_456:
    print("No messages found for session 456.")

else:
    for msg in history_for_456:
        print(f"{msg.type}: {msg.content}")

No messages found for session 456.


### Some Example Question and Answer

---
- You: What is the main theme of Verity?
- AskAI Answer: The main theme of "Verity" by Colleen Hoover revolves around the exploration of human motivations, the darker aspects of relationships, and the psychological depth of characters. The book delves into secrets, betrayal, and the complexities of human behavior, while also touching on the blurred lines between reality and fiction. It represents a departure from Hoover's typical emotional love stories, embracing a more intense and psychological narrative.
---
- You: Who murdered Megan Hipwell?
- AskAI Answer: In "The Girl on the Train" by Paula Hawkins, the murderer of Megan Hipwell is revealed to be **Tom Watson**. Tom, who is also the ex-husband of the protagonist Rachel Watson, killed Megan after she discovered his affair with her.
---
- You: What last 2 questions did I ask?
- AskAI Answer: The last two questions you asked were: 1. "What is the main theme of Verity?" 2. "Who murdered Megan Hipwell?"
---
- You: List all my questions which I asked till now.
- AskAI Answer: Here is a list of all the questions you've asked so far: 1. "What is the main theme of Verity?" 2. "Who murdered Megan Hipwell?" 3. "What last 2 questions did I ask?" 4. "List all my questions which I asked till now."
---
- You: How many childern do Verity and Jeremy have and tell me their children names.
- AskAI Answer: From the context provided, Verity and Jeremy have two children, twin daughters named **Chastin** and **Harper**.
---
- You: Who was Crew?
- AskAI Answer: Crew is Jeremy's son, and he plays a significant role in the story. Crew is a young boy who has experienced traumatic events, including witnessing his sister's death and his mother's inaction during the incident. He is close to his father, Jeremy, and is part of a family that includes the narrator, who is pregnant, making Crew a big brother soon. Crew's character deals with adjusting to his new family dynamics and the emotional scars from his past.
---


#### Simple RAG is able to answer direct questions about novels but it is not able to answer complex questions correctly like How many childern do Verity and Jeremy have and tell me their children names?