# Implementation of a RAG model for PDF Analysis

### Use Case

- The following use case relates to the implementation of a RAG model with a user interface for analyzing any PDF texts.
- 
A book on the topic of market research is provided in the knowledge base as an example text (feel free to add more specific data to the knowledge base).

### Libraries

- The `os` library provides tools for interacting with the operating system, such as accessing and manipulating file systems, handling environment variables, and executing system-level commands.
- The `glob` library complements this by enabling the retrieval of file and directory paths using Unix shell-style wildcards, making it particularly useful for searching and organizing files that match specific patterns.
- The `dotenv` library, accessed through `from dotenv import load_dotenv`, facilitates secure and convenient management of environment variables by loading them from a `.env` file into the application’s environment, which is ideal for handling sensitive information like API keys or configuration settings.
- The `gradio` library allows the creation of interactive web-based user interfaces, making it simple to showcase and interact with machine learning models or other Python functions. Gradio provides a range of components like text input, sliders, and image uploaders, making it a popular choice for building intuitive and shareable application interfaces. Together, these libraries suggest the code is likely setting up a secure, user-friendly application, potentially for showcasing or deploying a machine learning model.

In [57]:
# imports
import os
import glob
from dotenv import load_dotenv
import gradio as gr

- The `langchain.document_loaders` module, which includes `DirectoryLoader` and `TextLoader`, enables efficient loading of documents from directories or text files. 
- The `CharacterTextSplitter` from `langchain.text_splitter` is designed for breaking down documents into smaller chunks for easier processing, while the `Document` schema from `langchain.schema` standardizes document representation.
- Language model interaction is facilitated by `OpenAIEmbeddings` and `ChatOpenAI` from `langchain_openai`, enabling the use of OpenAI's embeddings and conversational capabilities. To store and query vectorized document embeddings, the `Chroma` vector store from `langchain_chroma` is employed.
- The code also leverages additional libraries for specific tasks. `numpy` (imported as `np`) provides support for numerical computations, while `PyPDF2` is used for reading and processing PDF files. Dimensionality reduction for visualizing high-dimensional data is achieved using the `t-SNE` algorithm from `sklearn.manifold`.
- To visualize the reduced data in an interactive manner, `plotly.graph_objects` is utilized.
- For conversational applications, `ConversationBufferMemory` from `langchain.memory` manages and retains the context of conversations, and `ConversationalRetrievalChain` from `langchain.chains` integrates conversational capabilities with document retrieval.

In [58]:
# imports for langchain
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
import numpy as np
import PyPDF2 as PyPDF2
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

In [59]:
# Model selection and creating a vector database
MODEL = "gpt-4o"
db_name = "vector_db"

In [60]:
# Load environment variables in a file called .env
load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

Read and extract text content from a PDF file, located in the `knowledge-base` directory. It begins by importing the `PdfReader` class from the `PyPDF2` library, which facilitates the handling and processing of PDF files. An instance of `PdfReader` is created for the specified file, enabling access to its pages. An empty string `text` is initialized to accumulate the extracted content. The code then iterates over all the pages in the PDF using a `for` loop, and for each page, the `extract_text()` method is called to extract the textual content. This text is subsequently appended to the `text` string, resulting in a single continuous string containing the combined text from all pages of the PDF.

In [61]:
# Import PDF file

from PyPDF2 import PdfReader
reader = PdfReader("knowledge-base/marktforschung.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text()

The code processes text data by first encapsulating it within a `Document` object and then optionally splitting it into smaller, manageable chunks. It begins by importing the necessary components from the `langchain` library, specifically the `Document` class from `langchain.docstore.document` and the `CharacterTextSplitter` from `langchain.text_splitter`. The text is wrapped inside a `Document` object, allowing it to be structured in a way that is compatible with LangChain's document processing tools. To facilitate downstream tasks such as text analysis, retrieval, or summarization, the code initializes a `CharacterTextSplitter` instance. This splitter is configured with a `chunk_size` of 1000 characters and an overlap of 100 characters between consecutive chunks. The `split_documents()` method is then called on the splitter, with the document passed as a list. This process results in the text being divided into smaller overlapping chunks, making it suitable for use in applications like natural language processing or information retrieval where smaller, contextually coherent pieces of text are advantageous.

In [62]:
# Use LangChain's Document class to wrap the text.
# Split the text into chunks (if needed) using LangChain's TextSplitter utilities.

from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter

# Wrap the text in a Document object
document = Document(page_content=text)

# Optionally split text into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents([document])

In [63]:
len(chunks)

1

# Assess the number of tokens in the chunks

Assessing the number of tokens in text chunks is crucial to ensure compatibility with model token limits, maintain contextual coherence, optimize performance, and control processing costs. Proper token management prevents errors from exceeding limits, preserves context with overlaps, and ensures efficient, reliable processing in NLP pipelines.

Tiktoken is a tokenization library used with OpenAI's language models to efficiently convert text into tokens and vice versa. It is designed to handle tokenization in a way that aligns with OpenAI's models, enabling precise management of token limits, cost estimation, and processing efficiency. Tiktoken is optimized for performance and supports various encoding formats for different models.

In [64]:
print(type(chunks))  # Should output <class 'str'> if it's a string

<class 'list'>


In [65]:
if isinstance(chunks, list):
    # Extract text from each Document object
    chunks = " ".join([chunk.page_content for chunk in chunks])

In [66]:
import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")

tokens = encoding.encode(chunks)

# Count the number of tokens
num_tokens = len(tokens)

print(f"Number of tokens in the chunk: {num_tokens}")

Number of tokens in the chunk: 265159


Example Token Limits for OpenAI Models:
- GPT-4 (8k context): Up to 8,192 tokens
- GPT-4 (32k context): Up to 32,768 tokens
- GPT-3.5-turbo: Up to 4,096 tokens

# RecursiveCharacterTextSplitter

The `RecursiveCharacterTextSplitter` is a component in the LangChain library designed to split large text into smaller, manageable chunks while preserving as much semantic coherence as possible. It works by recursively splitting the text at predefined delimiters in a prioritized order, such as paragraphs, sentences, or words. If a chunk exceeds the specified size limit, the splitter moves to the next level of granularity, breaking the text further until the chunks fit within the desired size. This approach ensures that the resulting chunks remain contextually meaningful and well-suited for tasks like retrieval, summarization, or processing by language models with token limits.

In [67]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Example: Set the max chunk size and overlap
chunk_size = 8000  # Adjust to fit within your model's token limit
chunk_overlap = 200  # Overlap ensures context continuity between chunks

splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

# Assuming your large text is in a variable called `text`
chunks = splitter.split_text(text)

# Check the number of chunks created
print(f"Number of chunks: {len(chunks)}")

# Example: Inspect the first chunk
print(chunks[0])

Number of chunks: 115
Marktforschung
Datenerhebung und Datenanalyse
8. Auflage  Henning Kreis
Raimund Wildner
Alfred KußMarktforschungSN Flashcards Microlearning
Schnelles und eﬃ  zientes Lernen mit digitalen Karteika rten – 
für Arbeit oder Studium!
Diese Möglichkeiten bieten Ihnen die SN Flashcards:
•Jederzeit und überall auf Ihrem Smar tphone,  Tablet oder Computer  lernen
•D en Inhalt des Buches lernen und Ihr Wissen  testen
•Sich durch verschieden e, mit multimedialen Komponenten angereicher te 
Fragetypen motivieren lassen  und zwischen drei Lernalgorithmen 
(Langzeit gedächtnis-,  Kurzzeitgedächtnis- oder Prüfungs-Modus) wählen
•Ihre eigenen Fragen-Sets erstellen , um Ihre Lerner fahrung zu personalisieren
So greifen Sie auf Ihre SN Flashcards zu:
1. Gehen Sie auf die 1. Seite des 1. Kapitels  dieses Buches und folgen Sie den 
Anweisungen in der Box, um sich für einen SN Flashcards-A ccount anzumelden 
und auf die Flashcards-Inhalte für dieses Buch zuzugreif en.
2. Laden Sie die

# Convert String to Documents

The code creates a vector store by converting text chunks into `Document` objects and generating embeddings using OpenAI's embedding model. These documents and their embeddings are stored in a Chroma vector store, with persistence enabled to save the data in a specified directory (`db_name`). Finally, it prints the number of documents stored in the vector store. This setup facilitates efficient similarity search and retrieval tasks.

In [68]:
# Put the chunks of data into a Vector Store that associates a Vector Embedding with each chunk
# Chroma is a popular open source Vector Database based on SQLLite
# Convert strings in `chunks` into `Document` objects

from langchain.docstore.document import Document
documents = [Document(page_content=chunk) for chunk in chunks]
embeddings = OpenAIEmbeddings()

# Create Vector Store

A vector store is a database designed to store, organize, and retrieve vector representations of data, typically numerical embeddings derived from text, images, or other data types. These embeddings are high-dimensional vectors that encode semantic or contextual information, enabling efficient similarity searches. Vector stores are crucial in applications like information retrieval, recommendation systems, and machine learning pipelines, where finding similar items or documents is a core task.

Chroma is an open-source vector store library designed for managing embeddings and facilitating semantic search and retrieval. It is particularly well-integrated with LangChain, making it a popular choice for building NLP and AI-powered applications.

In [69]:
# Clear chroma

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

# Create vectorstore

vectorstore = Chroma.from_documents(documents=documents, embedding=embeddings, persist_directory=db_name)
print(f"Vectorstore created with {vectorstore._collection.count()} documents")

Vectorstore created with 115 documents


In [70]:
# Get one vector and find how many dimensions it has

collection = vectorstore._collection
sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]
dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

The vectors have 1,536 dimensions


## Use LangChain for Integration

Setting up a conversational AI system with retrieval-augmented generation (RAG) capabilities using OpenAI's language model and LangChain tools:

1. **Creating the Chat Model:**
   - `llm = ChatOpenAI(temperature=0.7, model_name=MODEL)` initializes a conversational language model using OpenAI's API with a specified temperature (controlling creativity) and a model name (e.g., GPT-4).

2. **Setting up Memory:**
   - `memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)` establishes a memory buffer to store and manage the conversation history, ensuring the chat retains context across turns. The `return_messages` flag ensures the memory returns messages in a structured format.

3. **Defining the Retriever:**
   - `retriever = vectorstore.as_retriever()` creates a retriever from the vector store, enabling semantic search over the stored document embeddings. This is used to retrieve relevant information during the conversation.

4. **Creating the Conversational Retrieval Chain:**
   - `conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)` combines the language model, memory, and retriever into a single conversational framework. This setup allows the chatbot to use the vector store for retrieving contextually relevant information, maintain conversational memory, and generate responses.

In [71]:
# create a new Chat with OpenAI
llm = ChatOpenAI(temperature=0.7, model_name=MODEL)

# set up the conversation memory for the chat
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

# the retriever is an abstraction over the VectorStore that will be used during RAG
retriever = vectorstore.as_retriever()

# putting it together: set up the conversation chain with the GPT LLM, the vector store and memory
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

In [72]:
# set up a new conversation memory for the chat
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

# putting it together: set up the conversation chain with the GPT LLM, the vector store and memory
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

## Use Gradio as Interface

In [73]:
# Wrapping in a function

def chat(message, history):
    result = conversation_chain.invoke({"question": message})
    return result["answer"]

In [74]:
view = gr.ChatInterface(chat, type="messages").launch(inbrowser=True)

* Running on local URL:  http://127.0.0.1:7863

To create a public link, set `share=True` in `launch()`.
