<h1>Overview</h1>
<hr>

We will be going over three areas of the RAG pipeline in this lab. 1. Data Prep 2. Embedding 3. VectorDB and finish with retrieving to show a complete pipeline implementation. Keep in mind this is designed as an intermidiate aproach to the RAG pipeline by pulling back the cover and seeing some of the processes, code and technologies used.

<h1>Data Prep (chunking/splitting)</h1>
<hr>

Data chunking in the context of Retrieval Augmented Generation (RAG) refers to the process of breaking down large documents or datasets into smaller, more manageable units of information.  These units, called "chunks," are designed to be relevant and self-contained enough to be useful for retrieval and subsequent processing by a large language model (LLM).  The goal is to find the right sized chunks – small enough to be specific and relevant, but large enough to contain sufficient context.

In the following example, we'll demonstrate a simple method of how to split a file of text into individual chunks using the [langchain.text_splitter](https://python.langchain.com/api_reference/text_splitters/index.html) library. 

In [None]:
from langchain.document_loaders import TextLoader

Let's take a look at the library CharacterTextSplitter and see what it does with the text. We will be setting the chunk_size to 10

In [None]:
documents = []  # The data is returned into document objects which is of type list.

loader = TextLoader("data/dissimilar.txt") # This is a text file located in the data directory
docs = loader.load()
documents += docs

documents[0].page_content # This is displaying that the contents have been loaded into the document object.

<i>The TextLoader module from langchain returns the content as a Document object.  This object is comprised of the actual data and metadata that can be used for further classification.  An example would be the source of the data like the filename or web site the data was retrieved from. </i>

As mentioned above all the data from the file was loaded into a single document object, this is becuase we did not perform any chunking or splitting of the text we just loaded it.  Here we will verfiy there is a single document created.

In [None]:
len(documents)

Below will show the metadata that it had added to the document when the data was loaded. 

In [None]:
documents[0].metadata

Below will show the content of the document.

In [None]:
documents[0].page_content

As you can see the document needs to be cleaned up a little to remove any unnecessary characters, newlines, etc. This is part of the process of preparing the data for retrievel and provide more context aware results. 

<h2>CharacterTextSplitter</h2>
<hr>

langchain provides modules for this under the textsplitters.character class. The first one we will look at is the CharacterTextSplitter.

Let us setup the [CharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html#langchain_text_splitters.character.CharacterTextSplitter) function.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

In [None]:
text_splitter = CharacterTextSplitter(
    chunk_size=10,   # Setting the number of characters to split on
    chunk_overlap=0, # Using 0 as the overlap
    separator=""     # setting this to split on each character
)

docs = text_splitter.split_documents(documents)
len(docs)

Now that the data has been split into chunks there are a lot more documents created.  As you can see the content of each document is not a semantic chunk as it has no meaning or context.  This is an example of hard splitting on a set number of charcters, as you can see this doesn't provide much help. 

Lets take a look at the content in the top 10 documents. 

In [None]:
docs[0:10]

The content in each document is split by characters based on the amount defined in chunk_size.  It counts all characters (newlines, returns, blanks, characters, etc...) which you can see in the results.  

As you can see the output does not provide useful information, as it splits in the middle of a word, there is no meaning or context in these documents.  Our next step is to put some meaning to the data.  Instead of hard character splitting we will split on sentences.

In [None]:
text_splitter = CharacterTextSplitter(
    chunk_size=10,
    chunk_overlap=0,
    separator='.'  # This will separate on sentences
)

docs = text_splitter.split_documents(documents)

You will receive messages that indicate that a chunk size was created that is larger than defined. Those are due to how the separator and chunk size work together. For now those can be ignored.

Preserving context at the chunk boundaries invloves using `chunk_overlap` which allows you to specify how much overlap there should be between consecutive chunks. This means that some of the text from the end of one chunk will be repeated at the beginning of the next chunk. Overlapping chunks help maintain the flow of information and ensure that sentences or concepts that are split are still understood in their entirety.

Since the data being used in the sample has no semantic overlap between sentences the `chunk_overlap` argument is set to 0.  

<details>
    <summary>Click here for an explination of ChracterTextSplitter</summary>
Let's break down how they work together:

**`chunk_size`**

*   **Maximum Chunk Length:** This parameter sets the *maximum* number of characters that a single chunk of text can contain. Think of it as a target length for your chunks.
*   **Not a Strict Limit:** It's important to understand that `chunk_size` is not a hard limit. The splitter will try to create chunks around this size, but it might produce chunks that are slightly longer if it can't find a suitable split point within the `chunk_size` limit.

**`chunk_overlap`**

* This determines how much overlap there should be between consecutive chunks. This helps preserve context at the boundaries. (Default: 0)


**`separator`**

*   **Where to Split:** This parameter specifies the character or string that the splitter should use to divide the text into chunks. Common separators include:
    *   `\n\n` (double newline, often used to separate paragraphs)
    *   `\n` (single newline, often used to separate sentences)
    *   `.` (period, to split at sentence boundaries)
    *   ` ` (space, to split at word boundaries)
*   **Priority:** The splitter will prioritize splitting the text at the specified separator. It will try to make chunks that end with the separator, as long as they are within the `chunk_size`.

**How They Interact**

1.  **Splitting at Separators:** The `CharacterTextSplitter` first looks for the `separator` within the text.
2.  **Chunk Size Check:** It then checks if the text between those separators is less than or equal to the `chunk_size`.
3.  **Creating Chunks:**
    *   If the text is within the limit, it creates a chunk.
    *   If the text exceeds the limit, it will try to find another `separator` within the `chunk_size` range to split the text. If no suitable `separator` is found, the chunk will be larger than `chunk_size`.

**Key Points**

*   The `CharacterTextSplitter` aims to create chunks that are close to `chunk_size` but will respect the `separator` where possible.
*   If no suitable `separator` is found within the `chunk_size`, chunks might exceed the specified size.
*   Choosing the right `separator` and `chunk_size` depends on the nature of your text and how you plan to use the chunks.
*   `chunk_overlap` allows you to specify how much overlap there should be between consecutive chunks. This means that some of the text from the end of one chunk will be repeated at the beginning of the next chunk.

</details>

Lets take a look at the number of documents created as well as a snippet of the document contents.

In [None]:
len(docs)

In [None]:
docs[:20]

Since we used the '.' as a separaotr you can see the output makes a little more sense than before as it is chunking at the end of a sentence rather than the middle of a word. But you will need to pay attention to the data, as an example "Dr." would split on the "." and then leave some non-semantiac data.

<h2>RecursiveCharacterTextSplitter</h2>
<hr>

Because there is more than one way to split the data langchain has a function called [RecursiveCharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#recursivecharactertextsplitter) which provides a method to use mulitple separators in a recursive manner. 

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0
)

docs = text_splitter.split_documents(documents)

docs[:]

Using this function fixes the "Dr." problem from earlier and splits the data in a more context aware fashion. 

<details>
 <summary>Here's a breakdown of the key differences between the two functions</summary>

**CharacterTextSplitter**

*   **Simple Splitting:** This is the more basic approach. It splits the text based on a separator you define (e.g., `\n\n` for paragraphs, `\n` for sentences, or even a specific character).
*   **Fixed Chunking:** It tries to create chunks around a `chunk_size` you specify, but it might not always be exact. If it can't find your separator within the `chunk_size`, it might create a chunk that's a bit longer.
*   **Less Context Awareness:** It primarily focuses on the separator and `chunk_size`, without necessarily considering the semantic meaning or relationships between different parts of the text.

**RecursiveCharacterTextSplitter**

*   **Hierarchical Splitting:** This method is more sophisticated. It splits the text recursively, trying to keep related pieces of information together. It starts by splitting on the most significant separators (like `\n\n` for paragraphs), then moves down to less significant ones (like `\n` for sentences), and so on.
*   **Context Preservation:** It aims to maintain context by prioritizing splits at logical boundaries. This is especially useful for longer documents where you want to ensure that related sentences or paragraphs stay within the same chunk.
*   **More Flexible:** It offers more control over how the text is split, allowing you to define a list of separators to try in order.

**Here's a table summarizing the key differences:**

| Feature | CharacterTextSplitter | RecursiveCharacterTextSplitter |
|---|---|---|
| Splitting Approach | Simple, based on separator | Hierarchical, recursive |
| Context Awareness | Lower | Higher |
| Chunking | Tries to match `chunk_size` | Prioritizes logical boundaries |
| Flexibility | Less | More |

**When to Use Which**

*   **`CharacterTextSplitter`:** Suitable for simpler tasks where you just need to break the text into manageable chunks, and context preservation is not critical.
*   **`RecursiveCharacterTextSplitter`:** Recommended for more complex scenarios, especially when dealing with longer documents or when it's important to maintain the relationships between different parts of the text. This is often the better choice for tasks like question answering or summarization, where understanding the context is crucial.

**In essence:** The `RecursiveCharacterTextSplitter` is like a more intelligent version of the `CharacterTextSplitter`. It's designed to create chunks that are not only of a reasonable size but also make sense semantically.

</details>


Here is a link to a site [graphical site](https://chunkviz.up.railway.app/) and a screenshot that will visually show you chunk size and chunk overlap on text.  This is useful to get an understanding of chunk_size and chunk_overlap on your data. 

![screeshot](images/chunkviz-screenshot.png)

<h1>Embedding</h1>
<hr>

The data is in a good place to continue on to the next phase which is embedding. Embedding data refers to the process of converting data into a numerical representation called an "embedding vector."  This vector captures the semantic meaning or characteristics of the data in a way that can be understood and used by machine learning models, especially large language models (LLMs).

<details>


<summary>More details on Embedding Vectors.</summary>

Imagine you have words like "king," "queen," "man," and "woman."  A word embedding model would represent each of these words as a vector of numbers (e.g., `[0.25, 0.78, -0.12, ...]`).  The key is that these vectors are designed so that:

*   **Semantically Similar Items are Close:** The vectors for "king" and "man" would be closer to each other in "vector space" than the vectors for "king" and "woman."  "Queen" and "woman" would also be close.  This reflects the semantic relationships between the words.
*   **Relationships are Encoded:** The *differences* between the vectors can also be meaningful. For example, the difference between the "king" and "man" vectors might be similar to the difference between the "queen" and "woman" vectors, capturing the concept of "gender."

**Why Embed Data?**

*   **Machine Learning Understanding:** Machine learning models, particularly neural networks, work with numbers.  Embeddings provide a way to translate complex data like text, images, or even audio into a numerical form that these models can understand and process.
*   **Semantic Representation:** Embeddings capture the underlying meaning and relationships between data points. This is crucial for tasks like:
    *   **Similarity Search:** Finding items that are related to each other (e.g., finding similar articles, products, or images).
    *   **Clustering:** Grouping similar items together.
    *   **Recommendation Systems:** Recommending items that a user might like based on their past behavior.
    *   **Natural Language Processing (NLP):** Understanding the meaning and context of text.
*   **Dimensionality Reduction:** Embeddings can often represent complex data in a lower-dimensional space, making it easier to work with and reducing computational costs.

**How Does it Work?**

The process of creating embeddings typically involves training a machine learning model on a large dataset.  For text, models like Word2Vec, GloVe, and BERT are commonly used.  These models learn to associate words or phrases with vectors in such a way that the relationships described above are captured.

**Example (Conceptual):**

Let's imagine a simplified 2D embedding space (in reality, embedding vectors have many more dimensions).

```
      Woman
         |
         |
         |
Queen----|----King
         |
         |
      Man
```

In this example, the position of each word represents its embedding vector.  You can see how "king" and "man" are close, as are "queen" and "woman."

**In Summary**

Embedding data is the process of converting data into numerical vectors that capture its semantic meaning.  These embeddings are essential for enabling machine learning models to understand and work with complex data, leading to a wide range of applications in NLP, information retrieval, and other fields.
</details>

Here we are setting up what embedding model to use to store the documents with.  

In [None]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model="llama3.1:latest",
    base_url="192.168.15.91:11434"
)

And for fun lets do a test embedding, first we will see the vectors created for a single word. 

In [None]:
sentence = "Hello"

embedding_vector = embeddings.embed_query(sentence)

print(embedding_vector)

Let's see how many vectors have been created.

In [None]:
len(embedding_vector)

Let's see how many vectors are created for a full sentence.

In [None]:
sentence = "This is a test sentence to show the number of vectors created."

embedding_vector = embeddings.embed_query(sentence)

len(embedding_vector)

As you can see the same number of vectors have been created for either a single word or an entire sentence.  The number of vectors created is based on the model used to do the embeddding.  <i>On a side note, you will need to use the same model to retrieve as you did to embed you documents.</i> 

<h1>Vector Database</h1>
<hr>

Let us continue with embedding and storing the results.  To store the vectors created by embedding we will use a vector database.  These type of databases differ from traditional in the sense that they store vectors in high-dimensional space and perform similarity searching versus exact matches against structured data

<details>
   <summary>Here's a breakdown of the key differences between a traditional database and a vector database</summary>

**Traditional Databases**

*   **Structure:** Organize data into tables with rows (records) and columns (attributes). Think of a spreadsheet.
*   **Data Types:** Best suited for structured data like numbers, dates, and short strings.
*   **Queries:** Designed for exact matches, range queries, and aggregations (e.g., "Find all customers with age between 25 and 30").
*   **Indexing:** Use indexes to speed up searches based on specific columns (e.g., an index on customer ID).
*   **Examples:** MySQL, PostgreSQL, Oracle

**Vector Databases**

*   **Structure:** Store data as vectors (arrays of numbers) in a high-dimensional space. These vectors represent the meaning or features of the data.
*   **Data Types:** Optimized for unstructured data like text, images, audio, and video, which are converted into vectors using embedding models.
*   **Queries:** Designed for similarity searches (e.g., "Find the most similar images to this one").
*   **Indexing:** Use specialized indexing techniques (like Approximate Nearest Neighbor search) to efficiently find similar vectors in high-dimensional space.
*   **Examples:** LanceDB, Chroma, Pinecone, Weaviate, Milvus

**Here's a table summarizing the key differences:**

| Feature | Traditional Database | Vector Database |
|---|---|---|
| **Data Organization** | Tables with rows and columns | Vectors in high-dimensional space |
| **Data Types** | Structured (numbers, dates, strings) | Unstructured (text, images, audio) |
| **Query Type** | Exact matches, range queries | Similarity searches |
| **Indexing** | Standard indexes on columns | Specialized indexes for vector similarity |

**Why the Difference Matters**

*   **Different Use Cases:** Traditional databases are great for managing structured data and performing precise queries. Vector databases excel at handling unstructured data and finding items that are semantically similar.
*   **Performance:** Vector databases are optimized for the kind of queries that are essential for AI applications (like recommendation systems, semantic search, and RAG), which traditional databases would struggle with.

**In Simple Terms**

Imagine you have a library:

*   **Traditional Database:** It's like the card catalog, where you can look up books by title, author, or subject.
*   **Vector Database:** It's like having a map where books on similar topics are located close together. You can point to a topic on the map and instantly find all the related books.

**Key Takeaway:** Vector databases are a new kind of database designed specifically to handle the unique challenges of working with embeddings and performing similarity searches, which are fundamental to many AI applications.

</details>

Here we will use LanceDB as our vector database for this example.  

In [None]:
from langchain.vectorstores import LanceDB

import lancedb

db = lancedb.connect("data/lancedb") # Connect and create a persistent store for the database

vectordb = LanceDB.from_documents(
    docs, 
    embeddings, 
    connection=db,
    table_name="lab_embeddings",
    mode="overwrite"
)  # This wrapper will embed and store the results in the table of the vector database

The documents have now been embeded and the results have been stored in the vector database.

Now we will do a simple retrieval to verify it can get to the data that is stored in the vector db.

In [None]:
from langchain_core.vectorstores import VectorStoreRetriever

retriever = vectordb.as_retriever()

rdocs = retriever.invoke("what is their favorite receipe")

for rdoc in rdocs:
    print(rdoc.page_content)


Here you are seeing the results of your .invoke command.

<h1>Putting it all together</h1>
<hr>

We have processed data, embeded thet and stored it in a vector database.  Now we will test it out by adding the LLM portion and performing a query to the LLM using the data from above. We will use the same model we used for the embedding.

In [None]:
from langchain_ollama import OllamaLLM
from langchain.chains import RetrievalQA

llm = OllamaLLM(
    model="llama3.1:latest",
    base_url="http://192.168.15.91:11434"
)

qa_chain = RetrievalQA.from_chain_type(llm, retriever=retriever)

First question is something outside the context of the documents that have been loaded.

In [None]:
query = "Tell me about cats"

answer = qa_chain.invoke(query)
print(answer)

Second question is something inside the context of the documents that have been loaded.

In [None]:
query = "Tell me about astronauts"

answer = qa_chain.invoke(query)
print(answer)

<h1>Summary</h1>
<hr>
In summary we learned about each piece required for a RAG pipeline.  Keep in mind there are many other options for the technology we had reviewd in this lab, each having a different purpose and need based on the context of the data being ingested. I hope this will help you have a better understanding of the behind the scenes working of a RAG chat bot and peaks some curiosiity to look deeper into the technologies and data preping/processing methods.