# Preprocessing Data for Embeddings

## Introduction to Preprocessing
Preprocessing is an essential step in preparing textual data for machine learning models. Before generating embeddings, documents need to be processed to ensure they are structured appropriately. Chunking documents by sentences or sections helps capture smaller, more manageable segments, making it easier for the model to learn from the data.

## Using a Chunker for Preprocessing
In this section, we will showcase how a `Chunker` can split data based on sentence boundaries or specific chunk sizes.

### Chunking by Sentence
We will start with an example of preprocessing data by splitting it into sentences before embedding.

In [4]:
from swarmauri.chunkers.concrete.SentenceChunker import SentenceChunker

In [7]:
# Initialize the chunker to split by sentence
chunker = SentenceChunker()

# Example document
document = "This is the first sentence. This is the second sentence. This is the third sentence."

# Preprocess by sentence
chunks = chunker.chunk_text(document)
print("Chunks (Sentence-based):", chunks)

Chunks (Sentence-based): ['This is the first sentence.', 'This is the second sentence.', 'This is the third sentence.']


### Chunking by Size

Next, we will demonstrate how to chunk the text based on a specific size, for instance, by splitting the text into chunks of 10 tokens.

In [13]:
from swarmauri.chunkers.concrete.FixedLengthChunker import FixedLengthChunker

# Basic usage of FixedLengthChunker
fixed_chunker = FixedLengthChunker()

# Checking the resource and type attributes
print(f"Resource: {fixed_chunker.resource}")
print(f"Type: {fixed_chunker.type}")

# Preprocess by chunk size
chunks_by_size = fixed_chunker.chunk_text(document)
print("Chunks (Size-based):", chunks_by_size)

Resource: Chunker
Type: FixedLengthChunker
Chunks (Size-based): ['This is the first sentence. This is the second sentence. This is the third sentence.']


## Generating Embeddings after Chunking

Once the document has been chunked, we can apply the `TfidfEmbedding` to generate embeddings for each chunk. Let’s see how this works with the previously defined chunks.

In [15]:
from swarmauri.embeddings.concrete.TfidfEmbedding import TfidfEmbedding

# Initialize the TF-IDF embedder
embedder = TfidfEmbedding()

# Generate embeddings for chunks (sentence-based)
sentence_embeddings = embedder.fit_transform(chunks)
for i, embedding in enumerate(sentence_embeddings):
    print(f"Embedding for Chunk {i+1}: {embedding}")


Embedding for Chunk 1: name=None id='ea8653ef-f29a-40f2-bb85-36a224cecf44' members=[] owner=None host=None resource='Vector' version='0.1.0' type='Vector' value=[0.6461289150464732, 0.3816141458138271, 0.0, 0.3816141458138271, 0.3816141458138271, 0.0, 0.3816141458138271]
Embedding for Chunk 2: name=None id='70055066-8fe3-45ab-be8c-2b4e478b954b' members=[] owner=None host=None resource='Vector' version='0.1.0' type='Vector' value=[0.0, 0.3816141458138271, 0.6461289150464732, 0.3816141458138271, 0.3816141458138271, 0.0, 0.3816141458138271]
Embedding for Chunk 3: name=None id='357d9fbd-da17-413c-a303-16c9592c3d52' members=[] owner=None host=None resource='Vector' version='0.1.0' type='Vector' value=[0.0, 0.3816141458138271, 0.0, 0.3816141458138271, 0.3816141458138271, 0.6461289150464732, 0.3816141458138271]


## Explanation

Chunking affects the quality and interpretability of embeddings, especially in cases where context matters. For instance, larger chunks (like whole documents) may capture more contextual information, but they can also introduce noise if the document contains varied topics. On the other hand, smaller chunks (like sentences) tend to produce embeddings that are more focused and specific, which can be beneficial for tasks that require understanding individual components of the text.

## Notebook Metadata

In [16]:
import os
import platform
import sys
from datetime import datetime

author_name = "Huzaifa Irshad " 
github_username = "irshadhuzaifa"

print(f"Author: {author_name}")
print(f"GitHub Username: {github_username}")

notebook_file = "Notebook_02_Preprocessing_Data_For_Embeddings.ipynb"
try:
    last_modified_time = os.path.getmtime(notebook_file)
    last_modified_datetime = datetime.fromtimestamp(last_modified_time)
    print(f"Last Modified: {last_modified_datetime}")
except Exception as e:
    print(f"Could not retrieve last modified datetime: {e}")

print(f"Platform: {platform.system()} {platform.release()}")
print(f"Python Version: {sys.version}")

try:
    import swarmauri
    print(f"Swarmauri Version: {swarmauri.__version__}")
except ImportError:
    print("Swarmauri is not installed.")

Author: Huzaifa Irshad 
GitHub Username: irshadhuzaifa
Last Modified: 2024-10-18 11:15:10.217636
Platform: Windows 11
Python Version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:17:27) [MSC v.1929 64 bit (AMD64)]
Swarmauri Version: 0.5.0
