# Introducing Chunking in Partition Transformer in SparkNLP
This notebook demonstrates how to use **Spark NLP's PartitionTransformer** for
 chunking of documents, enabling efficient text segmentation.

We further showcase a practical application of this chunking strategy in the context of **Retrieval-Augmented Generation (RAG)**.

We can use this powerful method to enhance the performance of large language models by supplying context-relevant information from a knowledge base.

Creating Files

In [5]:
!echo -e "Introduction: RAG stands for Retrieval-Augmented Generation. Why RAG? It improves factual accuracy and adds fresh or private data to LLMs. Chunking: Breaks documents into pieces so they can be embedded. Semantic Chunking: Focus on respecting document structure like sections. Summary: RAG is powerful when paired with good chunking!" > rag_intro.txt

In [6]:
!echo -e "Tomatoes grow best in warm weather with plenty of sun. It's important to water them regularly and use nutrient-rich soil. They are typically planted after the last frost and harvested in late summer." > tomatoes.txt

In [7]:
!cat rag_intro.txt

Introduction: RAG stands for Retrieval-Augmented Generation. Why RAG? It improves factual accuracy and adds fresh or private data to LLMs. Chunking: Breaks documents into pieces so they can be embedded. Semantic Chunking: Focus on respecting document structure like sections. Summary: RAG is powerful when paired with good chunking!


In [8]:
!cat tomatoes.txt

Tomatoes grow best in warm weather with plenty of sun. It's important to water them regularly and use nutrient-rich soil. They are typically planted after the last frost and harvested in late summer.


In [9]:
!mkdir txt-data
!cp rag_intro.txt txt-data/rag_intro.txt
!cp tomatoes.txt txt-data/tomatoes.txt

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
import sparknlp

spark = sparknlp.start()

## Partitioning Documents

Partition Transformer

In [10]:
from pyspark.ml import Pipeline
from sparknlp.partition.partition_transformer import *

empty_df = spark.createDataFrame([], "string").toDF("text")

partition_transformer = PartitionTransformer() \
    .setInputCols(["text"]) \
    .setContentType("text/plain") \
    .setContentPath("./txt-data") \
    .setOutputCol("chunks") \
    .setChunkingStrategy("basic") \
    .setMaxCharacters(140)

pipeline = Pipeline(stages=[
    partition_transformer
])

pipeline_model = pipeline.fit(empty_df)
result_df = pipeline_model.transform(empty_df)

result_df.show()

+--------------------+--------------------+--------------------+--------------------+
|                path|             content|                text|              chunks|
+--------------------+--------------------+--------------------+--------------------+
|file:/content/txt...|Tomatoes grow bes...|[{NarrativeText, ...|[{document, 0, 19...|
|file:/content/txt...|Introduction: RAG...|[{NarrativeText, ...|[{document, 0, 33...|
+--------------------+--------------------+--------------------+--------------------+



In [11]:
result_df.select("chunks").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunks                                                                                                                                                                                                                                                                                                                                                                                  |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

RAG Pipeline

In [12]:
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

tokenizer = Tokenizer() \
    .setInputCols(["chunks"]) \
    .setOutputCol("token")

bert_embeddings = BertEmbeddings.pretrained() \
    .setInputCols(["chunks", "token"]) \
    .setOutputCol("embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["chunks", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

finisher = EmbeddingsFinisher().setInputCols(["sentence_embeddings"]).setOutputCols(["finished_sentence_embeddings"]).setOutputAsVector(True)

rag_pipeline = Pipeline(stages=[
    partition_transformer,
    tokenizer,
    bert_embeddings,
    sentence_embeddings,
    finisher
])

small_bert_L2_768 download started this may take some time.
Approximate size to download 135.3 MB
[OK!]


Embed a Knowledge Base

In [13]:
rag_model = rag_pipeline.fit(empty_df)
kb_df = rag_model.transform(empty_df)

In [14]:
kb_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+
|                path|             content|                text|              chunks|               token|          embeddings| sentence_embeddings|finished_sentence_embeddings|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+
|file:/content/txt...|Tomatoes grow bes...|[{NarrativeText, ...|[{document, 0, 19...|[{token, 0, 7, To...|[{word_embeddings...|[{sentence_embedd...|        [[0.6935687065124...|
|file:/content/txt...|Introduction: RAG...|[{NarrativeText, ...|[{document, 0, 33...|[{token, 0, 11, I...|[{word_embeddings...|[{sentence_embedd...|        [[0.5774036645889...|
+--------------------+--------------------+--------------------+--------------------+--------------------+----

In [15]:
kb_df.select("chunks").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunks                                                                                                                                                                                                                                                                                                                                                                                  |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Preparing the output of a Spark NLP RAG pipeline by aligning each chunk of text with its embedding vector,

In [16]:
from pyspark.sql.functions import posexplode, monotonically_increasing_id
from pyspark.ml.functions import vector_to_array

kb_df = kb_df.withColumn("doc_id", monotonically_increasing_id())
exploded_chunks = kb_df.selectExpr("doc_id", "chunks.result as chunks") \
                       .select(posexplode("chunks").alias("pos", "chunk_text"), "doc_id")

exploded_vectors = kb_df.selectExpr("doc_id", "finished_sentence_embeddings as vectors") \
                        .select(posexplode("vectors").alias("pos", "vector"), "doc_id")

aligned_df = exploded_chunks.join(exploded_vectors, on=["doc_id", "pos"]).select("doc_id", "chunk_text", "vector")

aligned_df = aligned_df.withColumn("vector", vector_to_array("vector"))

In [17]:
aligned_df_clean = aligned_df.select("doc_id", "chunk_text", "vector").cache()
aligned_df_clean.show()

+------+--------------------+--------------------+
|doc_id|          chunk_text|              vector|
+------+--------------------+--------------------+
|     0|Tomatoes grow bes...|[0.69356870651245...|
|     1|Introduction: RAG...|[0.57740366458892...|
+------+--------------------+--------------------+



Query Pipeline

In [18]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

bert_embeddings = BertEmbeddings.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["sentence", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

query_pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    bert_embeddings,
    sentence_embeddings,
    finisher
])

small_bert_L2_768 download started this may take some time.
Approximate size to download 135.3 MB
[OK!]


In [19]:
query = "What is semantic chunking?"
query_df = spark.createDataFrame([[query]]).toDF("text")
query_model = query_pipeline.fit(query_df)
# query_model = rag_pipeline.fit(query_df)
query_result = query_model.transform(query_df)

In [20]:
query_result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+
|                text|            document|            sentence|               token|          embeddings| sentence_embeddings|finished_sentence_embeddings|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+
|What is semantic ...|[{document, 0, 25...|[{document, 0, 25...|[{token, 0, 3, Wh...|[{word_embeddings...|[{sentence_embedd...|        [[0.3536282181739...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+



In [21]:
query_vector = query_result.select("finished_sentence_embeddings").first()[0][0]

In [22]:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import FloatType
import numpy as np

def cosine_sim(vec1, vec2):
    v1, v2 = np.array(vec1), np.array(vec2)
    return float(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))

# Register UDF
cosine_sim_udf = udf(lambda v: cosine_sim(v, query_vector), FloatType())

# Add similarity score to each chunk
scored_chunks = aligned_df_clean.withColumn("similarity", cosine_sim_udf(col("vector"))) \
                          .orderBy(col("similarity").desc())

In [25]:
scored_chunks.show()

+------+--------------------+--------------------+----------+
|doc_id|          chunk_text|              vector|similarity|
+------+--------------------+--------------------+----------+
|     1|Introduction: RAG...|[0.57740366458892...|0.61944675|
|     0|Tomatoes grow bes...|[0.69356870651245...| 0.2762234|
+------+--------------------+--------------------+----------+

