# Document Service

This notebook forms the core of our document service. It showcases how we are going to simplify our document intelligence application using Lakebase and Serverless jobs. This is tested on Serverless Version 3 - it takes a single file or a directory and parses all the files directly into an append operation on a postgres table. We can then get embeddings and use pgvector as the backend with a langgraph Agent.

We use our Databricks user IDs as the main entry point into the workflow and authentication

In [0]:
%pip install databricks-langchain databricks-sdk --upgrade
%restart_python

In [0]:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
me = w.current_user.me()
print(me.id)  # This is your Databricks user ID
print(me.user_name) 
USER_ID = me.id

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

dbutils.widgets.text(
  "file_path", 
  "/Volumes/fnf_demo/income_verification_demo/documents"
  )
dbutils.widgets.text(
  "embedding_endpoint", 
  'databricks-gte-large-en'
  )
dbutils.widgets.text(
  "database_instance", 
  'shm'
  )
dbutils.widgets.text(
  "doc_id", 
  ''
  )

In [0]:
config = dbutils.widgets.getAll()
config

We use ai_parse_document in a serverless job as our document processing service. This could be any isolated microservice and has lots of room for optimization, but ai_parse_document does a pretty good job and can handle lots of file types

In [0]:
parsed_df = (
    spark.read.format("binaryFile")
    .load(config.get("file_path"))
    .withColumn("user_id", lit(USER_ID))
    .select(
        col("path"),
        col("user_id"),
        expr("ai_parse_document(content)").alias("parsed")
    )
    .withColumn(
        "parsed_json",
        parse_json(col("parsed").cast("string"))
    )
    .select(
        col("path"),
        col("user_id"),
        expr("parsed_json:document:pages").alias("pages"),
        expr("parsed_json:document:elements").alias("elements"),
        expr("parsed_json:document:_corrupted_data").alias("_corrupted_data")
    )
)

In [0]:
display(parsed_df)

To get something simple and working, I propose that we simply chunk each page for now. We can work on refining the chunking strategy in this job, but this gives a good starting point. We even wrap the embedding call here for better horizontal scalability.

In [0]:
from pyspark.sql.functions import from_json, explode, col, concat_ws, lit, expr
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType, StringType
import uuid

# Define schema for pages based on provided example
page_schema = StructType([
    StructField("content", StringType()),
    StructField("footer", StringType()),
    StructField("header", StringType()),
    StructField("id", IntegerType()),
    StructField("page_number", IntegerType())
])

chunked_pages = (
    parsed_df
    .withColumn(
        "pages_array",
        from_json(
            col("pages").cast("string"),
            ArrayType(page_schema)
        )
    )
    .withColumn(
        "page_chunk",
        explode(col("pages_array"))
    )
    .withColumn(
        "doc_id",
        lit(config['doc_id'])
    )
    .select(
        lit(str(uuid.uuid4())).alias("id"),
        col("doc_id"),
        array(col("page_chunk.id").cast("string")).alias("page_ids"),
        concat_ws(
            "\n",
            concat_ws("", lit("Content: ["), col("page_chunk.content"), lit("]")),
            concat_ws("", lit("Footer: ["), col("page_chunk.footer"), lit("]")),
            concat_ws("", lit("Header: ["), col("page_chunk.header"), lit("]")),
            concat_ws("", lit("ID: ["), col("page_chunk.id").cast("string"), lit("]")),
            concat_ws("", lit("Page Number: ["), col("page_chunk.page_number").cast("string"), lit("]"))
        ).alias("content")
    )
    .withColumn("embedding", expr(f"ai_query('{config.get('embedding_endpoint')}', content)"))
    .withColumn("metadata", to_json(struct(col("_metadata"))))
    .withColumn("created_at", current_timestamp())
)

display(chunked_pages)

In [0]:
chunked_pages_pd = chunked_pages.toPandas()
chunked_pages_pd['embedding'] = chunked_pages_pd['embedding'].apply(lambda x: list(x))

In [0]:
chunked_pages_pd