# Les Misér-AI-bles: Batching Through the Barricades
## Data Preparation

In [0]:
%pip install bs4 langchain-text-splitters databricks-agents --quiet
%restart_python

## A Chunky View on Les Miserables

This notebook takes Les Miserables from Project Gutenberg and prepared a spark dataframe with the chunked sections. It works fine on standard clusters or serverless.

In [0]:
from langchain_text_splitters.html import HTMLHeaderTextSplitter
import requests

response = requests.get('https://www.gutenberg.org/cache/epub/135/pg135-images.html')
miserables_text = response.text

headers_to_split_on = [
    ("h2", "Header 1"),
    ("h3", "Header 2"),
    ("h4", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(miserables_text)

The main reason I love this dataset is that it has a nice distribution of length and provides a reasonable small test of LLM query performance. See the graph below. We filter to chunks where the page content has at least 1,000 characters. This gives us 368 chunks, 365 of which are chapters.

In [0]:
import matplotlib.pyplot as plt
import numpy as np

valid_chunks = [x for x in html_header_splits if len(x.page_content) > 1000][1:]
valid_chunk_lengths = [len(x.page_content) for x in valid_chunks]

plt.style.use('ggplot')
plt.hist(valid_chunk_lengths, bins=20, edgecolor='black')
plt.title('Histogram of Valid Chunk Lengths')
plt.xlabel('Length of Valid Chunks')
plt.ylabel('Frequency')
plt.show()

In [0]:
import pandas as pd

def extract_passage(passage):
  return {
    "header_1": passage.metadata.get('Header 1',""),
    "header_2": passage.metadata.get('Header 2',""),
    "page_content": passage.page_content
}
  
extracted_passages = [extract_passage(x) for x in valid_chunks]
les_mis_df = pd.DataFrame(extracted_passages).query("header_2 != ''")

We also want to ensure we have a consistent prompt for extracting structured data, so we pregenerate it.

In [0]:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType

def create_extraction_prompt(header1, header2, content):
    """
    Create a prompt for structured data extraction from Les Miserables passages.
    
    Args:
        header1 (str): The Header 1 content
        header2 (str): The Header 2 content
        content (str): The page content
        
    Returns:
        str: Formatted prompt string
    """
    prompt = f"""
         Take this passage from Les Miserables and do structured data extraction in JSON. I want you to provide the title of the chapter, a list of characters, a synopsis of the chapter, and the overall sentiment of the chapter - positive, neutral, or negative. Do not make up anything if the passage isn't part of the novel.
         
         {header1}
         {header2}
         {content}
         """
    return prompt

extraction_prompt_udf = udf(create_extraction_prompt, StringType())

les_mis_sp = spark.createDataFrame(les_mis_df).withColumn("extraction_prompt", extraction_prompt_udf(
  col("header_1"), 
  col("header_2"), 
  col("page_content")
  )
)

In [0]:
les_mis_sp.write.mode("overwrite").option("overwriteSchema", "true").format("delta").saveAsTable("shm.default.les_mis_w_prompt")