### Markdown

Separators:

\n#{1,6} - Split by new lines followed by a header (H1 through H6)
```\n - Code blocks
\n\\*\\*\\*+\n - Horizontal Lines
\n---+\n - Horizontal Lines
\n___+\n - Horizontal Lines
\n\n Double new lines
\n - New line
" " - Spaces
"" - Character

In [1]:
from langchain.text_splitter import MarkdownTextSplitter

In [2]:
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)

In [3]:
markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [4]:
splitter.create_documents([markdown_text])


[Document(page_content='# Fun in California\n\n## Driving'),
 Document(page_content='Try driving on the 1 down to San Diego'),
 Document(page_content='### Food'),
 Document(page_content="Make sure to eat a burrito while you're"),
 Document(page_content='there'),
 Document(page_content='## Hiking\n\nGo to Yosemite')]

### Python
Separators:

\nclass - Classes first,
\ndef - Functions next,
\n\tdef - Indented functions
\n\n - Double New lines
\n - New Lines
" " - Spaces
"" - Characters

In [5]:
from langchain.text_splitter import PythonCodeTextSplitter


In [6]:
python_text = """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

for i in range(10):
    print (i)
"""

In [7]:
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)


In [8]:
python_splitter.create_documents([python_text])


[Document(page_content='class Person:\n  def __init__(self, name, age):\n    self.name = name\n    self.age = age'),
 Document(page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n    print (i)')]

### PDFs w/ tables

PDFs are an extremely common data type for language model work. Often they'll contain tables that contain information.

This could be financial data, studies, academic papers, etc.

A very convenient way to do this is with `Unstructured`, a library dedicated to making your data LLM ready.

- for running it locally you need to install unstructured[local-inference] 

    `pip install unstructured[local-inference]`

- you may encounter poppler not found and teserract not found error . Install both and add them to your system variable.




In [8]:
import os

os.environ["UNSTRUCTURED_API_KEY"] = "HdicJidms8yOKvpXECBaEzZhfjm7me"
os.environ["UNSTRUCTURED_API_URL"] = "https://api.unstructured.io/general/v0/general"

In [None]:
%pip install unstructured[local-inference]

In [1]:
from unstructured.partition.pdf import partition_pdf


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
filename = "data/SalesforceFinancial.pdf"
# Extracts the elements from the PDF
elements = partition_pdf(
    filename=filename,

    # Unstructured Helpers
    strategy="hi_res", 
    infer_table_structure=True, 
    model_name="yolox",
)

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
elements

[<unstructured.documents.elements.NarrativeText at 0x1fb787e3790>,
 <unstructured.documents.elements.NarrativeText at 0x1fb787e30d0>,
 <unstructured.documents.elements.NarrativeText at 0x1fb787e3be0>,
 <unstructured.documents.elements.NarrativeText at 0x1fb78813c10>,
 <unstructured.documents.elements.NarrativeText at 0x1fb78813d00>,
 <unstructured.documents.elements.NarrativeText at 0x1fb78813eb0>,
 <unstructured.documents.elements.NarrativeText at 0x1fb78813040>,
 <unstructured.documents.elements.NarrativeText at 0x1fb788133a0>,
 <unstructured.documents.elements.Title at 0x1fb787e3190>,
 <unstructured.documents.elements.Title at 0x1fb787e3070>,
 <unstructured.documents.elements.Table at 0x1fb78823b20>,
 <unstructured.documents.elements.NarrativeText at 0x1fb787e3430>,
 <unstructured.documents.elements.Text at 0x1fb787e3040>,
 <unstructured.documents.elements.Text at 0x1fb787e30a0>,
 <unstructured.documents.elements.Text at 0x1fb787e39d0>]

In [9]:
elements[-5].metadata.text_as_html

'<table><thead><tr><th>Revenue)</th><th>Guidance $7.69 - $7.70 Billion</th><th>Guidance $31.7 - $31.8 Billion</th></tr></thead><tbody><tr><td>Y/Y Growth</td><td>~21%</td><td>~20%</td></tr><tr><td>FX Impact?)</td><td>~($200M) y/y FX</td><td>~($600M) y/y FX®</td></tr><tr><td>GAAP operating margin</td><td></td><td>~3.8%</td></tr><tr><td>Non-GAAP operating margin)</td><td></td><td>~20.4%</td></tr><tr><td>GAAP earnings (loss) per share</td><td>($0.03) - ($0.02)</td><td>$0.38 - $0.40</td></tr><tr><td>Non-GAAP earnings per share</td><td>$1.01 - $1.02</td><td>$4.74 - $4.76</td></tr><tr><td>Operating Cash Flow Growth (Y/Y)</td><td></td><td>~21% - 22%</td></tr><tr><td>Current Remaining Performance Obligation Growth (Y/Y)</td><td>~15%</td><td></td></tr></tbody></table>'

That table may look messy, but because it's in HTML format, the LLM is able to parse it much more easily than if it was tab or comma separated. You can copy and paste that html into a  [html viewer](https://codebeautify.org/htmlviewer) online to see it reconstructed.

**Important Point**: A common practice that developers do is to summarize the table after you've extracted it. Then get an embedding of that summary. If the summary embedding matches what you're looking for, then pass the raw table to your LLM.

### Multi-Modal (text + images)



In [None]:
%pip install "unstructured[all-docs]"

In [11]:
from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

In [12]:
filepath = "data/VisualInstruction.pdf"

In [13]:
# Get elements
raw_pdf_elements = partition_pdf(
    filename=filepath,
    
    # Using pdf format to find embedded image blocks
    extract_images_in_pdf=True,
    
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    # Hard max on chunks
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path="data/pdfImages/",
)

If you head over to data/pdfImages/ and check out the images that were parsed.

But the images don't do anything sitting in a folder, we need to do something with them! Though a bit outside the scope of chunking, let's talk about how to work with these.

The common tactics will either use a multi-modal model to generate summaries of the images or use the image itself for your task. Others get embeddings of images (like CLIP).

Let's generate summaries so you'll be inspired to take this to the next step. We'll use GPT-4V.

In [16]:
from langchain.chat_models import ChatOpenAI
from langchain.schema.messages import HumanMessage
import os
from PIL import Image
import base64
import io

I'm creating quick helper function to convert the image from file to base64 so we can pass it to GPT-4V



In [18]:

# Function to convert image to base64
def image_to_base64(image_path):
    with Image.open(image_path) as image:
        buffered = io.BytesIO()
        image.save(buffered, format=image.format)
        img_str = base64.b64encode(buffered.getvalue())
        return img_str.decode('utf-8')

image_str = image_to_base64("figures/figure-15-6.jpg")

In [22]:
import os
os.environ['OPENAI_API_KEY'] = ""

Then we can go ahead and pass our image to the LLM



In [24]:

chat = ChatOpenAI(model="gpt-4o",
                  max_tokens=1024)

msg = chat.invoke(
    [
        HumanMessage(
            content=[
                {"type": "text", "text" : "Please give a summary of the image provided. Be descriptive"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_str}"
                    },
                },
            ]
        )
    ]
)

Then the summary returned is what we will put into our vectordata base. Then when it comes time to do our retrieval process, we'll use these embeddings for semantic search.

In [25]:
msg.content

'The image features a humorous and creative arrangement of chicken nuggets on a baking tray. The nuggets are positioned to resemble the continents of the Earth, mimicking a map of the world. The text at the top of the image reads, "Sometimes I just look at pictures of the earth from space and I marvel at how beautiful it all is." This adds a layer of humor, as the person is humorously comparing the chicken nugget arrangement to the beauty of the Earth when viewed from space. The nuggets roughly represent the shapes of continents like South America, Africa, Europe, and Asia, though not in precise detail. The image playfully combines food with geography for a lighthearted effect.'