## Metadata Extraction And Chunking


Enriching extracted data with metadata is vital for better hybrid search. This allows use to chunk data more meaningfully for semantic search.


### What Is Metadata

Metadata is data about data, it is additional information about content extracted from source documents.

#### Types Of Metadata

1. **Source Information**

> This is information about the source from which content was extracted from. Information about the document itself. like filename, last modified data etcetera.


2. **Structure Metadata**

> This is constructured from the structure of the document itself. Eg element types and hierarchies, section information etc

### Why Hybrid Search?

Hybrid search is a search strategy that combines semantic search with information retrieval techniques such as filtering and keyword search.


1. **Too many matches**

In some cases, similary search may return too many similar documents.

2. **Most recent information**

Users may want the most recent information and not just the most similar one.


3. **Loss of important information**

Loss of important information that is relevant to the search such as section information.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

In [4]:
import json
from IPython.display import JSON

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.chunking.basic import chunk_elements
from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import dict_to_elements

import os

import chromadb

In [5]:
s = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
)

#### Calling API

In [8]:
filename = "./example_datasets/winter-sports.pdf"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(files=files)

In [9]:
try:
    resp = s.general.partition(req)
except SDKError as e:
    print(e)

In [15]:
print(json.dumps(resp.elements[0:3], indent=2))

[
  {
    "type": "Image",
    "element_id": "e914244ca866000ec93260287252af9b",
    "text": "WY R! NS PGRTS SWIDZIWRTAND: I A \u00a5.E BENSON",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "filename": "winter-sports.pdf"
    }
  },
  {
    "type": "Title",
    "element_id": "21501e320e445a0a9478f2775c43efbd",
    "text": "* A Distributed Proofreaders US Ebook *",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 2,
      "filename": "winter-sports.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "6ec7739f853217b1a845380ed110edae",
    "text": "This ebook is made available at no cost and with very few restrictions. These restrictions apply only if (1) you make a change in the ebook (other than alteration for different display devices), or (2) you are making commercial use of the ebook. If either of these conditions a

#### Filtering

In [11]:
[x for x in resp.elements if x['type'] == 'Title' and 'hockey' in x['text'].lower()]

[{'type': 'Title',
  'element_id': 'e99335bc0cc4901e83f4ed51b34777a8',
  'text': 'ICE-HOCKEY',
  'metadata': {'filetype': 'application/pdf',
   'languages': ['eng'],
   'page_number': 130,
   'filename': 'winter-sports.pdf'}}]

#### Find Elements Associated With Chapters

In [12]:
chapters = [
    "THE SUN-SEEKER",
    "RINKS AND SKATERS",
    "TEES AND CRAMPITS",
    "ICE-HOCKEY",
    "SKI-ING",
    "NOTES ON WINTER RESORTS",
    "FOR PARENTS AND GUARDIANS",
]

In [13]:
# Filtering down to Titles
chapter_ids = {}
for element in resp.elements:
    for chapter in chapters:
        if element["text"] == chapter and element["type"] == "Title":
            chapter_ids[element["element_id"]] = chapter
            break

In [14]:
chapter_ids

{'08ef4b916c44e72dfa21a3b462928b3c': 'THE SUN-SEEKER',
 '83004b57981390b5052bbc077c677e9b': 'RINKS AND SKATERS',
 'd0428d4e418c70676674f93921543ba7': 'TEES AND CRAMPITS',
 'e99335bc0cc4901e83f4ed51b34777a8': 'ICE-HOCKEY',
 '36615698e751eadf28b48d0c96f0384f': 'SKI-ING',
 '06f0ced6df29297c075035fc648700e9': 'NOTES ON WINTER RESORTS'}

In [16]:
chapter_to_id = {v: k for k, v in chapter_ids.items()}
[x for x in resp.elements if x["metadata"].get("parent_id") == chapter_to_id["ICE-HOCKEY"]][0]

{'type': 'NarrativeText',
 'element_id': '62163bb9ede34f67ae877df0f7d2e2f2',
 'text': 'M of the Swiss winter-resorts can put into the field a very strong ice-hockey team, and fine teams from other countries often make winter tours there; but the ice-hockey which the ordinary winter visitor will be apt to join in will probably be of the most elementary and unscientific kind indulged in, when the skating day is drawing to a close, by picked-up sides. As will be readily understood, the ice over which a hockey match has been played is perfectly useless for skaters any more that day until it has been swept, scraped, and sprinkled or flooded; and in consequence, at all Swiss resorts, with the exception of St. Moritz, where there is a rink that has been made for the hockey- player, or when an important match is being played, this sport is supplementary to such others as I have spoken of. Nobody, that is, plays hockey and nothing else, since he cannot play hockey at all till the greedy skaters

#### Load documents into a vector db

In [20]:
client = chromadb.PersistentClient(path="chroma_tmp", settings=chromadb.Settings(allow_reset=True))
client.reset()

True

In [21]:
collection = client.create_collection(
    name="winter_sports",
    # Cosine similarity search
    metadata={"hnsw:space": "cosine"}
)

#### Add Elements To Collection

- Can take awhile.

In [22]:
for element in resp.elements:
    parent_id = element["metadata"].get("parent_id")
    chapter = chapter_ids.get(parent_id, "")
    collection.add(
        documents=[element["text"]],
        ids=[element["element_id"]],
        metadatas=[{"chapter": chapter}]
    )

#### See the elements in Vector DB

In [23]:
results = collection.peek()
print(results["documents"])

['O C R (colour)', 'To pass this section the candidate must satisfy all the judges in the manner in which he skates each set considered as a whole, and also in the manner in which he skates each individual call.', '“The essentials of correct tracing are: “Maintenance of the long and transverse axes (as the long axis of the figure a line is to be conceived which divides each circle into two equal parts; a transverse axis cuts the long axis at right angles between two circles); approximately equal size of all circles, and of all curves before and after all turns; symmetrical grouping of the individual parts of the figure about the axes; curves without wobbles, skated out—that is, returning nearly to the starting-point. Threes with the turns lying in the long axis; changes of edge with an easy transition, the change falling in the long axis.”', 'Note.—The head, as already stated, consists of the projection of sixteen stones from one crampit towards the house at the other end of the rink, 

#### Perform a hybrid search with metadata

In [25]:
result = collection.query(
    query_texts=["How many players are on a team?"],
    n_results=2,
    where={"chapter": "ICE-HOCKEY"},
)
print(json.dumps(result, indent=2))

{
  "ids": [
    [
      "04db2c6acff79460008854980481b84c",
      "4e83445779c2acfec3d440845954003a"
    ]
  ],
  "distances": [
    [
      0.4859185814857483,
      0.6825367212295532
    ]
  ],
  "metadatas": [
    [
      {
        "chapter": "ICE-HOCKEY"
      },
      {
        "chapter": "ICE-HOCKEY"
      }
    ]
  ],
  "embeddings": null,
  "documents": [
    [
      "accuracy of a first-rate team, each member of which knows the play of the other five players. The finer the team, as is always the case, the greater is their interdependence on each other, and the less there is of individual play. Brilliant running and dribbling, indeed, you will see; but as distinguished from a side composed of individuals, however good, who are yet not a team, these brilliant episodes are always part of a plan, and end not in some wild shot but in a pass or a succession of passes, designed to lead to a good opening for scoring. There is, indeed, no game at which team play outwits individual br

#### Chunking

Chunking is simply breaking down a large piece of text into smaller sections. This is important since LLMs have limited context window sizes.

You can chunk based on:

1. Tokens
2. Characters

##### Is Chunking Necessary?

Yes, vector databases need documents to be split into smaller chunks for retrieval and prompt generation.

##### Query Variability

The same query can return different content depending on the way the document is chunked.

##### Even Size Chunks

The easiest way is to split the document down into evently sized chunks. This can result into getting similar content split across all chunks.

##### Chunking By Atomic Elements

By identifying atomic elements, you can chunk by combining elements rather than splitting raw text itself. This results in:

- More coherent chunks
- **Example:** Combining content under the same header section into the same chunk.


In chunking based on Elements, we first split the text into sections based on atomic elements and then perform a combining operation over them. This includes the following steps:

1. **Partitioning**

Breaking down a larget document into smaller atomic elements.

2. **Combine Elements Into Chunks**

Add one document atomic element into the chunk, repeat this for all the identified atomic elements adding each to the chunk untill we hit a threshold of the character limit or token limit depending on what you set.


3. **Apply Break Condition**

This simply means, we set a threshold on when to stop adding atomic elements to a given chunk and begin or start a new set of chunk. Basically under what circustances should we start a new chunk? Such conditions can be:

1. When content metadata changes such as page number
2. When title changes, indicating a new chapter
3. When content similarity exceeeds a given threshold


You can also apply a basic combinative chunking in which no conditions are applied aka no break condition.


**Coherent Chunks**

Coherent chunks are groups of related information presented in a way that makes logical sense and is easy to understand. The goal is to ensure that the information within each chunk is thematically and contextually related, making it easier for the reader or viewer to process and retain the information. Coherence is achieved by focusing on:

**Consistency:** The information within a chunk should follow a consistent theme or topic.

**Relevance:** Each piece of information should be directly related to the main idea of the chunk.

**Flow:** The information should be organized in a logical sequence that is easy to follow.

For example, in a blog post about Python programming, a coherent chunk might cover different data types in Python, with each section dedicated to a specific type like integers, strings, and lists. Each section would provide a clear and comprehensive overview of its respective topic.

**Structured Chunks**
Structured chunks refer to the organization of information into well-defined, hierarchical units that follow a clear and consistent structure. This approach emphasizes the use of headings, subheadings, bullet points, and other organizational tools to create a predictable and navigable layout. Structure is achieved by focusing on:

**Hierarchy:** Information is organized from general to specific, with clear distinctions between main points and subpoints.

**Format:** Consistent use of formatting tools such as headings, bullet points, and numbering to delineate different sections and sub-sections.

**Navigation:** Easy-to-follow layout that allows readers to quickly locate and understand different parts of the content.

For example, a technical manual might use structured chunks to present information. The manual could have chapters (main chunks), which are divided into sections and subsections. Each section might begin with an overview, followed by detailed explanations, examples, and summaries.

**Combining Coherent and Structured Chunks**

Combining coherent and structured chunks can significantly enhance the clarity and usability of information. Coherent chunks ensure that the content within each section is logical and easy to understand, while structured chunks provide a clear framework that helps readers navigate through the information efficiently.


![Chunking Strategies](./images/chunking_strategies.png)

In [26]:
# convert de-serialized elements dict to elements
elements = dict_to_elements(resp.elements)

In [27]:
chunks = chunk_by_title(
    elements,
    combine_text_under_n_chars=100,
    max_characters=3000,
)

`combine_text_under_n_chars=100` combine elements with less than 100 characters within it.

`max_characters=3000` max number of characters in each chunk.

In [28]:
print(json.dumps(chunks[0].to_dict(), indent=2))

{
  "type": "CompositeElement",
  "element_id": "c191902c-9e67-4d8d-abf0-1ebc6f8771ae",
  "text": "WY R! NS PGRTS SWIDZIWRTAND: I A \u00a5.E BENSON\n\n* A Distributed Proofreaders US Ebook *\n\nThis ebook is made available at no cost and with very few restrictions. These restrictions apply only if (1) you make a change in the ebook (other than alteration for different display devices), or (2) you are making commercial use of the ebook. If either of these conditions applies, please check with an FP administrator before proceeding.\n\nThis work is in the Canadian public domain, but may be under copyright in some countries. If you live outside Canada, check your country's copyright laws. If the book is under copyright in your country, do not download or redistribute this file.\n\nTitle: Winter Sports in Switzerland Author: Benson, E. F. (Edward Frederic) Date of first publication: 1913 Date first posted: August 23, 2019 Date last updated: February 3, 2021 Faded Page ebook#20210233\n\nTitl

In [29]:
len(elements)

744

In [30]:
len(chunks)

180