## Normalizing The Context

### Preprocessing Steps

Document preprocessing includes a few steps we have to follow:

1. **Document context:** Text content from the document used to perform similarity search or/and keyword search in RAG applications.

2. **Document elements:** These are the building blocks of a document. This are usefull in performing various tasks in a RAG systems such as filtering and document chuncking. These include:

- Titles
- List itesms
- Tables
- Images
- Narrative text content

3. **Element Metadata:** This includes a variety of things like:

- Page number
- File types
- Section
- File name


This metadata is typically used in **Hybrid searches**


### Why Is Data Preprocessing Hard?

1. **Content Cues**

    > Different documents have different cues for element types

2. **Standardization Need**

    > To process documents from different types, they need some form of data standardization which can be hard to implement.


3. **Data Extraction Variability**

    > Different documents have different ways or forms used to extract context from them. Each document format needs a different way to extract context from them.


4. **Metadata Insight**

    > Extracting metadata requires a deeper understanding of the document structure. Articles do not have pages, but PDF documents do etcetera.


### Why We Normalize Diverse Document Types

1. **Documents come in different formats**

    > When building LLM applications, the least you want to worry about is where and what format the data came from. For this very reason, it only makes sense to normalize your diverse documents.

2. **Common Format**

    > The first step in document preprocessing is to firt make sure the raw documents are converted into a common format that is able to identify common document elements like titles, paragraphs narative texts etcetera.


**Normalization Benefits**

1. If allows documents to be parsed down in the same way regardless of the format of origin. This allows as not to create separate parsers for each document type/format.

2. Helps reduce preprocessing costs. The most expensive and time consuming part of document preprocessing is the dat extraction part. Having data that is normalized means downstream tasks have it easier on operating with normalized texts.


### Document Serialization

This is the second step right after document normalization. This allows for us to use to results of document preprocessing over and over gaint without having to start from scratch. We can serialize our data as JSON format. There are other formats, but JSON is commonly used as JSON is supported across many programming languages. JSON can also be used for streaming use cases, standard HTTP response, common and well understood structure.



In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# ! poetry add unstructured python-pptx

In [2]:
from IPython.display import JSON

import json

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.partition.html import partition_html
from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements, elements_to_json

### Parse HTML Documents

In [4]:
filename = "./example_datasets/medium_blog_post.html"
elements = partition_html(filename=filename)

INFO: Reading document from string ...
INFO: Reading document ...


[nltk_data] Downloading package punkt to /home/prince/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/prince/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [5]:
element_dict = [el.to_dict() for el in elements]
example_output = json.dumps(element_dict[11:15], indent=2)
print(example_output)

[
  {
    "type": "Title",
    "element_id": "690e0c655cefd034544abc23a00861d7",
    "text": "Share",
    "metadata": {
      "category_depth": 0,
      "last_modified": "2024-06-02T22:52:40",
      "languages": [
        "eng"
      ],
      "file_directory": "./example_datasets",
      "filename": "medium_blog_post.html",
      "filetype": "text/html"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "2e8f75631dafe3e00405ccd3ffb99403",
    "text": "In the vast digital universe, data is the\n                                lifeblood that drives decision-making and\n                                innovation. But not all data is created equal.\n                                Unstructured data in images and documents often\n                                hold a wealth of information that can be\n                                challenging to extract and analyze.",
    "metadata": {
      "last_modified": "2024-06-02T22:52:40",
      "languages": [
        "eng"
      ],
  

In [6]:
JSON(example_output)



<IPython.core.display.JSON object>

### Parse MS PPTX Documents

In [7]:
filename = "./example_datasets/msft_openai.pptx"
elements = partition_pptx(filename=filename)

In [12]:
element_dict = [el.to_dict() for el in elements]
print(element_dict[:])
JSON(json.dumps(element_dict[:], indent=2))

[{'type': 'Title', 'element_id': 'e53cb06805f45fa23fb6d77966c5ec63', 'text': 'ChatGPT', 'metadata': {'category_depth': 1, 'file_directory': './example_datasets', 'filename': 'msft_openai.pptx', 'last_modified': '2024-06-02T23:00:54', 'page_number': 1, 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'}}, {'type': 'ListItem', 'element_id': '34a50527166e6765aa3e40778b5764e1', 'text': 'Chat-GPT: AI Chatbot, developed by OpenAI,\xa0trained to perform conversational tasks and\xa0creative tasks', 'metadata': {'category_depth': 0, 'file_directory': './example_datasets', 'filename': 'msft_openai.pptx', 'last_modified': '2024-06-02T23:00:54', 'page_number': 1, 'languages': ['eng'], 'parent_id': 'e53cb06805f45fa23fb6d77966c5ec63', 'filetype': 'application/vnd.openxmlformats-officedocument.presentationml.presentation'}}, {'type': 'ListItem', 'element_id': '631df69dff044f977d66d71c5cbdab83', 'text': 'Backed by GPT-3.5 model (gpt-35-turbo),

<IPython.core.display.JSON object>

### PDF Preprocessing

To process PDF documents takes abit of resources and hardware requirments. So Unstructured provides an API we can use instead.

In [15]:
# ! poetry add python-dotenv

In [26]:
import dotenv
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [27]:
import os

In [28]:
s = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
)

In [29]:
filename = "./example_datasets/CoT.pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["eng"],
)
try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
    print(e)

[
  {
    "type": "Title",
    "element_id": "826446fa7830f0352c88808f40b0cc9b",
    "text": "B All Experimental Results",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "filename": "CoT.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "055f2fa97fbdee35766495a3452ebd9d",
    "text": "This section contains tables for experimental results for varying models and model sizes, on all benchmarks, for standard prompting vs. chain-of-thought prompting.",
    "metadata": {
      "filetype": "application/pdf",
      "languages": [
        "eng"
      ],
      "page_number": 1,
      "parent_id": "826446fa7830f0352c88808f40b0cc9b",
      "filename": "CoT.pdf"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "9bf5af5255b80aace01b2da84ea86531",
    "text": "For the arithmetic reasoning benchmarks, some chains of thought (along with the equations produced) were correct, except the model p