im preprocessing schritt haben wir nun alle json files in text chunks umgewandelt mit text und metadata.
nun möchten wir im ersten schritt alle .jsonl chunks aus dem folder data_chunks einlesen und umwandeln in LangChain-kompatible Document-Objekte.
Ganz wichtig ist dass wir Text und Metadaten komplett aus den chunks übernehmen, da beide später eingesetzt werden für retrieval und filtern

# 1. read chunks and create Document objects

Here we create the list documents. A document contains page_content and metadata.
Diese kann man dann nach text durchsuchen und nach metadata filtern


Ergebnis sieht dann so aus:

doc = Document(
    page_content=new_text,     # der Text inkl. bank/date/source
    metadata=meta              # deine Metadaten: {"bank": ..., "date": ..., "source": ...}
)


In [5]:
# imports

import os
import json
from langchain.schema import Document
from tqdm import tqdm


In [6]:
# load chunks and create Document objects

CHUNKS_DIR = "data_chunks"

documents = []

for file in tqdm(os.listdir(CHUNKS_DIR)):
    if file.endswith(".jsonl"):
        file_path = os.path.join(CHUNKS_DIR, file)
        with open(file_path, "r", encoding="utf-8") as f:
            for line in f:
                try:
                    chunk = json.loads(line)

                    # extract metadata
                    meta = chunk.get("metadata", {})
                    bank = meta.get("bank", "unknown")
                    date = meta.get("date", "unknown")
                    source = meta.get("source", "unknown")

                    # add metadata also to the text (but leave the metadata also in the metadata)
                    new_text = f"Bank: {bank}\nDate: {date}\nSource: {source}\n\n{chunk['text']}"

                    doc = Document(
                        page_content=new_text,
                        metadata=meta
                    )
                    documents.append(doc)
                except Exception as e:
                    print(f"Error in file {file}: {e}")

print(f"✅ {len(documents)} Documents loaden and embedded with metadata in the text")



100%|██████████| 62/62 [00:00<00:00, 271.00it/s]

✅ 62984 Documents loaden and embedded with metadata in the text





In [7]:
# show some Document-Objects

for doc in documents[:3]:
    print("Text preview:\n", doc.page_content[:300])
    print("Metadata:", doc.metadata)
    print("—" * 80)


Text preview:
 Bank: UBSG
Date: 2023-12-31
Source: cashflow

As of 2023-12-31, UBSG cash flow report:
- Operating cash flow: 86.07 billion USD
- Investing cash flow: 103.23 billion USD
- Financing cash flow: -58.26 billion USD
- Free cash flow: 84.38 billion USD
- Beginning cash position: 195.32 billion USD
- End 
Metadata: {'bank': 'UBSG', 'date': '2023-12-31', 'source': 'cashflow'}
————————————————————————————————————————————————————————————————————————————————
Text preview:
 Bank: UBSG
Date: 2022-12-31
Source: cashflow

As of 2022-12-31, UBSG cash flow report:
- Operating cash flow: 14.65 billion USD
- Investing cash flow: -12.45 billion USD
- Financing cash flow: -9.09 billion USD
- Free cash flow: 13.00 billion USD
- Beginning cash position: 207.88 billion USD
- End c
Metadata: {'bank': 'UBSG', 'date': '2022-12-31', 'source': 'cashflow'}
————————————————————————————————————————————————————————————————————————————————
Text preview:
 Bank: UBSG
Date: 2021-12-31
Source: cashflow

As 