# Prepare data
This notebook prepares raw data to be later used for vector storage.

## Sources:
- Documentation (`docs-data`)
- Text of standards (`TEPs`)
- TVM specification docs (Generated in ParseSnippets)
- Some contracts examples from (`tolk-contracts`)

## Method

Documents are parsed into hierarchial structure based on their heading level.  
Snippets larger than 5 characters are removed into separate snippets index.  
It is done to not polute vector similarity, so vector search alway performed only  
against natural language content.

### Document structure
Document has the following structure:
``` python
new_doc = create_document(doc_content, path, {
    "concept": title,
    "word_count": len(doc_content.split()),
    "token_count": (len(doc_content) + 3) // 4, # Rough anthropic estimate
    "from": path,
    "url_from": url_from,
    "child_nodes": [],
    "references": references.copy(),
    "snippets": ref_snippets.copy(),
    "crumbs": ">>".join(crumbs)
})
```

#### Concept attribute
Concept name it explains.
Usually just it's header.

#### Word and token count

Theese are self exlainatory.

Token count is rough estimate and
not really used anywhere yet, but will
be helpfull in the future.

#### From attribue
Source path relative to the project root

#### Url from attribute
Url pointing to the document relative to the
[docs.ton.org](https://docs.ton.org/) root path
where applicable.

#### Child nodes attribute
Documents that are below in the document
hierarchy than the current document.
If document with lover heading level is encountered,
it is being added to the parent document child nodes.

#### References attribute
Hyperlinks used in this document chunk converted to the document ids(if possible.
If the document is part of docs-data tree, it's id will be added,
otherwise raw url is preserved.

#### Snippets attribute
List of snippet ids, that are referenced in this document chunk.

#### Document id
Id is generated from the content and path hash:
```python
doc_id = hashlib.sha256("||".join([content, doc_path]).encode('utf8')).hexdigest()
```

### Snippet structure
Snippet object has the following structure:
``` python
codeDoc = Document(page_content=code, id=doc_id, metadata={
        "lang": lang.lower(),
        "desc": desc,
        "word_count": len(code.split()),
        "token_count": (len(code) + 3) // 4, # Rough anthropic estimate
        "concept": concept,
    })
```

#### Lang attribute
Lowercase language name if present

#### Desc attribute
Paragraph of text that preceeds the code snippet in the document.  
It is not guaranteed to be accurate description, but most of the time
it works.  
Added opportunistically.

### Snippet word and token count
Same as in document

### Snippet concept attribute
Header of the document in which the snippet is encountered.

In [None]:
pip install tqdm beautifulsoup4 markdown2  sentence-transformers faiss-cpu

In [None]:
import pathlib, glob, tqdm
import markdown2 as markdown
from bs4 import BeautifulSoup, Tag, NavigableString
import unicodedata
import hashlib
import sys
import re
from langchain_core.documents import Document
from typing import Dict, List, Optional
from pydantic import BaseModel
    
docs: list[Document] = []
snippets: Dict[str, Document] = {}
location_map: Dict[str,str] = {}
root_path = pathlib.Path.cwd().parent.resolve()
sys.path.insert(0, str(root_path))
jsx_regex = re.compile(
    r"""^import\s+                     # the word “import” + whitespace
        \{([^}]+)\}\s+                # everything inside the braces → group 1
        from\s+                       # the word “from” + whitespace
        (['"])(/?([^'"]+))\.jsx\2     # quote (group 2), optional leading /,
                                      # the path without extension → group 4,
                                      # then “.jsx” and the same closing quote
        ;?\s+?                        # optional trailing semicolon
    """,
    re.VERBOSE | re.MULTILINE,
)

In [None]:
def process_code(code: str, lang: str, concept: str, desc: str):
    doc_id = hashlib.sha256("||".join([code, concept]).encode('utf8')).hexdigest()
    codeDoc = Document(page_content=code, id=doc_id, metadata={
        "lang": lang.lower(),
        "desc": desc,
        "word_count": len(code.split()),
        "token_count": (len(code) + 3) // 4, # Rough anthropic estimate
        "concept": concept,
    })
    snippets[codeDoc.id] = codeDoc
    return codeDoc
'''
def create_document(content: str, doc_path: str, metadata: dict):
    doc_id = hashlib.sha256("||".join([content, doc_path]).encode('utf8')).hexdigest()
    return Document(page_content=content, id=doc_id, metadata=metadata)
'''

def filter_jsx(content: str):
    # Let's start with something super dumb here.
    # Import is by far the most anoying token, rest could be ignored for now.
    return jsx_regex.sub("", content)
def create_document(title: str, doc_chunks: List[str], path: str, references: List[str], ref_snippets: List[Dict], crumbs:List[str], children_nodes: Optional[List[str]] = None):
    crumb_str = ">>".join(crumbs)
    # Put the crumbs string at the top instead of title, so the
    # whole hierarchy participates in scoring.
    doc_content = filter_jsx(f"{crumb_str}\n\n" + " ".join(doc_chunks))
    doc_id = hashlib.sha256("||".join([doc_content, path]).encode('utf8')).hexdigest()
    file_url = pathlib.Path(*(pathlib.Path(path).parts[1:]))
    url_from = '/' + str(file_url.parent / file_url.stem)
    
    doc_meta = {
                    "concept": title,
                    "word_count": len(doc_content.split()),
                    "token_count": (len(doc_content) + 3) // 4, # Rough anthropic estimate
                    "from": path,
                    "url_from": url_from,
                    "child_nodes": children_nodes if children_nodes is not None else [],
                    "references": references,
                    "snippets": ref_snippets,
                    "crumbs": crumb_str
                }
    new_doc = Document(id = doc_id, page_content=doc_content, metadata=doc_meta)
    
    if  url_from in location_map:
        location_map[url_from].append(doc_id)
    else:
        location_map[url_from] = [doc_id]
    for snip_ref in ref_snippets:
        snip = snippets[snip_ref["id"]]
        snip.metadata["parent_doc"] = new_doc.id
        new_doc.metadata["token_count"] += (len(snip.page_content) + 3) // 4
        new_doc.metadata["word_count"] += len(snip.page_content.split())
        
    return new_doc

def add_to_hierarchy(docId: str, hierarchy: [str]):
        for doc in hierarchy:
            doc.metadata['child_nodes'].append(docId)

def process_md(path: str, md_text: str, custom_title="", custom_crumbs=None, skip_top = False):
    raw_md = unicodedata.normalize("NFKC", md_text)
    #raw_md  = md_path.read_text(encoding="utf8")
    md = markdown.Markdown(extras=["metadata", "fenced-code-blocks", "highlightjs-lang"])
    text = md.convert(raw_md)
    markup = BeautifulSoup(text, "html.parser")
    title = custom_title if custom_title else md.metadata["title"].strip('"') if "title" in md.metadata else ""
    crumbs = [title] if title else []
    initial_crumbs = custom_crumbs if custom_crumbs else []
    heading_level = len(crumbs)
    chapter_length = 0
    references = []
    ref_snippets = []
    chapter_chunks = []
    last_added_text = ""
    doc_hierarchy = []
    new_docs = []
    new_snippets = []
    skip_set = set(["code", "h1","h2","h3","h4","h5","h6"])
    for node in markup.descendants:
        # skip text that is already extracted
        if isinstance(node, NavigableString):
            if node.parent and node.parent.name in skip_set:
                #print(f"Skipping {node.parent.name}")
                continue
            text = str(node).strip()
            if text:
                chapter_length += len(text) + int(chapter_length > 0)
                chapter_chunks.append(text)
            continue
        if isinstance(node, Tag) and (any(isinstance(c, str) and c.strip() for c in node.contents)):
            #print(node.text)
            # Potentially one of the heading tags h1 - h6
            if len(node.name) == 2 and node.name[0] == 'h':
                new_lvl = ord(node.name[1]) - 49 + 1 # Char codes 1 - 6 index 0 - 5
                # Should never happen
                if new_lvl < 0 or new_lvl > 6:
                    continue
                '''
                doc_content = f"{title}\n\n" + " ".join(chapter_chunks)
                #print(f"Covering concept {title}")
                file_url = pathlib.Path(*(pathlib.Path(path).parts[1:]))
                url_from = '/' + str(file_url.parent / file_url.stem)
                
                new_doc = create_document(doc_content, path, {
                    "concept": title,
                    "word_count": len(doc_content.split()),
                    "token_count": (len(doc_content) + 3) // 4, # Rough anthropic estimate
                    "from": path,
                    "url_from": url_from,
                    "child_nodes": [],
                    "references": references.copy(),
                    "snippets": ref_snippets.copy(),
                    "crumbs": ">>".join(crumbs)
                })

                # Backward snippets mapping
                for snip_ref in ref_snippets:
                    snip = snippets[snip_ref.id]
                    snip.metadata["parent_doc"] = new_doc.id
                # Add to a doc->[snippet obj] zipable list
                
                if  url_from in location_map:
                    location_map[url_from].append(new_doc.id)
                else:
                    location_map[url_from] = [new_doc.id]
                '''

                #print(new_doc)
                new_doc = create_document(
                    title,
                    doc_chunks=chapter_chunks,
                    path=path,
                    references=references,
                    ref_snippets=ref_snippets,
                    crumbs=initial_crumbs + crumbs
                )
                new_docs.append(new_doc)
                #docs.append(new_doc)
                
                title = node.get_text(strip=True, separator=" ").replace('"','').strip() # Strip quote from titles for escaping simplicity
                #print(title)
                #print(node)
                last_added_text = ""
                chapter_text = ""
                references = []
                ref_snippets = []
                chapter_chunks = []
                chapter_length = 0
                next_lvl = new_lvl > heading_level
                prev_lvl = new_lvl < heading_level
                #print(f"New{new_lvl}\n{heading_level}")
                #print(node)
                heading_level = new_lvl
                if next_lvl:
                    crumbs.append(title)
                    add_to_hierarchy(new_doc.id, doc_hierarchy)
                    doc_hierarchy.append(new_doc)
                elif prev_lvl: 
                    #print(crumbs)
                    #print(doc_hierarchy)
                    doc_hierarchy = doc_hierarchy[0:new_lvl - 1]
                    add_to_hierarchy(new_doc.id, doc_hierarchy)
                    doc_hierarchy.append(new_doc)
                    crumbs = crumbs[0:new_lvl - 1] + [title]
                else:
                    crumbs[-1] = title
                    add_to_hierarchy(new_doc.id, doc_hierarchy)
                    
                    '''
                    Hierarchy is for the parents.
                    If lelvel is not changed, should not
                    touch it.
                    if len(doc_hierarchy) > 0:
                        doc_hierarchy[-1] = new_doc
                    else:
                        doc_hierarchy = [new_doc]
                    '''
                    
                #print(f"New crumbs {'>'.join(crumbs)} level {new_lvl}")
            elif node.name == "code":
                code_lang = node.get('class')
                code_text = node.text
                lang = ""
                
                if code_lang:
                    lang = code_lang[0]
                # Leave inlined only oneliners with no language def
                # Theese often can be term, and useful in vector search
                if len(code_text.split("\n")) <= 1 and not lang:
                    last_added_text = f"`{code_text.strip()}`"
                    if last_added_text:
                        chapter_chunks.append(last_added_text)
                        # Data chunk length + delimiter
                        chapter_length += len(last_added_text) + int(chapter_length > 0)
                else:
                    code_doc = process_code(code_text, lang, title, last_added_text)
                    ref_snippets.append({"id": code_doc.id, "pos": chapter_length})
 
            elif node.name == "a":
                ref = node.get('href')
                if ref:
                    references.append(ref)

    if len(title) > 0 or len(chapter_chunks) > 0:
        #print(f"Adding last paragraph {title}")
        new_doc = create_document(
            title,
            doc_chunks=chapter_chunks,
            path=path,
            references=references,
            ref_snippets=ref_snippets,
            crumbs= initial_crumbs + crumbs
        )
        add_to_hierarchy(new_doc.id, doc_hierarchy)
        new_docs.append(new_doc)
        
        
    
    return new_docs[1:] if skip_top else new_docs
    
def read_md_file(md_path: pathlib.Path, crumbs: list[str] = None):
    doc_rel_path = md_path.relative_to(root_path)
    print(str(doc_rel_path))
    return process_md(str(doc_rel_path), md_path.read_text(encoding="utf-8"), custom_crumbs=crumbs)
    

In [None]:
docs = []

In [None]:
test_path = root_path / "docs-data/languages/func/known-issues.mdx"
crumbs = list(test_path.relative_to(root_path / "docs-data").parts)[:-1]
test_docs = read_md_file(test_path, crumbs=crumbs)

In [None]:
test_docs

In [None]:
#%%debug
def count_tokens(data: str):
    return (len(data) + 3) // 4
    
from core.rendering import render_docs_batch, render_single_doc
for doc in test_docs:
    if len(doc.metadata["snippets"]) > 0:
        #print(f"Before {doc.page_content}\n\n")
        print(f"Raw data tokens: {doc.metadata.get("token_count", 0)}")
        rendered = render_single_doc(doc, doc.metadata['snippets'], snippets)
        print(f"Rendered token count: {count_tokens(rendered)}")
        print(f"{rendered}\n\n")

In [None]:
docs_path = root_path / "docs-data/"
for file in docs_path.rglob("*.mdx"):
    print(f"Processing {file}...")
    crumbs = list(file.relative_to(docs_path).parts)[:-1]
    docs.extend(read_md_file(file, crumbs=crumbs))

In [None]:
# Let's add standards documentation
TEPs = root_path / "TEPs/text/"
for file in TEPs.rglob("*.md"):
    print(f"Processing {file}...")
    docs.extend(read_md_file(file))

In [None]:
import json
from langchain_core.documents import Document
def dump_json_docs(documents: list[Document], path: str):
    with open(path, "w", encoding="utf8") as out_file:
        for doc in documents:
            out_file.write(doc.model_dump_json() + "\n")
def load_docs_json(path: str):
    documents = []
    with open(path, "r", encoding="utf8") as docs_input:
        for doc_str in docs_input:
            doc_parsed = json.loads(doc_str)
            documents.append(Document(id=doc_parsed['id'], page_content=doc_parsed['page_content'], metadata = doc_parsed['metadata']))
    return documents
            

In [None]:
# Translate document references into actual path
for doc in docs:
    ref_ids = []
    for ref in doc.metadata["references"]:
        if ref in location_map:
            ref_ids.extend(location_map[ref])
    if len(ref_ids) > 0:
        doc.metadata["references"] = ref_ids.copy()

In [None]:
def add_examples(examples_path: str, glob: str, concept: str, lang: str, label: str = "", doc_text: str = ""):
    code_refs = []
    code_docs = []
    doc_chunks = []
    # Hack, but whathever
    parsed_path = pathlib.Path(examples_path)
    if not parsed_path.is_absolute():
        parsed_path = root_path / parsed_path
    if doc_text:
        doc_chunks = [doc_text]
    '''
    top_leval_doc = create_document(f"{concept}\n", str(examples_path), {
        "concept": concept,
        "word_count": len(concept) + 1,
        "token_count": (len(concept) + 3 + 1) // 4,
        "from": str(parsed_path.relative_to(root_path)),
        "url_from": "",
        "child_nodes": code_refs,
        "references": [],
        "snippets": [],
        "crumbs": ">>".join([concept])
    })
    '''
    top_level_doc = create_document(concept,
                                    doc_chunks,
                                    str(parsed_path.relative_to(root_path)),
                                    references=[],
                                    ref_snippets=[],
                                    crumbs=[concept],
                                    children_nodes=code_refs
                                   )
    
    for doc in parsed_path.rglob(glob):
        with open(doc, "r", encoding="utf8") as example_file:
            print(f"File: {doc}")
            content = example_file.read()
            #print(f"Content:\n{content}")
            file_label = label
            if len(label) == 0:
                file_label = input("Provide description for the source:")
            if file_label.lower() == "skip":
                print(f"Skipping code {doc}")
                continue
            source_doc = process_code(content, lang, concept, label)
            mention_doc = create_document(title = concept,
                                          doc_chunks = [],
                                          path=str(doc.relative_to(root_path)),
                                          references=[],
                                          ref_snippets=[{"id": source_doc.id, "pos": 0}],
                                          crumbs=[concept, file_label]
                                         )
            '''
            mention_doc = create_document(f"{label}\n", str(doc), {
                    "concept": file_label,
                    "word_count": len(file_label) + 1,
                    "token_count": (len(file_label) + 3 + 1) // 4, # Rough anthropic estimate
                    "from": str(doc.relative_to(root_path)),
                    "url_from": "",
                    "child_nodes": [],
                    "references": [],
                    "snippets": [{"id": source_doc.id, "pos": 0}],
                    "crumbs": ">>".join([concept,file_label])
            })
            '''
            code_docs.append(mention_doc)
            code_refs.append(mention_doc.id)
    return [top_level_doc, *code_docs]

In [None]:
#%%debug
jetton_examples = add_examples("tolk-contracts/contracts_Tolk/03_notcoin/", "*.tolk", "TOLK jetton contract example", "tolk")

In [None]:
jetton_examples

In [None]:
jetton_func = add_examples("tolk-contracts/contracts_FunC/03_notcoin/", "*.fc", "FunC jetton contract example", "FunC")

In [None]:
jetton_func

In [None]:
jetton_examples.extend(jetton_func)

In [None]:
nft_examples = add_examples("tolk-contracts/contracts_Tolk/02_nft/", "*.tolk", "TOLK NFT contract example", "tolk")

In [None]:
nft_examples.extend(add_examples("tolk-contracts/contracts_FunC/02_nft/","*.fc", "FunC NFT contract example", "FunC"))

In [None]:
vesting_examples = add_examples("tolk-contracts/contracts_Tolk/06_vesting/", "*.tolk", "TOLK Vesting implementation", "tolk")
vesting_examples.extend(add_examples("tolk-contracts/contracts_FunC/06_vesting/", "*.fc", "FunC Vesting implementation", "FunC"))

In [None]:
telemint_readme="""
# Telemint
This is the smart contract that Telegram intends to use in order to put some of its best usernames up for auction. The blockchain network for this smart contract is The Open Network (https://ton.org).

Anyone who finds serious security vulnerabilities in this smart contract prior to the auction launch will be rewarded.

## Description
There are two smart contracts in the repository: NftCollection and NftItem.

NftCollection source files: [nft-collection.fc](func/nft-collection.fc), [common.fc](func/common.fc) [stdlib.fc](func/stdlib.fc).

NftItem source files: [nft-item.fc](func/nft-item.fc), [common.fc](func/common.fc) [stdlib.fc](func/stdlib.fc).

One may also look at the [tlb decription](telemint.tlb) of internal messages and smart contract data.

There are also two additional smart contracts in the repository: NftCollectionNoDns and NftItemNoDns. They do not support DNS and allow to set additional restrictions on first bid.

NftCollectionNoDns source files: [nft-collection-no-dns.fc](func/nft-collection-no-dns.fc), [common.fc](func/common.fc) [stdlib.fc](func/stdlib.fc).

NftItemNoDns source files: [nft-item-no-dns.fc](func/nft-item-no-dns.fc), [common.fc](func/common.fc) [stdlib.fc](func/stdlib.fc).

### NftCollection

#### Internal messages
The first bidder receives a signed query from the server and sends it to NftCollection with the first bid attached.
```
// Create an NftItem and start an auction. Signed by auction's private key. Acts as a first bid in the auction.
telemint_unsigned_deploy$_ subwallet_id:uint32 valid_since:uint32 valid_till:uint32 token_name:TelemintText
  content:^Cell auction_config:^TeleitemAuctionConfig royalty_params:(Maybe ^NftRoyaltyParams) = TelemintUnsignedDeploy;
telemint_msg_deploy#4637289a  sig:bits512 msg:TelemintUnsignedDeploy = TelemintMsg;
```

The NftCollection interface is also supported.

#### External messages
The smart contract will accept the first external message to simplify the initialization of the smart contract.

### NftItem

#### Internal messages
The first bid is made through NftCollection, which will generate the following message.
```
// Create NftItem and start an auction. Accepted only from NftCollection.
teleitem_msg_deploy#299a3e15 sender_address:MsgAddressInt bid:Grams token_info:^TelemintTokenInfo nft_content:^Cell
  auction_config:^TeleitemAuctionConfig royalty_params:^NftRoyaltyParams = TeleitemMsg;
```

All following bids are simple transfers.

The owner of an NftItem may start a new auction.

```
// Start new auction. Accepted only from the owner.
teleitem_msg_start_auction#487a8e81 query_id:int64 auction_config:^TeleitemAuctionConfig = TeleitemMsg;

// Cancel auction auction. Accepted only from the owner. Forbidden if there are some active bids
teleitem_msg_cancel_auction#371638ae query_id:int64 = TeleitemMsg;
```

The NftItem interface is also supported, including transfer messages.

#### External messages
To finish a completed auction, one may send an empty message.
"""

In [None]:
telemint_examples = add_examples("tolk-contracts/contracts_FunC/07_telemint", "*.fc", "Telegram gift (telemint) contract", "FunC", doc_text = telemint_readme)

In [None]:
tests_examples = add_examples("tolk-contracts/tests", "*.ts", "Unit tests examples", "typescript", doc_text="Collection of unit tests examples")

In [None]:
wrapper_examples = add_examples("tolk-contracts/wrappers", "*.ts", "Contract wrappers examples", "typescript", doc_text="Examples of a various contracts typerscript wrapper functions")

In [None]:
#wrappers_examples = add_examples("tolk-contracts/wrappers", "*.ts", "Contract wrappers examples", "typescript", doc_text="Examples of a various contracts typerscript wrapper functions")

In [None]:
# Does it make sense to add tolk implementation here? Probably not.
wallet_examples = add_examples("tolk-contracts/contracts_FunC/05_wallet-v5/", "*.fc", "FunC wallet v5 (W5) implementation", "FunC")

In [None]:
src_examples = jetton_examples + tests_examples + wrapper_examples + nft_examples + vesting_examples + wallet_examples + telemint_examples
dump_json_docs(src_examples, "../rag-data/src_examples.jsonl")

In [None]:
docs[1881]

In [None]:
dump_json_docs(docs, "../rag-data/docs_dump.jsonl")

In [None]:
dump_json_docs(snippets.values(), "../rag-data/snippets_dump.jsonl")

In [None]:
#%%debug
import sys

from utils.json import load_json_dict
import re
def fix_instruct_paragraphs(paragraphs: [str], instruction: str):
    header_re = re.compile(r"^(#+)")
    new_paragraphs = []
    hashes_added = 0
    paragraph_count = 0
    for p in paragraphs:
        #print(p)
        header_match = header_re.match(p)
        if header_match:
            hashes = header_match.group()
            total_hashes = len(hashes)
            if hashes_added == 0 and total_hashes < 3:
                hashes_added = 3 - total_hashes
            if paragraph_count > 0:
                new_paragraphs.append(("#" * hashes_added) + p)
            else:
                # Replace the first paragraph with uniq and more sensible one
                new_paragraphs.append(f"### {instruction}\n instruction specification")
                # Make all the later paragraphs child to this one
                hashes_added = hashes_added + 1
            paragraph_count = paragraph_count + 1
    return new_paragraphs
def load_instructions():    
    instructions_desc = load_json_dict("../rag-data/instructions_desc.json")
    instructions_docs  = []
    for instr_cat in instructions_desc:
        for instruction in instructions_desc[instr_cat]:
            new_paragraphs = fix_instruct_paragraphs(instructions_desc[instr_cat][instruction].split("\n\n"), instruction)
            instructions_desc[instr_cat][instruction] = new_paragraphs
            cat_title = f"{instr_cat} instructions"
            instructions_docs.extend(
                process_md("tvm-specification.json",
                           "\n\n".join(new_paragraphs),
                           custom_title=instr_cat,
                           custom_crumbs=['TVM instrucitons', cat_title],
                           skip_top=True # Skip top level cat since there are hundreds of instruction in a category. Never pull them all at once
                          )
            )
            #raise "Testing"
    
    return instructions_docs

In [None]:
src_examples = load_docs_json(root_path / "rag-data/src_examples.jsonl")
old_docs = load_docs_json(root_path / "rag-data/docs_dump.jsonl")
old_snippets = load_docs_json(root_path / "rag-data/snippets_dump.jsonl")
docs = []
snippets = {}
for snip in old_snippets:
    snippets[snip.id] = snip

In [None]:
src_examples[0]

In [None]:
instructions_docs = load_instructions()

In [None]:
dump_json_docs(instructions_docs, root_path / "rag-data/instructions_documents.jsonl")
dump_json_docs(snippets.values(), root_path / "instructions_snippets.jsonl")

In [None]:
old_docs.extend(src_examples)
old_docs.extend(instructions_docs)
old_snippets.extend(snippets.values())
dump_json_docs(path=root_path / "rag-data/latest_docs.jsonl", documents=old_docs)
dump_json_docs(path=root_path / "rag-data/latest_snippets.jsonl", documents=old_snippets)

In [None]:
len(old_snippets)

# Simple documents update pipeline

Documents are stored as newline separated json objects (*jsonl*).
This means that is it editable on the spot using text editor and rebuild the index after.

However, this approach won't update document hashes (id) and other attributes, unless you do it manually.

Couple of snippets below allow to process documents in a more consistent fashion.
Someday there will be an admin panel, but this day is yet to come.

In [None]:
def recalc_doc(doc: Document, force_meta_crumbs: bool = False):
    orig_crumbs = doc.metadata["crumbs"]
    new_crumbs  = orig_crumbs.split(">>")
    new_content = doc.page_content
    crumb_split = doc.page_content.find("\n\n")
    if crumb_split >= 0:
        crumbs_found = new_content[:crumb_split]
        # if  force_meta_crumbs is set, the crumbs from text will be overwritted by those form metadata.
        # Otherwise crumbs found in actual document text will update the metadata.
        if crumbs_found != orig_crumbs and (not force_meta_crumbs):
            #print(f"Crumbs changed {orig_crumbs} \n New: {crumbs_found}")
            new_crumbs = crumbs_found.split(">>")
        new_content = new_content[crumb_split + 2:]
        
    new_doc = create_document(
                    doc.metadata["concept"],
                    doc_chunks=[new_content],
                    path=doc.metadata["from"],
                    references=doc.metadata["references"],
                    ref_snippets=doc.metadata["snippets"],
                    crumbs=new_crumbs,
                    children_nodes=doc.metadata["child_nodes"]
                )
    return (new_doc, doc.id != new_doc.id)

def recalc_batch(input_docs: list[Document]):
    for idx, doc in enumerate(input_docs):
        new_doc = recalc_doc(doc)
        if new_doc[1]:
            yield (idx, new_doc[0])
            

In [None]:
cur_docs = load_docs_json(root_path / "rag-data/latest_docs.jsonl")
cur_snippets = load_docs_json(root_path / "rag-data/latest_snippets.jsonl")

snippets = {}

for snip in cur_snippets:
    snippets[snip.id] = snip

In [None]:
# Do whatever you want with the documents

In [None]:
# Re-calculate the hashes and such
new_docs = [recalc_doc(doc)[0] for doc in cur_docs]    

In [None]:
dump_json_docs(path=root_path / "rag-data/updated_docs.jsonl", documents=new_docs)