# Python source

> Helpers to create and use llms.txt files

In [1]:
#| default_exp core

In [2]:
#| export
import re

In [3]:
#| hide
from nbdev.showdoc import *
import nbdev; nbdev.nbdev_export()

In [4]:
#| export
from fastcore.utils import *
from fastcore.xml import *
from fastcore.script import *
import httpx

## Introduction

The llms.txt file spec is for files located in the path `llms.txt` of a website (or, optionally, in a subpath). `llms-sample.txt` is a simple example. A file following the spec contains the following sections as markdown, in the specific order:

- An H1 with the name of the project or site. This is the only required section
- A blockquote with a short summary of the project, containing key information necessary for understanding the rest of the file
- Zero or more markdown sections (e.g. paragraphs, lists, etc) of any type, except headings, containing more detailed information about the project and how to interpret the provided files
- Zero or more markdown sections delimited by H2 headers, containing "file lists" of URLs where further detail is available
  - Each "file list" is a markdown list, containing a required markdown hyperlink `[name](url)`, then optionally a `:` and notes about the file.

Here's the start of a sample llms.txt file we'll use for testing:

In [5]:
samp = Path('llms-sample.txt').read_text()
print(samp[:480])

# FastHTML

> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore's `FT` "FastTags" into a library for creating server-rendered hypermedia applications.

FastHTML is written by Answer.AI, an organization which follows the fast.ai style guide instead of PEP 8, so most examples follow fast.ai style.

## Docs

- [FastHTML quick start](https://docs.fastht.ml/tutorials/quickstart_for_web_devs.html.md): A brief overview of many FastHTML feature


## Reading

We'll implement `parse_llms_file` to pull out the sections of llms.txt into a simple data structure.

In [6]:
%ai reset

In [7]:
#| export
def _opt_re(s): return f'(?:{s})?'

def _parse_llms_txt(txt):
    pat = r"^#\s*(?P<title>[^\n]+)\n+"
    pat += _opt_re(r"^>\s*(?P<summary>.+?)\n+")
    pat += r"(?P<rest>.*)"
    match = re.search(pat, txt, flags=(re.DOTALL | re.MULTILINE))
    return match.groupdict() if match else None

In [8]:
match = _parse_llms_txt(samp)
match['summary']

'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'

In [9]:
#| export
def _split_on_h2(text):
    parts = re.split(r'\n?## ', text)
    details = parts[0].strip() if parts[0].strip() else None
    sections = [f"## {p.strip()}" for p in parts[1:] if p.strip()]
    return details, sections

In [10]:
rest = match['rest']
details,sections = _split_on_h2(rest)
details

'FastHTML is written by Answer.AI, an organization which follows the fast.ai style guide instead of PEP 8, so most examples follow fast.ai style.'

In [11]:
#| export
def _parse_section(section):
    title = section.split('\n', 1)[0].strip('# ')
    links = re.findall(r'\[(.+?)\]\((.+?)\)(?:: (.+?))?(?=\n|$)', section)
    return title, [(t, u, d.strip() if d else None) for t, u, d in links]

In [12]:
sections_dict = dict(_parse_section(s) for s in sections)
sections_dict['Examples']

[('Todo list application',
  'https://raw.githubusercontent.com/AnswerDotAI/fasthtml/main/examples/adv_app.py',
  'Detailed walk-thru of a complete CRUD app in FastHTML showing idiomatic use of FastHTML and HTMX patterns.')]

In [13]:
#| export
def parse_llms_file(txt):
    parsed = _parse_llms_txt(txt)
    if not parsed: return None
    parsed['details'], sections = _split_on_h2(parsed['rest'])
    parsed['sections'] = dict(_parse_section(s) for s in sections)
    del parsed['rest']
    return dict2obj(parsed)

In [14]:
llmsd = parse_llms_file(samp)
llmsd.summary

'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'

In [15]:
llmsd.sections.Examples

(#1) [('Todo list application', 'https://raw.githubusercontent.com/AnswerDotAI/fasthtml/main/examples/adv_app.py', 'Detailed walk-thru of a complete CRUD app in FastHTML showing idiomatic use of FastHTML and HTMX patterns.')]

## XML conversion

For some LLMs such as Claude, XML format is preferred, so we'll provide a function to create that format.

In [16]:
#| export
Sections = partial(ft, 'sections')
Project = partial(ft, 'project')

In [17]:
#| export
def Doc(url, **kw):
    "Create a `Doc` FT object with the text retrieved from `url` as the child, and `kw` as attrs."
    re_comment = re.compile('^<!--.*-->$', flags=re.MULTILINE)
    txt = [o for o in httpx.get(url).text.splitlines() if not re_comment.search(o)]
    return ft('doc', '\n'.join(txt), **kw)

In [18]:
#| export
def Section(nm, items):
    "Create a `Section` FT object containing a `Doc` object for each child."
    return ft(nm, *[Doc(title=title, url=url, detl=detl) for title,url,detl in items])

In [19]:
#| export
def mk_ctx(d, optional=True):
    "Create a `Project` with a `Section` for each H2 part in `d`, optionally skipping the 'optional' section."
    skip = '' if optional else 'Optional'
    sections = [Section(k, v) for k,v in d.sections.items() if k!=skip]
    return Project(title=d.title, summary=d.summary, details=d.details)(*sections)

In [26]:
ctx = mk_ctx(llmsd)
print(to_xml(ctx, do_escape=False)[:490])

<project title="FastHTML" summary='FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore&#39;s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.' details="FastHTML is written by Answer.AI, an organization which follows the fast.ai style guide instead of PEP 8, so most examples follow fast.ai style.">
  <docs>
    <doc title="FastHTML quick start" detl="A brief overview of many FastHTML features"># Web Devs Quickstar


In [30]:
#| export
def get_sizes(ctx):
    "Get the size of each section of the LLM context"
    return {o.tag:{p.title:len(p.children[0]) for p in o.children} for o in ctx.children}

In [29]:
get_sizes(ctx)

{'docs': {'FastHTML quick start': 21997,
  'HTMX reference': 26427,
  'Starlette quick guide': 7936},
 'examples': {'Todo list application': 18588},
 'optional': {'Starlette full documentation': 48331}}

In [31]:
Path('fasthtml.md').write_text(to_xml(ctx, do_escape=False))

124335

In [40]:
#| export
@call_parse
def llms_txt2ctx(
    fname:str, # File name to read
    optional:bool_arg=True # Skip 'optional' section?
):
    "Print a `Project` with a `Section` for each H2 part in file read from `fname`, optionally skipping the 'optional' section."
    d = parse_llms_file(Path(fname).read_text())
    ctx = mk_ctx(d, optional=optional)
    print(to_xml(ctx, do_escape=False))

In [43]:
!llms_txt2ctx llms-sample.txt > fasthtml.md

## Export -

In [44]:
#|hide
#|eval: false
from nbdev import nbdev_export
nbdev_export()