# nLab in plain text

The aim of this notebook is to convert the nLab page source into plain text so that it can be fed into standard NLP tools.

In [1]:
import JSON
using JSON3
using DataFrames

skip_pages = (
    "Timeline of category theory and related mathematics",
    "AUTOMATH",
)

cd("/NetMath/nLab2024/2024")
pages = open(JSON.parse,"nlab_scrape.json","r")
print("All pages: ", length(pages),"\n")
filter!(page -> page["name"] ∉ skip_pages, pages)
pages = Dict(pop!(page, "name") => page for page in pages)
print("Filtered pages: ", length(pages),"\n")

All pages: 12912
Filtered pages: 12895


We will use [pandoc](https://pandoc.org/) to convert the Markdown to plain text.

In [2]:
function markdown_to_plain(s)
    io = IOBuffer()
    open(`pandoc --quiet --from markdown --to plain`, io, write=true) do pd
        write(pd, s)
    end
    String(take!(io))
end

markdown_to_plain("A **sentence** with _some_ [Markdown formatting](https://www.markdownguide.org)")

"A sentence with some Markdown formatting\r\n"

Besides Markdown formatting, the nLab supports LaTeX math and wiki, specifically Instiki, syntax. We strip out all LaTeX math which is not converted by Pandoc and all wiki syntax besides page links.

In [3]:
""" Strip LaTeX math, display and inline.
"""
function strip_latex_math(s)
    s = replace(s, r"\$\$(.*?)\$\$"s => "")
    s = replace(s, r"\$(.*?)\$"s => "")
    s = replace(s, r"\\\[(.*?)\\\]"s => "")
end

""" Strip Instiki commands such as includes, redirects, and ToCs.
"""
function strip_wiki_commands(s)
    s = join(filter(split(s, "\n")) do line
        !any(startswith(lstrip(line), prefix)
             for prefix in ("+--", "=--", "{:", "{#", "[[!"))
    end, "\n")
    s = replace(s, r"{#(.*?)}" => "")   
end

""" Replace wiki page links with plain text.
"""
function replace_page_links(s)
    # Links of form [[page name|displayed text]].
    s = replace(s, r"\[\[([^\]]*?)\|(.*?)\]\]" => s"\2")
    # Links of form [[page name]]
    s = replace(s, r"\[\[(.*?)\]\]" => s"\1")
end
    
nlab_to_plain(source) = source |> strip_wiki_commands |> replace_page_links |> 
    markdown_to_plain |> strip_latex_math 

nlab_to_plain (generic function with 1 method)

Run this pipeline on the nLab corpus.

In [4]:
using ProgressMeter
ProgressMeter.ijulia_behavior(:clear)

prog = Progress(length(pages))
for (name, page) in pairs(pages)
    ProgressMeter.next!(prog, showvalues=[(:name, name)])    
    page["plain"] = nlab_to_plain(page["source"])
end

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:12:32[39m
[34m  name:  formal deformation quantization[39m


In [5]:
open("nlab_clean.json", "w") do clean
    println(clean, "[")
    for (name, page) in pairs(pages)
        if !(page["plain"] == "")
            println(clean, "  {")
            rec = page["plain"]
            rec = replace(rec, r"(\r\n|\n|\r)" => " ")
            rec = replace(rec, r"\s+" => " ")
            rec = replace(rec, r"-" => " ")        
            println(clean, "    \"context\": \"$rec\",")
            println(clean, "    \"title\": \"$name\"")
            println(clean, "  },")      
        end
    end
    println(clean, "]")    
end

## Examples

In [6]:
pages["rig"]["plain"] |> println

###Context### #### Algebra

#Contents# * table of contents

Idea

Rigs and rig homomorphisms form the category Rig.

Definition

We consider rigs as having an additive unit 0, a multiplicative unit 1
and being such that 0.x = x.0 = 0, as discussed in the entry rig.

We recall that a rig homomorphism f: R → S is a function which is a
monoid homomorphism for both the additive underlying monoid and the
multiplicative underlying monoid.

Properties

Related concepts

-   Ring, CRing



In [7]:
pages["locally posetal 2-category"]["plain"] |> println

#Contents# * automatic table of contents goes here

##Definition

A 2-category C is locally posetal or locally partially ordered or
Pos-enriched if every hom-category C(x, y) is a poset - an object of the
category Pos of partial orders. One can also consider a locally
preordered 2-category, where every hom-category is a proset (a
preordered set); up to equivalence of 2-categories, these aren't any
more general.

Locally posetal 2-categories are the usual model of 2-posets, aka
(1,2)-categories. Just as the motivating example of a 2-category is the
2-category Cat of categories, so the motivating example of a 2-poset is
the 2-poset Pos of posets. If you interpret  as a full
sub-2-category of , then it is indeed locally posetal. Similarly,
the 2-category of prosets is a locally preordered 2-category that is
equivalent to Pos.

Compare the notion of partially ordered category. A locally partially
ordered category is a category enriched over the category Pos of posets,
while a partially ord

In [8]:
pages["hypergraph category"]["plain"] |> println

Context

Category theory

Graph theory

Hypergraph categories

-   table of contents

Idea

A hypergraph category is a monoidal category whose string diagrams are
hypergraphs. Recall that in general the vertices of a string diagram
correspond to morphisms in a category, and its edges to objects. An
ordinary string diagram is a directed graph, where the inputs and
outputs of a vertex describe objects appearing in a tensor
product-decomposition of the domain and codomain of a morphism; each
edge is connected to only one vertex as input and one vertex as output
because of how morphisms in a category are composed. A hypergraph
category allows edges to connect to many vertices as input and many
vertices as output, which category theoretically means that we may
compose many morphisms containing an object in their codomain with many
morphisms containing that object in their domain.

Hypergraph categories have been reinvented many times and given many
different names, such as “well-supported c