# Convert mmark TeX into Pandoc TeX

As part of converting [Skeptric](https://skeptric.com/) from Hugo mmark to Quarto we need to make changes to how TeX equations are represented.

The (deprecated) version of mmark in Hugo uses an unusual syntax for TeX.
It's not documented, but some empirical rules for mmark are:
- `$$...$$` inside a paragraph starts inline math (even with whitespace surrounding ...)
- `$$...$$` after a paragraph starts a math block (even with whitespace surrounding ...)
- A `$` sign not followed by another `$` sign is just a normal `$` sign (A `\$` should also be a `$` mode)
- Math isn't rendered in inline code/code blocks


In Pandoc it's [documented](https://pandoc.org/MANUAL.html#math)

> Anything between two `$` characters will be treated as TeX math. The opening `$` must have a non-space character immediately to its right, while the closing `$` must have a non-space character immediately to its left, and must not be followed immediately by a digit. Thus, \\$20,000 and \\$30,000 won’t parse as math. If for some reason you need to enclose text in literal $ characters, backslash-escape them and they won’t be treated as math delimiters.
> For display math, use `$$` delimiters. (In this case, the delimiters may be separated from the formula by whitespace. However, there can be no blank lines between the opening and closing `$$` delimiters.)

In summary:

- `$...$` starts an inline TeX (and space isn't allowed between them)
- `$$...$$` starts a math block
- A `\$` sign is rendered as a normal `\$` sign
- Math isn't rendered in inline code/code blocks

## Tests

The result should be a function that takes mmark code and returns pandoc code.

Since there are a set of rules the best way to check the implementation is with some examples.
Each `Example` will have a descriptive name, the `mmark` input and the expected `pandoc` output.

In [1]:
from dataclasses import dataclass

@dataclass
class Example:
    name: str
    mmark: str
    pandoc: str

We'll generate a bunch of examples that satisfy the above rules.

Sometimes there are multiple possibilities, like with `$20,000 to $30,000` but we will just pick a simple rule to transform them (escaping *every* `$`).

There's a bunch of other cases we won't check (like [indented code blocks](https://spec.commonmark.org/0.30/#indented-code-blocks) and [HTML BLocks](https://spec.commonmark.org/0.30/#html-blocks)) because they don't occur in the Skeptric code.

In [2]:
examples = [
    Example("Inline",
            "And $$x=2$$",
            "And $x=2$"),
    
    Example("Inline Space",
            "And $$ x = 2 $$",
            "And $x = 2$"),
    
    Example("Block",
           "And\n\n$$x=2$$\n",
           "And\n\n$$x=2$$\n"),
    
    Example("Block space",
            "And\n\n$$ x = 2 $$\n",
            "And\n\n$$x = 2$$\n"),
    
    Example("Block multiline",
            """
$$\begin{align}
& \text{maximize}   && \mathbf{c}^\mathrm{T} \mathbf{x}\\
& \text{subject to} && A \mathbf{x} \le \mathbf{b}, \\
&  && \mathbf{x} \ge \mathbf{0}, \\
\end{align}
$$
""",
                       """
$$\begin{align}
& \text{maximize}   && \mathbf{c}^\mathrm{T} \mathbf{x}\\
& \text{subject to} && A \mathbf{x} \le \mathbf{b}, \\
&  && \mathbf{x} \ge \mathbf{0}, \\
\end{align}
$$
"""),
    
    Example("Literal $", "It costs $20,000", r"It costs \$20,000"),
    
    Example("Two Literal $", "$20,000 to $30,000", r"\$20,000 to \$30,000"),
    
    Example("Inline code", "And `$x+=1`", "And `$x+=1`"),
    
    Example("Inline code double $", "As TeX `$$x=2$$`", "As TeX `$$x=2$$`"),
    
    Example("Inline code with escape", "And `\$x=2`", "And `\$x=2`"),
    
    Example("Fenced code",
            """
```
$x+=1
```
            """,
                        """
```
$x+=1
```
            """),
    
    Example("Fenced code double $",
            """
```latex
$$x==2$$
```
            """,
            """
```latex
$$x==2$$
```
            """),
    
    Example("Indented code blocks",
            "\n" + r"    %>% mutate_if(is.character, function(x) gsub('\\$', '\\\\$', x))",
            "\n" + r"    %>% mutate_if(is.character, function(x) gsub('\\$', '\\\\$', x))"),
    
    Example("After intended code blocks",
            "Like so\n    $x = 2\nfor $30",
            "Like so\n    $x = 2\nfor \$30"),
            ]

In [3]:
assert len(set([e.name for e in examples])) == len(examples)

Now we can test our examples by checking our transformation function and returning the failures.

In [4]:
def test(f, examples=examples):
    for example in examples:
        data = example.mmark
        result = f(data)
        expected = example.pandoc
        if result != expected:
            yield({'name': example.name, 'data': data, 'result': result, 'expected': expected})
            
list(test(lambda x: x))

[{'name': 'Inline',
  'data': 'And $$x=2$$',
  'result': 'And $$x=2$$',
  'expected': 'And $x=2$'},
 {'name': 'Inline Space',
  'data': 'And $$ x = 2 $$',
  'result': 'And $$ x = 2 $$',
  'expected': 'And $x = 2$'},
 {'name': 'Block space',
  'data': 'And\n\n$$ x = 2 $$\n',
  'result': 'And\n\n$$ x = 2 $$\n',
  'expected': 'And\n\n$$x = 2$$\n'},
 {'name': 'Literal $',
  'data': 'It costs $20,000',
  'result': 'It costs $20,000',
  'expected': 'It costs \\$20,000'},
 {'name': 'Two Literal $',
  'data': '$20,000 to $30,000',
  'result': '$20,000 to $30,000',
  'expected': '\\$20,000 to \\$30,000'},
 {'name': 'After intended code blocks',
  'data': 'Like so\n    $x = 2\nfor $30',
  'result': 'Like so\n    $x = 2\nfor $30',
  'expected': 'Like so\n    $x = 2\nfor \\$30'}]

## Strategy

We will use a simple Discrete Finite Automonon (DFA) to handle the transitions between the different states:

* In `default` state just yield characters, and look for transitions to other states
* In `inline_code` or `block_code` just yield characters until the end of the code
* In `inline_math` or `block_math` transform the delimiters and strip surrounding whitespace, leaving the input unchanged

### Why not a parser?

A good solution would be to use one of the many Markdown parsers like [Marko](https://marko-py.readthedocs.io/en/latest/), or [Mistletoe](https://github.com/miyuchina/mistletoe) or even [Pandoc](https://pandoc.org/) itself.
These all can produce Markdown and are able to be extended which would allow us to parse mmark maths.

The problem is they are all *destructive parsers*, they don't preserve things like whitespace and even an identity parse changes the markdown significantly.
This makes the git diffs much bigger and it's harder to check the results.

So we're forced to write our own.

### DFA Diagram

Here's a rough diagram of the DFA; unfortunately `blockdiag` doesn't do well with too many edges so I've had to compress some of the information about returned labels.

In [5]:
from subprocess import run

def get_font(name):
    """Get a font by name on a linux like system"""
    for line in run('fc-list', capture_output=True).stdout.decode('utf-8').split('\n'):
        if not line.strip():
            continue
        path, names, _styles = line.split(':')
        names = names.strip().split(',')
        if name in names:
            return path

We need to provide a font that can handle Unicode, so I use `DejaVu Sans`.
This should be provided in the Conda environment (at `${CONDA_PREFIX}/fonts/DejaVuSans.ttf`).


In [6]:
from blockdiag import parser, builder, drawer
from blockdiag.utils.fontmap import FontMap
from IPython.display import HTML

def show_block_diagram(source, font='DejaVu Sans'):
    fm = FontMap()
    fm.set_default_font(get_font(font))
    
    tree = parser.parse_string(source)
    diagram = builder.ScreenNodeBuilder.build(tree)
    draw = drawer.DiagramDraw("SVG", diagram, filename=None, fontmap=fm)
    draw.draw()
    result = draw.save()
    return HTML(result)

In [7]:
show_block_diagram("""
blockdiag {
     default_shape="ellipse";
     default_fontsize=8;
     
     default -> code_block [label = '↵```/```', dir="both"]
     
     default -> inline_code [label='`', dir="both"]
     
    default -> inline_code [label='↵    /↵.', dir="both"]

     
     default -> block_math [label='↵$$', dir="both"]
     
     default -> inline_math [label='$$', dir="both"]
     
     default [color=pink];
   }
""")

## Implementation

We will map out all the states and transitions.

Each Action:

- has an `input_mode` where it applies
- has a `match_re`, a regular expression on which to trigger the action
- a `output_mode` to transition to on match
- an `output` string to emit on a match, by default the matched string itself

There is also an implicit default action that consumes the next token, and outputs the current mode.

In [8]:
from enum import Enum, auto
import re
from typing import Optional
import logging

logger = logging.getLogger(__name__)

class Mode(Enum):
    DEFAULT = auto()
    INLINE_CODE = auto()
    BLOCK_CODE = auto()
    INLINE_MATH = auto()
    BLOCK_MATH = auto()
    INDENTED_CODE = auto()

@dataclass
class Action:
    input_mode: Mode
    match_re: str
    output_mode: Mode
    output: Optional[str] = None
        
    def __post_init__(self):
        self.pattern = re.compile(self.match_re)
        
    def match(self, s: str, idx: int = 0) -> Optional[str]:
        match = self.pattern.match(s, idx)
        if match:
            match_str = match.group(0)
            len_match_str = len(match_str)
            assert len_match_str > 0
            return {'output': self.output or match_str, 'size': len_match_str}
    
    

actions = [
    Action(Mode.DEFAULT, "\n```", Mode.BLOCK_CODE),
    Action(Mode.DEFAULT, "`", Mode.INLINE_CODE),
    Action(Mode.DEFAULT, "\n    ", Mode.INDENTED_CODE),
    Action(Mode.DEFAULT, "\n\$\$ *", Mode.BLOCK_MATH, "\n$$"),
    Action(Mode.DEFAULT, "\$\$ *", Mode.INLINE_MATH, "$"),
    Action(Mode.DEFAULT, "\$", Mode.DEFAULT, "\$"),
    
    
    Action(Mode.BLOCK_CODE, "```", Mode.DEFAULT),
    
    Action(Mode.INLINE_CODE, "`", Mode.DEFAULT),
    
    Action(Mode.INLINE_MATH, " *\$\$", Mode.DEFAULT, "$"),
    Action(Mode.BLOCK_MATH, " *\$\$", Mode.DEFAULT, "$$"),
    
    Action(Mode.INDENTED_CODE, "\n {,3}\S", Mode.DEFAULT),
]

def parse(s):
    mode = Mode.DEFAULT
    idx = 0
    output = []
    
    while idx < len(s):
        logger.debug('Mode: %s, Last output: %s, Next chars: %s' % (mode, output[-1:], s[idx:idx+5].replace('\n', '\\n')))
        last_idx = idx
        for action in actions:
            if action.input_mode != mode:
                continue
            match = action.match(s, idx)
            if match:
                logger.debug('Match: %s' % action)
                mode = action.output_mode
                idx += match['size']
                output += match['output']
                break
        else:
            output += s[idx]
            idx += 1
        
        assert idx > last_idx, "Infinite loop"
    
    return ''.join(output)        

In [9]:
list(test(parse))

[]

In [10]:
assert not list(test(parse))