# Finding Unescaped Backslashes

When doing some QA of the converted Pandoc I found that `dvi-by-example` had a `\special` in paragraph text which ended up with a broken newline in the output.

Let's find all instances of these using the DFA from `mmark2pandoc_math.ipynb`.

## Imports

In [1]:
from dataclasses import dataclass
from enum import Enum, auto
from pathlib import Path
import re
from typing import Optional

In [2]:
class Mode(Enum):
    DEFAULT = auto()
    INLINE_CODE = auto()
    BLOCK_CODE = auto()
    INLINE_MATH = auto()
    BLOCK_MATH = auto()
    INDENTED_CODE = auto()

In [3]:
@dataclass
class Action:
    input_mode: Mode
    match_re: str
    output_mode: Mode
    output: Optional[str] = None

    def __post_init__(self):
        self.pattern = re.compile(self.match_re)

    def match(self, s: str, idx: int = 0) -> Optional[str]:
        match = self.pattern.match(s, idx)
        if match:
            match_str = match.group(0)
            len_match_str = len(match_str)
            assert len_match_str > 0
            return {"output": self.output or match_str, "size": len_match_str}

In [4]:
ACTIONS = (
    Action(Mode.DEFAULT, "\n```", Mode.BLOCK_CODE),
    Action(Mode.DEFAULT, "`", Mode.INLINE_CODE),
    Action(Mode.DEFAULT, "\n\$\$ *", Mode.BLOCK_MATH, "\n$$"),
    Action(Mode.DEFAULT, "\$\$ *", Mode.INLINE_MATH, "$"),
    Action(Mode.DEFAULT, "\n    ", Mode.INDENTED_CODE),
    Action(Mode.DEFAULT, "\$", Mode.DEFAULT, "\$"),
    Action(Mode.BLOCK_CODE, "```", Mode.DEFAULT),
    Action(Mode.INLINE_CODE, "`", Mode.DEFAULT),
    Action(Mode.INLINE_MATH, " *\$\$", Mode.DEFAULT, "$"),
    Action(Mode.BLOCK_MATH, " *\$\$", Mode.DEFAULT, "$$"),
    Action(Mode.INDENTED_CODE, "\n {,3}\S", Mode.DEFAULT),
)

## Parser

Here we just want to find all backslash strings in normal mode; when we find one we will capture the whole backslash escaped string up to the next space.

In [5]:
ESCAPED_PATTERN = re.compile(r'\\[^ ]*')

ESCAPED_PATTERN.match('\special and \test')

<re.Match object; span=(0, 8), match='\\special'>

In [6]:
def find_backslash_escapes(s):
    mode = Mode.DEFAULT
    idx = 0
    output = []

    while idx < len(s):
        last_idx = idx
        for action in ACTIONS:
            if action.input_mode != mode:
                continue
            match = action.match(s, idx)
            if match:
                mode = action.output_mode
                idx += match["size"]
                break
        else:
            if s[idx] == '\\' and mode == Mode.DEFAULT:
                output.append(ESCAPED_PATTERN.match(s, idx).group(0))
            idx += 1
            

        assert idx > last_idx, "Infinite loop"

    return output

Load in all the source data

In [7]:
input_dir = Path('../data/content/post')
extensions = ['mmark', 'md', 'Rmd']

paths = [path for ext in extensions for path in input_dir.glob(f'*.{ext}')]
len(paths)

481

In [8]:
results = {}

for path in paths:
    with open(path) as f:
        data = find_backslash_escapes(f.read())
        if data:
            results[path] = data

There are only a few examples, and these can be manually fixed.

One thing to note is that the markdown converted from Jupyter notebooks uses `\\[` and `\\]` for block code and `\\(` and `\\)` for inline code, and escapes all backslashes and underscores; this doesn't get rendered by Quarto.
Since there's only a few examples we can manually clean them up too.

In [9]:
results

{PosixPath('../data/content/post/ngram-sentence-boundaries.mmark'): ['\\|',
  '\\|'],
 PosixPath('../data/content/post/dvi-by-example.mmark'): ['\\count{0-9}',
  '\\special.</td>\n</tr>\n<tr>\n<td>{F3-F5}</td>\n<td>fnt_def{1-3}</td>\n<td>i[{1-3}]',
  '\\count0',
  '\\count0',
  '\\count{1-9}=0.',
  '\\special{})'],
 PosixPath('../data/content/post/_ideas.mmark'): ['\\\nhead',
  '\\\ncut',
  "\\t'",
  '\\\nsort',
  '\\\nuniq',
  '\\\nsort',
  '\\\nsed',
  '\\\nawk',
  '\\\nhead\n\n#'],
 PosixPath('../data/content/post/history-of-integration.mmark'): ['\\A'],
 PosixPath('../data/content/post/export-athena.mmark'): ["\\n',",
  "\\\\')",
  "\\')",
  "\\N'),"],
 PosixPath('../data/content/post/latex-multiple-equations.mmark'): ['\\…”',
  '\\sin(x)},'],
 PosixPath('../data/content/post/prior-regularise.mmark'): ['\\epsilon$'],
 PosixPath('../data/content/post/normalising-salary.mmark'): ['\\%20Salary\\%20Extracted\\%20From\\%20CommonCrawl\\%20Job\\%20Data.html)',
  '\\%20Extracted\\%20From\\