Skip to content

Latest commit

 

History

History
241 lines (188 loc) · 10.1 KB

ABOUT.md

File metadata and controls

241 lines (188 loc) · 10.1 KB

About OMD

The implementation of this library and command-line tool is based on DFMSD. That description doesn't define a grammar but a sort of guide for human users who are not trying to implement it. In other words, it's ambiguous, which is a problem since there are no errors in the Markdown language, which design is mostly based on some email-writing experience: the meaning of a phrase is the meaning a human would give when reading the phrase as some email contents. For instance, if there are blank lines that have spaces (lines that read empty but actually contain some characters, from the computer point of view since spaces are represented by characters), since they're invisible to the normal human reader, they should be ignored.

Specificities

There follows a list of specificities of OMD. This list is probably not exhaustive.

Please note that OMD's semantics have changed over time, but they are becoming more and more stable with time and new releases. The goal is to eventually have a semantics that's as sane as it can possibly be for a Markdown parser. Please browse and open issues if you find something that seems wrong.

  • Email addresses encoding: email addresses are not hex entity-encoded.

  • [foo] is a short-cut for [foo][], but if foo is not a reference then [foo] is printed [foo], not [foo][]. (Taken from Github Flavour Markdown.)

  • The Markdown to Markdown conversion may performe some cleaning (some meaningless characters may disappear) or spoiling (some meaningless characters may appear), but both inputs and ouputs should have the same semantics (otherwise please do report the bug).

  • A list containing at least one item which has at least one paragraph is a list for which all items have paragraphs and/or blocks. In HTML words, in practice, if an li of a ul or ol has a p, then all other lis of that list have at least a p or a pre.

  • It's not possible to emphasise a part of a word using underscores. (Taken from Github Flavour Markdown.)

  • A code section declared with at least 3 backquotes (`) at the first element on a line is a code block. The backquotes should be followed by a language name (made of a-z characters) or by a newline.

  • A code block starting with several backquotes (e.g., ```) immediately followed by a word W made of a-z characters is a code block for which the code language is W. (If you use other characters than a-z, the semantics is currently undefined although it's deterministic of course, because it may change in the near future.) Also, if you use the command line tool omd, you can define programs to process code blocks specifically to the languages that are declared for those code blocks.

  • Each and every tabulation is converted by OMD to 4 spaces at the lexing step. And the behaviour of the parser is undefined for tabulations.

    • Note that it does mean that if you have a document with some code written using the Whitespace language, it will not work very well. This might be fixed in the future but unless you have a very good reason for OMD to support tabulations, it will probably not.
  • Parentheses and square brackets are generally parsed in a way such that [a[b]](http://c/(d)) is the URL http://c/(d) with the text a[b]. If you want a parenthesis or bracket not to count in the balanced parsing, escape it with a backslash, such as in [a\[b](http://c/\(d). This is typically something that's not defined in DFMSD.

    • Note about backslashes in URLs: some web browsers (e.g., Safari) automatically convert \ to /. It's not the case of CURL. However I assume it's safe to consider that backslashes are not to be used in URLs. Still it's always possible to backslashe-escape them anyways.
  • HTML is somewhat a part of Markdown. OMD will partially parse HTML tags and if you have a tag that isn't a known HTML tag, then it's possible that OMD will not consider it as HTML. For instance, a document containing just <foo></foo> will be converted to <p>&lt;foo&gt;&lt;/foo&gt;</p>.

    • It's possible to ask omd to relax this constraint.
  • Some additional features are available on the command line. For more information, used the command omd -help

"DFMSD" is short for "Daring Fireball: Markdown Syntax Documentation", which is the HTML title of the page located at http://daringfireball.net/projects/markdown/syntax.

Extension mechanisms

The parser is implemented using a big (very big) recursive function (Omd_parser.Make(Env).main_loop_rev), with a set of some auxiliary functions. Some parts are easy to understand, some parts are not. However, overall, it should be easy enough.

The parser has a double extension mechanism.

  1. To use the first mechanism, you may define a set of functions in the module Env given to instanciate the functor Omd_parser.Make.
  • The value Env.extensions is a list of elements of type Omd_representation.extension which is equal to r -> p -> l -> (r * p * l) option where
    • r = Omd_representation.t and represents the result of the parsing process,
    • p = Omd_representation.tok list and represents the tokens preceding l,
    • and l = tok list and is the list of tokens to parse.
    • The result, of type (r * p * l) option, is None if the extension has no effect (and the parser will continue doing its job with its state it had before using the extension), and is Some(r,p,l) when it gives a new set of data to the parser.
  • Each element of the list Env.extensions is applied in a fold left manner. (The first element of that list is applied first.)
  • And they are applied when a standard parsing rule fails.
  1. The second extension stands in the representation of the lexemes (Tag of string * extension). It allows to insert extensions directly into the lexeme list.

The Markdown representation also provides an extension mechanism, which is useful if you want to insert “smart objects” (which are as “smart” as smartphones). Those objects have four methods, 2 of them are particularly useful: to_html and to_t, and implementing one of them is necessary. They both return a string option, and a default dummy such smart object can be defined as follows:

let dummy =
  X (object
    method name = "dummy"
    method to_html ?(indent=0) _ _ = None
    method to_sexpr _ _ = None
    method to_t _ = None
  end)

History

OMD has been developed by Philippe Wang at OCaml Labs in Cambridge, with precious feedbacks and pull requests (cf. next section).

Its development was motivated by at least these facts:

  • We wanted an OCaml implementation of Markdown; some OCaml parsers of Markdown existed before but they were incomplete. It's easier for an OCaml project to depend on an pure-OCaml implementation of Markdown than to depend some interface to a library implemented using another language, and this is ever more important since Opam exists.

  • We wanted to provide a way to make the contents of the OCaml.org website be essentially in Markdown instead of HTML. And we wanted to this website to be implemented in OCaml.

  • Having an OCaml implementation of Markdown is virtually mandatory for those who want to use a Markdown parser in a Mirage application. Note that OMD has replaced the previous Markdown parser of COW, which has been developed as part of the Mirage project.

Thanks

Thank you to Christophe Troestler, Ashish Argawal, Sebastien Mondet, Thomas Gazagnaire, Daniel Bünzli, Amir Chaudry, Anil Madhavapeddy, David Sheets, Jeremy Yallop, and <please insert your name here if you believe you've been forgotten> for their feedbacks and contributions to this project.

Miscellaneous notes

  • There's been absolutely no effort in making OMD fast, but it should be amongst the fastest parsers of Markdown, just thanks to the fact that it is implemented in OCaml. That being said, there's quite some room for performance improvements. One way would be to make a several-pass parser with different intermediate representations (there're currently only 2 representations: one for the lexing tokens and one for the parse tree).

  • The hardest part of implementing a parser of Markdown is the process of understanding and unravelling the grammar of Markdown to turn it into a program.

  • OMD 1.0.0 will probably use some external libraries, e.g., UUNF and perhaps Xmlm

  • "OMD" is the name of this library and command-line tool.

    • It might be written "Omd" or "omd" sometimes, but it should be written using capital letters because it should be read əʊ ɛm diː rather than ə'md or ˌɒmd.
  • "Omd" is a module.

    • It's written using monospace font and it's capitalized.
  • "omd" is a command-line tool.

    • It's written using monospace font and it's always lowercase letters only because unless you have a non-sensitive file system, calling Omd on the command line is not just another way of calling omd.
  • OMD has been added on the quite long list of Markdown parsers https://github.com/markdown/markdown.github.com/wiki/Implementations on the 29th of January.