Skip to content
Simple Markup
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Simple Markup

Simple Markup, or SMU, does stuff to text.

Still limited and experimental. (And kinda sloppy.)

This is an attempt to take The Goofy Markup Language Processor and reduce it to it's bare minimum - a total re-write, retaining the basic design, but with a goal of 50% of the size. The GMLP code works fine, it is just too big and too complex (which is what evolutionary development can sometimes lead to - even with a well documented design).


This is version 2.1.5, December 14, 2018. (Documentation is still poor.) It was kind of a hasty release...


The SMU API is in three parts.

  1. Three functions in the file smu.php.
  2. Data consisting of constants and arrays of regular expressions defining how the input text is processed.
  3. User PHP code which puts the two together. (Examples provided.)

Please note: This is an idea more than a bit of code. All my code has been designed with one thought in mind: doing more with less. (More on this later; but I've been wrong before...)

The one main function:


string markup_simple ( mixed $input )

Which returns the marked-up input text which can be a string or an array

  • either from file_get_contents() or file(). If a string it is split on newlines (and optionally can have characters stripped) to an array; a passed array is expected to not have line ending characters.

This function is a simple loop, shifting a line out of the array, then passing both, by reference, to the other two functions. First is:


mixed markup_lines ( string &$line , array &$input )

This function converts a line or lines to a "paragraph" or block of text, like Markdown's (#) header and (>) block qoutes. (Though it can do anything to the line as other examples will show.)

Multiple lines are taken from the input text by shifting the array.

The return is TRUE if the line (or lines) have been modified, the result in $line and if lines were modified the $input array is modified as well; FALSE if the line was not modified.

The data that is used to modify the input is discussed later.

If this function returns TRUE, the next function is not called.


string markup_line ( string $line , boolean $para = TRUE )

This function converts "span" markup, like Markdown's emphasis. (But can do anything.) If $para is TRUE the line is enclosed by the line "begin" and "end" string constants described below.

The return is the line of text whether modified or not.


All markup is performed by two arrays of short, single line, regular expressions for converting lines, blocks (multiple lines) and spans (text within a line).

The use of the term "markup" is actually a misnomer as the purpose of this API is to enable "any text to any text" conversions based on the regular >expressions (and perhaps supplemental PHP support code)

The data that defines the input to output "rules" are independent of the API and are regular expression patterns assigned replacement strings - replacements as string values or as a string returned by a support function.

The data for both lines and line conversions have the same format:

	'/pattern/' => TRUE,
	'/pattern/' => "replacement string",
	'/pattern/' => "function_name",
	'/pattern/' => function($line){return $line;},

These data are globals, $mu_lines and $mu_line respectively.

Data Code

To use the SMU API the data for the conversions are placed in a PHP file which includes the SMU API and calls it's functions. Here is a simple example, Markdown header conversion.

include 'smu.php';
$mu_lines = array(
	'/^(#+)\s*([^#]*)/' => 'mdheader',
$ilines = file($argv[1]);
$olines = markup_simple($ilines);
print $olines;

function mdheader($line, $pattern, $input, $matches) {
	$n = strlen($matches[1]);
	return "<h$n>{$matches[2]}</h$n>";

$ cat tmp.txt
# H1

## H2

$ php tmp.php tmp.txt

To eliminate the linefeeds the flag FILE_IGNORE_NEW_LINES would be used, or file_get_contents() could be used.

The SMU API is also too limited as any options, input redirection, etc., have to be created from scratch. (This is being worked on.)

Defined Datasets

A dataset - or as I prefer, "data code", for the data drives the code - is a valid PHP source file that defines the markup rules, and these source files include the API and calls into it to "do the work".

The supplied datasets are PHP files:

  • smu_md.php - Markdown to HTML
  • smu_mdi.php - Make Markdown Index block
  • smu_cmt.php - Strip source code comments

Two latter two are really small datasets which can probably be done with sed. (In fact, I'd like to figure out how to do so as a basic comparison.)

And there are two that are incomplete but show the direction the code wants to go:

  • smu_pod.php - Perl POD to HTML
  • smu_pod2md.php - Perl POD to Markdown

Using Markdown:

php smu_md.php

Using Perl Pod to HTML:

php smu_pod.php /usr/share/perl5/5.26/Pod/

Markdown Dataset

The default dataset and support functions provide the following Markdown markup:

  • headers
  • emphasis
  • links
  • lists
  • code spans
  • pre blocks
  • code blocks
  • HTML blocks
  • block quotes
  • escape characters

Limitations are:

  • lists and block quotes do not nest
  • link references not supported
  • setext headers not supported

Limitations are just a matter of implementation.

A few extras are:

  • headers have id="header_text"
  • [#Relative Link] is shortcut for [Relative Link](#relative_link)
  • lines beginning with ; are treated as comments

See the source file smu_md.php for full documentation.

The code that SMU was based on does the same thing as this dataset (and is more complete) but is four times larger and much more complicated.

Perl POD to HTML Dataset

This one tries to do what pod2html does. It is about 90% complete and needs further testing. But it is a proof of concept and does produce comparable HTML.

Known Limitations:

  • index headers do not nest
  • L<> codes do not resolve for Perl functions, etc.
  • only <> format codes supported and can nest to one level
  • preliminary C<< >> format codes implemented
  • =encoding not supported
  • =begin and =for not fully tested (and probably have problems)
  • no attempt has been made to report pod syntax errors
  • <dt> item text is within <p></p>
  • output is to STDOUT only
  • limited command line option support

Again, limitations are just a matter of implementation. See the source file smu_pod.php.

Change Log

Version 2.1.5

  • API Change: markup_getline() moved out of markup_line().
  • API Change: markup_line() arguments changed.
  • POD to HTML works better.
  • Added POD to Markdown as experiment.

Version 2.1.4

  • Specification Change: NULL for ignoring lines replaces TRUE.

  • Fixed bug in markup_line() re: define MU_LINE_CALLBACK.

  • Fixed bug in markup_getline() where premature return could occur.

  • Added input text line counter.

  • Added a couple more API functions.

  • Added (started) an error reporting function.

  • Added option to retain blank lines (for text to text output).

  • Changed how the input line number count is implemented.

Version 2.1.2

  • Added Perl Pod to HTML.
  • Added a PHP error handler.
  • Some clean-up/additional functionality added.

Version 2.1.0

  • Fixed "smart" headers false positives.
  • Fixed many flaws in the Markdown data code.
  • Added documentation in the code.
  • Added a web interface, smu_htm.php.
  • Added an interactive regular expression test program.

Version 2.0.7

  • Added example data code.

Version 2.0.6

  • More comments.
  • More string literals as defines (can control "<p>...</p>" tags).

Version 2.0.4

  • Split API and DATA.

Version 2.0.2

  • Fixed "smart" headers for camel case and lowercase.
  • Fixed emphasis to work within punctuation characters.
  • Added ```` blocks as pre blocks.
  • Shortened (optimised) several regular expressions.
  • Removed the "backward" Markdown link extension (was dumb).
  • This document includes much about how the code works.
You can’t perform that action at this time.