Skip to content

Tutorial 2 Altering Markdown Rendering

flywire edited this page Jul 8, 2020 · 1 revision

Introduction

While many extensions to Python-Markdown add new syntax, occasionally, you want to simply alter the way Markdown renders the existing syntax. For example, you may want to display some images inline, but require externally hosted images to simply be links which point to the image.

Suppose the following Markdown was provided:

![a local image](/path/to/image.jpg)

![a remote image](http://example.com/image.jpg)

We would like Python-Markdown to return the following HTML:

<p><img alt="a local image" src="/path/to/image.jpg" /></p>
<p><a href="http://example.com/image.jpg">a remote image</a></p>

Note: This tutorial is very generic and assumes a basic Python 3 development environment. A basic understanding of Python development is expected.

Analysis

Let's consider the options available to us:

  1. Override the image related inline patterns.

    While this would work, we don't need to alter the existing patterns. The parser is recognizing the syntax just fine. All we need to do is alter the HTML output.

    We also want to support both inline image links and reference style image links, which would require redefining both inline patterns, doubling the work.

  2. Leave the existing pattern alone and use a Treeprocessor to alter the HTML.

    This does not alter the tokenization of the Markdown syntax in any way. We can be sure that anything which represents an image will be included, even any new image syntax added by other third-party extensions.

Given the above, let's use option two.

The Solution

To begin, let's create a new Treeprocessor:

from markdown.treeprocessors import Treeprocessor

class InlineImageProcessor(Treeprocessor):
    def run(self, root):
        # Modify the HTML here

The run method of a Treeprocessor receives a root argument which contains an ElementTree object. We need to iterate over all of the img elements within that object and alter those which contain external URLs. Therefore, add the following code to the run method:

# Iterate over img elements only
for element in root.iter('img'):
    # copy the element's attributes for later use
    attrib = element.attrib
    # Check for links to external images
    if attrib['src'].startswith('http'):
        # Save the tail
        tail = element.tail
        # Reset the element
        element.clear()
        # Change the element to a link
        element.tag = 'a'
        # Copy src to href
        element.set('href', attrib.pop('src'))
        # Copy alt to label
        element.text = attrib.pop('alt')
        # Reassign tail
        element.tail = tail
        # Copy all remaining attributes to element
        for k, v in attrib.items():
            element.set(k, v)

A few things to note about the above code:

  1. We make a copy of the element's attributes so that we don't loose them when we later reset the element with element.clear(). The same applies for the tail. As img elements don't have text, we don't need to worry about that.
  2. We explicitly set the href attribute and the element.text as those are assigned to different attribute names on a elements that on img elements. When doing so, we pop the src and alt attributes from attrib so that they are no longer present when we copy all remaining attributes in the last step.
  3. We don't need to make changes to img elements which point to internal images, so there no need to reference them in the code (they simply get skipped).
  4. The test for external links (startswith('http')) could be improved and is left as an exercise for the reader.

Now we need to inform Markdown of our new Treeprocessor with an Extension subclass:

from markdown.extensions import Extension

class ImageExtension(Extension):
    def extendMarkdown(self, md):
        # Register the new treeprocessor
        md.treeprocessors.register(InlineImageProcessor(md), 'inlineimageprocessor', 15)

We register the Treeprocessor with a priority of 15, which ensures that it runs after all inline processing is done.

Test 1

Let's see that all together:

ImageExtension.py

from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension


class InlineImageProcessor(Treeprocessor):
    def run(self, root):
        for element in root.iter('img'):
            attrib = element.attrib
            if attrib['src'].startswith('http'):
                tail = element.tail
                element.clear()
                element.tag = 'a'
                element.set('href', attrib.pop('src'))
                element.text = attrib.pop('alt')
                element.tail = tail
                for k, v in attrib.items():
                    element.set(k, v)


class ImageExtension(Extension):
    def extendMarkdown(self, md):
        md.treeprocessors.register(InlineImageProcessor(md), 'inlineimageprocessor', 15)

Now, pass our extension to Markdown:

Test.py

import markdown

input = """
![a local image](/path/to/image.jpg "A title.")

![a remote image](http://example.com/image.jpg  "A title.")
"""

from ImageExtension import ImageExtension
html = markdown.markdown(input, extensions=[ImageExtension()])
print(html)

And running python Test.py correctly returns the following output:

<p><img alt="a local image" src="/path/to/image.jpg"  title="A title."/></p>
<p><a href="http://example.com/image.jpg" title="A title.">a remote image</a></p>

Success! Note that we included a title for each image, which was also properly retained.

Adding Configuration Settings

Suppose we want to allow the user to provide a list of know image hosts. Any img tags which point at images in those hosts may be inlined, but any other images should be external links. Of course, we want to keep the existing behavior for internal (relative) links.

First we need to add the configuration option to our Extension subclass:

class ImageExtension(Extension):
    def __init__(self, **kwargs):
        # Define a config with defaults
        self.config = {'hosts' : [[], 'List of approved hosts']}
        super(ImageExtension, self).__init__(**kwargs)

We defined a hosts configuration setting which defaults to an empty list. Now, we need to pass that option on to our treeprocessor in the extendMarkdown method:

def extendMarkdown(self, md):
    # Pass host to the treeprocessor
    md.treeprocessors.register(InlineImageProcessor(md, hosts=self.getConfig('hosts')), 'inlineimageprocessor', 15)

Next, we need to modify our treeprocessor to accept the new setting:

class InlineImageProcessor(Treeprocessor):
    def __init__(self, md, hosts):
        self.md = md
        # Assign the setting to the hosts attribute of the class instance
        self.hosts = hosts

Then, we can add a method which uses the setting to test a URL:

from urllib.parse import urlparse

class InlineImageProcessor(Treeprocessor):
    ...
    def is_unknown_host(self, url):
        url = urlparse(url)
        # Return False if network location is empty or an known host
        return url.netloc and url.netloc not in self.hosts

Finally, we can make use of the test method by replacing the if attrib['src'].startswith('http'): line of the run method with if self.is_unknown_host(attrib['src']):.

Test 2

The final result should look like this:

ImageExtension.py

from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension
from urllib.parse import urlparse


class InlineImageProcessor(Treeprocessor):
    def __init__(self, md, hosts):
        self.md = md
        self.hosts = hosts

    def is_unknown_host(self, url):
        url = urlparse(url)
        return url.netloc and url.netloc not in self.hosts

    def run(self, root):
        for element in root.iter('img'):
            attrib = element.attrib
            if self.is_unknown_host(attrib['src']):
                tail = element.tail
                element.clear()
                element.tag = 'a'
                element.set('href', attrib.pop('src'))
                element.text = attrib.pop('alt')
                element.tail = tail
                for k, v in attrib.items():
                    element.set(k, v)


class ImageExtension(Extension):
    def __init__(self, **kwargs):
        self.config = {'hosts' : [[], 'List of approved hosts']}
        super(ImageExtension, self).__init__(**kwargs)

    def extendMarkdown(self, md):
        md.treeprocessors.register(InlineImageProcessor(md, hosts=self.getConfig('hosts')), 'inlineimageprocessor', 15)

Let's test that out:

Test.py

import markdown

input = """
![a local image](/path/to/image.jpg)

![a remote image](http://example.com/image.jpg)

![an excluded remote image](http://exclude.com/image.jpg)
"""

from ImageExtension import ImageExtension
html = markdown.markdown(input, extensions=[ImageExtension(hosts=['example.com'])])
print(html)

And running python Test.py returns the following output:

<p><img alt="a local image" src="/path/to/image.jpg"/></p>
<p><img alt="a remote image" src="http://example.com/image.jpg"/></p>
<p><a href="http://exclude.com/image.jpg">an excluded remote image</a></p>

Wrapping the above extension up into a package for distribution is left as an exercise for the reader.