Sanitise on input, escape on output. (Trac #561) #561

elgg-gitbot opened this Issue Feb 16, 2013 · 34 comments


None yet

2 participants


Original ticket on 38885458-12-20 by trac user judgej, assigned to unknown.

Elgg version: 1.1

If I use characters in a display name that would normally be used to define HTML tags, they get converted to HTML entities. I don't believe this is the correct action to take.

For example, if I enter the name "Admin>", then it saves the name "Admin>". If this is an input mapping, then it could have repercussions throughout the system. For example, if I used '>' or '<' in a password, then can I be sure those characters are actually going to be stored in the password? If I have a five character limit on an input field, then the what happens when I enter '&&&&&' and find the system expands it to '&&&&&' before it attempts to store it? What if I am entering data that has nothing whatsoever to do with HTML or XML? What relevance is that kind of 'sanitisation' of input?

Mapping of special XML characters to entities is a function of the output (including when displayed as pre-filled form items). It is only relevent when that kind of mapping is important to the output, such as when creating XML or HTML. It has no place being done in the path between the user and the database.

I believe this is a very important issue, because I have seen other projects flounder when they do not follow the simple rule of keeping XML mapping out of the path from the user to the database. It is just asking for double-encoding to happen in various output points, and then you are in serious trouble because your user-entered data has effectively been corrupted.


trac user judgej wrote on 38885464-02-07

This is a similar issue to systems that attempt to sanitise input data by preparing it for database storage (i.e. escaping quotes and backslashes) too early. They run into trouble causing escape sequences to start appearing at the output, and then they need to put extra reverse-sanitisers in to fix that, and then that causes further problems and the whole thing gets ten times more complicated than it should be.


trac user judgej wrote on 38885485-06-06

Another example is the 'location' field in the profile. If I enter "Tyne & Wear" then I expect it to be stored exactly like that. Instead, two things happen:

  • The string I enter is converted to "Tyne & Wear" which then gets stored like that.
  • On my profile page, the location is then displayed without any further conversion. This is dangerous because tags may have been added to that profile through a different API and may not contain HTML-safe characters.

In addition, this stored value becomes a new tag "Tyne & Wear" - but why? The tag should be "Tyne & Wear", because that is what is stored. This is where the output XML conversion is done too early, and then starts to impact other stages (ie. the tag generation) that actually need the raw data to operate on, and not the data that happens to have been converted for a specific output media (a web browser).

Supposing that location is now sent to a vCard, which is a non-XML text file? My address would appear as "Tyne & Wear", which is patently wrong.

Trust me - this must be sorted out and some very clear rules put into place at this early stage, otherwise you will have big data problems down the line, and will end up with so many layers of conversion that it could result in security problems down the line.


Title changed from Strange mapping of display name characters to Sanitise on output not input [Was: Strange mapping of display name characters] by trac user marcus on 38996756-02-04


trac user marcus wrote on 39115060-12-05

(In [svn:2709]) Refs #561: Split filtering into separate function


trac user marcus wrote on 39117387-06-03

I am beginning to think that maybe we shouldn't be using kses for sanitisation in any case - it would appear that the project is dead (hasn't been updated since 2005), and I don't think we should be adopting it as part of elgg.

There is a case for stripping certain tags from input by default, but I agree that &amp encoding should not be performed.


cash wrote on 39118718-07-05

As a point of reference, Wordpress and Drupal both use kses (though both projects may be updating the kses code since it isn't being maintained). Joomla uses custom filtering code.

Joomla has filtering on both input and output:[[BR]] [[BR]] [[BR]]

Drupal only filters on output for more flexibility. They have a very useful extensible filtering system.[[BR]] [[BR]] [[BR]]


trac user marcus wrote on 39120071-03-04

(In [svn:2724]) Refs #561: Removed &amp encoding from kses but retaining script input and entities.


trac user marcus wrote on 39166783-12-15

(In [svn:3009]) Closes #828: Quite correct - rather tired - arrays are individually trimmed - non-arrays are not.
Closes #714: Input filtering now triggers on a plugin hook, this allows plugins to provide other filtering methods than kses (Refs #561).


trac user marcus wrote on 39443423-07-15

Note to self: At some point evaluate turning filtering off by default. For this to happen, output views must either filter tags or in some way sanitise output (should already occur).


trac user darkwingduck wrote on 39554848-08-25

I would only filter on output (or on input, keeping the original stream) and keep the filtered stream in the db in a cache field, keyed with a hash (crc, preferably) of filter names + versions used to filter the cached data. And optionally cache the current filter chain hash in php.

So it would be flexible, open to modifications on existing data, and would require nearly the same cpu time with input-only filtering. Drupal does this right, though they have some small issues with their interface.

We may have input and output filters (or an input/output option on each filter), but then I wouldn't use an input filter on my server anyway, so I always have the original data.


trac user darkwingduck wrote on 39554854-11-22

And optionally cache the current filter chain hash in php.

I mean, hold it in a php variable =)


davetosh wrote on 39678375-08-30

Since v1.6, Elgg has switched to htmlawed - how are people finding that?


cash wrote on 39678589-04-22

I think the large issue is not what is being used to do the filtering, but when the filtering/escaping is happening. The general practice is to filter input for bad stuff and escape the output to get proper html (some also filter output using a different set of filters). Elgg has a tendency to escape the input causing problem elsewhere.

Here are some relevant articles:


cash wrote on 39752539-03-12

To be more explicit - htmLawed as configured escapes certain characters on input rather than just restricting itself to filtering.


Milestone changed to Elgg 1.8 by brettp on 40084594-06-04


Title changed from Sanitise on output not input [Was: Strange mapping of display name characters] to Sanitise on input, escape on output. by brettp on 40084594-06-04


trac user necronet wrote on 40471951-08-22

wouldt be a good idea to add a parameter on:

function serialise_object_to_xml($data, $name = "", $n = 0, $specialChar=true)

the boolean will allowed to know wether the specialChar should be called or not a good example of false would be for RESTful Methods or specify by the object....


ewinslow wrote on 40693177-06-10

It seems like the main problem is that the current filtering/sanitization system assumes that every input is html (htmLawed is run on all inputs with reckless abandon). But I don't always mean to input html.

It seems like it might be ok to have htmlawed make it standards compliant if it's meant to be html input. For example, <p>&</p> becomes <p>&amp;</p>.

However, when you input plain text (or anything else), you don't want that to happen. For example Jack & Jill should not become Jack &amp; Jill. I also wouldn't like it if one of my blog titles was "Introducing <script> injections" and it got "filtered" to "Introducing injections". Rather, I'd expect it to be left alone entirely and then escaped before output to be "Introducing <script> injections".

So it seems like get_input needs an extra specifying what kind of input it's expecting. And then the hooks can listen for only the applicable type of input.


ewinslow wrote on 40876291-07-30

The other option I can think of here is that inputs are not santized at all, but the ElggEntity classes do the sanitization on assignment. In order to choose the correct sanitization algorithm, you would have to manually specify what "type" certain fields should be, probably via plugin hook. For example,

trigger_plugin_hook('attributes', 'object', array('subtype' => 'page'), array());

register_plugin_hook('attributes', 'object', 'myplugin_custom_page_attributes');

myplugin_custom_page_attributes($hook, $type, $attributes, $params) {
    if ($params['subtype'] == 'page') {
        return array(
            'title' => array('type' => 'text'),
            'description' => array('type' => 'longtext'),

If I remember correctly, this is similar to the way custom profile fields work, so it might not be so far off from what we're already doing.


Milestone changed to Elgg 1.9 by cash on 40888586-08-01


trac user twall wrote on 41624431-02-05

This is turning out to be really horrendous for our application and we have not much choice but to disable htmlawed. We rely heavily on the REST API, and very little of the input has to do with HTML input.

I've added an extra flag to the REST API to indicate that an input should not be filtered, but this is getting to be a pain since basically it has to be unfiltered 99% of the time, and we still have problems when a widget setting has an ampersand in the value. We end up having to rewrite action code to avoid filtering the input.


trac user mrclay wrote on 42135297-12-04

Most web apps have issues like this, and it's [ why I wrote this], suggesting devs should explicitly declare the content/escaping model of all string content.

Like Evan said, Elgg seems to assume all content could be [ Markup]. This isn't a sin, but means things like titles (which should be arguably be [ Unescaped]) can't contain anything that looks like markup-ish without it being interpreted or stripped by htmlawed.

But that's not the only problem. It forces Elgg to use hacks like setting to false the 4th argument of htmlspecialchars, which breaks content loading in TinyMCE. Save this via TinyMCE then try editing it:

1 < 2, 2<3, three<four

ewinslow wrote on 42135941-06-30

Why don't we just remove the input hook?


trac user mrclay wrote on 42140093-01-01

Evan I think the general approach of filtering by default (assuming markup) is OK. I see the ultimate goals as being able to remove the 4th arg of htmlspecialchars (by default) in all views, and to store arbitrary strings in titles/names.

It seems most content (including titles/names) is already stored as markup and echoed directly without escaping, so I think the path of least resistance is to document that, then for single line inputs assumed to hold arbitrary content (like titles), run them through htmlspecialchars instead of htmlawed during input.


ewinslow wrote on 42140263-03-28

I think this approach will add unexpected complexity in other parts of the code. Character limits in the wire for example will not behave as expected.

I don't think it's a good idea to train people to trust the data coming out of the database. The rest of the web escapes and filters on output. Since someone can beat the filter you have to filter again on output anyways, so I definitely don't see the value in filtering on input.

We're in a bit of a catch 22 here. No matter what we choose, someone is going to feel the pain and not like it one bit.


trac user mrclay wrote on 42146320-06-23

We're in a bit of a catch 22 here. No matter what we choose, someone is going to feel the pain and not like it one bit.

Very good point, so where to start? I suggest documenting the string subtype of all known attributes and metadata in core classes and plugins, and figuring out what they really should be, and what types of filtering/escaping should be applied during input.

E.g. Currently titles/names are markup and filtered with htmlawed, but we could just htmlspecialchars them. That would fix the data loss immediately with few side effects and give a straightforward method to convert them back to plain text. At some upgrade point these could be converted to plain text alongside new code to escape them in the HTML.


ewinslow wrote on 42146427-03-27

If updating templates to properly escape data in output is the ultimate goal, then I'm not sure I see the value in going through all this trouble to document string types and escape on input accordingly. It just seems like an unnecessary middle step, no?


ewinslow wrote on 42165234-03-20

tl;dr: How about we switch the default third parameter of get_input to false?

This will never mangle data accidentally but is unsafe by default. When input is expected to be html, plugin devs can pass true to turn on html filtering. I'm inclined to say that we should be safe by default, but there are so many places where we would have to be using false (Most inputs are not html) and so few where we actually need to use true that it feels like true is a bad default, and asking devs to pass false everywhere would be more work than just asking plugin devs to explicitly pass true to apply html filtering to inputs.

Doing this means that devs will need to start escaping content in the appropriate locations. That's pretty cumbersome now. We seriously need to provide an h() function to make this easier. Requiring elgg_view('output/text', array('value' => $val)); is a sure way to get people NOT to escape their output. That view should only exist so that auto-generated forms can do "output/$type" and be safe. htmlspecialchars($text, ENT_QUOTES, 'UTF-8'); isn't much better.

This would require a security review of all Elgg views + actions. I could do it -- or we could make it more of a community thing where we write a blog post, raise awareness, and offer some sort of incentive for people who find and fix the most of these escaping bugs.

I'm slating this for 1.9 because I think the root problem needs to get solved soon, whatever approach we take. This ticket has been open for years. It's time to close it.


Milestone changed to Elgg 1.9.0 by ewinslow on 42165234-03-20


trac user mrclay wrote on 42165423-12-20

Evan: Let's say we move forward with that. Things to consider:

  • Timing of this switchover would be crucial.
  • Titles and other non-html data would need to be un-escaped in the DB.
  • The adding of escaping to the associated views (anything outputting titles) would have to be thorough.
  • A plugin not using elgg views to output markup (assuming filtering on input) would create a truck-wide XSS vulnerability.

I suggest deprecating the 3rd argument of get_input (having it always return filtered markup) and instead add a function get_unescaped_input(). If get_input were called with a falsy 3rd argument, throw a deprecation warning. That makes the default for old plugins over-escaped markup instead of hidden XSS holes.

Without this change I think upgrading from 1.8 with non-core/non-security-reviewed plugins would be horribly risky. There are probably other areas of the API needing replacement to make this change as safe as possible.


trac user mrclay wrote on 42165429-09-02

It might also be worth building a tool to flag dangerous-looking code. E.g. things like echo $var->property.


ewinslow wrote on 42190601-08-04

This looks promising:


cash wrote on 42543336-10-18

If this is going into 1.9, we need to start planning on this soon. I believe Evan's views on this have evolved in the last 5 months (do away with escaping functions for templates and replace with auto escaping templates - like Trig). I've created a google doc since reading through all these comments to remember what the current state is is tiring.

mrclay commented Feb 20, 2013

Closing this rambling, philosophical ticket that's not proving very helpful. Solutions should go in separate tickets and should take into account:

  • 3rd party plugins currently may assume a lot of DB content is safe to echo AS IS (changing that contract would be disastrous)
  • Upgrades should not require torturous data migrations to unescape content

IMO #4397 (basic escaping API covering OWASP guidelines) is a first necessary step.

/cc @cash @ewinslow

@mrclay mrclay closed this Feb 20, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment