HTML tag cleaning #2177

LunarWatcher · 2018-05-18T13:22:49Z

Attempts to remove this problem, where patterns are converted to HTML tags because the pattern is interpreted as markdown and converted.

Since removing HTML tags everywhere isn't necessarily an option, this only applies to adding blacklists. It should probably be added to !!/unwatch as well, but it would need some edits to make sure it's possible to remove blacklists with HTML tags (and I'm not familiar enough with Smokey to do that without breaking stuff).

coveralls · 2018-05-18T13:26:19Z

Coverage decreased (-0.08%) to 65.286% when pulling 54e8e89 on LunarWatcher:patch-5 into 8c882f0 on Charcoal-SE:master.

ArtOfCode- · 2018-05-18T13:29:45Z

helpers.py

+    """
+    Removes some HTML tags from input and replaces it with markdown
+    """
+    raw_message = regex.subf("<i>(.*)</i>", "*{1}*", raw_message)


(.*) is greedy; it'll match the closing tag. You want ([^<]+).

(.*?) should also work.

@ArtOfCode- local testing excludes the closing tag using this (checking against the one that failed earlier to validate that). I'll change it though.

All though the one you suggested would break if < or > is actually used in the regex in addition to italic. See this example. There are live examples containing these chars, and they're used in regex, which means there is a chance it would break with ([^<]+)

@LunarWatcher What's the example regex supposed to be once cleaned up? Should the tags be converted to their Markdown equivalents?

@NobodyNada The goal is to convert HTML formatted watches and blacklists to markdown instantly instead of manually fixing it after running the command.

This specifically applies to regexes like the one linked, and one that came before it, where the raw text contains chars that are interpreted as formatting instead of keeping the plain format.

Examples with the appropriate form and the HTML form:

Markdown (the goal) HTML

viet\W*cruise\W*(?:travel\W*agency|tours) viet\Wcruise\W*(?:travel\Wagency|tours)

auto\W*link\W*pk auto\Wlink\Wpk

total\W*credit\W*restoration total\Wcredit\Wrestoration

@LunarWatcher I was asking specifically about the regex101 link -- is it a regex meant to match HTML tags, or are the HTML tags supposed to be cleaned out?

never mind, I just realized that regex is an example of the regex Art suggested. My mistake; sorry for the noise.

Require at least one char in the tag to match and replace the group, added lazy modifier

NobodyNada · 2018-05-18T15:28:20Z

Has this been tested?

LunarWatcher · 2018-05-18T15:37:35Z

@NobodyNada I've tested it under the same conditions as a blacklist (just some temporary code to see it worked properly without necessarily triggering a push), the HTML-cleaned regex (tested with the one that failed earlier) gets cleaned properly and the compiling doesn't fail.

I haven't run it as a SD instance though

NobodyNada · 2018-05-18T15:47:47Z

@LunarWatcher Great! I'm running just a couple quick tests (e.g. to make sure we're getting chat messages in the expected format, to make sure edge cases don't break anything) and I'll merge in a bit.

makyen · 2018-05-18T16:48:57Z

Rather than attempt to "cleanup" the information, which could get it wrong, particularly if someone was wanting to include one more more of those HTML tags in the regex, why not fetch the actual information which was provided by the user?

For instance, the example in first comment here is available from:
https://chat.stackexchange.com/message/44685266?plain=true

NobodyNada · 2018-05-18T16:52:23Z

@makyen That sounds like a good idea. If I remember correctly, there's a way to mark a specific command as accepting plain-text instead of rendered HTML -- it's not the default because it's up to a couple seconds slower.

NobodyNada · 2018-05-18T17:08:21Z

Looks like ChatExchange gives us the plaintext content in msg.content_source. It includes the !!/blacklist-keyword prefix, so we'll have to drop all text before the first space.

makyen · 2018-05-18T17:09:16Z

@NobodyNada Assuming it has to perform an additional fetch, I would certainly expect it to be somewhat slower. I'm a bit surprised that it's up to a couple of seconds, but it is whatever it is. For the case of watches and blacklists, that doesn't seem to be that much of a penalty.

I guess what we really need to choose is do we want to use what the user actually enters, or do we want to have processing done on that input? IMO, we should go with using what the user has actually typed. Once we get people away from having already been conditioned to use \ to quote Markdown syntax, I think that we will get fewer surprises if we always use exactly what the user has entered, rather than have it processed multiple times.

If using what the user has actually typed is something that already exists in the code and can be chosen on a command by command basis, that certainly seems to be the right way to go, even if there's a couple/few second penalty for each command of the types so configured (assuming it's only configured that way on commands where it consistently matters).

NobodyNada · 2018-05-18T17:14:39Z

@makyen I think that's the right way to go (cc @quartata, since he's put some thought into fetching the plaintext and the latency it adds). That makes things a lot simpler, and it makes it nearly impossible for Smokey to misinterpret a blacklist regex.

@LunarWatcher Do you want to implement this, or should I?

LunarWatcher · 2018-05-18T17:33:12Z

@NobodyNada I'm not near familiar enough with Smokey and chatexchange to add it. It'd probably easiest and quickest if you did it. I'm also closing this PR, since this won't be used either way.

LunarWatcher added 3 commits May 18, 2018 14:46

Update helpers.py

83291bf

Finalizes HTML cleaning in blacklisting and watching

4be2fd7

Pep8, shadowing fix

77def3b

ArtOfCode- requested changes May 18, 2018

View reviewed changes

Tiny regex change

54e8e89

Require at least one char in the tag to match and replace the group, added lazy modifier

tripleee assigned NobodyNada May 18, 2018

LunarWatcher closed this May 18, 2018

NobodyNada mentioned this pull request May 18, 2018

Use raw message text when blacklisting/watching #2178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML tag cleaning #2177

HTML tag cleaning #2177

LunarWatcher commented May 18, 2018

coveralls commented May 18, 2018 •

edited

ArtOfCode- May 18, 2018

j-f1 May 18, 2018

LunarWatcher May 18, 2018 •

edited

NobodyNada May 18, 2018

LunarWatcher May 18, 2018

NobodyNada May 18, 2018

NobodyNada May 18, 2018

NobodyNada commented May 18, 2018

LunarWatcher commented May 18, 2018

NobodyNada commented May 18, 2018

makyen commented May 18, 2018

NobodyNada commented May 18, 2018

NobodyNada commented May 18, 2018

makyen commented May 18, 2018

NobodyNada commented May 18, 2018

LunarWatcher commented May 18, 2018

Markdown (the goal)	HTML
`viet\Wcruise\W(?:travel\W*agency\|tours)`	`viet\W<i>cruise\W*(?:travel\W</i>agency\|tours)`
`auto\Wlink\Wpk`	`auto\W<i>link\W</i>pk`
`total\Wcredit\Wrestoration`	`total\W<i>credit\W</i>restoration`

HTML tag cleaning #2177

HTML tag cleaning #2177

Conversation

LunarWatcher commented May 18, 2018

coveralls commented May 18, 2018 • edited

ArtOfCode- May 18, 2018

Choose a reason for hiding this comment

j-f1 May 18, 2018

Choose a reason for hiding this comment

LunarWatcher May 18, 2018 • edited

Choose a reason for hiding this comment

NobodyNada May 18, 2018

Choose a reason for hiding this comment

LunarWatcher May 18, 2018

Choose a reason for hiding this comment

NobodyNada May 18, 2018

Choose a reason for hiding this comment

NobodyNada May 18, 2018

Choose a reason for hiding this comment

NobodyNada commented May 18, 2018

LunarWatcher commented May 18, 2018

NobodyNada commented May 18, 2018

makyen commented May 18, 2018

NobodyNada commented May 18, 2018

NobodyNada commented May 18, 2018

makyen commented May 18, 2018

NobodyNada commented May 18, 2018

LunarWatcher commented May 18, 2018

coveralls commented May 18, 2018 •

edited

LunarWatcher May 18, 2018 •

edited