Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML tag cleaning #2177

Closed
wants to merge 4 commits into from
Closed

HTML tag cleaning #2177

wants to merge 4 commits into from

Conversation

LunarWatcher
Copy link
Member

Attempts to remove this problem, where patterns are converted to HTML tags because the pattern is interpreted as markdown and converted.

Since removing HTML tags everywhere isn't necessarily an option, this only applies to adding blacklists. It should probably be added to !!/unwatch as well, but it would need some edits to make sure it's possible to remove blacklists with HTML tags (and I'm not familiar enough with Smokey to do that without breaking stuff).

@coveralls
Copy link

coveralls commented May 18, 2018

Coverage Status

Coverage decreased (-0.08%) to 65.286% when pulling 54e8e89 on LunarWatcher:patch-5 into 8c882f0 on Charcoal-SE:master.

helpers.py Outdated
"""
Removes some HTML tags from input and replaces it with markdown
"""
raw_message = regex.subf("<i>(.*)</i>", "*{1}*", raw_message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(.*) is greedy; it'll match the closing tag. You want ([^<]+).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(.*?) should also work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArtOfCode- local testing excludes the closing tag using this (checking against the one that failed earlier to validate that). I'll change it though.

All though the one you suggested would break if < or > is actually used in the regex in addition to italic. See this example. There are live examples containing these chars, and they're used in regex, which means there is a chance it would break with ([^<]+)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LunarWatcher What's the example regex supposed to be once cleaned up? Should the tags be converted to their Markdown equivalents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NobodyNada The goal is to convert HTML formatted watches and blacklists to markdown instantly instead of manually fixing it after running the command.

This specifically applies to regexes like the one linked, and one that came before it, where the raw text contains chars that are interpreted as formatting instead of keeping the plain format.

Examples with the appropriate form and the HTML form:

Markdown (the goal) HTML
viet\W*cruise\W*(?:travel\W*agency|tours) viet\W<i>cruise\W*(?:travel\W</i>agency|tours)
auto\W*link\W*pk auto\W<i>link\W</i>pk
total\W*credit\W*restoration total\W<i>credit\W</i>restoration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LunarWatcher I was asking specifically about the regex101 link -- is it a regex meant to match HTML tags, or are the HTML tags supposed to be cleaned out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, I just realized that regex is an example of the regex Art suggested. My mistake; sorry for the noise.

Require at least one char in the tag to match and replace the group, added lazy modifier
@NobodyNada
Copy link
Member

Has this been tested?

@LunarWatcher
Copy link
Member Author

@NobodyNada I've tested it under the same conditions as a blacklist (just some temporary code to see it worked properly without necessarily triggering a push), the HTML-cleaned regex (tested with the one that failed earlier) gets cleaned properly and the compiling doesn't fail.

I haven't run it as a SD instance though

@NobodyNada
Copy link
Member

@LunarWatcher Great! I'm running just a couple quick tests (e.g. to make sure we're getting chat messages in the expected format, to make sure edge cases don't break anything) and I'll merge in a bit.

@makyen
Copy link
Contributor

makyen commented May 18, 2018

Rather than attempt to "cleanup" the information, which could get it wrong, particularly if someone was wanting to include one more more of those HTML tags in the regex, why not fetch the actual information which was provided by the user?

For instance, the example in first comment here is available from:
https://chat.stackexchange.com/message/44685266?plain=true

@NobodyNada
Copy link
Member

@makyen That sounds like a good idea. If I remember correctly, there's a way to mark a specific command as accepting plain-text instead of rendered HTML -- it's not the default because it's up to a couple seconds slower.

@NobodyNada
Copy link
Member

Looks like ChatExchange gives us the plaintext content in msg.content_source. It includes the !!/blacklist-keyword prefix, so we'll have to drop all text before the first space.

@makyen
Copy link
Contributor

makyen commented May 18, 2018

@NobodyNada Assuming it has to perform an additional fetch, I would certainly expect it to be somewhat slower. I'm a bit surprised that it's up to a couple of seconds, but it is whatever it is. For the case of watches and blacklists, that doesn't seem to be that much of a penalty.

I guess what we really need to choose is do we want to use what the user actually enters, or do we want to have processing done on that input? IMO, we should go with using what the user has actually typed. Once we get people away from having already been conditioned to use \ to quote Markdown syntax, I think that we will get fewer surprises if we always use exactly what the user has entered, rather than have it processed multiple times.

If using what the user has actually typed is something that already exists in the code and can be chosen on a command by command basis, that certainly seems to be the right way to go, even if there's a couple/few second penalty for each command of the types so configured (assuming it's only configured that way on commands where it consistently matters).

@NobodyNada
Copy link
Member

@makyen I think that's the right way to go (cc @quartata, since he's put some thought into fetching the plaintext and the latency it adds). That makes things a lot simpler, and it makes it nearly impossible for Smokey to misinterpret a blacklist regex.

@LunarWatcher Do you want to implement this, or should I?

@LunarWatcher
Copy link
Member Author

@NobodyNada I'm not near familiar enough with Smokey and chatexchange to add it. It'd probably easiest and quickest if you did it. I'm also closing this PR, since this won't be used either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants