New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML tag cleaning #2177
HTML tag cleaning #2177
Conversation
helpers.py
Outdated
""" | ||
Removes some HTML tags from input and replaces it with markdown | ||
""" | ||
raw_message = regex.subf("<i>(.*)</i>", "*{1}*", raw_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(.*)
is greedy; it'll match the closing tag. You want ([^<]+)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(.*?)
should also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArtOfCode- local testing excludes the closing tag using this (checking against the one that failed earlier to validate that). I'll change it though.
All though the one you suggested would break if <
or >
is actually used in the regex in addition to italic. See this example. There are live examples containing these chars, and they're used in regex, which means there is a chance it would break with ([^<]+)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LunarWatcher What's the example regex supposed to be once cleaned up? Should the tags be converted to their Markdown equivalents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NobodyNada The goal is to convert HTML formatted watches and blacklists to markdown instantly instead of manually fixing it after running the command.
This specifically applies to regexes like the one linked, and one that came before it, where the raw text contains chars that are interpreted as formatting instead of keeping the plain format.
Examples with the appropriate form and the HTML form:
Markdown (the goal) | HTML |
---|---|
viet\W*cruise\W*(?:travel\W*agency|tours) |
viet\W<i>cruise\W*(?:travel\W</i>agency|tours) |
auto\W*link\W*pk |
auto\W<i>link\W</i>pk |
total\W*credit\W*restoration |
total\W<i>credit\W</i>restoration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LunarWatcher I was asking specifically about the regex101 link -- is it a regex meant to match HTML tags, or are the HTML tags supposed to be cleaned out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind, I just realized that regex is an example of the regex Art suggested. My mistake; sorry for the noise.
Require at least one char in the tag to match and replace the group, added lazy modifier
Has this been tested? |
@NobodyNada I've tested it under the same conditions as a blacklist (just some temporary code to see it worked properly without necessarily triggering a push), the HTML-cleaned regex (tested with the one that failed earlier) gets cleaned properly and the compiling doesn't fail. I haven't run it as a SD instance though |
@LunarWatcher Great! I'm running just a couple quick tests (e.g. to make sure we're getting chat messages in the expected format, to make sure edge cases don't break anything) and I'll merge in a bit. |
Rather than attempt to "cleanup" the information, which could get it wrong, particularly if someone was wanting to include one more more of those HTML tags in the regex, why not fetch the actual information which was provided by the user? For instance, the example in first comment here is available from: |
@makyen That sounds like a good idea. If I remember correctly, there's a way to mark a specific command as accepting plain-text instead of rendered HTML -- it's not the default because it's up to a couple seconds slower. |
Looks like ChatExchange gives us the plaintext content in |
@NobodyNada Assuming it has to perform an additional fetch, I would certainly expect it to be somewhat slower. I'm a bit surprised that it's up to a couple of seconds, but it is whatever it is. For the case of watches and blacklists, that doesn't seem to be that much of a penalty. I guess what we really need to choose is do we want to use what the user actually enters, or do we want to have processing done on that input? IMO, we should go with using what the user has actually typed. Once we get people away from having already been conditioned to use If using what the user has actually typed is something that already exists in the code and can be chosen on a command by command basis, that certainly seems to be the right way to go, even if there's a couple/few second penalty for each command of the types so configured (assuming it's only configured that way on commands where it consistently matters). |
@makyen I think that's the right way to go (cc @quartata, since he's put some thought into fetching the plaintext and the latency it adds). That makes things a lot simpler, and it makes it nearly impossible for Smokey to misinterpret a blacklist regex. @LunarWatcher Do you want to implement this, or should I? |
@NobodyNada I'm not near familiar enough with Smokey and chatexchange to add it. It'd probably easiest and quickest if you did it. I'm also closing this PR, since this won't be used either way. |
Attempts to remove this problem, where patterns are converted to HTML tags because the pattern is interpreted as markdown and converted.
Since removing HTML tags everywhere isn't necessarily an option, this only applies to adding blacklists. It should probably be added to !!/unwatch as well, but it would need some edits to make sure it's possible to remove blacklists with HTML tags (and I'm not familiar enough with Smokey to do that without breaking stuff).