Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

small regex changes for better generic log support #2851

Merged
merged 2 commits into from Apr 17, 2021
Merged

small regex changes for better generic log support #2851

merged 2 commits into from Apr 17, 2021

Conversation

dkanada
Copy link
Contributor

@dkanada dkanada commented Apr 7, 2021

I encountered a few small issues while trying out the new generic log support.

Levels: Serilog uses three letter acronyms for their log levels
Exception: Many programming languages would benefit from highlighted exceptions (Java and C# for example)
UUID: Microsoft uses GUIDs which can be valid without dashes
Date: Serilog and Python output datetimes without a T character by default
Time: RFC 3339 allows +00:00 for the timezone portion

There were a few larger problems I encountered that I'd like to get some input on before changing the order and existence of the current rules.

  1. File paths will often contain UUIDs so I'd like to move the path and hash rules above the UUID rule.
  2. The string matching is extremely greedy and was stealing almost all tokens that would have otherwise matched UUID, path, IP, or URL patterns. I solved it by removing that pattern entirely, but I figured that might be too drastic.
  3. Any paths with spaces in the name weren't caught by the file-path rule. The logfile I was testing used strings to avoid path ambiguity (further exacerbating the string problem) but it doesn't seem like that rule accounts for this situation.

@github-actions
Copy link

github-actions bot commented Apr 7, 2021

JS File Size Changes (gzipped)

A total of 1 files have changed, with a combined diff of +27 B (+2.8%).

file master pull size diff % diff
components/prism-log.min.js 980 B 1.01 KB +27 B +2.8%

Generated by 🚫 dangerJS against 75d7ad0

@dkanada
Copy link
Contributor Author

dkanada commented Apr 7, 2021

I figured this would cause issues with any tests haha, let me know which changes make sense and I can fix the problems.

@RunDevelopment
Copy link
Member

Thank you for the PR @dkanada!

Lots of good ideas in here.

I'll go through your proposed changes one by one. I will only comment on whether these changes should be made not how they were/will be made. (Comments on the implementation will follow.)

In general, please try to share examples if you can. There is no one log format, so I have no idea what the logs you try to highlight look like.

Levels: Serilog uses three letter acronyms for their log levels

Accepted.

Exception: Many programming languages would benefit from highlighted exceptions (Java and C# for example)

From my testing, Java stack traces looked ok but I haven't tested C# exceptions. Could you please provide a few stack traces (just all in one file) that don't look good?

UUID: Microsoft uses GUIDs which can be valid without dashes

Accepted.

Date: Serilog and Python output datetimes without a T character by default

Accepted.

Time: RFC 3339 allows +00:00 for the timezone portion

Accepted.

  1. File paths will often contain UUIDs so I'd like to move the path and hash rules above the UUID rule.

File paths should be tokenized as such. IMO we should rather tokenize UUIDs inside file paths.

  1. The string matching is extremely greedy and was stealing almost all tokens that would have otherwise matched UUID, path, IP, or URL patterns. I solved it by removing that pattern entirely, but I figured that might be too drastic.

Strings are commonly used to store this kind of data. I don't really see why would want to not highlight strings. Do you have an example?

  1. Any paths with spaces in the name weren't caught by the file-path rule. The logfile I was testing used strings to avoid path ambiguity (further exacerbating the string problem) but it doesn't seem like that rule accounts for this situation.

Yes, that is intentional. The problem is that some log formats use tabs or spaces to separate arguments. The separator between a path and an error message might be a single space. A trade-off between false positives and false negatives had to be made and I decided on false negatives.

components/prism-log.js Outdated Show resolved Hide resolved
components/prism-log.js Outdated Show resolved Hide resolved
components/prism-log.js Outdated Show resolved Hide resolved
components/prism-log.js Outdated Show resolved Hide resolved
@dkanada
Copy link
Contributor Author

dkanada commented Apr 8, 2021

Regarding the string issues, I have no problems with matching strings at all. However, it seemed like the string matches were blocking any further matches for UUID or other more specific tokens. I don't personally want to highlight strings in my logs, but I definitely want to use a special color for IP addresses. I thought the UUID was overriding paths but I can't reproduce that one now so you can ignore that comment.

file-path and possible uuid are only matched as string in this ffmpeg log

Output #0, hls, to '/data/transcodes/275395c7ef978782d9755ae338e67cb7.m3u8':
  Metadata:
    encoder         : Lavf58.45.100

url and ip-address are only matched as string

[{timestamp}] [INF] [205] Emby.Server.Implementations.HttpServer.WebSocketManager: WS "{ip}" closed
[{timestamp}] [WRN] [255] Jellyfin.Server.Middleware.ResponseTimeMiddleware: Slow HTTP Response from "{url}" to "{ip}"

this exception isn't even matched with the property token

System.ArgumentNullException: Value cannot be null. (Parameter 'item')
   at Emby.Server.Implementations.Library.MediaSourceManager.GetStaticMediaSources(BaseItem item, Boolean enablePathSubstitution, User user)

seems like the timezone is broken again, I'll look into the changes tomorrow
final two digits here aren't matched correctly

[2021-04-08 00:05:22.310 +00:00]

@dkanada
Copy link
Contributor Author

dkanada commented Apr 8, 2021

It actually might not be an issue with strings but simply the special token matching - perhaps quotes aren't getting counted as word boundaries. I feel like wrapping variables in quotes is a relatively common practice though, so this doesn't seem unreasonable.

One more item I wanted to bring up was the properties, which match a trailing colon. Was that implemented to improve YAML and key-value output? Perhaps it would be better to match the colons as operators and remove them from the property rule with a non-capturing group.

@RunDevelopment
Copy link
Member

it seemed like the string matches were blocking any further matches for UUID or other more specific tokens

This is something I wanted to include in my first comment but forgot to.

The solution to this is to highlight IP-addresses, UUIDs, etc inside strings. Prism has the inside property for this (example).

[...] properties, which match a trailing colon [...] Perhaps it would be better to match the colons as operators and remove them from the property rule [...]?

Good idea. Go for it! (But please use a lookahead.)

Was that implemented to improve YAML and key-value output?

Yes, many logs contain key-value pairs using this pattern. But I actually never tested it on yaml files.

@dkanada
Copy link
Contributor Author

dkanada commented Apr 9, 2021

I decided against the property colon change in the end. One last issue is that the hash rule is designed in a way that currently conflicts with the new GUID changes. How do you want that resolved? Also, it seems the build CI is failing since the .min file isn't updated. Am I expected to update that manually or should I ignore it for now?

Copy link
Member

@RunDevelopment RunDevelopment left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I expected to update that manually

Yes. Run npm run build please.

the hash rule is designed in a way that currently conflicts with the new GUID changes

I haven't thought of that... This is definitely a problem but how are we going to resolve it?

The string 01234567890123456789012345678901 (32 hex digits) could be a hash (e.g. an MD5 hash) or a GUID without hyphens. What is it?

I say we drop unhyphenated GUIDs. UUIDs and GUIDs are supposed to be hyphenated anyway. Let's require the hyphens again.

components/prism-log.js Outdated Show resolved Hide resolved
components/prism-log.js Outdated Show resolved Hide resolved
@dkanada
Copy link
Contributor Author

dkanada commented Apr 12, 2021

I'm poking at the inner rule, but I'm still a bit unclear about its usage. As one example, would the UUID rule need to be duplicated at the root level and the inner block for strings?

@RunDevelopment
Copy link
Member

Yes, you need to duplicate the patterns you want inside strings. Luckily, there are variables :) In the case of log files, it should probably be done kinda like this:

var uuid = { pattern: ..., ... }; // the UUID pattern
var url = /.../; // the URL pattern

Prism.languages.log = {
	'string': {
		// Single-quoted strings must not be confused with plain text. E.g. Can't isn't Susan's Chris' toy
		pattern: /"(?:[^"\\\r\n]|\\.)*"|'(?![st] | \w)(?:[^'\\\r\n]|\\.)*'/,
		greedy: true,
		inside: {
			'url': url,
			'uuid': uuid
		}
	},
	...,
	'url': url,
	...,
	'uuid': uuid,
	...
};

Please don't forget to wrap the whole thing into an IIFE or else we pollute the global namespace.

@dkanada
Copy link
Contributor Author

dkanada commented Apr 17, 2021

After updating the tests for the exception rule, I noticed URLs and other tokens inside the exception were no longer getting matched, which might not be desirable. Since I have somewhat lost my spare time recently, I'd rather get the current changes merged than let the PR grow stale while I consider what to do about the inner matching.

@RunDevelopment
Copy link
Member

Since I have somewhat lost my spare time recently, I'd rather get the current changes merged than let the PR grow stale while I consider what to do about the inner matching.

Perfectly understandable.

Thank you for contributing!

@RunDevelopment RunDevelopment merged commit 45ec4a8 into PrismJS:master Apr 17, 2021
@dkanada dkanada deleted the generic-logs branch April 18, 2021 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants