Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Regex based IP detection #44

Closed
open-dynaMIX opened this issue May 9, 2020 · 5 comments · Fixed by #48
Closed

[RFC] Regex based IP detection #44

open-dynaMIX opened this issue May 9, 2020 · 5 comments · Fixed by #48

Comments

@open-dynaMIX
Copy link
Member

open-dynaMIX commented May 9, 2020

Rationale

Our column-based approach of specifying the location of an IP address is not flexible enough to cover all usecases.

A good example of such a usecase can be seen in this issue. Since it's not possible to configure the log format for error logs in nginx, Anonip can't reliably detect IP addresses.

Proposal

I propose an alternative regex matching IP detection.

I don't intend to match IP addresses with regexes! But I'd like to provide a way to point Anonip to the locations of IP addresses with a regex.

This alternative approach should be provided alongside the already existing column-based approach.

When using the new --regex argument, the arguments --column and --delimiter will become obsolete.

--replace can still be used, for cases, where we have matching groups, but they're not valid IP addresses.

Example

The regexes provided in the examples are simplified and should just illustrate the proposed feature. For production environments you want to have more robust ones.

Let's use the log line from the before mentioned issue:

2020/03/05 19:27:43 [error] 1253#1253: *15347 open() "/usr/share/nginx/html/favicon.ico" failed (2: No such file or directory), client: XXX.XXX.XXX.XXX, server: address.tld, request: "GET /favicon.ico HTTP/1.1", host: "address.tld"

With the new feature in place, we could do:

$ ./anonip.py --regex ".* client\: ([^,]+), .*"

This would then match the provided log line and capture the IP address (XXX.XXX.XXX.XXX) into the first group.

In order to find all IP addresses, Anonip would then iterate over all available matched groups (just one in this example).

More involved example

Let's say we still want to handle above log line, but additionally we expect lines in the following format:

1970-01-01 - somefixedstring: XXX.XXX.XXX.XXX - exception foo - XXX.XXX.XXX.XXX

Note the two IP addresses.

This can be handled in one single regex:

$ ./anonip.py --regex "(?:.*, client\: ([^,]+), .*|.* - somefixedstring\: ([^,]+) - .* - ([^,]+))"

Considerations

This opens a box of very verbose and hardly readable commands needed to run Anonip against certain logs.

But for more advanced users, it would fill the gap which exists now for parsing log files with formats that are not parseable by Anonip.

@open-dynaMIX
Copy link
Member Author

/cc @datenreisen

@open-dynaMIX
Copy link
Member Author

Over another channel the question of performance impact was raised.

In order get some numbers on this, I've implemented a prototype of the feature and hacked together a quick and dirty profiling script.

For the tests, I've used an exampe access.log with 10000 lines.

In order to smooth any spikes, I ran this file 1000 times through anonip with regex matching and another 1000 times with normal column based matching.

Here are the results:

$ ./profile.sh 1000
regex based: 0.571898 seconds average
column based: 0.509772 seconds average
column based detection is 10.87% faster than regex based detection

Based on those numbers, I'd say the performance hit definately is a concern. OTOH: For normal parsing of a (configurable) access.log, column based detection is absolutely sufficient. In the context of webserver logs, regex based detection would just come into play for (unconfigurable) error.logs.

As the effort needed to properly implement this feature is manageabIe, I propose we implement it and transparently document its performance impact, thus advising users to only use it when absolutely necessary and when performance is not very critical.

@datenreisen
Copy link
Contributor

What are your opinions? @rettichschnidi @benib @ganti @ryru @packi @ideadapt

@rettichschnidi
Copy link
Member

rettichschnidi commented Oct 18, 2020

Some thoughts:

  • I like the idea and I am willing to give it a try on a (test) server
  • A 10% performance penalty seems easily acceptable to me
  • Instead of having the user writing complicated regex to match multiple locations, how about allow passing multiple --regex arguments? Could be concatenated again internally if helpful for performance reasons?
  • Having the possibility to read the regex(s) from a file might free the user from having to pass lengthy, escaping-requiring arguments

open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Oct 19, 2020
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#44
open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Oct 19, 2020
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#44
open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Oct 19, 2020
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#44
open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Oct 19, 2020
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#44
@open-dynaMIX
Copy link
Member Author

  • I like the idea and I am willing to give it a try on a (test) server

Awesome 🚀
I've opened a PR (#48) with a draft implementation.

  • Instead of having the user writing complicated regex to match multiple locations, how about allow passing multiple --regex arguments? Could be concatenated again internally if helpful for performance reasons?

Great idea! Implemented in a similar way: --regex allows for multiple regexes that are concatenated by anonip.

  • Having the possibility to read the regex(s) from a file might free the user from having to pass lengthy, escaping-requiring arguments

Great idea! Let's save this for a later iteration though.

open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Oct 19, 2020
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#44
open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Nov 4, 2020
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#44
open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Nov 4, 2020
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#44
open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Dec 17, 2021
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#44
open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Dec 26, 2021
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#44
open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Dec 26, 2021
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#42, closes DigitaleGesellschaft#44
open-dynaMIX added a commit to open-dynaMIX/Anonip that referenced this issue Dec 26, 2021
This commit implements regex based IP detection. This is intended to use
for logfiles where column based detection doesn't work.

See RFC (DigitaleGesellschaft#44) for more information.

Closes DigitaleGesellschaft#42, closes DigitaleGesellschaft#44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants