Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of ignore_user_agents #3238

Merged
merged 14 commits into from
May 13, 2023
Merged

Improved handling of ignore_user_agents #3238

merged 14 commits into from
May 13, 2023

Conversation

empiricompany
Copy link
Contributor

Modified the code to skip request logging for all user agents that contain any substring of an ignored agent in the ignore_user_agents config. The previous code only matched the exact string of user_agent with the ignore_user_agents config using in_array function, which resulted in some requests not being skipped from logging. Now, the code uses stristr function to match the string partially and case-insensitive, which improves the functionality of ignore_user_agents.

Modified the code to skip request logging for all user agents that contain any substring of an ignored agent in the ignore_user_agents config. The previous code only matched the exact string of user_agent with the ignore_user_agents config using in_array function, which resulted in some requests not being skipped from logging. Now, the code uses stristr function to match the string partially and case-insensitive, which improves the functionality of ignore_user_agents.
@github-actions github-actions bot added the Component: Log Relates to Mage_Log label May 5, 2023
@empiricompany empiricompany changed the title Improve ignore_user_agents functionality Better handling ignore_user_agents May 5, 2023
@empiricompany
Copy link
Contributor Author

empiricompany commented May 5, 2023

We could also improve the config by including a new bot and optimizing the strings to ignore the bot versions and other information.

app/code/core/Mage/Log/etc/config.xml

<global>

    <ignore_user_agents>
        <google1>Googlebot</google1>
        <amazon>Amazonbot</amazon>
    </ignore_user_agents>

fballiano
fballiano previously approved these changes May 7, 2023
@empiricompany
Copy link
Contributor Author

Do you think it would make sense to add a 4° option 'Logged In Only' to activate logs only for logged-in customers?

@empiricompany
Copy link
Contributor Author

I got a bit confused, sorry. I did some tests and found that stripos is slightly more performant than preg_match. Now, it seems OK

@fballiano
Copy link
Contributor

Let’s go, thanks @empiricompany

@sreichel
Copy link
Contributor

I got a bit confused, sorry. I did some tests and found that stripos is slightly more performant than preg_match. Now, it seems OK

Can you please share test results? With my tests preg_match was faster on large arrays.

I thinks its okay (but i still would move that to a new method to reduce complexity in _construct())?

@ADDISON74
Copy link
Collaborator

See an updated list here https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker. In addition, the number is much higher.

Besides an issue of decreased performance, the advantages of this PR must be carefully analyzed. Everyone should keep their own list and keep it as short as possible. I use .htaccess not to ignore bots but to block them.

@empiricompany
Copy link
Contributor Author

I got a bit confused, sorry. I did some tests and found that stripos is slightly more performant than preg_match. Now, it seems OK

Can you please share test results? With my tests preg_match was faster on large arrays.

I thinks its okay (but i still would move that to a new method to reduce complexity in _construct())?

I got a bit confused, sorry. I did some tests and found that stripos is slightly more performant than preg_match. Now, it seems OK

Can you please share test results? With my tests preg_match was faster on large arrays.

I thinks its okay (but i still would move that to a new method to reduce complexity in _construct())?

Besides having read on various forums that it's more performant, I did some simple tests with microtime. I didn't have time to do more complex tests on resource usage.

The result of the tests is a difference of just a few microseconds on a large array taken from here
https://github.com/fabiomb/is_bot/blob/master/is_bot.php:

2023-05-13T09:07:38+00:00 DEBUG (7): Benchmark preg_match 0.0002
2023-05-13T09:07:38+00:00 DEBUG (7): Benchmark foreach>stripos 0.0001

The code with foreach also seems less complex and clearer

@empiricompany
Copy link
Contributor Author

See an updated list here https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker. In addition, the number is much higher.

Besides an issue of decreased performance, the advantages of this PR must be carefully analyzed. Everyone should keep their own list and keep it as short as possible. I use .htaccess not to ignore bots but to block them.

I also use Nginx to block bad bots at the source, but I don't want to block Googlebot.

I don't think there are performance differences simply by changing the comparison method. This was supposed to be a very simple PR to prevent exclusion if the version changed in the user_agent string for googlebot, like Googlebot/2.1 change to Googlebot/2.0.

I wanted to include the "compatible;" keyword in the exclusion list or by default in the method, which I think is only used by bots?. However, I left this decision to the discretion of the developer.

If we want to develop a more complex mechanism, I'm open to other ideas (for example, we could save the check in session for subsequent requests).

Copy link
Contributor

@fballiano fballiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still feel think PR could be merged as it is, I like it.

@fballiano
Copy link
Contributor

probably it could be improved in the future (like everything) but I don't see any point that would prevent us from implementing it :-)

@fballiano fballiano changed the title Better handling ignore_user_agents Improved handling of ignore_user_agents May 13, 2023
@fballiano fballiano merged commit c0d136a into OpenMage:main May 13, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Log Relates to Mage_Log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants