Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow to IPSet() from a large source, and very slow to go though iter_cidrs() as well #152

Open
mochouinard opened this issue Feb 11, 2017 · 4 comments

Comments

@mochouinard
Copy link

I got a script that load from multiple db IP address that are considered infected in some way.

It take about 3.5 minute for my python script to run,
It take 5 second to load the data into an array in python from a remote database (850 000 entry).

I got 1287 IP in a array(whitelist), which took 45millisecond to load using IPset(listhere)
I also got 850 000 IP in a array(blacklist), which took 68 second to load using IPset(listhere).
It then took 75second to remove the whitelist entry from the blacklist.
And it took 34 second to go though iter_cidrs()

Are these time considered acceptable ? Should I look at some other library ?

My goal is to create a shorten list of IP/Mask that I can feed to linux ipset and block those IP.

@snordhausen
Copy link
Contributor

Which version of netaddr are you using? Newer versions tend to be more optimized for larger data sets like yours. I also recall that version 0.7.18 had a big performance improvement for IPSet.

@mochouinard
Copy link
Author

mochouinard commented Feb 12, 2017

I was running a older version, but upgraded github trunk latest commit date of : Sun Jan 22 23:47:00 2017 commit : 4205371 before doing these test and posting here

@snordhausen
Copy link
Contributor

I tried to reproduce loading 850,000 entries from a blacklist into an IPSet with this code:

import random, netaddr, time
# Produce a list of ips like ["1.23.4.5", "10.5.4.3", ...].
ips = [str(netaddr.IPAddress(random.randint(0, 2**32-1))) for _ in range(850000)]

start = time.time()
ip_set = netaddr.IPSet(ips)
print("Total time to build IPSet: %.2f" % (time.time() - start))

With netaddr 0.7.19 this takes

  • 15.22 seconds with Python 3.4
  • 13.59 seconds with Python 2.7

If you have a very slow CPU, you might get up to 68 seconds. That's mostly because netaddr focusses on flexibility (IPv4/6, allowing individual IPs or whole networks). And because 850k IPs is quite a bit.

I also built a quick benchmark to reproduce the entire process. One thing I noticed is that iter_cidrs is unnecessarily slow, taking 16 seconds on my computer. That's because it always sorts the cidrs. If we added an official way to get the cidrs in an unspecified order (e.g. just return list(self._cidrs)) it would take <0.1 seconds.

@mochouinard
Copy link
Author

Well it not running on latest hardware for sure on my end;)

850k is a lot... But Sadly, that the average number of IP I need to block on my firewall :(

Trying to work around it, but would be great if it could be made quicker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants