Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing error after X responsive pages #14

Closed
h-m-f-t opened this issue Oct 7, 2016 · 6 comments
Closed

Parsing error after X responsive pages #14

h-m-f-t opened this issue Oct 7, 2016 · 6 comments

Comments

@h-m-f-t
Copy link

h-m-f-t commented Oct 7, 2016

domain-scan's gather feature queries the Censys API to collect .gov subdomains. The query parsed.subject.common_name:".gov" or parsed.extensions.subject_alt_name.dns_names:".gov" works on API pages 1-100, but errors out on any page ≥ 101:

$ ./gather censys --suffix=.gov --start=101 --end=101 --force --delay=5 --debug --parents=current-federal.csv --censys_id=<id> --censys_key=<key>

Starting new HTTPS connection (1): www.censys.io
"GET /api/v1/account HTTP/1.1" 200 245
Censys query:
parsed.subject.common_name:".gov" or parsed.extensions.subject_alt_name.dns_names:".gov"

Fetching up to 100 records, starting at page 101.
Fetching page 101.
"POST /api/v1/search/certificates HTTP/1.1" 400 116
Traceback (most recent call last):

  File "/home/hmft/Desktop/domain-scan-master/gatherers/censys.py", line 100, in gather
    certs = list(certificate_api.search(query, fields=fields, page=current_page, max_records=page_size))

  File "/usr/local/lib/python3.5/dist-packages/censys/base.py", line 151, in search
    payload = self._post(self.search_path, data=data)

  File "/usr/local/lib/python3.5/dist-packages/censys/base.py", line 98, in _post
    return self._make_call(self._session.post, endpoint, args, data)

  File "/usr/local/lib/python3.5/dist-packages/censys/base.py", line 92, in _make_call
    const=const)

censys.base.CensysException: 400 (es_transport_error): Your search was invalid and could not be parsed.

This same query on the web indicates 7,661 responsive pages:

image

...but any page ≥ 401 produces an error:

image

cc: @konklone

@zakird
Copy link
Member

zakird commented Oct 7, 2016

It looks like by default Elasticsearch 2.x limits results to 10K results and has the following to say:

To understand why deep paging is problematic, let’s imagine that we are searching within a single index with five primary shards. When we request the first page of results (results 1 to 10), each shard produces its own top 10 results and returns them to the coordinating node, which then sorts all 50 results in order to select the overall top 10.

Now imagine that we ask for page 1,000—results 10,001 to 10,010. Everything works in the same way except that each shard has to produce its top 10,010 results. The coordinating node then sorts through all 50,050 results and discards 50,040 of them!

You can see that, in a distributed system, the cost of sorting results grows exponentially the deeper we page. There is a good reason that web search engines don’t return more than 1,000 results for any query.

We can increase the number some, but probably not infinitely. If you want to iterate over a very large set of results like this, you are going to be better off using the SQL or EXPORT API endpoints.

@konklone
Copy link

konklone commented Oct 7, 2016

I guess I'm surprised we didn't notice this until now -- maybe I haven't noticed error messages, but it also might be a recent issue. It sounds too basic for it to be recent, but I at least wanted to raise the possibility.

Given the issue, I do think Censys could supply a more precise/graceful error message than:

Error! Your search was invalid and could not be parsed.

You probably want to point users to a different API instead.

Since the REST API is the most user/developer-friendly API to Censys data, I would suggest raising the page limit to whatever Censys.io can reasonably support.

@zakird
Copy link
Member

zakird commented Oct 8, 2016

We recently migrated from Elasticsearch 1.6 to 2.4, which is when this would have changed. I'll update the number of results to 25,000, update the documentation, and change this to be a better error messages. This is the first time this has come up in this particular API. You can also export results through the REST API, there's just a slightly different set of semantics: https://censys.io/api/v1/docs/export.

@h-m-f-t h-m-f-t changed the title Parsing error in base.py after X responsive pages Parsing error after X responsive pages Oct 9, 2016
@h-m-f-t
Copy link
Author

h-m-f-t commented Oct 9, 2016

Thanks @zakird, and thank you for your efforts on Censys. 👍

@h-m-f-t h-m-f-t closed this as completed Oct 9, 2016
@zakird
Copy link
Member

zakird commented Nov 3, 2016

I just wanted to note that this error message should now describe the problem and paths forward. Example at https://censys.io/certificates?q=*&page=401.

@konklone
Copy link

konklone commented Nov 6, 2016

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants