-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten #776
Comments
Good summary of the existing behaviour and interesting point, thanks. Not sure how frequent this issue would arise but certainly worth considering, although as you pointed out there would be a cost in updating the configs and code. |
Not very often for non-standard headers, but it happens. I've counted HTTP header names for a random selection of 250 WARC files and got about 14,000 unique lowercased header names, among them:
If URLs are partitioned "byIP", the So, it's probably more a matter of time and size of the crawl until a collision happens. It's also a minor security issue. The prefix would then also mark all prefixed metadata names as unsafe, same as for
For the code: At a first glance, there aren't so many changes: the protocol implementations and few more places. An alternative solution would be to force users to explicitly configure the HTTP header names put into the response metadata: # lists the (HTTP) protocol response headers put into
# response metadata as pairs <key, value(s)>
# Header names are lowercased. Metadata persisted under the
# same name is overwritten by the response metadata.
metadata.protocol.response:
- etag
- set-cookie |
Probably easier to use the prefix instead, there's already enough config everywhere ;-) |
…ccidentally overwritten; fixes apache#776 - prefix "etag" metadata key in HTTP protocol implementations
…ccidentally overwritten; fixes apache#776 - prefix "set-cookie" metadata key when read in HTTP protocol implementations - add note in default configuration about protocol metadata prefixes and metadata.persist and metadata.transfer - implement method in Metadata to insert/update of metadata with prefix
…ccidentally overwritten; fixes apache#776 - WARC module: access protocol metadata using the configured prefix
…ccidentally overwritten; fixes apache#776 - read from metadata using the protocol metadata prefix: * HTTP Content-Type * info whether content payload has been trimmed during fetch
* Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776 * Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776 - prefix "etag" metadata key in HTTP protocol implementations * Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776 - prefix "set-cookie" metadata key when read in HTTP protocol implementations - add note in default configuration about protocol metadata prefixes and metadata.persist and metadata.transfer - implement method in Metadata to insert/update of metadata with prefix * Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776 - WARC module: access protocol metadata using the configured prefix * Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776 - read from metadata using the protocol metadata prefix: * HTTP Content-Type * info whether content payload has been trimmed during fetch * Added getValue(s) methods with prefix * use getValue methods with prefix * changed default value for md.prefix + removed cleanup param from putAll method Co-authored-by: Sebastian Nagel <snagel@apache.org>
@sebastian-nagel just came across such a case
annoyingly I had a source metadata already in use and it got overridden :-) |
Resembles me about one statement in this discussion: "the maxim that web-scale data will provide an example breaking every single one of your shortcuts." |
…#777 - assumed that the HTTP last-modified field is kept separately, see `protocol.md.prefix` and apache#776 - set `last-modified` in metadata - for the initial successful fetch (also after error status) - and if a change of the content signature indicates a modification - clear `last-modified` for permanent failures (status ERROR) - set `fetchInterval` in metadata for initial fetches scheduled by DefaultScheduler
…812) - assumed that the HTTP last-modified field is kept separately, see `protocol.md.prefix` and #776 - set `last-modified` in metadata - for the initial successful fetch (also after error status) - and if a change of the content signature indicates a modification - clear `last-modified` for permanent failures (status ERROR) - set `fetchInterval` in metadata for initial fetches scheduled by DefaultScheduler
- add prefix for protocol metadata, cf. apache/incubator-stormcrawler#776 - add protocol.etag to persisted metadata, triggers If-None-Match HTTP requests - add properties to handle cookies (not active for now) - fix typos
HTTP protocol implementations add all HTTP response header fields to the response metadata. The FetcherBolt merges the response metadata into the metadata which is passed forward in the topology as part of the tuples <url, content, metadata>. Existing key-value(s) pairs (persisted in the status index) are overwritten and later eventually stored in the status index (if listed in "metadata.persist") or even transferred to outlinks ("metadata.transfer").
This allows to easily use the response header values for requests, e.g. cookies or "ETag".
However, webadmins are free to send any header back. This may cause unwanted collisions with metadata keys used by crawler-internal classes. E.g. if the server responds with non-standard headers "HostName" or "Depth". Or even standard headers such as "Last-modified" which require to follow a specific format for internal use.
To avoid collisions: Why not prefix protocol metadata:
protocol.content-type
orhttp.content-type
? This would also make it clear which component sets the metadata - similar to the prefixesfetch.
andparse.
already used. The draw-back would be that users are required to update the configuration.The text was updated successfully, but these errors were encountered: