Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten #776

Closed
sebastian-nagel opened this issue Dec 13, 2019 · 5 comments · Fixed by #789

Comments

@sebastian-nagel
Copy link
Contributor

HTTP protocol implementations add all HTTP response header fields to the response metadata. The FetcherBolt merges the response metadata into the metadata which is passed forward in the topology as part of the tuples <url, content, metadata>. Existing key-value(s) pairs (persisted in the status index) are overwritten and later eventually stored in the status index (if listed in "metadata.persist") or even transferred to outlinks ("metadata.transfer").

This allows to easily use the response header values for requests, e.g. cookies or "ETag".

However, webadmins are free to send any header back. This may cause unwanted collisions with metadata keys used by crawler-internal classes. E.g. if the server responds with non-standard headers "HostName" or "Depth". Or even standard headers such as "Last-modified" which require to follow a specific format for internal use.

To avoid collisions: Why not prefix protocol metadata: protocol.content-type or http.content-type? This would also make it clear which component sets the metadata - similar to the prefixes fetch. and parse. already used. The draw-back would be that users are required to update the configuration.

@jnioche
Copy link
Contributor

jnioche commented Dec 16, 2019

Good summary of the existing behaviour and interesting point, thanks. Not sure how frequent this issue would arise but certainly worth considering, although as you pointed out there would be a cost in updating the configs and code.

@sebastian-nagel
Copy link
Contributor Author

Not sure how frequent this issue would arise

Not very often for non-standard headers, but it happens. I've counted HTTP header names for a random selection of 250 WARC files and got about 14,000 unique lowercased header names, among them:

HTTP/1.1 200 OK
Server: nginx
...
ip: 107.23.218.4

If URLs are partitioned "byIP", the ip header field would overwrite the persisted IP eventually causing duplicates in the status index, same URL in various partitions.

So, it's probably more a matter of time and size of the crawl until a collision happens. It's also a minor security issue. The prefix would then also mark all prefixed metadata names as unsafe, same as for parse.keywords - you cannot trust the value unless you crawl your own content.

a cost in updating the configs and code.

For the code: At a first glance, there aren't so many changes: the protocol implementations and few more places.
For the user config: A configurable prefix if empty would allow to delay any changes. That's important if you want to keep your status index and cannot reindex immediately.

An alternative solution would be to force users to explicitly configure the HTTP header names put into the response metadata:

  # lists the (HTTP) protocol response headers put into
  # response metadata as pairs <key, value(s)>
  # Header names are lowercased. Metadata persisted under the
  # same name is overwritten by the response metadata.
  metadata.protocol.response:
   - etag
   - set-cookie

@jnioche
Copy link
Contributor

jnioche commented Dec 21, 2019

hi @sebastian-nagel

An alternative solution would be to force users to explicitly configure the HTTP header names put into the response metadata:

Probably easier to use the prefix instead, there's already enough config everywhere ;-)

jnioche added a commit that referenced this issue Feb 20, 2020
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Feb 21, 2020
…ccidentally overwritten; fixes apache#776

- prefix "etag" metadata key in HTTP protocol implementations
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Feb 26, 2020
…ccidentally overwritten; fixes apache#776

- prefix "set-cookie" metadata key when read in HTTP protocol implementations
- add note in default configuration about protocol metadata prefixes and
  metadata.persist and metadata.transfer
- implement method in Metadata to insert/update of metadata with prefix
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Feb 26, 2020
…ccidentally overwritten; fixes apache#776

- WARC module: access protocol metadata using the configured prefix
sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Feb 26, 2020
…ccidentally overwritten; fixes apache#776

- read from metadata using the protocol metadata prefix:
  * HTTP Content-Type
  * info whether content payload has been trimmed during fetch
jnioche added a commit that referenced this issue Mar 3, 2020
* Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776

* Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776
- prefix "etag" metadata key in HTTP protocol implementations

* Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776
- prefix "set-cookie" metadata key when read in HTTP protocol implementations
- add note in default configuration about protocol metadata prefixes and
  metadata.persist and metadata.transfer
- implement method in Metadata to insert/update of metadata with prefix

* Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776
- WARC module: access protocol metadata using the configured prefix

* Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten; fixes #776
- read from metadata using the protocol metadata prefix:
  * HTTP Content-Type
  * info whether content payload has been trimmed during fetch

* Added getValue(s) methods with prefix

* use getValue methods with prefix

* changed default value for md.prefix + removed cleanup param from putAll method

Co-authored-by: Sebastian Nagel <snagel@apache.org>
@jnioche
Copy link
Contributor

jnioche commented May 5, 2020

@sebastian-nagel just came across such a case

curl -i "https://cdn.prod.www.spiegel.de/images/d1f3784e-24cd-41fa-b6be-f85a19f9cea1_w920_r1.77_fpx45.33_fpy50.jpg"
HTTP/2 200
date: Tue, 05 May 2020 12:40:42 GMT
content-type: image/jpeg
content-length: 94287
cache-control: public, max-age=604800, s-maxage=604800
etag: "0c3fe0e90ffa24814db2517fac84b6a1"
expires: Tue, 12 May 2020 06:45:30 GMT
last-modified: Tue, 05 May 2020 06:44:53 GMT
server: Footprint Distributor V6.1.1162
alt-svc: clear
x-cache: MISS
x-cache-grace: 300.000
x-ttl: 1800.000
source: default
age: 21345
accept-ranges: byte

annoyingly I had a source metadata already in use and it got overridden :-)

@sebastian-nagel
Copy link
Contributor Author

Resembles me about one statement in this discussion: "the maxim that web-scale data will provide an example breaking every single one of your shortcuts."

sebastian-nagel added a commit to sebastian-nagel/storm-crawler that referenced this issue Jul 7, 2020
…#777

- assumed that the HTTP last-modified field is kept separately,
  see `protocol.md.prefix` and apache#776
- set `last-modified` in metadata
  - for the initial successful fetch (also after error status)
  - and if a change of the content signature indicates a modification
- clear `last-modified` for permanent failures (status ERROR)
- set `fetchInterval` in metadata for initial fetches scheduled
  by DefaultScheduler
jnioche pushed a commit that referenced this issue Jul 8, 2020
…812)

- assumed that the HTTP last-modified field is kept separately,
  see `protocol.md.prefix` and #776
- set `last-modified` in metadata
  - for the initial successful fetch (also after error status)
  - and if a change of the content signature indicates a modification
- clear `last-modified` for permanent failures (status ERROR)
- set `fetchInterval` in metadata for initial fetches scheduled
  by DefaultScheduler
sebastian-nagel added a commit to commoncrawl/news-crawl that referenced this issue Jul 29, 2020
- add prefix for protocol metadata, cf. apache/incubator-stormcrawler#776
- add protocol.etag to persisted metadata, triggers If-None-Match HTTP requests
- add properties to handle cookies (not active for now)
- fix typos
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants