Skip to content
This repository has been archived by the owner on Nov 7, 2018. It is now read-only.

sec.gov replies with 404 error to HEAD requests but with 200 to GET requests #67

Open
R-Adrian opened this issue Mar 4, 2017 · 1 comment

Comments

@R-Adrian
Copy link

R-Adrian commented Mar 4, 2017

here's a bit of a problem with accessing SEC filings:

the sec.gov EDGAR archive replies with a HTTP/1.1 404 Not Found to a HEAD request but with HTTP/1.1 200 OK to a GET request for the same URL... this breaks code that relies on HEAD to determine when a file has been changed (usually applies to .xml index files).

this means that apps have to re-download an entire file just to check if has changed... not optimal but this will cause a HUGELY increased traffic volume to the site because applications will fall back to re-downloading entire files instead of just using server headers to determine timestamps.

(or maybe this was their exact intention? to boost their traffic volume numbers artificially?)

for example:

* Connected to www.sec.gov (23.63.182.226) port 443
> GET /Archives/edgar/full-index/2017/QTR1/sitemap.quarterlyindex6.xml HTTP/1.1
> Host: www.sec.gov
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en-GB,en;q=0.5
> Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
> Accept-Encoding: gzip, deflate
> Connection: keep-alive
> Cache-Control: max-age=0
> Keep-Alive: 115

< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Content-Encoding: gzip
< Content-Type: text/xml
< ETag: "047699c126f8436b29896706cdcd6274"
< Last-Modified: Sat, 04 Mar 2017 07:47:00 GMT
< Server: AmazonS3
< Vary: Accept-Encoding
.... (and so on)...

but if i use HEAD instead, i'm getting:

* Connected to www.sec.gov (23.63.182.226) port 443
> HEAD /Archives/edgar/full-index/2017/QTR1/sitemap.quarterlyindex6.xml HTTP/1.1
> Host: www.sec.gov
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en-GB,en;q=0.5
> Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
> Accept-Encoding: gzip, deflate
> Connection: keep-alive
> Cache-Control: max-age=0
> Keep-Alive: 115

< HTTP/1.1 404 Not Found
< Accept-Ranges: bytes
< Content-Encoding: gzip
< Content-Length: 6214
< Content-Type: text/html
< Server: Apache
< Vary: Accept-Encoding
...(and so on)....
@joebarbere
Copy link

joebarbere commented Nov 11, 2017

@Aditza2015 Thanks for posting this issue! I tried sending a HEAD request today and got the same 404 response.

As a workaround, I'm using a GET request with a Range header to reduce the response size. You can still get all the needed headers this way.

Sample Go Code

url := "https://www.sec.gov/Archives/edgar/full-index/2017/QTR1/company.zip"
resp1, _ := http.Head(url)
defer resp1.Body.Close()
println("HEAD response: ", resp1.Status)

client := &http.Client{}
req, _ := http.NewRequest("GET", url, nil)
req.Header.Add("Range", "1-100")
resp2, _ := client.Do(req)
defer resp2.Body.Close()
println("GET (Range: 1-100) response: ", resp2.Status)
println("Content-Length: ", resp2.Header.Get("Content-Length"))
println("Last-Modified: ", resp2.Header.Get("Last-Modified"))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants