Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the philosophy of HtmlUnit when a response contains a header "Content-Type: application/octet-stream" #611

Closed
qurikuduo opened this issue Jun 30, 2023 · 9 comments

Comments

@qurikuduo
Copy link

Hi there,
Some URL have a response with header "Content-Type: application/octet-stream". Should I process it as an attachment?
After some digs, The Attachment only handle specific response which define in rfc-2183.
the :
attachmentHandler_.isAttachment(webResponse)
will return False when we have "application/octet-stream".
I found org.htmlunit.HttpWebConnection.downloadContent() will be called:
public static DownloadedContent downloadContent(final InputStream is, final int maxInMemory)
It will download the response content.
If I DON'T want HtmlUnit to download big content( e.g. https://dg.10000gd.tech:12348/shmfile/100 ), what should I do?
I want to block download action if a resource lager than 20MB to save on bandwidth.

Thanks a lot.

@rbri
Copy link
Member

rbri commented Jun 30, 2023

Maybe a simple solution is to set up your own WebConnectionWrapper and intercept the request url's. For the large ones don's call super and simply return a static response.

see https://www.htmlunit.org/faq.html#HowToModifyRequestOrResponse as starting point

@rbri
Copy link
Member

rbri commented Jun 30, 2023

will try to make a bit more detailed description ....

@qurikuduo
Copy link
Author

qurikuduo commented Jun 30, 2023

Sounds like an option.

  1. Specify my own WebConnectionWrapper.
  2. Try to get content-length which defined in response Headers
  3. If content-length not defined, try to implement my own HttpWebConnection implements WebConnection interface, then I will determine the response body is too large to be blocked in public static DownloadedContent downloadContent():
    When
    while( readCount = InputStream.read(buffer) !=0){ //... }
    Is it a solution?
    thx.

@qurikuduo
Copy link
Author

qurikuduo commented Jul 4, 2023

After trying a few small tricks, I achieved the functionality I wanted.
Here is what I did:

  1. Specify my own WebConnectionWrapper copied from HttpWebConnection and put it in package org.htmlunit : public class MyxxHttpWebConnection extends HttpWebConnection,
    Override public WebResponse getResponse(final WebRequest webRequest)
    and get content-length by read : httpResponse.getFirstHeader(ContentLength).getValue() ,
    determine if it is too large:
    if(contentLengthLong> maxContentLength){
    System.out.println("Content is too big. url="+webRequest.getUrl().toString()+" contentLength = " + contentLengthLong + ", maxContentLength = " + maxContentLength);
    httpMethod.abort();
    httpResponse.setEntity(null);
    }

  2. Specify my own AttachmentHandler: public class MyxxAttachmentHandler implements AttachmentHandler
    @OverRide:
    public void handleAttachment(final Page page) {
    //not download attachment lager than 100KB
    if(page.getWebResponse().getContentLength() > maxAttachmentSize){
    System.out.println("Attachment is too big. url=" + page.getUrl()+" contentLength = " + page.getWebResponse().getContentLength() + ", maxAttachmentSize = " + maxAttachmentSize);
    try {
    page.getEnclosingWindow().getWebClient().getWebConnection().close();
    }catch(Exception e){
    logger.error("Error when close attachment download.", e);
    }
    finally {
    try {
    page.getWebResponse().cleanUp();//new AbstractPage(page.getWebResponse(),page.getEnclosingWindow())) ;
    page.getEnclosingWindow().setEnclosedPage(new HtmlPage(createWebResponse(new WebRequest(page.getUrl(),page.getWebResponse().getWebRequest().getHttpMethod()), "",
    page.getWebResponse().getContentType(), page.getWebResponse().getStatusCode(),page.getWebResponse().getStatusMessage()),page.getEnclosingWindow()));
    } catch (Exception e) {
    logger.error("Error when close attachment download.", e);
    }
    return;
    }
    }
    else {
    //if not response
    collectedAttachments_.add(new Attachment(page));
    }
    }

  3. Create new instance before calling getPage(url):

webClient.setAttachmentHandler(new MyxxAttachmentHandler(attachmentList) );
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
MyxxHttpWebConnection webConnection = new MyxxHttpWebConnection(webClient);
return webConnection.getResponse(request);
}
};
page=webClient.getPage(url)
if(attachmentList.size()>0){
//download attachment.
long contentLength = attachement.getPage().getWebResponse().getContentLength();
if(contentLength==0||(contentLength>MyxxAttachmentHandler.maxAttachmentSize)){
System.out.println("attachment too large, will not save to disk. contentLength = "+contentLength);
continue;
}
else{
//save attachment to file.
}
}

It is work for me now.

@rbri
Copy link
Member

rbri commented Jul 8, 2023

Hi @qurikuduo,

slowly i got an idea what you like to do.
I made some small changes and now i can do something like this.

@Test
public void contentBlocking() throws Exception {
    final byte[] content = new byte[] {77, 44};
    final List<NameValuePair> headers = new ArrayList<>();
    headers.add(new NameValuePair("Content-Encoding", "gzip"));
    headers.add(new NameValuePair(HttpHeader.CONTENT_LENGTH, String.valueOf(content.length)));

    final MockWebConnection conn = getMockWebConnection();
    conn.setResponse(URL_FIRST, content, 200, "OK", MimeType.APPLICATION_JSON, headers);

    startWebServer(getMockWebConnection());

    final WebClient client = getWebClient();
    client.setWebConnection(new HttpWebConnection(client) {
        @Override
        protected WebResponse downloadResponse(final HttpUriRequest httpMethod,
                final WebRequest webRequest, final HttpResponse httpResponse,
                final long startTime) {

            // check the header here if you like
            // call return super.downloadResponse() in case you are happy with the headers

            httpMethod.abort();

            // create empty response and mark as blocked for later
            final DownloadedContent downloaded = new DownloadedContent.InMemory(null);
            final long endTime = System.currentTimeMillis();
            final WebResponse response = makeWebResponse(httpResponse, webRequest, downloaded, endTime - startTime);
            response.markAsBlocked("test blocking");
            return response;
        }
    });

    final UnexpectedPage page = client.getPage(URL_FIRST);
    assertTrue(page.getWebResponse().wasBlocked());
    assertEquals("test blocking", page.getWebResponse().getBlockReason());
}

Will this help to simplify your code? do you need some other changes for your case?

@rbri
Copy link
Member

rbri commented Jul 9, 2023

@qurikuduo just made a new snapshot build - please try

3.4.0-SNAPSHOT

@rbri
Copy link
Member

rbri commented Jul 9, 2023

Have update the documentation a bit - https://www.htmlunit.org/details.html
Hope that helps.

@rbri
Copy link
Member

rbri commented Jul 14, 2023

Will close this, hope the changes and the docu are sufficient

@rbri rbri closed this as completed Jul 14, 2023
@qurikuduo
Copy link
Author

Than you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants