Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to stop puppeteer follow redirects #1132

Closed
ali-habibzadeh opened this issue Oct 23, 2017 · 16 comments
Closed

How to stop puppeteer follow redirects #1132

ali-habibzadeh opened this issue Oct 23, 2017 · 16 comments

Comments

@ali-habibzadeh
Copy link

Currently it seems the default behaviour of puppeteer is to follow redirects and return the DOM at the end of the chain.
How can this be changed when need to stop the behaviour and make the .goto to stop after the first redirect and simply return the html from that first 301 page for example?

@GuilloOme
Copy link

Same problem here…
I agree, the page.goto() should return the 30x response in the promise.
It could be a params to pass to it upon calling it, like followRedirect: false.

@ali-habibzadeh
Copy link
Author

So I have a solution now but I agree with @GuilloOme that followRedirect: false should be a .goto option prop.

let docRedirected = false;

const onResponseHandler = (response) => {
    docRedirected = redirects.isRedirect(response);
};
const onRequestHandler = (request) => {
    if (docRedirected) { return request.abort(); } //Aborts all subsequent requests
    reqRejection.shouldReject(request, options) ? request.abort() : request.continue();
};
page.on('request', onRequestHandler);
page.on('response', onResponseHandler);

then in redirect I have:

const redirectStatuses = [301, 302, 303, 307, 308];

exports.isRedirect = (response) => {
    return redirectStatuses.includes(response.status)
        && response.request().resourceType === "document";
};

@GuilloOme
Copy link

Thanks a lot @ali-habibzadeh for this workaround!
Still, it would be better to have this docRedirected state in puppeteer.

@ali-habibzadeh
Copy link
Author

definitely agree. using let with side effect like i have done is not ideal code. followRedirect: false as an option for .goto would be much better.

@GuilloOme
Copy link

@ali-habibzadeh, your solution is good enough if you are sure that it will not be any concurrent request. In my context, it could be possible to have multiple concurrent request ; so, to avoid blocking the wrong request, I store the redirect response and compare the "location" url with the given url.

Here is my workaround:

let lastRedirectResponse = undefined;
page.setRequestInterceptionEnabled(true);

page.on('response', response => {
    // if this response is a redirect
    if ([301, 302, 303, 307, 308].includes(response.status) 
            && response.request().resourceType === 'document') {
        lastRedirectResponse = response;
    }
});

page.on('request', interceptedRequest => {
    // if this request is the one related to the lastRedirect
    if (lastRedirectResponse 
            && lastRedirectResponse.headers.location === interceptedRequest.url) {
        interceptedRequest.abort();
    }
});

@aslushnikov
Copy link
Contributor

@ali-habibzadeh @GuilloOme I'm curious why would you need to load without redirects?

@ali-habibzadeh
Copy link
Author

@GuilloOme Would concurrency of request be a concern even if chromium is running with --process-per-tab? Perhaps your solution is more applicate to --process-per-site?

@aslushnikov My rationale for this is mainly based on two things. Firstly if you're a technical tool that reports on redirect chains, it suits the context more to make a clean request, get the response and report on it.
Most libraries handle their request like this e.g. node or cURL both require the user to enable the following of redirects. Which makes sense in terms of user experience too. If you are able to enable it then you have both options, if it follows by default then not having it requires a fair bit of implementation.

@GuilloOme
Copy link

@ali-habibzadeh, my concern is more about pages where multiple resources are requested at the "same" time and in the case of one or more returns a redirect, I will block the wrong one… You make me think that I should store all the redirect response (not only the last one) and check against them all.
Using --process-per-tab/site will not prevent the page from requesting multiples resources at the same time since these calls are async.

@aslushnikov, to add some context to @ali-habibzadeh's point: I use puppeteer with chromium to crawl pages. For a given starting url, I need to get all the "outbound" urls without navigating it. The redirect is a type of "outbound" url since it goes "away" from my starting url. Then, I make a decision based on multiple factors to go through with it or not. In this case, it is very important to have a complete control over any navigation (on a side note, that also why I need some workaround for #823)

@ali-habibzadeh
Copy link
Author

ali-habibzadeh commented Nov 12, 2017

We potentially need to bring meta refresh directs under the umbrella of this. Unfortunately page scripts can not stop those redirects. Also there isn't a supported methodology for modifying the response text before the page is rendered (e.g. to modify/remove the meta refresh tags).

Anyone has a solution for stopping the meta redirect too?

I found this page: https://bugs.chromium.org/p/chromium/issues/detail?id=63107
However contains lots of contradictory conversations.

@GuilloOme
Copy link

@ali-habibzadeh, I tried and you can prevent the meta redirect with the extension I wrote as a workaround for the "navigation away" problem (here: #823 (comment)). if you have question about the extension, you can comment on the related gist.

aslushnikov added a commit that referenced this issue Jun 1, 2018
This patch introduces `Request.isNavigationRequest()` method.

Fixes #2627, #1132.
@aslushnikov
Copy link
Contributor

Coupling request interception with request.isNavigationRequest and request.redirectChain, you can now abort navigation redirects:

await page.setRequestInterception(true);
page.on('request', request => {
  if (request.isNavigationRequest() && request.redirectChain().length)
    request.abort();
  else
    request.continue();
});
await page.goto('https://example.com');

@batt842
Copy link

batt842 commented Jun 5, 2018

@aslushnikov
Cool! I've been waiting for this feature since last year!

@rotimi-best
Copy link

@aslushnikov Thank you! Life saver )

@adgonzalez-ml
Copy link

adgonzalez-ml commented Dec 26, 2018

@aslushnikov

Commenting here to prevent from creating a new issue.

I have a problem with these, im trying to take screenshots from pages before the make the redirect, if I use your example im unable to do this, as it takes it as a navigation error, theres no real way to prevent the redirects from making me get another url?

in this case they would be 307 (Temporary redirects)

@chan-dev
Copy link

@aslushnikov if all i want is to know if the URL has a follow redirect that this code enough?

  const response = await page.goto(uri, {
    waitUntil: 'networkidle2',
  });

  // is a redirect? is this condition enough to know if it's a redirect?
  if (response.status === 301) {
 }

@heaven
Copy link

heaven commented Dec 20, 2021

Any way I can call request.abort(); and make it raise net::ERR_TOO_MANY_REDIRECTS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants