Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get all navigation redirect urls #2163

Closed
leem32 opened this issue Mar 8, 2018 · 4 comments
Closed

Get all navigation redirect urls #2163

leem32 opened this issue Mar 8, 2018 · 4 comments

Comments

@leem32
Copy link

leem32 commented Mar 8, 2018

I'm trying to use the new request.frame() method to log all navigation/domain redirects, but it only seems to log JS redirects. Along with JS redirects I also need Meta refresh and PHP redirects.
Another issue with request.frame() is it also logs other non-navigation redirects which I do not want such as doubleclick.net and image server links.

How can I use request.frame to log all navigation/domain redirects?

request.frame()

If using request.frame() will not work to log all navigation/domain redirects (JS, Meta, PHP), how else can I achieve this with Puppeteer?

Note: Ideally I'd also get the status code of each navigation redirect i.e 200, 301, 404, but if this isn't possible I can just curl each URL instead.

Thoughts: In the Network tab of Chrome Dev tools, if I select Preserve log and then load a URL in the browser which redirects a few times. Chrome Dev tools picks up all the client and server redirects along with the status codes. Could I access this info somehow? from the request headers maybe? I would just require the navigation/domain redirects and status codes. If I could access this info I would need to find a way to differentiate between navigation redirects and all other requests.

  • Puppeteer version:
    v1.1.1

Code example:

const puppeteer = require('puppeteer');
var url = process.argv[2];
(async () => {

    const browser = await puppeteer.launch({headless: true, timeout: 30000, ignoreHTTPSErrors: true});
    const page = await browser.newPage();

     // note: add trailing slash since chrome adds it
      if (!url.endsWith('/'))
        url = url + '/';

      // urls hold redirect chain
      const urls = [];

    try {

       // Get all navigation redirects
       page.on('request', request => {
        const frame = request.frame();
      if (frame.url() !== urls[urls.length - 1] && frame.url() !== "about:blank") {
      urls.push(frame.url());
        }
      });

    await page.goto(url, {timeout: 30000, waitUntil: 'load'}); //default load

    var lastUrl = urls[urls.length - 1]; // get last redirected url
    var fileName = lastUrl.replace(/http.*\/\//g, "").replace("www.", "").split("/")[0];
    var filePath = "screenshot/" + fileName + '.jpg';
    await page.screenshot({path: filePath, type: 'jpeg', quality: 80, fullPage: false});

    console.log(filePath, '|', urls);
    await page.close();
    await browser.close();

    } catch (err) {

    console.log("caught an exception", err);
    if (urls.length > 1) {
    console.log('|', urls);
    } else {
    console.log('|', url);
    }

    await page.close();
    await browser.close();
    }

})();

Thanks :)

@aslushnikov
Copy link
Contributor

aslushnikov commented Apr 10, 2018

Since pptr 1.2.0, you can get a redirect chain for every request, see request.redirectChain().

For the main navigation, you're interested in the redirect chain for the main resource:

const response = await page.goto('http://example.com');
const chain = response.request().redirectChain();
console.log(chain.length); // 1
console.log(chain[0].url()); // 'http://example.com'

Hope this helps.

@faheel
Copy link

faheel commented Jan 22, 2020

This seems to be the solution to @leem32's problem (found at https://groups.google.com/forum/#!topic/chrome-debugging-protocol/rPSMWfFD2Jo):

const puppeteer = require('puppeteer');

puppeteer.launch().then(async browser => {
    const page = await browser.newPage();
    // note: add trailing slash since chrome adds it
    if (!url.endsWith('/'))
        url = url + '/';

    // urls hold redirect chain
    const urls = [url];

    const client = await page.target().createCDPSession();
    await client.send('Network.enable');
    await client.on('Network.requestWillBeSent', (e) => {
        if (e.type !== "Document") {
            return;
        }

        console.log("EVENT INFO: ");
        console.log(e.type);
        console.log(e.documentURL);
        console.log("INITIATOR: " + JSON.stringify(e.initiator, null, 4));

        // check if url redirected
        if (typeof e.redirectResponse != "undefined") {
            // get redirect info
            console.log("REDIRECT STATUS CODE: ");
            console.log(e.redirectResponse.status);

            console.log("REDIRECT REQUEST URL: ");
            console.log(e.request.url);
            urls.push(e.redirectResponse.status, e.request.url);
        } else {
            // url did not redirect
            if (e.request.url !== urls[urls.length - 1]) {
                console.log("NO REDIRECT REQUEST URL: ");
                console.log(e.request.url);
                urls.push(e.request.url);
            }
        }
    });
    await page.goto(url);
    console.log("Final urls array: ");
    console.log(urls);

    await browser.close();
});

@vandolphreyes
Copy link

This seems to be the solution to @leem32's problem (found at https://groups.google.com/forum/#!topic/chrome-debugging-protocol/rPSMWfFD2Jo):

const puppeteer = require('puppeteer');

puppeteer.launch().then(async browser => {
    const page = await browser.newPage();
    // note: add trailing slash since chrome adds it
    if (!url.endsWith('/'))
        url = url + '/';

    // urls hold redirect chain
    const urls = [url];

    const client = await page.target().createCDPSession();
    await client.send('Network.enable');
    await client.on('Network.requestWillBeSent', (e) => {
        if (e.type !== "Document") {
            return;
        }

        console.log("EVENT INFO: ");
        console.log(e.type);
        console.log(e.documentURL);
        console.log("INITIATOR: " + JSON.stringify(e.initiator, null, 4));

        // check if url redirected
        if (typeof e.redirectResponse != "undefined") {
            // get redirect info
            console.log("REDIRECT STATUS CODE: ");
            console.log(e.redirectResponse.status);

            console.log("REDIRECT REQUEST URL: ");
            console.log(e.request.url);
            urls.push(e.redirectResponse.status, e.request.url);
        } else {
            // url did not redirect
            if (e.request.url !== urls[urls.length - 1]) {
                console.log("NO REDIRECT REQUEST URL: ");
                console.log(e.request.url);
                urls.push(e.request.url);
            }
        }
    });
    await page.goto(url);
    console.log("Final urls array: ");
    console.log(urls);

    await browser.close();
});

Doesn't work for me if URLs are being shorten using bit.ly or https://www.shorturl.at/shortener.php

@srozzo
Copy link

srozzo commented Oct 29, 2020

This worked for me for capturing redirects specific to the location of the browser. Not all page assets.

    // Request interception handler
    page.on('request', request => {
        // Capture any request that is a navigation requests that attempts to load a new document
        // This will capture HTTP Status 301, 302, 303, 307, 308, HTML, and Javascript redirects    
        // Make sure the redirect is in the parent frame or we will see the navigation for other frames
        var parentFrame = request.frame().parentFrame()
        if (request.isNavigationRequest() && parentFrame === null) {
            o = { url: request.url() }
            redirects.chain.push(o)
        }
        // Continue to next request
        request.continue()
    });

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants