New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How do I get puppeteer to download a file? #299

Open
aherriot opened this Issue Aug 16, 2017 · 102 comments

Comments

Projects
None yet
@aherriot
Copy link

aherriot commented Aug 16, 2017

Question: How do I get puppeteer to download a file or make additional http requests and save the response?

@Garbee

This comment has been minimized.

Copy link
Contributor

Garbee commented Aug 16, 2017

I'll look into the specifics after I work on another issue (if no one gets to it before me.) But, I feel like we'd need to look for the request going off and then save the buffer of that response body. Not sure how you'd trigger the download though pragmatically. Although, "clicking" like normal on the right part of the page should do it.

@aherriot

This comment has been minimized.

Copy link

aherriot commented Aug 16, 2017

Let me elaborate a little bit. In my use case, I want to visit a website which contains a list of files. Then I want to trigger HTTP requests for each file and save the response to the disk. The list may not already by hyperlinks to the files, just a plain text name of a file, but from that I can derive the actually URL of the file to download.

@kensoh

This comment has been minimized.

Copy link
Contributor

kensoh commented Aug 16, 2017

I saw this Chromium issue some time ago. It addresses downloads and seems to be moving along well. Due to security reasons I have the impression that headless Chrome does not support downloading by clicking on the download button. But above issue opens up that possibility for this important use case.

@aherriot

This comment has been minimized.

Copy link

aherriot commented Aug 16, 2017

It seems that a key use case of this project is to use it for web scraping and downloading content from the internet. For example CasperJS has a download method for this purpose.

@kensoh

This comment has been minimized.

Copy link
Contributor

kensoh commented Aug 16, 2017

I'm using CasperJS for couple of years, yep that actually sends raw XMLHttpRequests to the URL specified to grab the contents of a file. However, in practice it is not foolproof. For example, try scripting to fetch the download zip URL of a GitHub repo, it will produce a zero-byte file. I guess a method like that can be revisited to see if it can be improved to cover edge cases.

Although fundamentally, my preference is to have the ability to click some download button and getting the file directly. This seems the easier to implement in scripting, because some types of downloads you won't get to see the actual URL from the DOM layer. Only after clicking and going through some JS code, the download initiates in a normal browser. That type of setup may not be applicable with a download method because there is no full URL to give to the method in the first place.

@Garbee

This comment has been minimized.

Copy link
Contributor

Garbee commented Aug 16, 2017

Yea, Chromium doesn't support it. But if you can trigger the request you should be able to at least get the buffer content and write it to disk using the filesystem API in Node. Or get the URL to then initiate a manual request, if prevented as a download outright, which you'd then do the same with the buffer from.

Chromium may not support it, but it should be possible to work around it.

@aherriot

This comment has been minimized.

Copy link

aherriot commented Aug 16, 2017

@kensoh Ideally, it could support both downloading by clicking a link and downloading from URL.
@Garbee There may be a way for me to just use NodeJS to make the requests, but if I use a headless browser, it will send the proper sessions cookies with my requests.

@ebidel ebidel added the feature label Aug 16, 2017

@pavelfeldman

This comment has been minimized.

Copy link
Contributor

pavelfeldman commented Aug 17, 2017

Support for downloads is on the way. It needs changes to Chromium that are under review.

@kazaff

This comment has been minimized.

Copy link

kazaff commented Aug 17, 2017

@pavelfeldman im waiting~~~~

@pavelfeldman pavelfeldman added the P1 label Aug 18, 2017

@aslushnikov

This comment has been minimized.

Copy link
Contributor

aslushnikov commented Aug 23, 2017

Upstream has landed as r496577, we now need to roll.

@intellix

This comment has been minimized.

Copy link

intellix commented Aug 30, 2017

I'm having file download issues as well, not sure if it's the same thing. I'm visiting a link that triggers a download like: somewhere.com/download/en-GB.po.

I'm creating a new page per language file I need to download so they're run in parallel and then trying to Promise.all() them before closing the browser. It seems that even after the downloads all finish, the page.goto is never resolved:

const urls = languages.map(lang => `${domain}/download/${lang}/${project}/${lang}.${extension}`);
await Promise.all(urls.map(url => browser.newPage().then(async tab => await tab.goto(url))));
browser.close();
@nisrulz

This comment has been minimized.

Copy link

nisrulz commented Sep 1, 2017

@aslushnikov is the change you mentioned in your last commit, now shipped?
If yes, then I am looking for some examples to download a file in headless mode. I couldn't find any documentation on it.

@aslushnikov

This comment has been minimized.

Copy link
Contributor

aslushnikov commented Sep 1, 2017

@nisrulz additional work is required upstream, we're working on it.

@pavelfeldman

This comment has been minimized.

Copy link
Contributor

pavelfeldman commented Sep 1, 2017

@aslushnikov 496577 landed a week ago, it should be few lines of code on your end.

Looks like aslushnikov@ bakes something more upstream to deliver an event upon download.

@aslushnikov

This comment has been minimized.

Copy link
Contributor

aslushnikov commented Sep 1, 2017

@pavelfeldman r590913 is not enough for a complete story; I'm working on Page.FileDownloaded event to notify about successful download.

@mmacaula

This comment has been minimized.

Copy link

mmacaula commented Sep 1, 2017

One workaround I found (not ideal for sure), open chrome with a profile that has a download directory set: Worked for me when i clicked a link that downloaded an audio file. Then in your puppeteer script just wait for that file to appear and copy it over where you need to go.

const browser = await puppeteer.launch({headless: false, args: '--profile-directory="Default"'});

see here for how to find your profile

@nisrulz

This comment has been minimized.

Copy link

nisrulz commented Sep 1, 2017

@mmacaula I am looking for a way to download a file when Chrome is running in headless mode. If I am not running in headless mode, the file downloads perfectly into the Downloads folder which is the default location I guess.

Its a much sought after feature, already available in projects such as CasperJS

@mmacaula

This comment has been minimized.

Copy link

mmacaula commented Sep 1, 2017

Oh yeah my mistake, It doesn't work with headless mode. :(

@dagumak

This comment has been minimized.

Copy link

dagumak commented Sep 9, 2017

Is there a way to just capture the request and have stored in another remote location instead of local to Chrome/puppeteer?

@ebidel

This comment has been minimized.

Copy link
Member

ebidel commented Sep 9, 2017

@dagumak couldn't you catch the responses and write the files to a location of your choice?

const puppeteer = require('puppeteer');
const fs = require('fs');
const mime = require('mime');
const URL = require('url').URL;

(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

const responses = [];
page.on('response', resp => {
  responses.push(resp);
});

page.on('load', () => {
  responses.map(async (resp, i) => {
    const request = await resp.request();
    const url = new URL(request.url);

    const split = url.pathname.split('/');
    let filename = split[split.length - 1];
    if (!filename.includes('.')) {
      filename += '.html';
    }

    const buffer = await resp.buffer();
    fs.writeFileSync(filename, buffer);
  });
});

await page.goto('https://news.ycombinator.com/', {waitUntil: 'networkidle'});
browser.close();
})();

You may need to adjust the timing for your page. Waiting for the load event and networkidle might not be enough.

@mickdekkers

This comment has been minimized.

Copy link

mickdekkers commented Sep 9, 2017

@ebidel Sorry if I'm missing something, but where are you getting the buffer from in that code?

edit: response.buffer seems to be a function, but when I call it and await the promise it returns I get this error:

Unhandled promise rejection (rejection id: 1): Error: Protocol error (Network.getResponseBody):
No data found for resource with given identifier undefined

This seems to only happen when the file gets downloaded by the browser -- that is to say, when the file appears in the download bar in non-headless mode.

This is the code I used:

// this works
// const downloadUrl = 'https://nodejs.org/dist/v6.11.3/'
// this doesn't work
const downloadUrl = 'https://nodejs.org/dist/v6.11.3/SHASUMS256.txt.sig'

const responseHandler = async (response) => {
  if (response.url !== downloadUrl) {
    return
  }

  const buffer = await response.buffer()
  console.log('response buffer', buffer)
  browser.close()
}
page.on('response', responseHandler)
page.goto(downloadUrl)

Version info:
Windows 10 64-bit
Puppeteer 0.10.2
Chromium 62.0.3198.0 (Developer Build) (64-bit)

@dagumak

This comment has been minimized.

Copy link

dagumak commented Sep 10, 2017

@ebidel I'm going to give that a shot. Thank you!!

@ebidel

This comment has been minimized.

Copy link
Member

ebidel commented Sep 10, 2017

@mickdekkers updated the snippet to include const buffer = await resp.buffer();. Bad copy and paste.

@mickdekkers

This comment has been minimized.

Copy link

mickdekkers commented Sep 10, 2017

@ebidel alright, thanks! Do you know if the issue I described in my edit is a bug or expected behavior? I couldn't find any info about it and I'd like to report it somewhere if it is, but I'm not sure what's the best place for it. I can make a new issue for it on this tracker if it's a puppeteer bug.

@ebidel

This comment has been minimized.

Copy link
Member

ebidel commented Sep 10, 2017

For resource types that the renderer doesn't support, the default browser behavior is to download the file. That's probably what's going on here. Could you use pure node apis to fetch/write the file instead of waiting for the page response? You could also intercept requests page.on('request') and fetch the file.

@aslushnikov would know for sure. He's been working on a download API. There may be a cleaner way to handle cases like this in the future.

@elbrodelche

This comment has been minimized.

Copy link

elbrodelche commented Jul 19, 2018

You can get the filesize and the name of the file from the response, and then use a watch script to check filesize from downloaded file, in order to close the browser.
And them you can trgger browser.on('disconnected') to do something else after download is done.

This is an example:

const filename = <set this with some regex in response>;
const dir = <watch folder or file>;

// Download and wait for download
    await Promise.all([
        page.click('#DownloadFile'),
       // Event on all responses
        page.on('response', response => {
            // If response has a file on it
            if (response._headers['content-disposition'] === `attachment;filename=${filename}`) {
               // Get the size
                console.log('Size del header: ', response._headers['content-length']);
                // Watch event on download folder or file
                 fs.watchFile(dir, function (curr, prev) {
                   // If current size eq to size from response then close
                    if (parseInt(curr.size) === parseInt(response._headers['content-length'])) {
                        browser.close();
                        this.close();
                    }
                });
            }
        })
    ]);

// Trigger some event when browser closes.
let self = this;
    browser.on('disconnected', async () => {
        <something else>
    });

Even that the way of searching in response can be improved though I hope you'll find this usefull.

@kchen-bv

This comment has been minimized.

Copy link

kchen-bv commented Sep 6, 2018

Not sure if my use case is the same as yours, but if you don't provide a path to page.pdf it returns a buffer. You can return this from the endpoint and use it however you want on the client side

Server side

     const pdf = await page.pdf({
        fullPage: true,
        printBackground: true,
      });

      await browser.close();
      res.set({
        'Content-Disposition': 'attachment; filename="test.pdf"',
        'Content-Type': 'application/pdf'
      });
      res.send(pdf);

then client side, I used a get request and filesaver.

      axios
      .get(`/pdf-gen`, {
        responseType: 'arraybuffer',
        headers: { Accept: 'application/pdf' }
      })
      .then(response => {
        console.log(response);
        const blob = new Blob([response.data], {
          type: 'application/pdf'
        });
        saveAs(blob, 'file.pdf');
      });
@dotNetDR

This comment has been minimized.

Copy link

dotNetDR commented Sep 26, 2018

Can someone else add fileDownloadCompleted event to puppeteer.Page or puppeteer.Browser?

@msprancis

This comment has been minimized.

Copy link

msprancis commented Oct 21, 2018

I spent hours poring through this thread and Stack Overflow yesterday, trying to figure out how to get Puppeteer to download a csv file by clicking a download link in headless mode in an authenticated session. This article saved the day. In short, fetch:

const res = await this.page.evaluate(() =>
{
    return fetch('https://example.com/path/to/file.csv', {
        method: 'GET',
        credentials: 'include'
    }).then(r => r.text());
});

Any ideas on how to fetch it if I need to save a binary file, such as .pdf? I'm trying blob() instead of text() and other ideas but so far unsuccessful.

@xprudhomme

This comment has been minimized.

Copy link

xprudhomme commented Oct 22, 2018

@msprancis : see my solution here

@msprancis

This comment has been minimized.

Copy link

msprancis commented Oct 22, 2018

@xprudhomme Thank you, it worked!

@xprudhomme

This comment has been minimized.

Copy link

xprudhomme commented Oct 23, 2018

@msprancis: You're welcome, happy to help :)

@gersonfs

This comment has been minimized.

Copy link

gersonfs commented Oct 26, 2018

@gersonfs I'm sorry this takes long. Our initial approach turned out to be quite involved and vastly complicated chromium codebase. We're now evaluating a different approach that sounds promising, but there's no ETA. See https://crbug.com/831887

Issue #831887 is fixed now!

@ryanrhee

This comment has been minimized.

Copy link

ryanrhee commented Oct 26, 2018

I also saw that the chromium issue was marked as fixed. Are there docs on how to use the new feature to download a binary file?

@stevestmartin

This comment has been minimized.

Copy link

stevestmartin commented Oct 29, 2018

@gersonfs @aslushnikov Is there anything that needs to be done to get this working that is either in progress or one of the community members can submit a PR for?

@xprudhomme

This comment has been minimized.

Copy link

xprudhomme commented Oct 29, 2018

Well to be honest, I don't really see where there is an issue at this point, everything seems to be working like a charm.

I've even been able to get Puppeteer download a file, simply by clicking on the "Download file" button that triggers the download process.

It works this way:

 function setDownloadBehavior(downloadPath='/tmp/puppeteer/downloads/') {
    return page._client.send('Page.setDownloadBehavior', {
        behavior: 'allow',
        downloadPath
    });
 }

await setDownloadBehavior();
await page.click(downloadButtonSelector);

If the download button triggers a PDF file download, then you will end up with this PDF file being downloaded at the downloadPath location, in the case above it would be located at '/tmp/puppeteer/downloads/whateverPDFname.pdf'

@verglor

This comment has been minimized.

Copy link

verglor commented Oct 29, 2018

Hi @xprudhomme, for me the problem is, that I don't know what is the filename of the downloaded file and when the download has finished. There are workarounds, but it is quite cumbersome. Also using your snippet without some kind of wait at the end will cause that file will not be downloaded if it is not super tiny.

@xprudhomme

This comment has been minimized.

Copy link

xprudhomme commented Nov 3, 2018

Hi @verglor ,

I understand the issues you are dealing with. However, they are easy to overcome.

We can :

  1. Determine the downloaded file name by catching it from a customized response handler (because even if we don't know the exact file name, at least we know what the requested url looks like)
  2. Trigger the file download
  3. Wait for the downloaded file name to be 'ready' / caught, and get it
  4. Wait for the downloaded file to be 'ready' / written to local storage

With this code:

function responseHandler(response) { // Related to step 1

    const url = response.url();
    const dlRequestUriPattern = /whatever-pattern\/download/i;
    const urlMatchesDownloadRequestURI = dlRequestUriPattern.test(url);

    // Only deal with requests matching our file download URL pattern
    if(!urlMatchesDownloadRequestURI) {
        return;
    }
    console.log(` [responseHandler] Got response for file download: ${url}`);

    let dlFilename = decodeURIComponent(url.replace(/http.+\//g, '').replace(/\?.+/g, ''));
    setDLFileName(dlFilename);
}


function setDLFileName(filename) { // Related to step 1
    return page.evaluate((name) => {
        window._dlFilename_ = name;
    }, filename);
}


function waitForDLFileName() { // Related to step 3
    return page.waitForFunction(() => (window._dlFilename_ !== null && window._dlFilename_ !== undefined) );
}

function getDLFileName() { // Related to step 3
    return waitForDLFileName()
   .then( () => page.evaluate( () => window._dlFilename_));
}


/**
 * Check if file exists, watching containing directory meanwhile.
 * Resolve if the file exists, or if the file is created before the timeout occurs
 * @param {string} filePath 
 * @param {integer} timeout 
 */
function checkFileExists(filePath, timeout=15000) { // Related to step 4

    return new Promise(function (resolve, reject) {

        let timer = setTimeout(function () {
            watcher.close();
            reject(new Error(' [checkFileExists] File does not exist, and was not created during the timeout delay.'));
        }, timeout);

        fs.access(filePath, fs.constants.R_OK, function (err) {
            if (!err) {
                clearTimeout(timer);
                watcher.close();
                resolve();
            }
        });

        let dir = path.dirname(filePath);
        let basename = path.basename(filePath);
        let watcher = fs.watch(dir, function (eventType, filename) {
            if (eventType === 'rename' && filename === basename) {
                clearTimeout(timer);
                watcher.close();
                resolve();
            }
        });
    });
}

page.on('response', responseHandler); // Step 1

await page.click('whateverDownloadButtonSelector'); // Step 2 

const downloadedFilename  = await getDLFileName(); // Step 3

// Step 4
const filePath = `/tmp/puppeteer/downloads/`${downloadedFilename};
await checkFileExists(filePath);
@verglor

This comment has been minimized.

Copy link

verglor commented Nov 4, 2018

Hi @xprudhomme, thank you for your sharing.
As I said workaround exists but is cumbersome.
Mine was the following:

const puppeteer = require('puppeteer')
const expect = require('expect-puppeteer')
const { setDefaultOptions } = require('expect-puppeteer')
setDefaultOptions({ timeout: 5000 })
const fs = require('fs')
const mkdirp = require('mkdirp')
const path = require('path')
const uuid = require('uuid/v1')

    async download(page, selector) {
        const downloadPath = path.resolve(__dirname, 'download', uuid())
        mkdirp(downloadPath)
        console.log('Downloading file to:', downloadPath)
        await page._client.send('Page.setDownloadBehavior', { behavior: 'allow', downloadPath: downloadPath })
        await expect(page).toClick(selector)
        let filename = await this.waitForFileToDownload(downloadPath)
        return path.resolve(downloadPath, filename)
    }

    async waitForFileToDownload(downloadPath) {
        console.log('Waiting to download file...')
        let filename
        while (!filename || filename.endsWith('.crdownload')) {
            filename = fs.readdirSync(downloadPath)[0]
            await sleep(500)
        }
        return filename
    }
@allquixotic

This comment has been minimized.

Copy link

allquixotic commented Dec 26, 2018

@verglor Your solution involving Page.setDownloadBehavior doesn't work for me; the Chromium process hard crashes (native code crash) on macOS.

@allquixotic

This comment has been minimized.

Copy link

allquixotic commented Dec 26, 2018

@xprudhomme The issue with your solution is it's required to know in advance what the URL of the file is. If you have some kind of complicated way of resolving the URL of the file (e.g. the server spits out a URL from an XHR in clientside JS upon clicking a button), you'll not be able to write any meaningful code with a regex for const dlRequestUriPattern = /whatever-pattern\/download/i;.

Also, you should probabaly parameterize that pattern. What if I want to download two completely separate files in the same script that have nothing to do with one another in terms of their pattern?

This is very unwieldy. +1 for an easy, baked-in way to do this!

@GerryOnGithub

This comment has been minimized.

Copy link

GerryOnGithub commented Jan 9, 2019

I got xprudhomme's solution to work. None of the other solutions worked for me, mostly because the page.on('response' never seemed to show the download.

@xprudhomme

This comment has been minimized.

Copy link

xprudhomme commented Jan 14, 2019

@xprudhomme The issue with your solution is it's required to know in advance what the URL of the file is. If you have some kind of complicated way of resolving the URL of the file (e.g. the server spits out a URL from an XHR in clientside JS upon clicking a button), you'll not be able to write any meaningful code with a regex for const dlRequestUriPattern = /whatever-pattern\/download/i;.

Also, you should probabaly parameterize that pattern. What if I want to download two completely separate files in the same script that have nothing to do with one another in terms of their pattern?

This is very unwieldy. +1 for an easy, baked-in way to do this!

@allquixotic
If you don't mind, I would kindly remind that the topic here is "Question: How do I get puppeteer to download a file?", question to which I have answered with a working solution.

The question is not "How do I get the best improved solution ever". I agree there are improvements, such as parameterizing the download URL regex pattern, it does totally make sense.

The solution works, not just for me. Unfortunately when you deal with many different website, you will always have to first figure out what is the URL pattern for the files you are trying to download, which always require a bit of reverse engineering work on the target website.

I guess that your point is, or could be: "How do we get a fully automated way to get Puppeteer to download files for us, without any customization or any kind", which is not the question here from my point of view :)

@gufranco

This comment has been minimized.

Copy link

gufranco commented Jan 18, 2019

I did a workaround using puppeteer and axios.

await page.setRequestInterception(true);

const interceptedRequest = await new Promise((resolve) => {
  page.goto(downloadLink);

  page.on('request', (request) => {
    request.abort();
    resolve(request);
  });
});

const cookies = await page.cookies();
const headers = interceptedRequest._headers;
headers.Cookie = cookies.map(cookie => `${cookie.name}=${cookie.value}`).join(';');

const response = await axios.get(interceptedRequest._url, {
  headers,
  responseType: 'blob',
});
@GerryOnGithub

This comment has been minimized.

Copy link

GerryOnGithub commented Jan 18, 2019

In my case the server generates a filename dynamically, and I can't predict what the filename will be.
page._client.send('Page.setDownloadBehavior', { ... }
sees all the files, except for the file downloaded by clicking a download link.

@xprudhomme

This comment has been minimized.

Copy link

xprudhomme commented Jan 21, 2019

In my case the server generates a filename dynamically, and I can't predict what the filename will be.
page._client.send('Page.setDownloadBehavior', { ... }
sees all the files, except for the file downloaded by clicking a download link.

@GerryOnGithub : in our case too, we are not able to predict what the filename will be, that's why our solution guess it: const downloadedFilename = await getDLFileName();

@GerryOnGithub

This comment has been minimized.

Copy link

GerryOnGithub commented Jan 21, 2019

Sorry, the client (browser) is creating the filename and downloading it locally - there is no url.

Node.js, running puppeteer on the server can't see the filename. I think I will have the client write the filename to a hidden div/span and then puppeteer should be able to extract it from the page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment