Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: How do I get puppeteer to download a file? #299

Open
aherriot opened this issue Aug 16, 2017 · 197 comments
Open

Question: How do I get puppeteer to download a file? #299

aherriot opened this issue Aug 16, 2017 · 197 comments

Comments

@aherriot
Copy link

Question: How do I get puppeteer to download a file or make additional http requests and save the response?

@Garbee
Copy link
Contributor

Garbee commented Aug 16, 2017

I'll look into the specifics after I work on another issue (if no one gets to it before me.) But, I feel like we'd need to look for the request going off and then save the buffer of that response body. Not sure how you'd trigger the download though pragmatically. Although, "clicking" like normal on the right part of the page should do it.

@aherriot
Copy link
Author

Let me elaborate a little bit. In my use case, I want to visit a website which contains a list of files. Then I want to trigger HTTP requests for each file and save the response to the disk. The list may not already by hyperlinks to the files, just a plain text name of a file, but from that I can derive the actually URL of the file to download.

@kensoh
Copy link
Contributor

kensoh commented Aug 16, 2017

I saw this Chromium issue some time ago. It addresses downloads and seems to be moving along well. Due to security reasons I have the impression that headless Chrome does not support downloading by clicking on the download button. But above issue opens up that possibility for this important use case.

@aherriot
Copy link
Author

It seems that a key use case of this project is to use it for web scraping and downloading content from the internet. For example CasperJS has a download method for this purpose.

@kensoh
Copy link
Contributor

kensoh commented Aug 16, 2017

I'm using CasperJS for couple of years, yep that actually sends raw XMLHttpRequests to the URL specified to grab the contents of a file. However, in practice it is not foolproof. For example, try scripting to fetch the download zip URL of a GitHub repo, it will produce a zero-byte file. I guess a method like that can be revisited to see if it can be improved to cover edge cases.

Although fundamentally, my preference is to have the ability to click some download button and getting the file directly. This seems the easier to implement in scripting, because some types of downloads you won't get to see the actual URL from the DOM layer. Only after clicking and going through some JS code, the download initiates in a normal browser. That type of setup may not be applicable with a download method because there is no full URL to give to the method in the first place.

@Garbee
Copy link
Contributor

Garbee commented Aug 16, 2017

Yea, Chromium doesn't support it. But if you can trigger the request you should be able to at least get the buffer content and write it to disk using the filesystem API in Node. Or get the URL to then initiate a manual request, if prevented as a download outright, which you'd then do the same with the buffer from.

Chromium may not support it, but it should be possible to work around it.

@aherriot
Copy link
Author

@kensoh Ideally, it could support both downloading by clicking a link and downloading from URL.
@Garbee There may be a way for me to just use NodeJS to make the requests, but if I use a headless browser, it will send the proper sessions cookies with my requests.

@ebidel ebidel added the feature label Aug 16, 2017
@pavelfeldman
Copy link
Contributor

Support for downloads is on the way. It needs changes to Chromium that are under review.

@kazaff
Copy link

kazaff commented Aug 17, 2017

@pavelfeldman im waiting~~~~

@aslushnikov
Copy link
Contributor

Upstream has landed as r496577, we now need to roll.

@intellix
Copy link

I'm having file download issues as well, not sure if it's the same thing. I'm visiting a link that triggers a download like: somewhere.com/download/en-GB.po.

I'm creating a new page per language file I need to download so they're run in parallel and then trying to Promise.all() them before closing the browser. It seems that even after the downloads all finish, the page.goto is never resolved:

const urls = languages.map(lang => `${domain}/download/${lang}/${project}/${lang}.${extension}`);
await Promise.all(urls.map(url => browser.newPage().then(async tab => await tab.goto(url))));
browser.close();

@nisrulz
Copy link

nisrulz commented Sep 1, 2017

@aslushnikov is the change you mentioned in your last commit, now shipped?
If yes, then I am looking for some examples to download a file in headless mode. I couldn't find any documentation on it.

@aslushnikov
Copy link
Contributor

@nisrulz additional work is required upstream, we're working on it.

@pavelfeldman
Copy link
Contributor

pavelfeldman commented Sep 1, 2017

@aslushnikov 496577 landed a week ago, it should be few lines of code on your end.

Looks like aslushnikov@ bakes something more upstream to deliver an event upon download.

@aslushnikov
Copy link
Contributor

@pavelfeldman r590913 is not enough for a complete story; I'm working on Page.FileDownloaded event to notify about successful download.

@mmacaula
Copy link

mmacaula commented Sep 1, 2017

One workaround I found (not ideal for sure), open chrome with a profile that has a download directory set: Worked for me when i clicked a link that downloaded an audio file. Then in your puppeteer script just wait for that file to appear and copy it over where you need to go.

const browser = await puppeteer.launch({headless: false, args: '--profile-directory="Default"'});

see here for how to find your profile

@nisrulz
Copy link

nisrulz commented Sep 1, 2017

@mmacaula I am looking for a way to download a file when Chrome is running in headless mode. If I am not running in headless mode, the file downloads perfectly into the Downloads folder which is the default location I guess.

Its a much sought after feature, already available in projects such as CasperJS

@mmacaula
Copy link

mmacaula commented Sep 1, 2017

Oh yeah my mistake, It doesn't work with headless mode. :(

@dagumak
Copy link

dagumak commented Sep 9, 2017

Is there a way to just capture the request and have stored in another remote location instead of local to Chrome/puppeteer?

@ebidel
Copy link
Contributor

ebidel commented Sep 9, 2017

@dagumak couldn't you catch the responses and write the files to a location of your choice?

const puppeteer = require('puppeteer');
const fs = require('fs');
const mime = require('mime');
const URL = require('url').URL;

(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();

const responses = [];
page.on('response', resp => {
  responses.push(resp);
});

page.on('load', () => {
  responses.map(async (resp, i) => {
    const request = await resp.request();
    const url = new URL(request.url);

    const split = url.pathname.split('/');
    let filename = split[split.length - 1];
    if (!filename.includes('.')) {
      filename += '.html';
    }

    const buffer = await resp.buffer();
    fs.writeFileSync(filename, buffer);
  });
});

await page.goto('https://news.ycombinator.com/', {waitUntil: 'networkidle'});
browser.close();
})();

You may need to adjust the timing for your page. Waiting for the load event and networkidle might not be enough.

@mickdekkers
Copy link

mickdekkers commented Sep 9, 2017

@ebidel Sorry if I'm missing something, but where are you getting the buffer from in that code?

edit: response.buffer seems to be a function, but when I call it and await the promise it returns I get this error:

Unhandled promise rejection (rejection id: 1): Error: Protocol error (Network.getResponseBody):
No data found for resource with given identifier undefined

This seems to only happen when the file gets downloaded by the browser -- that is to say, when the file appears in the download bar in non-headless mode.

This is the code I used:

// this works
// const downloadUrl = 'https://nodejs.org/dist/v6.11.3/'
// this doesn't work
const downloadUrl = 'https://nodejs.org/dist/v6.11.3/SHASUMS256.txt.sig'

const responseHandler = async (response) => {
  if (response.url !== downloadUrl) {
    return
  }

  const buffer = await response.buffer()
  console.log('response buffer', buffer)
  browser.close()
}
page.on('response', responseHandler)
page.goto(downloadUrl)

Version info:
Windows 10 64-bit
Puppeteer 0.10.2
Chromium 62.0.3198.0 (Developer Build) (64-bit)

@dagumak
Copy link

dagumak commented Sep 10, 2017

@ebidel I'm going to give that a shot. Thank you!!

@ebidel
Copy link
Contributor

ebidel commented Sep 10, 2017

@mickdekkers updated the snippet to include const buffer = await resp.buffer();. Bad copy and paste.

@mickdekkers
Copy link

@ebidel alright, thanks! Do you know if the issue I described in my edit is a bug or expected behavior? I couldn't find any info about it and I'd like to report it somewhere if it is, but I'm not sure what's the best place for it. I can make a new issue for it on this tracker if it's a puppeteer bug.

@ebidel
Copy link
Contributor

ebidel commented Sep 10, 2017

For resource types that the renderer doesn't support, the default browser behavior is to download the file. That's probably what's going on here. Could you use pure node apis to fetch/write the file instead of waiting for the page response? You could also intercept requests page.on('request') and fetch the file.

@aslushnikov would know for sure. He's been working on a download API. There may be a cleaner way to handle cases like this in the future.

@NelsonScott
Copy link

Not sure if above still works, I had to use the solution provided here https://stackoverflow.com/questions/74424735/puppeteer-not-actually-downloading-zip-despite-clicking-link

@cole-jacobs
Copy link

@JollyGrin @gregtap I've also had success with Page.setDownloadBehavior, but it seems to have broken with headless=new. Has anyone else experienced this?

@jiteshdhamaniya
Copy link

@cole-jacobs i having same issue.

await page._client.send("Page.setDownloadBehavior", {
    behavior: "allow",
    downloadPath: downloadDir
  });

throws

There was an uncaught error TypeError: page._client.send is not a function

anybody found any solution to this ? i am using "puppeteer": "^20.9.0"

@mattmillen888
Copy link

mattmillen888 commented Oct 18, 2023

You could try this.

import puppeteer from 'puppeteer';

async function main() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const client = await page.target().createCDPSession();
  await client.send('Page.setDownloadBehavior', {
    behavior: 'allow',
    downloadPath: './your-download-directory'
  });
  
  // Your further actions here...

  await browser.close();
}

main().catch(console.error);

Notice the use of page.target().createCDPSession() to get the client object directly and then call .send() on it. This is a more reliable way to access Chrome DevTools Protocol (CDP) methods.

also try updating puppeteer to latest version:

@jiteshdhamaniya
Copy link

@mattmillen888 Worked well. thanks

@cipri-tom
Copy link

Thanks @mattmillen888 !

Note to future: if docs are opening in new tabs instead of downloading, you should set download property:

await link.evaluate((a, fn) => {
    a.removeAttribute('target');    // was set to '_blank'
    a.setAttribute('download', fn);  // fn is the filename under which you want to download
    console.log('setting download to', fn); // this logs _in the browser_, not in terminal
}, fileName); // pass the filename to the browser context
console.log(fileName); // this logs in the terminal

link is the JSElement of a

@sadym-chromium sadym-chromium unpinned this issue Nov 9, 2023
@sirzento
Copy link

sirzento commented Dec 7, 2023

You could try this.

import puppeteer from 'puppeteer';

async function main() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  const client = await page.target().createCDPSession();
  await client.send('Page.setDownloadBehavior', {
    behavior: 'allow',
    downloadPath: './your-download-directory'
  });
  
  // Your further actions here...

  await browser.close();
}

main().catch(console.error);

Notice the use of page.target().createCDPSession() to get the client object directly and then call .send() on it. This is a more reliable way to access Chrome DevTools Protocol (CDP) methods.

also try updating puppeteer to latest version:

I'm running the latest version. This code still doesn't work in headless mode for me.

@mattmillen888
Copy link

mattmillen888 commented Dec 7, 2023

Its tough to know what the exact issue is without error messages. You could try this with better error handling and explicitly set the headless mode, handle potential errors, and ensure that all resources are properly closed in case of an error.

import puppeteer from 'puppeteer';

async function main() {
  let browser;
  try {
    // Launch browser in headless mode
    browser = await puppeteer.launch({ headless: true });

    // Open a new page
    const page = await browser.newPage();

    // Set download behavior
    const client = await page.target().createCDPSession();
    await client.send('Page.setDownloadBehavior', {
      behavior: 'allow',
      downloadPath: '/absolute/path/to/your-download-directory' // Use an absolute path
    });

    // Your further actions here...

  } catch (error) {
    // Error handling
    console.error('An error occurred:', error);
  } finally {
    // Ensure browser is closed even if an error occurs
    if (browser) {
      await browser.close();
    }
  }
}

main().catch(error => console.error('Failed to run the main function:', error));

A few things to note in this revision are:

  • Ensure that the specified download directory exists and is writable.

  • Replace the placeholder /absolute/path/to/your-download-directory with the actual path where you want the downloads to be saved.

  • The script assumes that your further actions (commented as // Your further actions here...) are properly implemented and error-free. Any code that goes there should also follow best practices for error handling.

let me know what errors you receive when using this and i may be able to isolate the issue.

@sirzento
Copy link

sirzento commented Dec 7, 2023

I use error handling but there is just no error. When I start it with headless to false to see the browser, the file will be downloaded. After that I delete the file and only change headless to true and run it again. This time no file will be downloaded after the script finishes. I even removed the browser.close() part in case there are some loading issues to give it more time but that didn't help. Just in case, i ran it again with headless: false and the file will be downloaded again.

I also tryed headless: new but that also didn't work.

@mattmillen888
Copy link

mattmillen888 commented Dec 7, 2023

There are a couple of options here and then a last resort you could try.

  • Using a Using a Custom User-Agent
  • Network Monitoring which involves listening to network responses to potentially capture the download URL then you can use axios to download the file and Waiting for Network Idle to ensure that all network activity has settled before proceeding.
  • [LAST RESORT] Running in Headful Mode in a Hidden Window

I have generated this script but have not had time to test it yet

import puppeteer from 'puppeteer';
import axios from 'axios';
import fs from 'fs';

async function downloadFile(url, path) {
  const writer = fs.createWriteStream(path);

  const response = await axios({
    url,
    method: 'GET',
    responseType: 'stream'
  });

  response.data.pipe(writer);

  return new Promise((resolve, reject) => {
    writer.on('finish', resolve);
    writer.on('error', reject);
  });
}

async function main() {
  let browser;
  try {
    // Launch browser
    browser = await puppeteer.launch({
      headless: true, // Change to false for headful mode
      args: ['--window-size=1920,1080', '--window-position=-2000,-2000'] // Only for headful mode in hidden window
    });

    const page = await browser.newPage();

    // Set custom user agent
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36');

    let downloadUrl = null;

    // Listen to network responses
    page.on('response', async response => {
      if (response.url().endsWith('.pdf')) { // Modify this condition based on your download URL pattern
        downloadUrl = response.url();
      }
    });

    // Your navigation and interaction here...

    // Wait for network to be idle
    await page.waitForNetworkIdle();

    // Close the page
    await page.close();

    // Download the file using axios if URL is captured
    if (downloadUrl) {
      await downloadFile(downloadUrl, '/path/to/download/file.pdf'); // Set your download path
    }

  } catch (error) {
    console.error('An error occurred:', error);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
}

main().catch(error => console.error('Failed to run the main function:', error));

Let me know how you get on!

@sirzento
Copy link

sirzento commented Dec 8, 2023

Ok it is working now. Setting the user agent did work. I just have the problem now that I don't know when the download is finished. await page.waitForNetworkIdle(); will timeout after 30s even if the download is just a few KB and instantly finished. Do websockets maybe prevent await page.waitForNetworkIdle(); from finishing?

Even client.on('Browser.downloadProgress') wont show me any state changes. Any idea how to see if all downloads are finished/no active download at the moment?

@mattmillen888
Copy link

mattmillen888 commented Dec 9, 2023

With page.waitForNetworkIdle(), it’s as you rightly said possible that persistent connections like WebSockets are preventing the network idle state from being reached.

The method waits for a period of idle network activity, so any ongoing connections, including WebSockets, can cause it to time out.

There are few ways you could solve this by using slightly different approaches to determine when the download is complete.

Custom Wait Function
Check for Download File Existence
Listen to Download Event
Adjust waitForNetworkIdle Options

  1. Custom Wait Function
    Instead of relying on waitForNetworkIdle, you can create a custom function that waits for the specific network request you are interested in (the file download) to finish.

  2. Check for Download File Existence
    If the download initiates promptly after a certain action, you can periodically check if the file has appeared in the download directory.

  3. Listen to the Download Event
    If you have control over or insights into the backend, you might get a download completion event through WebSocket or another method.

  4. Adjust waitForNetworkIdle Options
    You can adjust the options for waitForNetworkIdle to better suit your needs, although this might not fully resolve the issue if WebSockets are indeed the cause of the timeout.

Here’s an example of how you might implement a custom wait function based on file existence:

async function waitForDownload(downloadPath, timeout = 60000) {
  let startTime = new Date().getTime();
  while (true) {
    if (fs.existsSync(downloadPath)) {
      return true;
    } else if (new Date().getTime() - startTime > timeout) {
      throw new Error('Download timeout');
    }
    await new Promise(resolve => setTimeout(resolve, 1000)); // Wait for 1 second before checking again
  }
}

// Usage in your main function
try {
  // Your code to initiate the download...

  // Wait for the download to complete
  await waitForDownload('/path/to/expected/download/file.pdf');
} catch (error) {
  console.error('Error:', error);
}

This function checks for the existence of the downloaded file every second and times out after a specified duration (default 60 seconds but you can change to whatever you like). You would need to replace '/path/to/expected/download/file.pdf' with the actual path where you expect the file to be downloaded.

Hopefully one of these options will work for you.

Let me know how you get on

@ivanalemunioz
Copy link

ivanalemunioz commented Jan 27, 2024

I am going to share my solution after days trying to solve this in a remote chrome connection

const client = await page.target().createCDPSession();

await client.send('Fetch.enable', { patterns: [{ urlPattern: '*', requestStage: 'Response' }] });

// Use the Fetch.requestPaused event as a response middleware
await client.on('Fetch.requestPaused', async (reqEvent) => {
	const { requestId } = reqEvent;

	const responseHeaders = reqEvent.responseHeaders || [];
	let contentType = '';

	for (const elements of responseHeaders) {
		if (elements.name.toLowerCase() === 'content-type') {
			contentType = elements.value;
		}
	}

	// Change this to the type you are waiting for
	if (contentType.includes('application/x-x509-ca-cert')) {
		// I remove content-disposition and content-type headers to get the response inline in the browser
		for (let i = responseHeaders.length - 1; i >= 0; i--) {
			if (responseHeaders[i].name.toLowerCase() === 'content-disposition' || responseHeaders[i].name.toLowerCase() === 'content-type') {
				responseHeaders.splice(i, 1);
			}
		}

		responseHeaders.push({
			name: 'Content-Type',
			value: 'text/plain'
		});
		
		const responseObj = await client.send('Fetch.getResponseBody', {
			requestId
		});

		// You have the response in responseObj.body

		await client.send('Fetch.fulfillRequest', {
			requestId,
			responseCode: 200,
			responseHeaders,
			body: responseObj.body
		});
		
		// Once you get what yout want don't forget to close the client
		await client.send('Fetch.disable');
		await client.detach();
	}
	else {
		await client.send('Fetch.continueRequest', { requestId });
	}
});

const response = await page.waitForResponse('https://url im waiting for');
const file = await response.text();

@sirzento
Copy link

sirzento commented Feb 29, 2024

With page.waitForNetworkIdle(), it’s as you rightly said possible that persistent connections like WebSockets are preventing the network idle state from being reached.

The method waits for a period of idle network activity, so any ongoing connections, including WebSockets, can cause it to time out.

There are few ways you could solve this by using slightly different approaches to determine when the download is complete.

Custom Wait Function Check for Download File Existence Listen to Download Event Adjust waitForNetworkIdle Options

  1. Custom Wait Function
    Instead of relying on waitForNetworkIdle, you can create a custom function that waits for the specific network request you are interested in (the file download) to finish.
  2. Check for Download File Existence
    If the download initiates promptly after a certain action, you can periodically check if the file has appeared in the download directory.
  3. Listen to the Download Event
    If you have control over or insights into the backend, you might get a download completion event through WebSocket or another method.
  4. Adjust waitForNetworkIdle Options
    You can adjust the options for waitForNetworkIdle to better suit your needs, although this might not fully resolve the issue if WebSockets are indeed the cause of the timeout.

Here’s an example of how you might implement a custom wait function based on file existence:

async function waitForDownload(downloadPath, timeout = 60000) {
  let startTime = new Date().getTime();
  while (true) {
    if (fs.existsSync(downloadPath)) {
      return true;
    } else if (new Date().getTime() - startTime > timeout) {
      throw new Error('Download timeout');
    }
    await new Promise(resolve => setTimeout(resolve, 1000)); // Wait for 1 second before checking again
  }
}

// Usage in your main function
try {
  // Your code to initiate the download...

  // Wait for the download to complete
  await waitForDownload('/path/to/expected/download/file.pdf');
} catch (error) {
  console.error('Error:', error);
}

This function checks for the existence of the downloaded file every second and times out after a specified duration (default 60 seconds but you can change to whatever you like). You would need to replace '/path/to/expected/download/file.pdf' with the actual path where you expect the file to be downloaded.

Hopefully one of these options will work for you.

Let me know how you get on

I tryed that but I still can't get it to work with those solution because:

  1. I don't exactly know how to do that.
  2. I could do that but I don't know the filename of the file
  3. Sadly not possible
  4. I don't think this will work since there are many websockets on this website.

I also looked into the CDP and tryed to bind the events Page.downloadProgress, Browser.downloadProgress, Page.downloadWillBegin and Browser.downloadWillBegin but no event did fire and thats very strange I think.

I did use this code:

    browser = await puppeteer.launch({ headless: true, ignoreHTTPSErrors: true });
    page = await browser.newPage();

    ...

    const client = await page.target().createCDPSession();
    await client.send('Page.setDownloadBehavior', {
      behavior: 'allow',
      downloadPath: __dirname + '\\'
    });

    client.on('Page.downloadProgress', e => {
      console.log("Page.downloadProgress:", e.state);
    });
    client.on('Browser.downloadProgress', e => {
      console.log("Browser.downloadProgress:", e.state);
    });
    client.on('Browser.downloadWillBegin', e => {
      console.log("Browser.downloadWillBegin:", e.suggestedFilename);
    });
    client.on('Page.downloadWillBegin', e => {
      console.log("Page.downloadWillBegin:", e.suggestedFilename);
    });

    browser.on('downloadProgress', (e: any) => {
      console.log("class downloadProgress:", e.state);
    })

    browser.on('downloadWillBegin', (e: any) => {
      console.log("class downloadWillBegin:", e.suggestedFilename);
    })

    page.on('downloadProgress', (e: any) => {
      console.log("class downloadProgress:", e.state);
    })

    page.on('downloadWillBegin', (e: any) => {
      console.log("class downloadWillBegin:", e.suggestedFilename);
    })
    await dlLink?.click();

The download will start and successfully download the file but still no event will fire..

Any other idea what to try?

Edit: Network.responseReceived also doesn't fire . I think there is something wrong here. Feels like all events are broken somehow. But client.send does still work to set the Page.setDownloadBehavior

@ivanalemunioz
Copy link

ivanalemunioz commented Feb 29, 2024

@sirzento check this #299 (comment).

Fetch.requestPaused is fired once the browser gets the http headers

@mattmillen888
Copy link

mattmillen888 commented Mar 1, 2024

@sirzento check this

Fetch.requestPaused is fired once the browser gets the http headers. This could also work well! here is a version based on @ivanalemunioz version

const puppeteer = require('puppeteer');

// Configuration for easy adjustments
const CONFIG = {
    contentType: 'your-content-type-here', // Adjust this to match the specific Content-Type you're targeting
    responseCode: 200,
    downloadPath: __dirname // or any path where you'd like to save downloads
};

// Function to determine if a response should be modified based on its headers
const shouldModifyResponse = (responseHeaders) => {
    return responseHeaders.some(header => 
        header.name.toLowerCase() === 'content-type' && header.value.includes(CONFIG.contentType)
    );
};

// Function to modify response headers
const modifyHeaders = (responseHeaders) => {
    return responseHeaders.reduce((acc, header) => {
        const nameLower = header.name.toLowerCase();
        if (nameLower !== 'content-disposition' && nameLower !== 'content-type') {
            acc.push(header);
        }
        return acc;
    }, [{ name: 'Content-Type', value: 'text/plain' }]);
};

(async () => {
    const browser = await puppeteer.launch({ headless: true, ignoreHTTPSErrors: true });
    const page = await browser.newPage();
    const client = await page.target().createCDPSession();

    try {
        await client.send('Fetch.enable', { patterns: [{ urlPattern: '*', requestStage: 'Response' }] });

        client.on('Fetch.requestPaused', async (reqEvent) => {
            const { requestId, responseHeaders = [] } = reqEvent;

            try {
                if (shouldModifyResponse(responseHeaders)) {
                    const modifiedHeaders = modifyHeaders(responseHeaders);
                    const responseObj = await client.send('Fetch.getResponseBody', { requestId });
                    await client.send('Fetch.fulfillRequest', {
                        requestId,
                        responseCode: CONFIG.responseCode,
                        responseHeaders: modifiedHeaders,
                        body: responseObj.body
                    });
                } else {
                    await client.send('Fetch.continueRequest', { requestId });
                }
            } catch (error) {
                console.error('Error handling request:', error);
                await client.send('Fetch.continueRequest', { requestId }); // Ensure continuation in case of error
            }
        });

        // Your code to navigate to the page and initiate the download...
    } catch (error) {
        console.error('An error occurred:', error);
    } finally {
        // Clean up
        await client.send('Fetch.disable');
        await client.detach();
        await browser.close();
    }
})();

@sirzento
Copy link

sirzento commented Mar 7, 2024

@mattmillen888 @ivanalemunioz I still don't get how this works. So where can I wait for the download to finish? I guess it does work when using page.waitForResponse() to start the download but I don't have the option for a direct link. I need to start the download with an button click. The direct url does contain parameters that can be differend each time.

@ivanalemunioz
Copy link

@sirzento in the @mattmillen888 response, where it says // Your code to navigate to the page and initiate the download... click that button and then wait using page.waitForNetworkIdle() and the downloaded content should be in page.content()

@sirzento
Copy link

sirzento commented Mar 7, 2024

@ivanalemunioz Ok but page.waitForNetworkIdle() wont finish for me because websockets on the site :/

@ivanalemunioz
Copy link

@sirzento in const responseObj = await client.send('Fetch.getResponseBody', { requestId }); you are geting the file content, just save responseObj into a file

@haixc
Copy link

haixc commented Aug 5, 2024

With page.waitForNetworkIdle(), it’s as you rightly said possible that persistent connections like WebSockets are preventing the network idle state from being reached.
The method waits for a period of idle network activity, so any ongoing connections, including WebSockets, can cause it to time out.
There are few ways you could solve this by using slightly different approaches to determine when the download is complete.
Custom Wait Function Check for Download File Existence Listen to Download Event Adjust waitForNetworkIdle Options

  1. Custom Wait Function
    Instead of relying on waitForNetworkIdle, you can create a custom function that waits for the specific network request you are interested in (the file download) to finish.
  2. Check for Download File Existence
    If the download initiates promptly after a certain action, you can periodically check if the file has appeared in the download directory.
  3. Listen to the Download Event
    If you have control over or insights into the backend, you might get a download completion event through WebSocket or another method.
  4. Adjust waitForNetworkIdle Options
    You can adjust the options for waitForNetworkIdle to better suit your needs, although this might not fully resolve the issue if WebSockets are indeed the cause of the timeout.

Here’s an example of how you might implement a custom wait function based on file existence:

async function waitForDownload(downloadPath, timeout = 60000) {
  let startTime = new Date().getTime();
  while (true) {
    if (fs.existsSync(downloadPath)) {
      return true;
    } else if (new Date().getTime() - startTime > timeout) {
      throw new Error('Download timeout');
    }
    await new Promise(resolve => setTimeout(resolve, 1000)); // Wait for 1 second before checking again
  }
}

// Usage in your main function
try {
  // Your code to initiate the download...

  // Wait for the download to complete
  await waitForDownload('/path/to/expected/download/file.pdf');
} catch (error) {
  console.error('Error:', error);
}

This function checks for the existence of the downloaded file every second and times out after a specified duration (default 60 seconds but you can change to whatever you like). You would need to replace '/path/to/expected/download/file.pdf' with the actual path where you expect the file to be downloaded.
Hopefully one of these options will work for you.
Let me know how you get on

I tryed that but I still can't get it to work with those solution because:

  1. I don't exactly know how to do that.
  2. I could do that but I don't know the filename of the file
  3. Sadly not possible
  4. I don't think this will work since there are many websockets on this website.

I also looked into the CDP and tryed to bind the events Page.downloadProgress, Browser.downloadProgress, Page.downloadWillBegin and Browser.downloadWillBegin but no event did fire and thats very strange I think.

I did use this code:

    browser = await puppeteer.launch({ headless: true, ignoreHTTPSErrors: true });
    page = await browser.newPage();

    ...

    const client = await page.target().createCDPSession();
    await client.send('Page.setDownloadBehavior', {
      behavior: 'allow',
      downloadPath: __dirname + '\\'
    });

    client.on('Page.downloadProgress', e => {
      console.log("Page.downloadProgress:", e.state);
    });
    client.on('Browser.downloadProgress', e => {
      console.log("Browser.downloadProgress:", e.state);
    });
    client.on('Browser.downloadWillBegin', e => {
      console.log("Browser.downloadWillBegin:", e.suggestedFilename);
    });
    client.on('Page.downloadWillBegin', e => {
      console.log("Page.downloadWillBegin:", e.suggestedFilename);
    });

    browser.on('downloadProgress', (e: any) => {
      console.log("class downloadProgress:", e.state);
    })

    browser.on('downloadWillBegin', (e: any) => {
      console.log("class downloadWillBegin:", e.suggestedFilename);
    })

    page.on('downloadProgress', (e: any) => {
      console.log("class downloadProgress:", e.state);
    })

    page.on('downloadWillBegin', (e: any) => {
      console.log("class downloadWillBegin:", e.suggestedFilename);
    })
    await dlLink?.click();

The download will start and successfully download the file but still no event will fire..

Any other idea what to try?

Edit: Network.responseReceived also doesn't fire . I think there is something wrong here. Feels like all events are broken somehow. But client.send does still work to set the Page.setDownloadBehavior

@sirzento @ivanalemunioz hi, has this issue been resolved? I having same issue ,where no download event is triggered when downloading a file.

  1. opening a page on a.com
  2. then downloading a file from file.a.com
  3. but no download-related events are being triggered.

@ivanalemunioz
Copy link

ivanalemunioz commented Aug 5, 2024

@haixc I solved it with the Fetch.requestPaused event, as you can see in my answer

@SystemDisc
Copy link

SystemDisc commented Sep 1, 2024

  const browser = await puppeteer.launch({
    headless: true,
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  // Ensure the download directory exists
  const downloadPath = path.join('some', 'path', 'to', 'downloads');
  fs.mkdirSync(downloadPath, { recursive: true });

  const session = await browser.target().createCDPSession();
  await session.send('Browser.setDownloadBehavior', {
    behavior: 'allow',
    downloadPath,
    eventsEnabled: true, // this must be set
  });
  
  // do your stuff
  
  console.log('Saving file...');
  await new Promise<void>((resolve, reject) => {
    session.on('Browser.downloadProgress', (e) => {
      if (e.state === 'completed') {
        resolve();
      } else if (e.state === 'canceled') {
        reject();
      }
    });
  });
  
  await browser.close();

Solution found here: #7173 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.