Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different behavior between { headless: false } and { headless: true } #665

Closed
optikalefx opened this issue Sep 2, 2017 · 53 comments
Closed

Comments

@optikalefx
Copy link

I'm curious to know what changes there are between running as headless true vs false. When I run a login to Amazon using headless: true I get an error from Amazon via the screenshot. But when I set headless: false I watch it work just fine, no error.

So I'm trying to figure out what headless: true is doing that is different from when it's not headless.

Thanks to any suggestions.

@Garbee
Copy link
Contributor

Garbee commented Sep 2, 2017

There could be any number of things going on. They could be looking for the Headless added to the UA string and blocking that. Or they could be using some techniques to detect automated access and prevent it.

If it works in non-headless and fails in headless then the site itself is doing something to prevent automated access. So you'd need to figure out what that is and work around it or move on. Some things are easy to get around (like modifying the UA string) while others are non-trivial to bypass.

@kaushiksundar
Copy link

I am also facing the same issue.

When Headless is false

page url ===> http://lvh.me:3000/dashboard

When Headless is true

page url ===> about:blank
(node:29206) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1)

@Garbee
Copy link
Contributor

Garbee commented Sep 3, 2017

Can anyone provide an actual example file to run that reproduces this issue?

@optikalefx
Copy link
Author

I will try to find something public that I can post. My example is confidential so I can't share it.

@optikalefx
Copy link
Author

@Garbee just FYI, I'm setting the UA, so I don't think that's it. And I'm performing things like delays and mouse movement etc. Since the only difference is the headless: true it leads me to believe that there is something going on in the lib, and not on the site that I'm scraping. But I will keep trying and hopefully will find an example to post.

Are there other kinds of debugging maybe that can help point to where an issue might be?

@kaushiksundar
Copy link

kaushiksundar commented Sep 3, 2017

@Garbee Here is the code. This happens only for localhost if I give the actual website URL (http://www.google.com... etc) it is working for both options.

const browser = await puppeteer.launch({headless: true});
  const page = await browser.newPage();
  await page.goto('localhost:3000', {
          networkIdleTimeout: 1000,
          waitUntil: 'networkidle',
          timeout: 3000000
        });
  console.log(page.url());

Output:
about:blank

Expected output:
localhost:3000

If headless is false I am getting the expected output.

@optikalefx
Copy link
Author

I'll thicken the plot. I've started debugging the POST requests to my amazon login. When headless is set to true, Amazon is making an additional POST request that I don't recognize. That doesn't exist when headless is set to false. So that says to me something else is changing with this setting that I don't yet know.

@optikalefx
Copy link
Author

I've also inspected the request and response for both headless and non-headless. They seem to be identical in nature.

@LoganDark
Copy link

LoganDark commented Sep 3, 2017

In non-headless mode, screenshots work differently because my screen is in HiDPI mode (MacBook Retina). Here's one of the 'different' screenshots:
example

@Garbee
Copy link
Contributor

Garbee commented Sep 3, 2017

Remember the protocol is required for urls in goto.

@LoganDark that is a different issue completely. Please file your own for triage and discussion.

@LoganDark
Copy link

Different issue? Well, I didn't know that because of the title.

@LoganDark
Copy link

Reading the issue description as well, nothing stands out to me that would make my issue completely different. Here are the parts that made me think my issue did belong here:

I'm curious to know what changes there are between running as headless true vs false.

So I'm trying to figure out what headless: true is doing that is different from when it's not headless.

@kaushiksundar
Copy link

@Garbee Yes giving the protocol in goto solves the issue.

await page.goto('http://localhost:3000', {
          networkIdleTimeout: 1000,
          waitUntil: 'networkidle',
          timeout: 3000000
        });
console.log(page.url)

If I don't give the protocol for google.com, am getting an error Error: Protocol error (Page.navigate): Cannot navigate to invalid URL undefined whereas for the above case I am getting about:blank. The error handling it done differently for localhosts.. Shouldn't it be giving the protocol error?

await page.goto('www.google.com', {
          networkIdleTimeout: 1000,
          waitUntil: 'networkidle',
          timeout: 3000000
        });
console.log(page.url)

@Garbee
Copy link
Contributor

Garbee commented Sep 3, 2017

@LoganDark Sorry about the poorly worded title for the issue. There is nothing I can do about that. Your issue is with screenshot functionality while this was opened about some navigational problems. They are entirely distinct separated issues. Therefore a new issue is required to focus on your problem.

@kaushik-sundar Throwing an error for missing the protocol is a good idea IMO. I'll need to look into it though as it could be non-trivial to setup well due to the number of allowed protocols.

@optikalefx
Copy link
Author

My apologies on the title, but I do agree that protocol issue is separate. My issue is more related to something about the request from the browser is different when headless is on vs off, causing the site in question to act differently.

@rosshadden
Copy link

rosshadden commented Sep 3, 2017

Here is a gist of the problem. With params.isHeadless as false the browser opens and the form successfully logs in, whereas with it false I get an auth error page (which I actually cannot replicate through normal means no matter what kinds of correct/incorrect credential permutations I try to use).

Since the problem is behind an auth wall (or rather, the act of authenticating itself) I cannot share the exact code with my own credentials. However if you have or create your own vendorcentral account you should be able to see this behavior.

I wrote the code in such a way that it works for some other services as well, such as imgur. For this, just change params.url (to https://imgur.com/signin for example). It works on Imgur, which implies that Amazon is doing something explicit, however we have been as of yet unable to determine what that is, because as @optikalefx has said we have tried sporadic mouse movement, delayed typing, etc.

Note: I'll open another unrelated issue for this eventually as I need to do more research and experimentation, but I found that page.press('Enter') does not actually press the enter key. At least for me and my environment.

@LoganDark
Copy link

LoganDark commented Sep 3, 2017

but I found that page.press('Enter') does not actually press the enter key

Try page.press('Return') as well..?

@rosshadden
Copy link

@LoganDark That didn't work either. I probably shouldn't have brought it up here at all, completely unrelated. Let's ignore it.

@aslushnikov
Copy link
Contributor

I'm curious to know what changes there are between running as headless true vs false.

@optikalefx The major change is a user agent - chrome headless identifies itself as HeadlessChrome. Try running the following script in headless and headful modes:

const puppeteer = require('puppeteer');

(async() => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  console.log(await page.evaluate(() => navigator.userAgent));
  browser.close();
})();

User agent is sent with every request as a user-agent header. If there's a need, user-agent could be changed with the page.setUserAgent method.

In non-headless mode, screenshots work differently because my screen is in HiDPI mode (MacBook Retina). Here's one of the 'different' screenshots:

@LoganDark please, file a separate issue.

Here is a gist of the problem.

@rosshadden try overriding user-agent in your gist. If this doesn't help, please file a separate issue.

@LoganDark
Copy link

LoganDark commented Sep 6, 2017

From @Garbee:

@LoganDark that is a different issue completely. Please file your own for triage and discussion.

From @Garbee again:

Therefore a new issue is required to focus on your problem.

From @aslushnikov

@LoganDark please, file a separate issue.

Yeah, 3 times already I've been told to file a different issue.

I haven't. And I won't right now.

Stop telling me to.

@optikalefx
Copy link
Author

@aslushnikov we need to re-open this ticket IMO. I'm sorry that this issue had unrelated things in it. Setting the user-agent doesn't change anything - as in something is still different about the request. The result of that user-agent log after it's set is exactly what I set it to.

Can you think of anything else that changes when headless is set to true? Something that Amazon is able to detect? Maybe something about cookies? Maybe you could guide me in the right direction in the code and I can look through myself. Being unfamiliar with the codebase would make having a quick guidance very helpful.

@Garbee
Copy link
Contributor

Garbee commented Sep 6, 2017

There are a few ways Amazon can be detecting headless access. Nothing can really be done internally about them if Amazon is implementing any techniques like this.

The only primary difference is the Headless in the UA string. Beyond that, everything should be functioning the same from the user perspective of headless, as stated before.

@optikalefx
Copy link
Author

@Garbee super interesting. So, why can't we just define things like language, plugins etc? I can't set things on navigator, but I can polyfill other methods to prevent detection. Maybe you guys can set the navigator settings?

@optikalefx
Copy link
Author

optikalefx commented Sep 6, 2017

It looks like I can polyfill navigator using

Object.defineProperties(navigator, {
	 'plugins': {
	     value: ['adBlock'],
	      writable: true
	 }
});

@optikalefx
Copy link
Author

Well I polyfilled everything in that article, and it passes all of those tests after the goto statement. But it still is getting caught. quite interesting.

@rosshadden
Copy link

@aslushnikov While my gist doesn't have a UA set, setting it was the first thing @optikalefx tried when we discovered this problem. What I can do is update my gist with setting the UA and the polyfills/workarounds we have tried since.

@aslushnikov
Copy link
Contributor

@optikalefx @rosshadden Chrome headless is built atop of content/ layer and doesn't include chrome/ layer, whereas chrome headful includes both content/ and chrome/ layers. So naturally, there might be multiple subtle ways to detect headless.

More on chromium architecture could be found here:

@koreus7
Copy link

koreus7 commented Jan 10, 2018

As mentioned in the article @Garbee posted the headless version does not have languages set on the navigator object.

Note also that the headless version will not have languages set in its Accept-Language Header. Some sites (ASP.NET in my experience) require this header to be set. Other sites are looking for this header specifically to identify headless browsers.

I copied the value from an example request generated by my normal chrome install. There is probably a more minimal setting for this header that works.

await page.setExtraHTTPHeaders({
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
});

@hvaoc
Copy link

hvaoc commented Apr 20, 2018

@koreus7 - Solution worked for Amazon issue reported by @optikalefx

@mercmobily
Copy link

This is an absolute pearl. Thanks for sharing the code above.

@optikalefx
Copy link
Author

I would also like to add, for our implementation, we turned on 2FA, and will keep it on. We have setup a number with Twilio or a Twilio like service to receive the SMS code, and then our login script receives that code from Twilio to enter into the 2FA. We require this b/c sometimes Amazon asks for it, and rather than a re-try sometimes code, we just always assume 2fa.

@jondlm
Copy link

jondlm commented Jun 7, 2018

For what it's worth I've also found that adding the following user agents override can help smooth over differences in some cases:

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36')

The UA I've provided is just an example. You can use any valid UA that matches an existing browser.

@felixfbecker
Copy link

I noticed another difference, when in non-headless mode the address seems to change localhost to 127.0.0.1 which means it's difficult to assert on the URL.

@roeniss
Copy link

roeniss commented Aug 21, 2018

as @jondlm said, UserAgent option make headless selenium work do same with non-headless selenium. thx.

@stefpe
Copy link

stefpe commented Nov 12, 2018

@koreus7 setting the languages works like a charm!

@jslim89
Copy link

jslim89 commented Apr 9, 2019

I get it works by adding this 2

await page.setExtraHTTPHeaders({
    'Accept-Language': 'en-US,en;q=0.9'
});
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36');

Must thanks to @koreus7 & @jondlm , it won't if miss out any 1 of it.

P/S: I was trying to access this site www.blibli.com

@endel
Copy link

endel commented Sep 28, 2019

I've made a fake user agent generator that works pretty fine!

function* generateUserAgent() {
  let webkitVersion = 10;
  let chromeVersion = 1000;

  const so = [
    'Windows NT 6.1; WOW64',
    'Windows NT 6.2; Win64; x64',
    "Windows NT 5.1; Win64; x64",
    'Macintosh; Intel Mac OS X 10_12_6',
    "X11; Linux x86_64",
    "X11; Linux armv7l"
  ];
  let soIndex = Math.floor(Math.random() * so.length);

  while (true) {
    yield `Mozilla/5.0 (${so[soIndex++ % so.length]}) AppleWebKit/537.${webkitVersion} (KHTML, like Gecko) Chrome/56.0.${chromeVersion}.87 Safari/537.${webkitVersion} OPR/43.0.2442.991`;

    webkitVersion++;
    chromeVersion++;
  }
}

const userAgents = generateUserAgent();

// ...
await page.setUserAgent(userAgents.next().value);

@andreabisello
Copy link

So headless true/false change user agent and other stuffs?
i have two different test that works on headless:false mode but fails on headless:true mode due to rendering differences of fonts and due to time needed to make a button clickable, but i cannot share due to confindential website.
I think headless true/false should not change rendering process.
Should i consider to set a common user agent to make behaviour more consistent?
thanks.

@heathera2016
Copy link

My case is completely the opposite of the OP's situation. I got an Amazon's robot check while headless mode:false, and bypass while headless mode:true. I solved this issue thanks to @koreus7 Many thanks 👍

@gdossant
Copy link

Using @koreus7 and @jondlm comments solved my problem

@Bhabaranjan19966
Copy link

Recently, I had the same experience of getting blocked because of using headless browser. While scraping a popular website. Even after adding proper headers and user agent it didn't work out.

Finally used puppeteer-extra with stealth mode plugin which fixed the problem.

This thread helped me a lot to figure out what all could go wrong.

Thanks @Garbee @optikalefx

@andreabisello
Copy link

not working for me : headless and gui mode renders page in a little different way
image

@Bhabaranjan19966
Copy link

@Bhabaranjan19966 so this https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra with this https://www.npmjs.com/package/puppeteer-extra-plugin-stealth ? i will try, thanks.

Yes, those are the two repositories fixed my problem. @andreabisello

@pgibler
Copy link

pgibler commented Apr 18, 2020

I'm having this same issue with peapod.com right now. In headful mode, my program runs successfully. In headless mode, I'm screenshotting to debug and see that the link is clicked, spinner is activated, but the page never changes. How can I debug this better? @aslushnikov , could you provide me some guidance?

@mewtcor
Copy link

mewtcor commented Apr 24, 2020

Recently, I had the same experience of getting blocked because of using headless browser. While scraping a popular website. Even after adding proper headers and user agent it didn't work out.

Finally used puppeteer-extra with stealth mode plugin which fixed the problem.

This thread helped me a lot to figure out what all could go wrong.

Thanks @Garbee @optikalefx

The stealth mode did the trick for me too! TYVM

@peterhil
Copy link

peterhil commented May 7, 2020

None of these suggested solutions work on Mac OS X. To reproduce:

  1. Change your system language to something other than en-US or en, so that applications use that locale.
  2. Test a browser extension or web site that is internationalised by selected user locale.
  3. It is impossible to test or change the browser locale to en-US on non-headless mode at least.

What I am trying to do, is setup testing with Puppeteer for my browser extension Spellbook.

I have the first test now passing on Mac OS X (using some Finnish strings), and it is probably failing on other systems when you do yarn run test:puppeteer, because I use every method of setting the locale: peterhil/spellbook@3480a73

@harshvats2000
Copy link

For what it's worth I've also found that adding the following user agents override can help smooth over differences in some cases:

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36')

The UA I've provided is just an example. You can use any valid UA that matches an existing browser.

Add this just below where page is defined.

@mishra5047
Copy link

I'm curious to know what changes there are between running as headless true vs false. When I run a login to Amazon using headless: true I get an error from Amazon via the screenshot. But when I set headless: false I watch it work just fine, no error.

So I'm trying to figure out what headless: true is doing that is different from when it's not headless.

Thanks to any suggestions.

I am using puppeteer to make an simple automation script to login into my google account, it's working fine in headless: false mode but in case of headless: true it's showing selector not found;

@sajjadafridi
Copy link

Can anyone provide an actual example file to run that reproduces this issue?

i have the same problem on grainger.com. When set --headless : false it is working but headless: true return promise handling error

any help will be appreciated

@UsmanGhani-Emumba
Copy link

All the above methods are not working for me, I am still facing issues related to headless and normal mode. Any help will be appreciated

@JavedBoqo
Copy link

I got same issue and with combination of puppeteer-extra and following lib solved the issue
https://www.npmjs.com/package/puppeteer-extra-plugin-stealth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests