Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to scrape an element #489

Closed
xiki808 opened this issue Aug 23, 2017 · 17 comments
Closed

How to scrape an element #489

xiki808 opened this issue Aug 23, 2017 · 17 comments

Comments

@xiki808
Copy link

xiki808 commented Aug 23, 2017

Hi, I think this is an obvious question but still I cannot figure out how to grab text from an element.

<span class="scrape">HelloPuppeteer</span>

How can I scrape this 'HelloPuppeteer' text using Puppeteer?

@xiki808 xiki808 changed the title How to scrape part of the page How to scrape an element Aug 23, 2017
@aslushnikov
Copy link
Contributor

The following should work:

  // ...
  const text = page.evaluate(() => document.querySelector('.scrape').textContent);

@vsemozhetbyt
Copy link
Contributor

@Xikky Also, if your element has not so simple structure, it is better to use innerText to get more readable content: textContent will return raw whitespaces from markup and ignore whitespaces (including line breaks) from block elements. Compare:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  const textContent = await page.evaluate(() => document.querySelector('p').textContent);
  const innerText = await page.evaluate(() => document.querySelector('p').innerText);

  console.log(textContent);
  console.log(innerText);

  browser.close();
})();
This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.

This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission.

@xiki808
Copy link
Author

xiki808 commented Aug 23, 2017

This is very nice of you guys! Thanks for your help :)

@psy901
Copy link

psy901 commented Feb 21, 2018

How can I handle multiple elements?

For example, what if I use document.querySelectorALl('.scrape'), ?

@vsemozhetbyt
Copy link
Contributor

vsemozhetbyt commented Feb 21, 2018

@psy901 You can return an array of texts or a joined string:

  const textsArray = await page.evaluate(
    () => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText)
  );
  const textsJoined = await page.evaluate(
    () => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText).join('\n')
  );

@psy901
Copy link

psy901 commented Feb 21, 2018

I tried with your advice as follows:

        const text = await webPage.evaluate(() => {
          [...document.querySelectorAll(
              "#ewt_main_structure_body > div > div:nth-child(4) > table > tbody > tr"
            )
          ].map(elem => elem.innerText).join("\n");
        });

but received the following error

Error: Evaluation failed: ReferenceError: _toConsumableArray is not defined
    at <anonymous>:2:22
    at ExecutionContext.evaluateHandle (/Users/sangyunpark/Dev/wagyu/node_modules/puppeteer/lib/ExecutionContext.js:66:13)
    at <anonymous>
    at process._tickDomainCallback (internal/process/next_tick.js:228:7)

Any advice would be appreciate 👍

** added
I thought of the way to turn a return value from querySelectorAll, and came up with using Array.prototype.slice.call() to turn NodeList into an array.

const a_arr = await webPage.evaluate(() => {
    let names = document.querySelectorAll(
        "#ewt_main_structure_body > div > div:nth-child(4) > table > tbody > tr > td > a"
    );
    let arr = Array.prototype.slice.call(names);
    let text_arr = [];
    for (let i = 0; i < arr.length; i += 1) {
        text_arr.push(arr[i].innerHTML);
    }
    return text_arr;
});

A bit verbose though

@vsemozhetbyt
Copy link
Contributor

Seems like Babel issue. Do you transpile?

@psy901
Copy link

psy901 commented Feb 21, 2018

Not familiar with term transpile, but if you mean if I am using babel, yes.

@putuoka
Copy link

putuoka commented Mar 29, 2018

How can I handle multiple elements?

For example, what if I use document.querySelectorALl('.scrape'), ?

@psy901 you can iterate through it array.


function extractItems() {
  const extractedElements = document.querySelectorAll('.scrape');
  const items = [];
  for (let element of extractedElements) {
    items.push(element.innerText);
  }
  return items;
}

let items = await page.evaluate(extractItems);

@Manngold
Copy link

Manngold commented Jan 25, 2019

@vsemozhetbyt
`const textsArray = await page.evaluate(
() => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText)
);

const textsJoined = await page.evaluate(
() => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText).join('\n')
);`

i have one question
in this code

why use ... (rest) ??

when i was delete rest syntax return value is null why?

@vsemozhetbyt
Copy link
Contributor

vsemozhetbyt commented Jan 25, 2019

document.querySelectorAll() returns NodeList collection. It has not array methods like .map(). So we need to make an array from it. We do this with spreading NodeList iterable into an array. Without spreading, we have an array with just one element - NodeList collection, It has not innerText property, so .map() returns [undefined] which is serialized into [null] by the puppeteer.

@KSmith0x86
Copy link

@psy901 You can return an array of texts or a joined string:

  const textsArray = await page.evaluate(
    () => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText)
  );
  const textsJoined = await page.evaluate(
    () => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText).join('\n')
  );

This was SOOOOOOOO helpful to me. I've been struggling for like 3 weeks trying to figure out how to scrape innerText from multiple elements. The google example runs perfectly fine (puppeteer-master/examples/search.js) But since I am new to JS it looked a bit complicated and I tried to strip out the code and simplify it down. I must have totally overlooked the map in their function.

vsemozhetbyt - I was building my example from the post above that uses example.com and shows the difference between innerText & textContent. I built a script off that code and I then tried to modify the code to extract multiple paragraphs off a wordpress site, but the arrays always came back empty or undefined or I had unresolved promise errors.

I am really new to JS and have been learning as I play with puppeteer. I kept creating arrays from page.evaluate and then document.querySelectorAll but the array returned just one or two like empty parantheses. I didn't realize you get a nodeList and have to deconstruct it / iterate through to create a normal array of elements. This was so helpful!!! I'm gonna go review how map works, and see if I can find more info about the returned nodeList (is that an object?) I just wanted to say thank you!!! I'm wondering if a lot of people have had this problem like me and the OP.

I originally was testing scripts on a fedora system and figured there was some kind of error with node/puppeteer due to some security settings in fedora. I installed puppeteer on an ubuntu VM and the example above that scrapes example.com finally worked fine. So I figured some of my issues were with my setup. I'm not very good at debugging JS yet (other than spitting out variables in console.log to see what comes out at different sections of the script) So I was really struggling trying to find out why my variables/arrays from document.querySelector(All) weren't returning what I expected to be getting.
(I assumed I would just get an array with each element as an index of the array)

But alas, even though I was working from most of this code since the beginning, I guess I never scrolled down in this post far enough to see you guys discussing multiple elements!! (doh!) I guess I just assumed if I just created an array and used the same code as when I extracted a single element from example.com it would all work. I couldn't figure out where I went wrong!! I am so excited I found this!!!! Hurray!! Immediately my script that was testing out pulling an array of elements started working after adding .map. I'm glad I kept working at it and didn't try to go do scraping with Python instead. (which would still be fun but I was really looking forward to using and learning chrome headless/js/chrome dev-tools.) Thank you x1000. You guys are amazing.

@vsemozhetbyt
Copy link
Contributor

vsemozhetbyt commented Feb 4, 2019

@KSmith0x86 Thank you for kind words!

You can read more about Array.prototype.map(), NodeList in MDN.

Also, it would be useful to read about JSON.stringify() restrictions to understand what can be returned from the page.evaluate() (Nodes and Elements cannot be returned as they have circular references and methods, but their stringifiable properties can be returned as we can see).

Happy coding)

@CQBinh
Copy link

CQBinh commented Mar 5, 2019

@aslushnikov How about if I would like to get the href attribute of span tag?

I tried elementHandle.getProperty(propertyName) but it doesn't work.

P/s: It works when I do the same thing in a tag

@VersLaFlamme
Copy link

VersLaFlamme commented Mar 29, 2020

@aslushnikov How about if I would like to get the href attribute of span tag?

I tried elementHandle.getProperty(propertyName) but it doesn't work.

P/s: It works when I do the same thing in a tag

Have you tried
const hrefAttribute = page.evaluate(() => document.querySelector('span').href) ?

@Sekai966
Copy link

const element = await page.$(".btn");
const text = await page.evaluate(element => element.textContent, element);

if there is a multiple elements have btn class how to get the second one or the thrid one this code only gives the the first one

@vsemozhetbyt
Copy link
Contributor

You can use page.$$(selector) with array indices:

  const element = (await page.$$("p"))[0];
  const text = await page.evaluate(element => element.textContent, element);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants