How to scrape an element #489

xiki808 · 2017-08-23T08:25:24Z

Hi, I think this is an obvious question but still I cannot figure out how to grab text from an element.

<span class="scrape">HelloPuppeteer</span>

How can I scrape this 'HelloPuppeteer' text using Puppeteer?

The text was updated successfully, but these errors were encountered:

aslushnikov · 2017-08-23T08:49:48Z

The following should work:

  // ...
  const text = page.evaluate(() => document.querySelector('.scrape').textContent);

vsemozhetbyt · 2017-08-23T08:57:35Z

@Xikky Also, if your element has not so simple structure, it is better to use innerText to get more readable content: textContent will return raw whitespaces from markup and ignore whitespaces (including line breaks) from block elements. Compare:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  const textContent = await page.evaluate(() => document.querySelector('p').textContent);
  const innerText = await page.evaluate(() => document.querySelector('p').innerText);

  console.log(textContent);
  console.log(innerText);

  browser.close();
})();

This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.

This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission.

xiki808 · 2017-08-23T09:18:54Z

This is very nice of you guys! Thanks for your help :)

psy901 · 2018-02-21T00:26:06Z

How can I handle multiple elements?

For example, what if I use document.querySelectorALl('.scrape'), ?

vsemozhetbyt · 2018-02-21T01:11:49Z

@psy901 You can return an array of texts or a joined string:

  const textsArray = await page.evaluate(
    () => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText)
  );
  const textsJoined = await page.evaluate(
    () => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText).join('\n')
  );

psy901 · 2018-02-21T01:46:13Z

I tried with your advice as follows:

        const text = await webPage.evaluate(() => {
          [...document.querySelectorAll(
              "#ewt_main_structure_body > div > div:nth-child(4) > table > tbody > tr"
            )
          ].map(elem => elem.innerText).join("\n");
        });

but received the following error

Error: Evaluation failed: ReferenceError: _toConsumableArray is not defined
    at <anonymous>:2:22
    at ExecutionContext.evaluateHandle (/Users/sangyunpark/Dev/wagyu/node_modules/puppeteer/lib/ExecutionContext.js:66:13)
    at <anonymous>
    at process._tickDomainCallback (internal/process/next_tick.js:228:7)

Any advice would be appreciate 👍

** added
I thought of the way to turn a return value from querySelectorAll, and came up with using Array.prototype.slice.call() to turn NodeList into an array.

const a_arr = await webPage.evaluate(() => {
    let names = document.querySelectorAll(
        "#ewt_main_structure_body > div > div:nth-child(4) > table > tbody > tr > td > a"
    );
    let arr = Array.prototype.slice.call(names);
    let text_arr = [];
    for (let i = 0; i < arr.length; i += 1) {
        text_arr.push(arr[i].innerHTML);
    }
    return text_arr;
});

A bit verbose though

vsemozhetbyt · 2018-02-21T02:02:06Z

Seems like Babel issue. Do you transpile?

psy901 · 2018-02-21T05:11:54Z

Not familiar with term transpile, but if you mean if I am using babel, yes.

putuoka · 2018-03-29T08:50:24Z

How can I handle multiple elements?

For example, what if I use document.querySelectorALl('.scrape'), ?

@psy901 you can iterate through it array.


function extractItems() {
  const extractedElements = document.querySelectorAll('.scrape');
  const items = [];
  for (let element of extractedElements) {
    items.push(element.innerText);
  }
  return items;
}

let items = await page.evaluate(extractItems);

Manngold · 2019-01-25T07:31:34Z

@vsemozhetbyt
`const textsArray = await page.evaluate(
() => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText)
);

const textsJoined = await page.evaluate(
() => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText).join('\n')
);`

i have one question
in this code

why use ... (rest) ??

when i was delete rest syntax return value is null why?

vsemozhetbyt · 2019-01-25T10:41:00Z

document.querySelectorAll() returns NodeList collection. It has not array methods like .map(). So we need to make an array from it. We do this with spreading NodeList iterable into an array. Without spreading, we have an array with just one element - NodeList collection, It has not innerText property, so .map() returns [undefined] which is serialized into [null] by the puppeteer.

KSmith0x86 · 2019-02-04T21:44:41Z

@psy901 You can return an array of texts or a joined string:

  const textsArray = await page.evaluate(
    () => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText)
  );
  const textsJoined = await page.evaluate(
    () => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText).join('\n')
  );

This was SOOOOOOOO helpful to me. I've been struggling for like 3 weeks trying to figure out how to scrape innerText from multiple elements. The google example runs perfectly fine (puppeteer-master/examples/search.js) But since I am new to JS it looked a bit complicated and I tried to strip out the code and simplify it down. I must have totally overlooked the map in their function.

vsemozhetbyt - I was building my example from the post above that uses example.com and shows the difference between innerText & textContent. I built a script off that code and I then tried to modify the code to extract multiple paragraphs off a wordpress site, but the arrays always came back empty or undefined or I had unresolved promise errors.

I am really new to JS and have been learning as I play with puppeteer. I kept creating arrays from page.evaluate and then document.querySelectorAll but the array returned just one or two like empty parantheses. I didn't realize you get a nodeList and have to deconstruct it / iterate through to create a normal array of elements. This was so helpful!!! I'm gonna go review how map works, and see if I can find more info about the returned nodeList (is that an object?) I just wanted to say thank you!!! I'm wondering if a lot of people have had this problem like me and the OP.

I originally was testing scripts on a fedora system and figured there was some kind of error with node/puppeteer due to some security settings in fedora. I installed puppeteer on an ubuntu VM and the example above that scrapes example.com finally worked fine. So I figured some of my issues were with my setup. I'm not very good at debugging JS yet (other than spitting out variables in console.log to see what comes out at different sections of the script) So I was really struggling trying to find out why my variables/arrays from document.querySelector(All) weren't returning what I expected to be getting.
(I assumed I would just get an array with each element as an index of the array)

But alas, even though I was working from most of this code since the beginning, I guess I never scrolled down in this post far enough to see you guys discussing multiple elements!! (doh!) I guess I just assumed if I just created an array and used the same code as when I extracted a single element from example.com it would all work. I couldn't figure out where I went wrong!! I am so excited I found this!!!! Hurray!! Immediately my script that was testing out pulling an array of elements started working after adding .map. I'm glad I kept working at it and didn't try to go do scraping with Python instead. (which would still be fun but I was really looking forward to using and learning chrome headless/js/chrome dev-tools.) Thank you x1000. You guys are amazing.

vsemozhetbyt · 2019-02-04T22:46:06Z

@KSmith0x86 Thank you for kind words!

You can read more about Array.prototype.map(), NodeList in MDN.

Also, it would be useful to read about JSON.stringify() restrictions to understand what can be returned from the page.evaluate() (Nodes and Elements cannot be returned as they have circular references and methods, but their stringifiable properties can be returned as we can see).

Happy coding)

CQBinh · 2019-03-05T03:50:38Z

@aslushnikov How about if I would like to get the href attribute of span tag?

I tried elementHandle.getProperty(propertyName) but it doesn't work.

P/s: It works when I do the same thing in a tag

VersLaFlamme · 2020-03-29T14:18:49Z

@aslushnikov How about if I would like to get the href attribute of span tag?

I tried elementHandle.getProperty(propertyName) but it doesn't work.

P/s: It works when I do the same thing in a tag

Have you tried
const hrefAttribute = page.evaluate(() => document.querySelector('span').href) ?

Sekai966 · 2021-01-29T15:51:02Z

const element = await page.$(".btn");
const text = await page.evaluate(element => element.textContent, element);

if there is a multiple elements have btn class how to get the second one or the thrid one this code only gives the the first one

vsemozhetbyt · 2021-01-29T16:05:45Z

You can use page.$$(selector) with array indices:

  const element = (await page.$$("p"))[0];
  const text = await page.evaluate(element => element.textContent, element);

xiki808 changed the title ~~How to scrape part of the page~~ How to scrape an element Aug 23, 2017

aslushnikov closed this as completed Aug 23, 2017

tyt34 mentioned this issue Jan 24, 2018

How get information from tag use puppeteer ? #1897

Closed

kirkstrobeck mentioned this issue Nov 2, 2018

Recommended approach to get innerText kpdecker/react-query-selector#2

Closed

Manngold mentioned this issue Jan 25, 2019

How to scrap multiple innerText? using queryselectorAll #3840

Closed

erdun mentioned this issue Jan 10, 2023

[Snyk] Fix for 1 vulnerabilities erdun/puppeteer#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to scrape an element #489

How to scrape an element #489

xiki808 commented Aug 23, 2017 •

edited

Loading

aslushnikov commented Aug 23, 2017

vsemozhetbyt commented Aug 23, 2017

xiki808 commented Aug 23, 2017

psy901 commented Feb 21, 2018

vsemozhetbyt commented Feb 21, 2018 •

edited

Loading

psy901 commented Feb 21, 2018 •

edited

Loading

vsemozhetbyt commented Feb 21, 2018

psy901 commented Feb 21, 2018

putuoka commented Mar 29, 2018 •

edited

Loading

Manngold commented Jan 25, 2019 •

edited

Loading

vsemozhetbyt commented Jan 25, 2019 •

edited

Loading

KSmith0x86 commented Feb 4, 2019

vsemozhetbyt commented Feb 4, 2019 •

edited

Loading

CQBinh commented Mar 5, 2019

VersLaFlamme commented Mar 29, 2020 •

edited

Loading

Sekai966 commented Jan 29, 2021

vsemozhetbyt commented Jan 29, 2021

How to scrape an element #489

How to scrape an element #489

Comments

xiki808 commented Aug 23, 2017 • edited Loading

aslushnikov commented Aug 23, 2017

vsemozhetbyt commented Aug 23, 2017

xiki808 commented Aug 23, 2017

psy901 commented Feb 21, 2018

vsemozhetbyt commented Feb 21, 2018 • edited Loading

psy901 commented Feb 21, 2018 • edited Loading

vsemozhetbyt commented Feb 21, 2018

psy901 commented Feb 21, 2018

putuoka commented Mar 29, 2018 • edited Loading

Manngold commented Jan 25, 2019 • edited Loading

vsemozhetbyt commented Jan 25, 2019 • edited Loading

KSmith0x86 commented Feb 4, 2019

vsemozhetbyt commented Feb 4, 2019 • edited Loading

CQBinh commented Mar 5, 2019

VersLaFlamme commented Mar 29, 2020 • edited Loading

Sekai966 commented Jan 29, 2021

vsemozhetbyt commented Jan 29, 2021

xiki808 commented Aug 23, 2017 •

edited

Loading

vsemozhetbyt commented Feb 21, 2018 •

edited

Loading

psy901 commented Feb 21, 2018 •

edited

Loading

putuoka commented Mar 29, 2018 •

edited

Loading

Manngold commented Jan 25, 2019 •

edited

Loading

vsemozhetbyt commented Jan 25, 2019 •

edited

Loading

vsemozhetbyt commented Feb 4, 2019 •

edited

Loading

VersLaFlamme commented Mar 29, 2020 •

edited

Loading