-
Notifications
You must be signed in to change notification settings - Fork 9.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to scrape an element #489
Comments
The following should work: // ...
const text = page.evaluate(() => document.querySelector('.scrape').textContent); |
@Xikky Also, if your element has not so simple structure, it is better to use const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const textContent = await page.evaluate(() => document.querySelector('p').textContent);
const innerText = await page.evaluate(() => document.querySelector('p').innerText);
console.log(textContent);
console.log(innerText);
browser.close();
})(); This domain is established to be used for illustrative examples in documents. You may use this
domain in examples without prior coordination or asking for permission.
This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission. |
This is very nice of you guys! Thanks for your help :) |
How can I handle multiple elements? For example, what if I use |
@psy901 You can return an array of texts or a joined string: const textsArray = await page.evaluate(
() => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText)
);
const textsJoined = await page.evaluate(
() => [...document.querySelectorAll('.scrape')].map(elem => elem.innerText).join('\n')
); |
I tried with your advice as follows: const text = await webPage.evaluate(() => {
[...document.querySelectorAll(
"#ewt_main_structure_body > div > div:nth-child(4) > table > tbody > tr"
)
].map(elem => elem.innerText).join("\n");
}); but received the following error
Any advice would be appreciate 👍 ** added const a_arr = await webPage.evaluate(() => {
let names = document.querySelectorAll(
"#ewt_main_structure_body > div > div:nth-child(4) > table > tbody > tr > td > a"
);
let arr = Array.prototype.slice.call(names);
let text_arr = [];
for (let i = 0; i < arr.length; i += 1) {
text_arr.push(arr[i].innerHTML);
}
return text_arr;
}); A bit verbose though |
Seems like Babel issue. Do you transpile? |
Not familiar with term |
@psy901 you can iterate through it array.
|
@vsemozhetbyt const textsJoined = await page.evaluate( i have one question why use ... (rest) ?? when i was delete rest syntax return value is null why? |
|
This was SOOOOOOOO helpful to me. I've been struggling for like 3 weeks trying to figure out how to scrape innerText from multiple elements. The google example runs perfectly fine (puppeteer-master/examples/search.js) But since I am new to JS it looked a bit complicated and I tried to strip out the code and simplify it down. I must have totally overlooked the map in their function. vsemozhetbyt - I was building my example from the post above that uses example.com and shows the difference between innerText & textContent. I built a script off that code and I then tried to modify the code to extract multiple paragraphs off a wordpress site, but the arrays always came back empty or undefined or I had unresolved promise errors. I am really new to JS and have been learning as I play with puppeteer. I kept creating arrays from page.evaluate and then document.querySelectorAll but the array returned just one or two like empty parantheses. I didn't realize you get a nodeList and have to deconstruct it / iterate through to create a normal array of elements. This was so helpful!!! I'm gonna go review how map works, and see if I can find more info about the returned nodeList (is that an object?) I just wanted to say thank you!!! I'm wondering if a lot of people have had this problem like me and the OP. I originally was testing scripts on a fedora system and figured there was some kind of error with node/puppeteer due to some security settings in fedora. I installed puppeteer on an ubuntu VM and the example above that scrapes example.com finally worked fine. So I figured some of my issues were with my setup. I'm not very good at debugging JS yet (other than spitting out variables in console.log to see what comes out at different sections of the script) So I was really struggling trying to find out why my variables/arrays from document.querySelector(All) weren't returning what I expected to be getting. But alas, even though I was working from most of this code since the beginning, I guess I never scrolled down in this post far enough to see you guys discussing multiple elements!! (doh!) I guess I just assumed if I just created an array and used the same code as when I extracted a single element from example.com it would all work. I couldn't figure out where I went wrong!! I am so excited I found this!!!! Hurray!! Immediately my script that was testing out pulling an array of elements started working after adding .map. I'm glad I kept working at it and didn't try to go do scraping with Python instead. (which would still be fun but I was really looking forward to using and learning chrome headless/js/chrome dev-tools.) Thank you x1000. You guys are amazing. |
@KSmith0x86 Thank you for kind words! You can read more about Also, it would be useful to read about Happy coding) |
@aslushnikov How about if I would like to get the I tried elementHandle.getProperty(propertyName) but it doesn't work. P/s: It works when I do the same thing in |
Have you tried |
const element = await page.$(".btn"); if there is a multiple elements have btn class how to get the second one or the thrid one this code only gives the the first one |
You can use const element = (await page.$$("p"))[0];
const text = await page.evaluate(element => element.textContent, element); |
Hi, I think this is an obvious question but still I cannot figure out how to grab text from an element.
<span class="scrape">HelloPuppeteer</span>
How can I scrape this 'HelloPuppeteer' text using Puppeteer?
The text was updated successfully, but these errors were encountered: