Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A solution to scrape Markdown from posts #35

Closed
iMrDJAi opened this issue May 26, 2021 · 2 comments
Closed

A solution to scrape Markdown from posts #35

iMrDJAi opened this issue May 26, 2021 · 2 comments
Assignees
Labels
Projects

Comments

@iMrDJAi
Copy link
Contributor

iMrDJAi commented May 26, 2021

Note: This solution applies to the desktop version of the Facebook website, just as the other solutions I'm providing to improve this library, you should switch from the mobile version first then I'll start making some pull requests.

Scrapping text from posts on the desktop version is much complicated than the mobile version, since it comes in the form of HTML elements rather than plain text, the key here is finding the right selector for the post body, as for the other elements we need to scrape like images, videos, submission permalink... and other staff, this needs a separate issue and a deeper discussion.

Anyway, using the browser inspector we can see how it looks like under the hood:

image

You'll notice that It's located between two pseudo-elements (::before and ::after), we just need to copy the .innerHTML of the parent element, then converting it to markdown, and there is a very good library for that called turndown, and as you can see from the image below, we MADE IT!

image
image

Another issue is the See More button, you should click it first to allow more text to appear:

image

And that's all, I hope that this information will help <3

@iMrDJAi
Copy link
Contributor Author

iMrDJAi commented May 26, 2021

cc @kaanyagci

@kaanyagci kaanyagci added the P3 label May 28, 2021
@kaanyagci
Copy link
Contributor

Hi @iMrDJAi thank you very much for your feedback, perfectly detailed.

It seems doable to me but, as you know for instance the first priority for the library is the TypeScript migration + NodeJS module support, that's why I've added to P3.

@kaanyagci kaanyagci self-assigned this May 28, 2021
@kaanyagci kaanyagci added this to P3 in npm module Jun 1, 2021
@iMrDJAi iMrDJAi closed this as completed Jun 2, 2022
npm module automation moved this from P3 to Done Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

2 participants