Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching to the desktop layout (done) #41

Merged
merged 12 commits into from
Jul 7, 2021
Merged

Switching to the desktop layout (done) #41

merged 12 commits into from
Jul 7, 2021

Conversation

iMrDJAi
Copy link
Contributor

@iMrDJAi iMrDJAi commented Jun 30, 2021

The mobile layout of Facebook provides limited data and low quality media, because of that the project should switch entirely to the desktop layout.

Todo list:

  • Desktop layout selectors:
    • I'm currently trying to support posts that contain text, images and videos, as a result posts with embeds, 360 images, backgrounds and shared posts may not be fully/correctly scraped, and this is due to the lack of selectors which will be added later, or the workarounds that will be used to scrape them (Especially 360 images, I have no idea how to deal with them).
      These are all the selectors explained:
      • group_name: This is the title element of the document, it contains the group name.
      • group_feed_container: This element contains the submissions.
      • post_element: This is the submission element, it's not necessary a direct child of the group feed container, but it's guaranteed to contain all the data that we're targeting.
      • post_author: This is an "a" element, we can extract the author name and profile url from it.
      • post_author_avatar: A small profile picture of the author.
      • post_link: An "a" element contains the post permalink and the time when it was created (but since the innerText is obfuscated we can't simply extract it).
      • post_content: This contains the post text. We can convert the innerHTML of this element to markdown with the turndown library.
      • post_content_expand_button: This is the "See More" button, it must be clicked before we start scraping the post content.
      • post_attachment: This one contains attachments, it's the next element after the post content wrapper.
      • post_video: A selector for all the videos inside the post attachments wrapper, these can be scraped with the following workaround.
      • post_img: A selector for all the images inside the post attachments wrapper, these don't necessary have the best quality since you should click on them to show their best resolutions, but they are far better than the images on the mobile layout.
  • Update the scraper code.

@kaanyagci kaanyagci changed the base branch from master to development July 1, 2021 08:57
@kaanyagci
Copy link
Contributor

HI,

Thank you for your contribution. I changed the base branch from master to development as there are still things uncompleted. I'll try to take a closer look probably this weekend.

The markdown on post contents using another npm module could be an overkill also may be a maintainability problem. But we can implement something which removes all HTML tags around the String content not necessarily converting to a markdown input.

@iMrDJAi
Copy link
Contributor Author

iMrDJAi commented Jul 1, 2021

HI,

Thank you for your contribution. I changed the base branch from master to development as there are still things uncompleted. I'll try to take a closer look probably this weekend.

The markdown on post contents using another npm module could be an overkill also may be a maintainability problem. But we can implement something which removes all HTML tags around the String content not necessarily converting to a markdown input.

@kaanyagci The point of the markdown format is actually providing users a minimal output can be used to re-visualize posts content in the exact same way as the original, I only suggested that to avoid including the HTML format as it may be large and not human readable.
But I think you're right, we should reduce the number of the third party modules, also it's a good idea to let users handle the output by themselves, and we can provide them examples on how to do that.
In this case, we have to include the innerHTML along with the innerText in the output.

@iMrDJAi
Copy link
Contributor Author

iMrDJAi commented Jul 1, 2021

After some testing, I've noticed that when you start scraping without authentication, some posts won't provide the author profile url, in this case the selector group_post_author won't work.

image

Also, it's quite different how elements are being loaded in the desktop layout, in fact they won't until they show up on the viewport, and for that we should start scrolling before scraping.

kaanyagci and others added 7 commits July 2, 2021 17:33
npm badges added
This is completely a mess! The function `getGroupPosts()` needs a full rewrite!
- A full rewrite for the scraper function.
- MutationObserver implementation.
- Fixed a bug when posts don't have  text content.
- Now it clicks on the "See more..." button before extracting the post content!
@iMrDJAi
Copy link
Contributor Author

iMrDJAi commented Jul 3, 2021

@kaanyagci So yeah, I did it! The scraper works perfectly now with the new desktop layout of Facebook, and it has the same functionality as the one from the master branch. I think it's time you merge this to the development branch (after reviewing and testing it of course). Other features and new fields for the GroupPost interface should be added in a separate pull request to make it easier to organize things up!

@kaanyagci
Copy link
Contributor

@all-contributors please add @iMrDJAi for code

@allcontributors
Copy link
Contributor

@kaanyagci

I've put up a pull request to add @iMrDJAi! 🎉

@kaanyagci
Copy link
Contributor

@iMrDJAi This is excellent news! I was really busy with other stuff today. I'll test this first thing tomorrow! Great job! 💯

@iMrDJAi
Copy link
Contributor Author

iMrDJAi commented Jul 5, 2021

@kaanyagci Any updates? Have you tested it? Any issues?

@kaanyagci
Copy link
Contributor

Sorry for the delay. I was still a little busy :( I'll look ASAP.

@kaanyagci
Copy link
Contributor

kaanyagci commented Jul 7, 2021

Just checked. Sadly I can not get it to work.

  • The first error I've faced with log in with the following reproduction route:
import { FB } from './index';

async function main() {
  const f = await FB.init({
    debug: true,
    output: 'test.json',
    headless: false,
    groupIds: ['774278349295443'],
    useCookies: true,
    disableAssets: true,
  });
  f.login('<redacted>', '<redacted>');
  await f.getGroupPosts(774278349295443, 'groupOutput');
}

main().then(() => {
  console.log('Done');
});

Gives the following output:

/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115
                    ? new Error(`${response.errorText} at ${url}`)
                      ^

Error: net::ERR_ABORTED at https://facebook.com
    at navigate (/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async FrameManager.navigateFrame (/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:90:21)
    at async Frame.goto (/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:416:16)
    at async Page.goto (/Users/kaanyagci/Documents/makepad/fbjs/node_modules/puppeteer/lib/cjs/puppeteer/common/Page.js:819:16)
    at async Facebook.login (/Users/kaanyagci/Documents/makepad/fbjs/dist/lib/models/fb.js:113:9)

Note: The output is the same for both headless and not headless modes.

  • The second error I've faced without login the web page used for group details is still the mobile page m.facebook.com

I'll try to investigate these issues as soon as possible this week

@iMrDJAi
Copy link
Contributor Author

iMrDJAi commented Jul 7, 2021

@kaanyagci Interesting. in fact I haven't tried logging in, I been always testing in userless mode, I'll try that later and check what's going on.
For now you can try this:

;(async () => {

    const { FB } = require("@makepad/fbjs")

    const fb = await FB.init({
        headless: true,
        useCookies: false,
        output: ''
    })

    //await fb.getGroupPosts("319144912641926", "./output.json")
    
    await fb.getGroupPosts("319144912641926")

})()

@iMrDJAi
Copy link
Contributor Author

iMrDJAi commented Jul 7, 2021

The second error I've faced without login the web page used for group details is still the mobile page m.facebook.com

@kaanyagci That doesn't make sense, I'm 100% sure that I totally removed the mobile website. Fork my master branch again.

@kaanyagci
Copy link
Contributor

My bad, I was trying on another branch 🤦

@kaanyagci
Copy link
Contributor

This looks great actually. For the first issue, I've added the userAgent as Facebook rejects connections from headless browsers. I'll add this line once it's merged on development branch! Anyway great work @iMrDJAi !

@kaanyagci kaanyagci merged commit 7935979 into Makepad-fr:development Jul 7, 2021
@iMrDJAi iMrDJAi changed the title Switching to the desktop layout (work in progress) Switching to the desktop layout (done) Jul 7, 2021
kaanyagci added a commit that referenced this pull request Jul 7, 2021
* Switching to the desktop layout (work in progress) (#41)

* 📝 Funding documentation added (#40)

* Added desktop layout selectors

* Added "See More" button selector

* Added xpath selectors + some improvements

* Small fix

* README updated (#42)

npm badges added

* Checkpoint

This is completely a mess! The function `getGroupPosts()` needs a full rewrite!

* Updated the scraper code

- A full rewrite for the scraper function.
- MutationObserver implementation.

* Improvements!

- Fixed a bug when posts don't have  text content.
- Now it clicks on the "See more..." button before extracting the post content!

* Removed xpath selectors + a bunch of minor changes

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>

* docs: add iMrDJAi as a contributor for code (#43)

* 📝 Funding documentation added (#40)

* README updated (#42)

npm badges added

* docs: update README.md [skip ci]

* docs: update .all-contributorsrc [skip ci]

* Update README.md for missing badges

* Duplicated all contributors badge removed

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>
Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>

* feat(package): sponsor button added to the npm package

* fix(browser): facebook headless browser rejection issue fixed

user agents added

* feat: concom configuration added

Concom is a Conventional Commit formatter which is actually in private alpha release

* feat(version): Version incremented to 4.1.0

Co-authored-by: ${Mr.DJA} <aoutou.d@gmail.com>
Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>
kaanyagci added a commit that referenced this pull request Jul 7, 2021
* Switching to the desktop layout (work in progress) (#41)

* 📝 Funding documentation added (#40)

* Added desktop layout selectors

* Added "See More" button selector

* Added xpath selectors + some improvements

* Small fix

* README updated (#42)

npm badges added

* Checkpoint

This is completely a mess! The function `getGroupPosts()` needs a full rewrite!

* Updated the scraper code

- A full rewrite for the scraper function.
- MutationObserver implementation.

* Improvements!

- Fixed a bug when posts don't have  text content.
- Now it clicks on the "See more..." button before extracting the post content!

* Removed xpath selectors + a bunch of minor changes

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>

* docs: add iMrDJAi as a contributor for code (#43)

* 📝 Funding documentation added (#40)

* README updated (#42)

npm badges added

* docs: update README.md [skip ci]

* docs: update .all-contributorsrc [skip ci]

* Update README.md for missing badges

* Duplicated all contributors badge removed

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>
Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>

* feat(package): sponsor button added to the npm package

* fix(browser): facebook headless browser rejection issue fixed

user agents added

* feat: concom configuration added

Concom is a Conventional Commit formatter which is actually in private alpha release

* feat(version): Version incremented to 4.1.0

Co-authored-by: ${Mr.DJA} <42304709+iMrDJAi@users.noreply.github.com>
Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>
kaanyagci added a commit that referenced this pull request Jul 7, 2021
* 📝 Funding documentation added (#40)

* Added desktop layout selectors

* Added "See More" button selector

* Added xpath selectors + some improvements

* Small fix

* README updated (#42)

npm badges added

* Checkpoint

This is completely a mess! The function `getGroupPosts()` needs a full rewrite!

* Updated the scraper code

- A full rewrite for the scraper function.
- MutationObserver implementation.

* Improvements!

- Fixed a bug when posts don't have  text content.
- Now it clicks on the "See more..." button before extracting the post content!

* Removed xpath selectors + a bunch of minor changes

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>
@iMrDJAi iMrDJAi deleted the master branch July 7, 2021 21:23
@iMrDJAi iMrDJAi restored the master branch July 7, 2021 21:26
kaanyagci added a commit that referenced this pull request Jul 9, 2021
* README updated (#42)

npm badges added

* Switching to the desktop layout (work in progress) (#41)

* 📝 Funding documentation added (#40)

* Added desktop layout selectors

* Added "See More" button selector

* Added xpath selectors + some improvements

* Small fix

* README updated (#42)

npm badges added

* Checkpoint

This is completely a mess! The function `getGroupPosts()` needs a full rewrite!

* Updated the scraper code

- A full rewrite for the scraper function.
- MutationObserver implementation.

* Improvements!

- Fixed a bug when posts don't have  text content.
- Now it clicks on the "See more..." button before extracting the post content!

* Removed xpath selectors + a bunch of minor changes

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>

* docs: add iMrDJAi as a contributor for code (#43)

* 📝 Funding documentation added (#40)

* README updated (#42)

npm badges added

* docs: update README.md [skip ci]

* docs: update .all-contributorsrc [skip ci]

* Update README.md for missing badges

* Duplicated all contributors badge removed

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>
Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>

* feat(package): sponsor button added to the npm package

* fix(browser): facebook headless browser rejection issue fixed

user agents added

* feat: concom configuration added

Concom is a Conventional Commit formatter which is actually in private alpha release

* feat(version): Version incremented to 4.1.0

* feat: tsconfig build information updated

* Switching to the desktop layout (work in progress) (#41)

* 📝 Funding documentation added (#40)

* Added desktop layout selectors

* Added "See More" button selector

* Added xpath selectors + some improvements

* Small fix

* README updated (#42)

npm badges added

* Checkpoint

This is completely a mess! The function `getGroupPosts()` needs a full rewrite!

* Updated the scraper code

- A full rewrite for the scraper function.
- MutationObserver implementation.

* Improvements!

- Fixed a bug when posts don't have  text content.
- Now it clicks on the "See more..." button before extracting the post content!

* Removed xpath selectors + a bunch of minor changes

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>

* feat: tsconfig build information updated

* [4.1.1] - Bug fixes (#47)

* 🐛 Cookie file double extension issue fixed

The issue was causing the impossibility to load cookies it is fixed by replacing the .json extension if exists by nothing

* 🔧 TypeScript compiler configuration file updated

examples source code is excluded

* 🔇 Unnecessary console.logs removed

* ✨ callback parameter added to get group posts

* 🔧 last build info file added

* 🔧 .npmignore file added

this files contains files to ignore on npm module

* 🚨 source file linted

* 🔧 .eslintignore file updated

example folder will not be linted

* ⬆️ Dependency versions upgrade

package-lock.json file updated

* ✨ local file saving is now optional

a parameter added to getGroupPosts function to save or not on a local file

* 📝 README file updated

Usage example added

* 🔧 last build information added

* 📝 Example app created

* 🐛 group name normalisation issue fixed

Output files by default will be named with group id instead of group name

* 🔖 version incremented to 4.1.1

Co-authored-by: ${Mr.DJA} <aoutou.d@gmail.com>
Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>
kaanyagci added a commit that referenced this pull request Jul 12, 2021
* Added desktop layout selectors

* Added "See More" button selector

* Added xpath selectors + some improvements

* Small fix

* Checkpoint

This is completely a mess! The function `getGroupPosts()` needs a full rewrite!

* Updated the scraper code

- A full rewrite for the scraper function.
- MutationObserver implementation.

* Improvements!

- Fixed a bug when posts don't have  text content.
- Now it clicks on the "See more..." button before extracting the post content!

* Removed xpath selectors + a bunch of minor changes

* Added new fields to the GroupPost interface!

* Bug fixes + now it parse posts one by one

* Small changes

- Updated the group name selector.
- Now it scrolls down a little bit before start scraping to ensure that posts will load.

* Fixed some issues with hovering

* [4.1.1] - Bug fixes (#48)

* README updated (#42)

npm badges added

* Switching to the desktop layout (work in progress) (#41)

* 📝 Funding documentation added (#40)

* Added desktop layout selectors

* Added "See More" button selector

* Added xpath selectors + some improvements

* Small fix

* README updated (#42)

npm badges added

* Checkpoint

This is completely a mess! The function `getGroupPosts()` needs a full rewrite!

* Updated the scraper code

- A full rewrite for the scraper function.
- MutationObserver implementation.

* Improvements!

- Fixed a bug when posts don't have  text content.
- Now it clicks on the "See more..." button before extracting the post content!

* Removed xpath selectors + a bunch of minor changes

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>

* docs: add iMrDJAi as a contributor for code (#43)

* 📝 Funding documentation added (#40)

* README updated (#42)

npm badges added

* docs: update README.md [skip ci]

* docs: update .all-contributorsrc [skip ci]

* Update README.md for missing badges

* Duplicated all contributors badge removed

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>
Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>

* feat(package): sponsor button added to the npm package

* fix(browser): facebook headless browser rejection issue fixed

user agents added

* feat: concom configuration added

Concom is a Conventional Commit formatter which is actually in private alpha release

* feat(version): Version incremented to 4.1.0

* feat: tsconfig build information updated

* Switching to the desktop layout (work in progress) (#41)

* 📝 Funding documentation added (#40)

* Added desktop layout selectors

* Added "See More" button selector

* Added xpath selectors + some improvements

* Small fix

* README updated (#42)

npm badges added

* Checkpoint

This is completely a mess! The function `getGroupPosts()` needs a full rewrite!

* Updated the scraper code

- A full rewrite for the scraper function.
- MutationObserver implementation.

* Improvements!

- Fixed a bug when posts don't have  text content.
- Now it clicks on the "See more..." button before extracting the post content!

* Removed xpath selectors + a bunch of minor changes

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>

* feat: tsconfig build information updated

* [4.1.1] - Bug fixes (#47)

* 🐛 Cookie file double extension issue fixed

The issue was causing the impossibility to load cookies it is fixed by replacing the .json extension if exists by nothing

* 🔧 TypeScript compiler configuration file updated

examples source code is excluded

* 🔇 Unnecessary console.logs removed

* ✨ callback parameter added to get group posts

* 🔧 last build info file added

* 🔧 .npmignore file added

this files contains files to ignore on npm module

* 🚨 source file linted

* 🔧 .eslintignore file updated

example folder will not be linted

* ⬆️ Dependency versions upgrade

package-lock.json file updated

* ✨ local file saving is now optional

a parameter added to getGroupPosts function to save or not on a local file

* 📝 README file updated

Usage example added

* 🔧 last build information added

* 📝 Example app created

* 🐛 group name normalisation issue fixed

Output files by default will be named with group id instead of group name

* 🔖 version incremented to 4.1.1

Co-authored-by: ${Mr.DJA} <42304709+iMrDJAi@users.noreply.github.com>
Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>

* Updates...

- Hovering bug fix.
- Renamed the GroupPost interface to.. just Post.
- Added images field to the Post interface.

* Minor changes

* Update fb.ts

Co-authored-by: Kaan Yagci <9104546+kaanyagci@users.noreply.github.com>
Co-authored-by: allcontributors[bot] <46447321+allcontributors[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants