Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should be the origin when NAVIGATE #280

Closed
luckylittle opened this issue Apr 10, 2019 · 11 comments
Closed

What should be the origin when NAVIGATE #280

luckylittle opened this issue Apr 10, 2019 · 11 comments
Labels
type/question Further information is requested

Comments

@luckylittle
Copy link

This is a question rather than issue. I am trying to get URLs of files inside the FOR loop (or ideally download the PDFs). The flow is Login page -> Saved content page -> Refcard page -> Download button. I struggle to understand what the origin should be and how many DOCUMENTs do i need? Any help would be appreciated.

// Login works fine.
LET base_url = DOCUMENT("https://dzone.com/", true)
LET login_doc = DOCUMENT("https://dzone.com/users/login.html", true)
LET login_btn = ELEMENT(login_doc, "button[type=submit]")
INPUT(login_doc, "form[role=form] input[name=j_username]", "dzone-refcardz@mailcatch.com", 5)
INPUT(login_doc, "form[role=form] input[name=j_password]", "XXXXXX", 5)
CLICK(login_btn)
WAIT_NAVIGATION(login_doc, 25000)

// Loop in Refcardz on the 'Saved' content page of the user to get the links. Also working fine.
LET origin_doc = DOCUMENT("https://dzone.com/users/3590306/dzone-refcardz.html?sort=saved", true)
LET origin_url = "https://dzone.com/users/3590306/dzone-refcardz.html?sort=saved"
NAVIGATE(login_doc, origin_url, 25000)
WAIT_ELEMENT(origin_doc, 'p[class=comment-title]', 50000)
LET titles = ELEMENTS(origin_doc, 'div[class="col-md-11 comment-description"] p[class="comment-title"]')
LET links = (
  FOR el IN titles
    LET refcard_name = ELEMENT(el, "a")
    LET refcard_url = "https://dzone.com" + refcard_name.attributes.href
    RETURN refcard_url
)

// On each Refcard page, click on the 'Download' button, get the URL and then go back. Does not work.
FOR link_url IN links
  LET link_origin_doc = DOCUMENT(origin_url, true)
  NAVIGATE(link_origin_doc, link_url, 50000)
  WAIT_ELEMENT(link_origin_doc, 'button[class="btn download btn-lg"]', 5000)
  LET download_btn = ELEMENT(link_origin_doc, 'button[class="btn download btn-lg"]')
  CLICK(download_btn)
  RETURN(link_origin_doc.url)
  NAVIGATE_BACK(link_origin_doc)
@ziflex
Copy link
Member

ziflex commented Apr 11, 2019

Hey, thank you for giving Ferret a try!
I think you can use 1-2 documents for this query.
The origin can be anything. It’s basically the same if you type a new address inside a browser tab.
So, I do not see any reasons of having so many open documents.
Additionally to that, you are creating a new document on each iteration, that might slow down the execution (or even cause Chrome crash) if there are too many links.
You may try to reuse either the very first document or open an empty one before FOR IN loop and reuse it.

@luckylittle
Copy link
Author

Thanks @ziflex - this is very helpful information. Chrome was indeed crashing so that is why i was asking. Ferret has a great potential, but the documentation needs to be improved. Keep doing a great job.

@luckylittle
Copy link
Author

I simplified it and now only use one DOCUMENT:

FOR link_url IN links
  NAVIGATE(login_doc, link_url, 25000)
  WAIT_ELEMENT(login_doc, 'button[class="btn download btn-lg"]', 5000)
  LET download_btn = ELEMENT(login_doc, 'button[class="btn download btn-lg"]')
  CLICK(download_btn)
  RETURN(login_doc.url)
  NAVIGATE_BACK(login_doc)

Unfortunaltely it times out:

Failed to execute the query
operation timed out: NAVIGATE(login_doc,link_url,25000)

And this is in the ferret.log:

"error":"cdp.DOM: AttributeModified Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: AttributeRemoved Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: ChildNodeCountUpdated Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: ChildNodeInserted Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: ChildNodeRemoved Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.Page: LoadEventFired Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: DocumentUpdated Recv: rpcc: the connection is closing","message":"unexpected error"}

Any ideas?

@ziflex
Copy link
Member

ziflex commented Apr 12, 2019

Hey, here is an updated query that works (make sure you have unlimited amount of downloads)

// Login works fine.
LET doc = DOCUMENT("https://dzone.com/users/login.html", true)

INPUT(doc, "form[role=form] input[name=j_username]", @username, 5)
INPUT(doc, "form[role=form] input[name=j_password]", @password, 5)
CLICK(doc, "button[type=submit]")
WAIT_NAVIGATION(doc, 25000)

// Loop in Refcardz on the 'Saved' content page of the user to get the links. Also working fine.
LET origin_url = "https://dzone.com/users/" + @userid + "/" + @username + ".html?sort=saved"
NAVIGATE(doc, origin_url, 25000)
WAIT_ELEMENT(doc, 'p[class=comment-title]', 50000)

LET titles = ELEMENTS(doc, 'div[class="col-md-11 comment-description"] p[class="comment-title"]')
LET links = (
  FOR el IN titles
    LET refcard_name = ELEMENT(el, "a")
    LET refcard_url = "https://dzone.com" + refcard_name.attributes.href
    RETURN refcard_url
)

// On each Refcard page, click on the 'Download' button, get the URL and then go back. Does not work.
FOR link_url IN links
  NAVIGATE(doc, link_url, 50000)
  WAIT_ELEMENT(doc, '.download', 5000)
  CLICK(doc, '.download')
  WAIT_NAVIGATION(doc, 25000)

  RETURN doc.URL

@ziflex
Copy link
Member

ziflex commented Apr 12, 2019

Btw, you can download the PDF files if you need. At this moment there is no way to download currently open PDF file, so you will need to do a plaint HTTP request for that.
Note that files are gonna be in base64 strings.

// Login works fine.
LET doc = DOCUMENT("https://dzone.com/users/login.html", true)

INPUT(doc, "form[role=form] input[name=j_username]", @username, 5)
INPUT(doc, "form[role=form] input[name=j_password]", @password, 5)
CLICK(doc, "button[type=submit]")
WAIT_NAVIGATION(doc, 25000)

// Loop in Refcardz on the 'Saved' content page of the user to get the links. Also working fine.
LET origin_url = "https://dzone.com/users/" + @userid + "/" + @username + ".html?sort=saved"
NAVIGATE(doc, origin_url, 25000)
WAIT_ELEMENT(doc, 'p[class=comment-title]', 50000)

LET titles = ELEMENTS(doc, 'div[class="col-md-11 comment-description"] p[class="comment-title"]')
LET links = (
  FOR el IN titles
    LET refcard_name = ELEMENT(el, "a")
    LET refcard_url = "https://dzone.com" + refcard_name.attributes.href
    RETURN refcard_url
)

// On each Refcard page, click on the 'Download' button, get the URL and then go back. Does not work.
FOR link_url IN links
  NAVIGATE(doc, link_url, 50000)
  WAIT_ELEMENT(doc, '.download', 5000)
  CLICK(doc, '.download')
  WAIT_NAVIGATION(doc, 25000)

  RETURN { url: doc.URL, file: DOWNLOAD(doc.URL) }

@ziflex ziflex added the type/question Further information is requested label Apr 12, 2019
@luckylittle
Copy link
Author

luckylittle commented Apr 13, 2019

Thanks for the reply.

What do you mean by

make sure you have unlimited amount of downloads

? Is this some hidden settings in ferret?

I tried both - and again times out on the line# 27, WAIT_NAVIGATION(doc, 25000). Increasing the limit does not make a difference. Trying it with account that has all 293 links or account that has just 1 also doesn't make difference.
By the way, i use latest Docker image alpeware/chrome-headless-stable and ferret --cdp http://127.0.0.1:9222 (not -cdp-keep-cookies).

@ziflex
Copy link
Member

ziflex commented Apr 15, 2019

Ok, it seems there is a problem with Chrome in headless mode.
Because, the query works fine when it's not in headless mode, but doesn't when it is.

Need to investigate.

@ziflex
Copy link
Member

ziflex commented Apr 15, 2019

What do you mean by

make sure you have unlimited amount of downloads

? Is this some hidden settings in ferret?

Not in ferret, but on the website if your profile information is not complete.

@ziflex
Copy link
Member

ziflex commented Apr 15, 2019

Ok, it seems Chrome in headless mode does not support PDF files.
puppeteer/puppeteer#1872

@ziflex
Copy link
Member

ziflex commented Apr 15, 2019

// Login works fine.
LET doc = DOCUMENT("https://dzone.com/users/login.html", true)

WAIT_ELEMENT(doc, "form", 25000)

INPUT(doc, "form[role=form] input[name=j_username]", @username, 5)
INPUT(doc, "form[role=form] input[name=j_password]", @password, 5)
CLICK(doc, "button[type=submit]")
WAIT_NAVIGATION(doc, 25000)

// Loop in Refcardz on the 'Saved' content page of the user to get the links. Also working fine.
LET origin_url = "https://dzone.com/users/" + @userid + "/" + @username + ".html?sort=saved"
NAVIGATE(doc, origin_url, 25000)
WAIT_ELEMENT(doc, 'p[class=comment-title]', 50000)

LET titles = ELEMENTS(doc, 'div[class="col-md-11 comment-description"] p[class="comment-title"]')
LET links = (
  FOR el IN titles
    LET refcard_name = ELEMENT(el, "a")
    LET refcard_url = "https://dzone.com" + refcard_name.attributes.href
    RETURN refcard_url
)

// On each Refcard page, click on the 'Download' button, get the URL and then go back. Does not work.
FOR link_url IN links
  NAVIGATE(doc, link_url, 50000)
  WAIT_ELEMENT(doc, 'dz-download[asset]', 5000)
  
  WAIT(1000)

  LET el = ELEMENT(doc, 'dz-download[asset]')
  LET attr = el.attributes.asset

  RETURN "https://dzone.com" + SUBSTITUTE(attr, "'", "")

Here is an updated query that does not require to open PDF files.

@luckylittle
Copy link
Author

Thanks for pointing me to the right direction of non-headless mode. I was testing this on CentOS and Chromium v72.0.3626.0. Unfortunately not all 297 of them can be fully automated - some have interstitial/advertisement page before you can access e.g.:

  {
    "name": "297_GitOps_for_Kubernetes.pdf",
    "url": "https://dzone.com/interstitial?asset=2768919&item=358521"
  }

And also the Download button on some of the old ones (pre-2016) directly downloads the file instead of opening it in the browser:

New - opened in the browser:
https://dzone.com/asset/download/83632

Old - immediately downloaded:
https://dzone.com/asset/download/148

Either way, i made some progress and automated most of these tasks. Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants