What should be the origin when NAVIGATE #280

luckylittle · 2019-04-10T12:15:26Z

This is a question rather than issue. I am trying to get URLs of files inside the FOR loop (or ideally download the PDFs). The flow is Login page -> Saved content page -> Refcard page -> Download button. I struggle to understand what the origin should be and how many DOCUMENTs do i need? Any help would be appreciated.

// Login works fine.
LET base_url = DOCUMENT("https://dzone.com/", true)
LET login_doc = DOCUMENT("https://dzone.com/users/login.html", true)
LET login_btn = ELEMENT(login_doc, "button[type=submit]")
INPUT(login_doc, "form[role=form] input[name=j_username]", "dzone-refcardz@mailcatch.com", 5)
INPUT(login_doc, "form[role=form] input[name=j_password]", "XXXXXX", 5)
CLICK(login_btn)
WAIT_NAVIGATION(login_doc, 25000)

// Loop in Refcardz on the 'Saved' content page of the user to get the links. Also working fine.
LET origin_doc = DOCUMENT("https://dzone.com/users/3590306/dzone-refcardz.html?sort=saved", true)
LET origin_url = "https://dzone.com/users/3590306/dzone-refcardz.html?sort=saved"
NAVIGATE(login_doc, origin_url, 25000)
WAIT_ELEMENT(origin_doc, 'p[class=comment-title]', 50000)
LET titles = ELEMENTS(origin_doc, 'div[class="col-md-11 comment-description"] p[class="comment-title"]')
LET links = (
  FOR el IN titles
    LET refcard_name = ELEMENT(el, "a")
    LET refcard_url = "https://dzone.com" + refcard_name.attributes.href
    RETURN refcard_url
)

// On each Refcard page, click on the 'Download' button, get the URL and then go back. Does not work.
FOR link_url IN links
  LET link_origin_doc = DOCUMENT(origin_url, true)
  NAVIGATE(link_origin_doc, link_url, 50000)
  WAIT_ELEMENT(link_origin_doc, 'button[class="btn download btn-lg"]', 5000)
  LET download_btn = ELEMENT(link_origin_doc, 'button[class="btn download btn-lg"]')
  CLICK(download_btn)
  RETURN(link_origin_doc.url)
  NAVIGATE_BACK(link_origin_doc)

The text was updated successfully, but these errors were encountered:

ziflex · 2019-04-11T03:27:55Z

Hey, thank you for giving Ferret a try!
I think you can use 1-2 documents for this query.
The origin can be anything. It’s basically the same if you type a new address inside a browser tab.
So, I do not see any reasons of having so many open documents.
Additionally to that, you are creating a new document on each iteration, that might slow down the execution (or even cause Chrome crash) if there are too many links.
You may try to reuse either the very first document or open an empty one before FOR IN loop and reuse it.

luckylittle · 2019-04-11T04:57:54Z

Thanks @ziflex - this is very helpful information. Chrome was indeed crashing so that is why i was asking. Ferret has a great potential, but the documentation needs to be improved. Keep doing a great job.

luckylittle · 2019-04-12T11:02:43Z

I simplified it and now only use one DOCUMENT:

FOR link_url IN links
  NAVIGATE(login_doc, link_url, 25000)
  WAIT_ELEMENT(login_doc, 'button[class="btn download btn-lg"]', 5000)
  LET download_btn = ELEMENT(login_doc, 'button[class="btn download btn-lg"]')
  CLICK(download_btn)
  RETURN(login_doc.url)
  NAVIGATE_BACK(login_doc)

Unfortunaltely it times out:

Failed to execute the query
operation timed out: NAVIGATE(login_doc,link_url,25000)

And this is in the ferret.log:

"error":"cdp.DOM: AttributeModified Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: AttributeRemoved Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: ChildNodeCountUpdated Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: ChildNodeInserted Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: ChildNodeRemoved Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.Page: LoadEventFired Recv: rpcc: the connection is closing","message":"unexpected error"}
"error":"cdp.DOM: DocumentUpdated Recv: rpcc: the connection is closing","message":"unexpected error"}

Any ideas?

ziflex · 2019-04-12T16:44:03Z

Hey, here is an updated query that works (make sure you have unlimited amount of downloads)

// Login works fine.
LET doc = DOCUMENT("https://dzone.com/users/login.html", true)

INPUT(doc, "form[role=form] input[name=j_username]", @username, 5)
INPUT(doc, "form[role=form] input[name=j_password]", @password, 5)
CLICK(doc, "button[type=submit]")
WAIT_NAVIGATION(doc, 25000)

// Loop in Refcardz on the 'Saved' content page of the user to get the links. Also working fine.
LET origin_url = "https://dzone.com/users/" + @userid + "/" + @username + ".html?sort=saved"
NAVIGATE(doc, origin_url, 25000)
WAIT_ELEMENT(doc, 'p[class=comment-title]', 50000)

LET titles = ELEMENTS(doc, 'div[class="col-md-11 comment-description"] p[class="comment-title"]')
LET links = (
  FOR el IN titles
    LET refcard_name = ELEMENT(el, "a")
    LET refcard_url = "https://dzone.com" + refcard_name.attributes.href
    RETURN refcard_url
)

// On each Refcard page, click on the 'Download' button, get the URL and then go back. Does not work.
FOR link_url IN links
  NAVIGATE(doc, link_url, 50000)
  WAIT_ELEMENT(doc, '.download', 5000)
  CLICK(doc, '.download')
  WAIT_NAVIGATION(doc, 25000)

  RETURN doc.URL

ziflex · 2019-04-12T17:03:58Z

Btw, you can download the PDF files if you need. At this moment there is no way to download currently open PDF file, so you will need to do a plaint HTTP request for that.
Note that files are gonna be in base64 strings.

// Login works fine.
LET doc = DOCUMENT("https://dzone.com/users/login.html", true)

INPUT(doc, "form[role=form] input[name=j_username]", @username, 5)
INPUT(doc, "form[role=form] input[name=j_password]", @password, 5)
CLICK(doc, "button[type=submit]")
WAIT_NAVIGATION(doc, 25000)

// Loop in Refcardz on the 'Saved' content page of the user to get the links. Also working fine.
LET origin_url = "https://dzone.com/users/" + @userid + "/" + @username + ".html?sort=saved"
NAVIGATE(doc, origin_url, 25000)
WAIT_ELEMENT(doc, 'p[class=comment-title]', 50000)

LET titles = ELEMENTS(doc, 'div[class="col-md-11 comment-description"] p[class="comment-title"]')
LET links = (
  FOR el IN titles
    LET refcard_name = ELEMENT(el, "a")
    LET refcard_url = "https://dzone.com" + refcard_name.attributes.href
    RETURN refcard_url
)

// On each Refcard page, click on the 'Download' button, get the URL and then go back. Does not work.
FOR link_url IN links
  NAVIGATE(doc, link_url, 50000)
  WAIT_ELEMENT(doc, '.download', 5000)
  CLICK(doc, '.download')
  WAIT_NAVIGATION(doc, 25000)

  RETURN { url: doc.URL, file: DOWNLOAD(doc.URL) }

luckylittle · 2019-04-13T11:50:01Z

Thanks for the reply.

What do you mean by

make sure you have unlimited amount of downloads

? Is this some hidden settings in ferret?

I tried both - and again times out on the line# 27, WAIT_NAVIGATION(doc, 25000). Increasing the limit does not make a difference. Trying it with account that has all 293 links or account that has just 1 also doesn't make difference.
By the way, i use latest Docker image alpeware/chrome-headless-stable and ferret --cdp http://127.0.0.1:9222 (not -cdp-keep-cookies).

ziflex · 2019-04-15T00:31:02Z

Ok, it seems there is a problem with Chrome in headless mode.
Because, the query works fine when it's not in headless mode, but doesn't when it is.

Need to investigate.

ziflex · 2019-04-15T00:32:12Z

What do you mean by

make sure you have unlimited amount of downloads

? Is this some hidden settings in ferret?

Not in ferret, but on the website if your profile information is not complete.

ziflex · 2019-04-15T20:43:40Z

Ok, it seems Chrome in headless mode does not support PDF files.
puppeteer/puppeteer#1872

ziflex · 2019-04-15T21:19:21Z

// Login works fine.
LET doc = DOCUMENT("https://dzone.com/users/login.html", true)

WAIT_ELEMENT(doc, "form", 25000)

INPUT(doc, "form[role=form] input[name=j_username]", @username, 5)
INPUT(doc, "form[role=form] input[name=j_password]", @password, 5)
CLICK(doc, "button[type=submit]")
WAIT_NAVIGATION(doc, 25000)

// Loop in Refcardz on the 'Saved' content page of the user to get the links. Also working fine.
LET origin_url = "https://dzone.com/users/" + @userid + "/" + @username + ".html?sort=saved"
NAVIGATE(doc, origin_url, 25000)
WAIT_ELEMENT(doc, 'p[class=comment-title]', 50000)

LET titles = ELEMENTS(doc, 'div[class="col-md-11 comment-description"] p[class="comment-title"]')
LET links = (
  FOR el IN titles
    LET refcard_name = ELEMENT(el, "a")
    LET refcard_url = "https://dzone.com" + refcard_name.attributes.href
    RETURN refcard_url
)

// On each Refcard page, click on the 'Download' button, get the URL and then go back. Does not work.
FOR link_url IN links
  NAVIGATE(doc, link_url, 50000)
  WAIT_ELEMENT(doc, 'dz-download[asset]', 5000)
  
  WAIT(1000)

  LET el = ELEMENT(doc, 'dz-download[asset]')
  LET attr = el.attributes.asset

  RETURN "https://dzone.com" + SUBSTITUTE(attr, "'", "")

Here is an updated query that does not require to open PDF files.

luckylittle · 2019-04-23T04:28:03Z

Thanks for pointing me to the right direction of non-headless mode. I was testing this on CentOS and Chromium v72.0.3626.0. Unfortunately not all 297 of them can be fully automated - some have interstitial/advertisement page before you can access e.g.:

  {
    "name": "297_GitOps_for_Kubernetes.pdf",
    "url": "https://dzone.com/interstitial?asset=2768919&item=358521"
  }

And also the Download button on some of the old ones (pre-2016) directly downloads the file instead of opening it in the browser:

New - opened in the browser:
https://dzone.com/asset/download/83632

Old - immediately downloaded:
https://dzone.com/asset/download/148

Either way, i made some progress and automated most of these tasks. Thanks for your help.

ziflex added the type/question Further information is requested label Apr 12, 2019

luckylittle closed this as completed Apr 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should be the origin when NAVIGATE #280

What should be the origin when NAVIGATE #280

luckylittle commented Apr 10, 2019

ziflex commented Apr 11, 2019 •

edited

luckylittle commented Apr 11, 2019

luckylittle commented Apr 12, 2019

ziflex commented Apr 12, 2019

ziflex commented Apr 12, 2019 •

edited

luckylittle commented Apr 13, 2019 •

edited

ziflex commented Apr 15, 2019

ziflex commented Apr 15, 2019 •

edited

ziflex commented Apr 15, 2019

ziflex commented Apr 15, 2019

luckylittle commented Apr 23, 2019

What should be the origin when NAVIGATE #280

What should be the origin when NAVIGATE #280

Comments

luckylittle commented Apr 10, 2019

ziflex commented Apr 11, 2019 • edited

luckylittle commented Apr 11, 2019

luckylittle commented Apr 12, 2019

ziflex commented Apr 12, 2019

ziflex commented Apr 12, 2019 • edited

luckylittle commented Apr 13, 2019 • edited

ziflex commented Apr 15, 2019

ziflex commented Apr 15, 2019 • edited

ziflex commented Apr 15, 2019

ziflex commented Apr 15, 2019

luckylittle commented Apr 23, 2019

ziflex commented Apr 11, 2019 •

edited

ziflex commented Apr 12, 2019 •

edited

luckylittle commented Apr 13, 2019 •

edited

ziflex commented Apr 15, 2019 •

edited