Skip to content

Cannot scrape this site #592

@LRXYZ

Description

@LRXYZ

Sample page from a site I am trying to scrape:
https://en.tutiempo.net/records/lemg/1-may-2023.html

Much of the html is dynamically loaded with javascript, but that should work, right?

Inspecting the html-code in Chrome, I see it has hundreds of <td>-elements.
But the code below gives an empty list.

Anyone can see what the issue is?

I am using htmlunit-2.70.0.

import java.net.URL;

import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class Test {

	public static void main(String[] args) throws Exception {
		String url = "https://en.tutiempo.net/records/lemg/1-may-2023.html";
		URL u = new URL(url);
		HttpMethod m = HttpMethod.GET;
		WebRequest request = new WebRequest(u, m);

		WebClient webClient = new WebClient();
		webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
		webClient.getOptions().setThrowExceptionOnScriptError(false);
		webClient.getOptions().setUseInsecureSSL(true);
		webClient.getOptions().setJavaScriptEnabled(true);
		webClient.getOptions().setRedirectEnabled(true);
		HtmlPage page = webClient.getPage(request);
		System.out.println("page: " + page.getElementsByTagName("td"));
		webClient.close();
	}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions