Cloning website #186

jaanuska · 2015-11-21T04:55:31Z

I want to build a system which makes an exact cloned copy of a website and stores it locally. All links in pages have to be modified to point to the local structure, e.g. www.example.com/resource.jpg -> //local/file/system/mirrors/www.example.com/resource.jpg. This allows users to browse the copy of the website locally.
In addition, all content needs to be sent to Solr also.
As I understand, keepDownloads option is not meant for this purpose. Is there any other way to "clone" a website to a local file system, today? If not, should I implement my own committer, using the ICommitter interface, for example?

OkkeKlein · 2015-11-21T13:47:02Z

I think there are plenty of tools online for cloning a website. You can then maybe use https://www.norconex.com/collectors/collector-filesystem/ to index that content with Solr.

essiembre · 2015-12-08T22:41:21Z

Is the solution proposed by @OkkeKlein working for you?

Otherwise, we can transform this ticket into a feature request to allow pluggable implementations of how files are saved when downloads are kept. One such implementation could try to mimic the same directory structure as the website being crawled. Would that be useful to you?

jaanuska · 2015-12-10T16:12:10Z

I still would like to use Norconex entirely in my project :). At the moment I am trying to write something by myself.
In addition to get a mirror of a website plus redirecting all links to local destinations, I need a solution for storing modified files every after recrawling cycle.

essiembre · 2015-12-20T03:09:58Z

Marking as a feature request to have a way to overwrite/customize the way downloaded files are stored.

dgomesbr · 2017-09-28T17:21:00Z

+1.

Maybe a exporter rather than how it's stored

essiembre · 2017-09-29T17:36:00Z

@dgomesbr, can you elaborate what your exporter would look like? There are a few challenges with cloning websites in general. There are dynamic + javascript rendered ones that do not work well as static sites, but also, they sometimes have server-side logic that we do not know about. One example: http://example.com/home can lead to a home page, so we would make "home" a file. What extension? If we do not give it ".html", the cloned static site will not open it properly. If we give it ".html" we will have to update all references to it. Also, what if this also exists: http://example.com/home/about.html. Then "home" needs to be a directory, but we already made it a file.

We could add enough configuration options, but it may be hard to have a one-size-fits-all so maybe it is best to have an interface people can extend and custom-code how they want the cloning done?

dgomesbr · 2017-10-04T15:51:54Z

The behavior I'm proposing is like what https://www.httrack.com/ does. Copy everything locally to HTML. I don't have a opinion on how SPA (single page applications) and other javascript scenarios should be treated.

essiembre added the feature-request label Dec 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloning website #186

Cloning website #186

jaanuska commented Nov 21, 2015

OkkeKlein commented Nov 21, 2015

essiembre commented Dec 8, 2015

jaanuska commented Dec 10, 2015

essiembre commented Dec 20, 2015

dgomesbr commented Sep 28, 2017 •

edited

essiembre commented Sep 29, 2017

dgomesbr commented Oct 4, 2017

Cloning website #186

Cloning website #186

Comments

jaanuska commented Nov 21, 2015

OkkeKlein commented Nov 21, 2015

essiembre commented Dec 8, 2015

jaanuska commented Dec 10, 2015

essiembre commented Dec 20, 2015

dgomesbr commented Sep 28, 2017 • edited

essiembre commented Sep 29, 2017

dgomesbr commented Oct 4, 2017

dgomesbr commented Sep 28, 2017 •

edited