Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloning website #186

Open
jaanuska opened this issue Nov 21, 2015 · 7 comments
Open

Cloning website #186

jaanuska opened this issue Nov 21, 2015 · 7 comments

Comments

@jaanuska
Copy link

I want to build a system which makes an exact cloned copy of a website and stores it locally. All links in pages have to be modified to point to the local structure, e.g. www.example.com/resource.jpg -> //local/file/system/mirrors/www.example.com/resource.jpg. This allows users to browse the copy of the website locally.
In addition, all content needs to be sent to Solr also.
As I understand, keepDownloads option is not meant for this purpose. Is there any other way to "clone" a website to a local file system, today? If not, should I implement my own committer, using the ICommitter interface, for example?

@OkkeKlein
Copy link

I think there are plenty of tools online for cloning a website. You can then maybe use https://www.norconex.com/collectors/collector-filesystem/ to index that content with Solr.

@essiembre
Copy link
Contributor

Is the solution proposed by @OkkeKlein working for you?

Otherwise, we can transform this ticket into a feature request to allow pluggable implementations of how files are saved when downloads are kept. One such implementation could try to mimic the same directory structure as the website being crawled. Would that be useful to you?

@jaanuska
Copy link
Author

I still would like to use Norconex entirely in my project :). At the moment I am trying to write something by myself.
In addition to get a mirror of a website plus redirecting all links to local destinations, I need a solution for storing modified files every after recrawling cycle.

@essiembre
Copy link
Contributor

Marking as a feature request to have a way to overwrite/customize the way downloaded files are stored.

@dgomesbr
Copy link

dgomesbr commented Sep 28, 2017

+1.

Maybe a exporter rather than how it's stored

@essiembre
Copy link
Contributor

@dgomesbr, can you elaborate what your exporter would look like? There are a few challenges with cloning websites in general. There are dynamic + javascript rendered ones that do not work well as static sites, but also, they sometimes have server-side logic that we do not know about. One example: http://example.com/home can lead to a home page, so we would make "home" a file. What extension? If we do not give it ".html", the cloned static site will not open it properly. If we give it ".html" we will have to update all references to it. Also, what if this also exists: http://example.com/home/about.html. Then "home" needs to be a directory, but we already made it a file.

We could add enough configuration options, but it may be hard to have a one-size-fits-all so maybe it is best to have an interface people can extend and custom-code how they want the cloning done?

@dgomesbr
Copy link

dgomesbr commented Oct 4, 2017

The behavior I'm proposing is like what https://www.httrack.com/ does. Copy everything locally to HTML. I don't have a opinion on how SPA (single page applications) and other javascript scenarios should be treated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants