Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trouble with wikis with more than 500 pages #32

Open
anarcat opened this issue Jan 14, 2016 · 6 comments
Open

trouble with wikis with more than 500 pages #32

anarcat opened this issue Jan 14, 2016 · 6 comments

Comments

@anarcat
Copy link
Contributor

anarcat commented Jan 14, 2016

it seems virtually impossible to fetch all the pages from wikitravel. the wiki is not that large (dumps are 75MB) so it should be possible to fetch all the changes. however, getting all the revisions (over 2M revisions!) seems to be a little prohibitive:

$ git -c remote.origin.fetchStrategy=by_rev -c remote.origin.shallow=true clone mediawiki::https://en.wikivoyage.org/w/
Clonage dans 'w'...
Searching revisions...
No previous mediawiki revision found, fetching from beginning.
Fetching & writing export data by revs...
Listing pages on remote wiki...
500 pages found.
1/2922483: Revision #1 of 1770
2/2922483: Revision #2 of 1liner
3/2922483: Revision #3 of 1st
4/2922483: Revision #4 of 1st
7/2922483: Revision #7 of 2010_FIFA_World_Cup
11/2922483: Revision #11 of 's-Graveland
[...]

notice here how remote.origin.shallow=true is not having any effect.. maybe that combination should be an error, but that's another thing.

trying just shallow gets only 500 pages, probably the API limit of mediawiki:

$ git -c remote.origin.shallow=true clone mediawiki::https://en.wikivoyage.org/w/
Clonage dans 'w'...
Searching revisions...
No previous mediawiki revision found, fetching from beginning.
Fetching & writing export data by pages...
Listing pages on remote wiki...
500 pages found.
page 1/500: Aggressive animals
  Found 1 revision (shallow import).
page 2/500: Adjaria
  Found 1 revision (shallow import).
[...]

could there be a hack similar to #16 to get all the pages?

@anarcat
Copy link
Contributor Author

anarcat commented Jan 14, 2016

to give you an idea, in 10 minutes i was able to get ~1000 revisions out of the ~3 million revisions. that is 0.03%. at this rate, it would take 20 days to get a clone of the wiki. :)

@anarcat
Copy link
Contributor Author

anarcat commented Jan 14, 2016

oh and oops: all the revisions is actually between 1GB and 5GB, not 80MB: that's only the latest revisions!

https://dumps.wikimedia.org/enwikivoyage/20151201/

so yeah, definitely need to go through a shallow copy, and need to overcome the 500 pages limitation.

@anarcat
Copy link
Contributor Author

anarcat commented Jan 14, 2016

and here's a patch:

--- a/contrib/mw-to-git/git-remote-mediawiki.perl
+++ b/contrib/mw-to-git/git-remote-mediawiki.perl
@@ -281,17 +281,30 @@ sub get_mw_tracked_namespaces {
 sub get_mw_all_pages {
        my $pages = shift;
        # No user-provided list, get the list of pages from the API.
-       my $mw_pages = $mediawiki->list({
+        my $query = {
                action => 'query',
                list => 'allpages',
                aplimit => 'max'
-       });
-       if (!defined($mw_pages)) {
+       };
+        my $curpage;
+        my $oldpage = '';
+        while (1) {
+            if (defined($curpage)) {
+                if ($oldpage eq $curpage) {
+                    last;
+                }
+                $query->{apfrom} = $curpage;
+                $oldpage = $curpage;
+            }
+            my $mw_pages = $mediawiki->list($query);
+            if (!defined($mw_pages)) {
                fatal_mw_error("get the list of wiki pages");
-       }
-       foreach my $page (@{$mw_pages}) {
+            }
+            foreach my $page (@{$mw_pages}) {
                $pages->{$page->{title}} = $page;
-       }
+                $curpage = $page->{title};
+            }
+        }
        return;
 }

note that it may seem to hang for large wiki... i thought of doing a progress bar, but couldn't get $| to work somehow... oh well.

@akhuettel
Copy link

Patch added to Gentoo git patchset, thanks!

tobbez added a commit to tobbez/Git-Mediawiki that referenced this issue Jun 22, 2017
@anarcat
Copy link
Contributor Author

anarcat commented Oct 29, 2017

i pushed this on https://github.com/anarcat/git/tree/large-wikis

@anarcat
Copy link
Contributor Author

anarcat commented Nov 21, 2017

filed the patch as #52.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants