[MediaWiki] Generators

Chen edited this page Aug 30, 2018 · 29 revisions

MediaWiki API uses the notion of lists and generators to represent a (long) sequence of items. E.g. allpages, categorymembers, etc. Usually the items in a MediaWiki "list" contain page ID, title, and namespace ID of each page (i.e. WikiPageStub in WCL). For some lists such as recentchanges, etc., the items will also contain extra information like timestamp of the change, user name made the change, and so on. For lists like users, the items will be user instead of basic page information.

On the other hand, MediaWiki "generator" is fed with "list", and generates a sequence of page (i.e. WikiPage in WCL), with optionally included page content. generators are convenient if you are more interested in retrieving page properties and/or content, but it will lose the list-specific extra information, such as timestamp and editor of a change to the page from recentchanges.

Not all of the MediaWiki lists can be used to feed the generator. users and abusefilters are the example, which do not generate a sequence of pages, and would be pointless to be used as generator. On the other hand, a generator may use something other than MediaWiki "list" to generate a sequence of pages, e.g. backlinks is only a property of MediaWiki page, but it can be used to feed generator.

For long sequences, MediaWiki API will split the sequences into multiple parts, and client should use query continuations to ask for the next part. In WCL, continuations are encapsulated in IAsyncEnumerable<T>, which will request for more results from server when necessary. You just need to keep enumerating from the returned IAsyncEnumerable<T>.

In WCL, lists and generators are represented by classes derived from WikiList<TItem> and WikiPageGenerator<TItem, TPage>. Note that the latter class is actually derived from the first one, all the WikiPageGenerator-derived classes can be used either as list or generator, depending on your needs. In WCL, list-like properties of a page and their corresponding generators are represented by classes derived from WikiPagePropertyGenerator<TItem, TPage>, which is derived from WikiPagePropertyList<TItem>.

wcl-generator-classes

WCL has some implemented generators in WikiClientLibrary.Generators namespace. You can also implement your own generator classes if necessary. Please take a look at the library code for reference.

Library references

How to work with IAsyncEnumerable<T>

IAsyncEnumerable<T> and IAsyncEnumerator<T> are introduced in Ix.Async package as asynchronous counterpart for IEnumerable<T> and IEnumerator<T>. With Ix.Async package, You can consume these asynchronous enumerators in a somewhat similar manner as you are working with ordinary enumerators.

  • You can use all the LINQ extension methods on IAsyncEnumerator<T>.
  • You can use Rx.NET package to convert IAsyncEnumerator<T> to IObservable<T>, if necessary.
  • For now, you can consume the items in IAsyncEnumerator<T> sequentially using the expanded for-each pattern. (See ShowAllTemplatesAsync method below for example); later when async for is introduced into C# 8 (hopefully), you might be able to use async for each on IAsyncEnumerator<T>.

Some caveats when consuming the IAsyncEnumerator<T> taken out from generator classes in Wiki Client Library:

  • Choose a proper PaginationSize. It decides (at most) how many items are to be fetched from server in one MediaWiki API request. So for example, if you are working with top 50 items from RecentChangesGenerator, you might choose 50 rather than 10 (by default) as PaginationSize value, so they will all be fetched at one time.
  • The maximum value of allowed PaginationSize is usually 500 for normal users, and 5000 for users with api-highlimits right (typically bot and sysop).
    • If you are using PageQueryOptions.FetchContent flag with EnumPagesAsync, this limit will be lowered to 1/10, i.e. 50 for normal users, and 500 for users with api-highlimits right.
    • If you are using PageQueryOptions.FetchExcerpt flag with EnumPagesAsync, this limit will be lowered to 10 for normal users, and 20 for users with api-highlimits right.
    • Considering the stability of network traffic, it is advised that you use 50 for typical in-batch WikiPage processing. PyWikiBot also uses this value for default pagination in site.preloadpage method.
  • Do not forget to chain the returned IAsyncEnumerator with Take(n) if you are only interested in the top n items in the sequence.
  • And in most cases, do not attempt to revert or sort a sequence returned by WCL using LINQ methods (e.g. AsyncEnumerator.Reverse), unless you know what you are doing. Instead, you can take a look at properties of Generators, which may include such options for sorting.
  • A common idiom for fetching a small number of results from the generator is as follows.
    • If you are working with a large number of pages, it's recommended that you convert the returned IAsyncEnumerator to something like IObservable or ISourceBlock, or use expanded for-each pattern.
static async Task ShowRecentChangesAsync()
{
    var generator = new RecentChangesGenerator(myWikiSite)
    {
        // Choose wisely.
        PaginationSize = 50,
        // Configure the generator, e.g. setting filter/sorting criteria
        NamespaceIds = new[] {BuiltInNamespaces.Main, BuiltInNamespaces.File},
        AnonymousFilter = PropertyFilterOption.WithProperty
    };
    // Gets the latest 50 changes made to article and File: namespace,
    // by anonymous users.
    var items = await generator.EnumItemsAsync().Take(50).ToList();
    foreach (var i in items)
    {
        Console.WriteLine(i.Title);
        // Show revision comments.
        Console.Write("\t");
        Console.WriteLine(i.Comment);
    }

    // When you want to fetch extracts for the pages, it's safe to fetch for no more than
    // 10 pages at one time.
    generator.PaginationSize = 10;
    // Gets the latest 50 pages in article and File: namespace that were changed
    // by anonymous users.
    var pages = await generator.EnumPagesAsync(PageQueryOptions.FetchExtract).Take(50).ToList();
    foreach (var i in pages)
    {
        Console.WriteLine(i.Title);
        // Show abstract for each revised page.
        Console.Write("\t");
        Console.WriteLine(i.Extract);
    }
}

How to consume IWikiList-implementation classes

static async Task SearchAsync()
{
    Console.Write("Enter your search keyword: ");
    var generator = new SearchGenerator(myWikiSite, Console.ReadLine())
    {
        PaginationSize = 22
    };
    // We are only interested in the top 20 items.
    foreach (var item in await generator.EnumItemsAsync().Take(20).ToList())
    {
        Console.WriteLine(item);
        Console.WriteLine("\t{0}", item.Snippet);
    }
}

Most of the WikiPageGenerator-derived classes (including AllPagesGenerator) implement IWikiListGenerator<WikiPageStub>, i.e., .EnumItemsAsync() will return a sequence of WikiPageStub. If you are only interested in the titles of the pages, consider using .EnumItemsAsync() instead of .EnumPagesAsync().

Still, there are some classes implementing IWikiList<T> where T is something other than WikiPageStub, including

  • class RecentChangesGenerator : WikiPageGenerator<RecentChangeItem, WikiPage>, IWikiList<RecentChangeItem>, IWikiPageGenerator<WikiPage>
  • class RecentChangesGenerator : WikiPageGenerator<RecentChangeItem, WikiPage>, IWikiList<RecentChangeItem>, IWikiPageGenerator<WikiPage>
  • class SearchGenerator : WikiPageGenerator<SearchResultItem, WikiPage>, IWikiList<SearchResultItem>, IWikiPageGenerator<WikiPage>
  • class GeoSearchGenerator : WikiPageGenerator<GeoSearchResultItem, WikiPage>, IWikiList<GeoSearchResultItem>, IWikiPageGenerator<WikiPage>
  • class RevisionsGenerator : WikiPagePropertyGenerator<Revision, WikiPage>, IWikiList<Revision>, IWikiPageGenerator<WikiPage>

This allows these lists, when used with EnumItems methods, to provide you more information other than just ID, title, and namespace ID of the pages (in WikiPageStub).

How to consume IWikiPageGenerator-implementation classes

static async Task ShowAllTemplatesAsync()
{
    var generator = new AllPagesGenerator(myWikiSite)
    {
        StartTitle = "A",
        NamespaceId = BuiltInNamespaces.Template,
        PaginationSize = 50
    };
    // You can specify EnumPagesAsync(PageQueryOptions.FetchContent),
    // if you are interested in the content of each page
    using (var enumerator = generator.EnumPagesAsync().GetEnumerator())
    {
        int index = 0;
        // Before the advent of "async for" (might be introduced in C# 8),
        // to handle the items in sequence one by one, we need to use
        // the expanded for-each pattern.
        while (await enumerator.MoveNext(CancellationToken.None))
        {
            var page = enumerator.Current;
            Console.WriteLine("{0}: {1}", index, page);
            index++;
            // Prompt user to continue listing, every 50 pages.
            if (index % 50 == 0)
            {
                Console.WriteLine("Esc to exit, any other key for next page.");
                if(Console.ReadKey().Key == ConsoleKey.Escape)
                    break;
            }
        }
    }
}

Some more example code

static async Task HelloWikiGenerators()
{
    // Create a MediaWiki API client.
    var wikiClient = new WikiClient();
    // Create a MediaWiki site instance.
    var site = await WikiSite.CreateAsync(wikiClient, "https://en.wikipedia.org/w/api.php");
    // List all pages starting from item "Wiki", without redirect pages.
    var allpages = new AllPagesGenerator(site)
    {
        StartTitle = "Wiki",
        RedirectsFilter = PropertyFilterOption.WithoutProperty
    };
    // Take the first 1000 results
    var pages = await allpages.EnumPagesAsync().Take(1000).ToList();
    foreach (var p in pages)
        Console.WriteLine("{0, -30} {1, 8}B {2}", p, p.ContentLength, p.LastTouched);
            
    // List the first 10 subcategories in Category:Cats
    Console.WriteLine();
    Console.WriteLine("Cats");
    var catmembers = new CategoryMembersGenerator(site, "Category:Cats")
    {
        MemberTypes = CategoryMemberTypes.Subcategory
    };
    pages = await catmembers.EnumPagesAsync().Take(10).ToList();
    foreach (var p in pages)
        Console.WriteLine("{0, -30} {1, 8}B {2}", p, p.ContentLength, p.LastTouched);
}
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.