## Olympians by Wikipedia Pageviews

### Get a list of competing Olympians

Getting a list of people competing in the Olympics turns out to actually be a pretty difficult thing to do. Olympians are named by any of 240 so-called National Olympic Committees, each of which publishes a list of its athletes&mdash;maybe.

The closest thing to a full list is the apparently maniacally-upkept [sports-reference.com](http://www.sports-reference.com/olympics/summer/2012/) list, which tries to include as close as possible to every Olympian ever. However, they do not have data on the 2016 Olympics available yet.

It is possible to search Wikidata for [participants at the 2016 Summer Games](https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%0AWHERE%0A%7B%0A%09%3Fitem%20wdt%3AP1344%20wd%3AQ8613%20.%20%0A%09SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20%7D%0A%7D%0A), but the result is hilariously short on results.

Individual Wikipedias contain categories for such things, another approach would be to try to use these. This turns out to be very reasonable: there are ~11,000 Olympians, and [Category:Competitors_at_the_2016_Summer_Olympics](https://en.wikipedia.org/wiki/Category:Competitors_at_the_2016_Summer_Olympics) on the English Wikipedia covers, according to a hand count, at least 10,000 of them. It's quite possible it contains all of them.

Interestingly, for those atheletes that were lacking an article at the beginning of the games, this seems to be due to the work of [one particularly indefatigable user](https://en.wikipedia.org/w/index.php?title=Special:Contributions&offset=20170101000000&limit=100&contribs=user&target=Yellow+Dingo&namespace=&tagfilter=&newOnly=1).

Thus getting a list which is *almost* complete (I doubt that it is totally complete) is a function of reading in this category.

We use the `mwapi` package for this task.

In [1]:
import mwapi
session = mwapi.Session('https://en.wikipedia.org', user_agent='olympian-list-fetch')
print(session.get(action='query', meta='userinfo'))

{'batchcomplete': '', 'query': {'userinfo': {'id': 0, 'anon': '', 'name': '108.41.39.170'}}}


The category consists of a stack of subcategories, which we read in first.

In [2]:
olympian_wp_categories = session.get(action='query', 
            list='categorymembers', 
            cmtitle='Category:Competitors_at_the_2016_Summer_Olympics',
            cmlimit=500,
            cmtype='subcat')
olympian_wp_categories = [cat['title'] for cat in olympian_wp_categories['query']['categorymembers'] if cat['title'] != 'Category:LGBT sportspeople at the 2016 Summer Olympics']  # avoid double-counting

Then we read in more data from each of the subcategories. The categories might themselves have subcategories, which we have to scope out as well.

Build a list of all cats and subcats.

In [3]:
# Build a list of all cats and subcats.
rio_cat_list = olympian_wp_categories
for cat in rio_cat_list:
    subcats = session.get(action='query',
                          list='categorymembers', 
                          cmtitle=cat,
                          cmlimit=500,
                          cmtype='subcat')
    # Add the subcats to the list, for further exploration.
    rio_cat_list += [cat['title'] for cat in subcats['query']['categorymembers']]

Fetch all of the articles in all of the categories and combine them.

In [4]:
 from tqdm import tqdm

athletes_en = []
for cat in tqdm(rio_cat_list):
    subl = []
    cmcontinue = ""
    while True:
        q = session.get(action='query', 
            list='categorymembers', 
            cmtitle=cat,
            cmlimit=500,
            cmtype='page',
            cmcontinue=cmcontinue)
        subl += [d['title'] for d in q['query']['categorymembers']]
        athletes_en += subl
        if 'continue' in q.keys():
            cmcontinue = q['continue']['cmcontinue']
            continue
        else:
            athletes_en += subl
            break

athletes_en = set(athletes_en)

100%|██████████████████████████████████████████| 40/40 [00:06<00:00,  6.66it/s]


In [5]:
len(athletes_en)

9481

In [7]:
9481/11000

0.861909090909091

About 85% of athletes are represented.

Get the associated Wikidata entities.

I first I tried to use [the following seemingly undescribed capacity](https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=iwlinks&list=&titles=Michael+Phelps). As you can see below, this actually only works in a very small minority of cases. The reason why it does so at all, then, is unknown to me.

In [77]:
# wd_items = []
# for athlete in athletes_en:
#     q = session.get(action='query', 
#         format='json', 
#         prop='iwlinks',
#         titles='Michael Phelps')
#     print(q)
#     break

{'query': {'pages': {'19084502': {'iwlinks': [{'prefix': 'c', '*': 'Category:Michael_Phelps'}, {'prefix': 'd', '*': 'Q39562'}, {'prefix': 'n', '*': 'Special:Search/Michael_Phelps'}, {'prefix': 'q', '*': 'Special:Search/Michael_Phelps'}], 'pageid': 19084502, 'ns': 0, 'title': 'Michael Phelps'}}}, 'batchcomplete': ''}


In [84]:
# wd_items = []
# for athlete in tqdm(athletes_en):
#     q = session.get(action='query', 
#         format='json', 
#         prop='iwlinks',
#         titles=athlete)
#     q = q['query']['pages']
#     k = list(q.keys())[0]
#     q = q[k]
#     try:
#         q = q['iwlinks']
#     except:
#         continue
#     wd = [d['*'] for d in q if d['prefix'] == 'd']
#     wd_items += wd

100%|██████████████████████████████████████| 9452/9452 [18:07<00:00,  8.69it/s]


In [85]:
wd_items

['Q16231988',
 'Q10843402',
 'Q21531041',
 'Q24816347',
 'Q5580695',
 'Q39562',
 'Q1338865',
 'Q11459',
 'Q1616699']

Ok, this clearly (and surprisingly) doesn't work.

After reading documentation on interlanguage link management via Wikidata some more, I discovered that I can hackishly get what I want by examining the redirects I get bounced to when I query `Special:ItemByTitle` on Wikidata ([like so](https://www.wikidata.org/wiki/Special:ItemByTitle?site=enwiki&page=Michael_Phelps)).

The following solution is based on [this StackOverflow answer](http://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url).

Unfortunately it is really slow, but meh.

In [8]:
import requests

wd_urls = []
for en_title in tqdm(athletes_en):
    r = requests.get('https://www.wikidata.org/wiki/Special:ItemByTitle?site=enwiki&page={0}'.format(en_title))
    wd_urls.append(r.url)

100%|██████████████████████████████████████| 9481/9481 [38:07<00:00,  4.66it/s]


In the following, "S" means that no page was found, while "Q" means that one was.

In [9]:
import pandas as pd

pd.Series([u[:31] for u in wd_urls]).value_counts()

https://www.wikidata.org/wiki/Q    9440
https://www.wikidata.org/wiki/S      41
dtype: int64

We string match to recover endpoints.

In [10]:
wd_endpoints = [url.split('/')[-1] if url[30] != 'S' else None for url in wd_urls]

In [11]:
wd_endpoints[:5]

['Q25482912', 'Q26254395', 'Q17096778', 'Q60968', 'Q5108391']

Package this into a DataFrame.

In [12]:
rio = pd.DataFrame(data={'enwiki_article': list(athletes_en), 'identifier': wd_endpoints})

(intermediate save)

In [15]:
rio.to_csv("2016_olympian_pageviews.csv")

These are the English article names. Now we will merge on all of the "other"-wiki names.

In [19]:
from time import sleep
from json import JSONDecodeError

wiki_dict = dict()
n = 0

def wikify(id):
    ret = dict()
#     while True:
#         try:
    if id:
        d = requests.get("https://www.wikidata.org/wiki/Special:EntityData/{0}.json".format(id)).json()
#             break
#         except JSONDecodeError:
#             print("Encountered an error! Trying again...")
#             sleep(1)
#             continue
        labels = d['entities'][id]['labels']
        for lang in labels.keys():
            value = labels[lang]['value']
            if lang in wiki_dict.keys():
                wiki_dict[lang].append(value)
            else:
                wiki_dict[lang] = [None]*n
    for lang in wiki_dict.keys():
        if len(wiki_dict[lang]) == n:
            wiki_dict[lang].append(None)

for id in tqdm(rio['identifier']):
    wikify(id)
    n += 1

100%|██████████████████████████████████████| 9481/9481 [31:41<00:00,  5.09it/s]


Now assign that data to our DataFrame.

In [24]:
for lang in wiki_dict.keys():
    rio[lang] = wiki_dict[lang]

(intermediate save)

In [20]:
rio.to_csv("2016_olympian_pageviews.csv")

Now we can tie everything together by bringing in our pageview data.

In [1]:
from mwviews.api import PageviewsClient

In [None]:
p = PageviewsClient()
p.article_views('en.wikipedia', ['Selfie', 'Cat', 'Dog'])
p.project_views(['ro.wikipedia', 'de.wikipedia', 'commons.wikimedia'])
p.top_articles('en.wikipedia', limit=10)

# The client can do more than the API, for example monthly rollups for article views:
# p.article_views('en.wikipedia', ['Selfie', 'Cat'], granularity='monthly', start='2016020100', end='2016043000')

In [6]:
p = PageviewsClient()
p.project_views(['ro.wikipedia', 'de.wikipedia', 'commons.wikimedia'])

defaultdict(dict,
            {datetime.datetime(2016, 7, 24, 0, 0): {'commons.wikimedia': 15532515,
              'de.wikipedia': 35559142,
              'ro.wikipedia': None},
             datetime.datetime(2016, 7, 25, 0, 0): {'commons.wikimedia': 17790369,
              'de.wikipedia': 36038054,
              'ro.wikipedia': None},
             datetime.datetime(2016, 7, 26, 0, 0): {'commons.wikimedia': 16217568,
              'de.wikipedia': 37828679,
              'ro.wikipedia': None},
             datetime.datetime(2016, 7, 27, 0, 0): {'commons.wikimedia': 15852765,
              'de.wikipedia': 36907168,
              'ro.wikipedia': None},
             datetime.datetime(2016, 7, 28, 0, 0): {'commons.wikimedia': 15912657,
              'de.wikipedia': 34625468,
              'ro.wikipedia': None},
             datetime.datetime(2016, 7, 29, 0, 0): {'commons.wikimedia': 17037426,
              'de.wikipedia': 35709470,
              'ro.wikipedia': None},
             datetime.