# Accessing the Wikipedia API
This notebook pulls some information from the Wikipedia API. This API is nice because it doesn't require authentication. (The Twitter API requires authentication--that's a necessary process to go through, but requires some work.) 

In [1]:
import requests

wikipedia_api_url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmlimit=10"

We can start by just building a simple query, getting 10 people born in 1973. We will use the delightful and amazing [`requests`](http://docs.python-requests.org/en/master/) library in Python. The format of the URL is based on a bunch of reading about the [Wikipedia API](https://www.mediawiki.org/wiki/API:Categorymembers) and trial and error. And error. 

In [2]:
full_url = wikipedia_api_url + "&cmtitle=Category:1973_births"

print(full_url)

r = requests.get(full_url)

https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmlimit=10&cmtitle=Category:1973_births


In [3]:
from pprint import pprint

Let's use the `pprint` (for "pretty print") to print out the `json` object that returns from the API call.

In [4]:
pprint(r.json())

{'batchcomplete': '',
 'continue': {'cmcontinue': 'page|292b2b294d06042949294b292b03062949294b292b04292b2b294d011e01dcc0dcc0dcc0dc08|54679534',
              'continue': '-||'},
 'query': {'categorymembers': [{'ns': 0, 'pageid': 3657357, 'title': 'A-do'},
                               {'ns': 0,
                                'pageid': 16159192,
                                'title': 'Jacob Aagaard'},
                               {'ns': 0, 'pageid': 22785402, 'title': 'Aamani'},
                               {'ns': 0,
                                'pageid': 14516619,
                                'title': 'Hallvard Aamlid'},
                               {'ns': 0,
                                'pageid': 13700991,
                                'title': 'Jason Aaron'},
                               {'ns': 0,
                                'pageid': 6284418,
                                'title': 'Ann Kristin Aarønes'},
                               {'ns': 0,
         

Feel free to click on the link above. You'll see a `pprint` version of what was returned. Thanks Wikipedia!

In this next cell, type `r.` and a tab and look at all the options you have to complete the request object.

In [14]:
r.url

'https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmlimit=10&cmtitle=Category:1973_births'

One of the most useful is `r.json()`.

In [5]:
r.json()

{'batchcomplete': '',
 'continue': {'cmcontinue': 'page|292b2b294d06042949294b292b03062949294b292b04292b2b294d011e01dcc0dcc0dcc0dc08|54679534',
  'continue': '-||'},
 'query': {'categorymembers': [{'ns': 0, 'pageid': 3657357, 'title': 'A-do'},
   {'ns': 0, 'pageid': 16159192, 'title': 'Jacob Aagaard'},
   {'ns': 0, 'pageid': 22785402, 'title': 'Aamani'},
   {'ns': 0, 'pageid': 14516619, 'title': 'Hallvard Aamlid'},
   {'ns': 0, 'pageid': 13700991, 'title': 'Jason Aaron'},
   {'ns': 0, 'pageid': 6284418, 'title': 'Ann Kristin Aarønes'},
   {'ns': 0, 'pageid': 14513688, 'title': 'John-Ragnar Aarset'},
   {'ns': 0, 'pageid': 55915264, 'title': 'Dagfinn Aarskog (bobsleigh)'},
   {'ns': 0, 'pageid': 36759928, 'title': 'Eileen Abad'},
   {'ns': 0, 'pageid': 6822011, 'title': 'Aydo Abay'}]}}

Compare these results to the entry: https://en.wikipedia.org/wiki/Category:1973_births.

JSON objects look a lot like Python dictonaries. In this case, we've got three main keys, `batchcomplete`, `continue`, and `query`.

In [6]:
for item in r.json() :
    print(item)

batchcomplete
continue
query


In [7]:
for item in r.json()['query'] :
    print(item)

categorymembers


`batchcomplete` tells us if we're done, I think. `continue` is used to continue through the results since we can't request more than 500 items at once. And `query` has the results. 

In [8]:
for item in r.json()['query']['categorymembers'] :
    print(item)

{'pageid': 3657357, 'ns': 0, 'title': 'A-do'}
{'pageid': 16159192, 'ns': 0, 'title': 'Jacob Aagaard'}
{'pageid': 22785402, 'ns': 0, 'title': 'Aamani'}
{'pageid': 14516619, 'ns': 0, 'title': 'Hallvard Aamlid'}
{'pageid': 13700991, 'ns': 0, 'title': 'Jason Aaron'}
{'pageid': 6284418, 'ns': 0, 'title': 'Ann Kristin Aarønes'}
{'pageid': 14513688, 'ns': 0, 'title': 'John-Ragnar Aarset'}
{'pageid': 55915264, 'ns': 0, 'title': 'Dagfinn Aarskog (bobsleigh)'}
{'pageid': 36759928, 'ns': 0, 'title': 'Eileen Abad'}
{'pageid': 6822011, 'ns': 0, 'title': 'Aydo Abay'}


Now let's build a list of everyone born in 1973. I've added a way to get out using an interation counter. Change the `iteration > n` line to get a different number of pages of results or make it something like 50 to get all the names. 

In [9]:
# Let's build up our request in a more sustainable way
req = {'action':'query',
       'format':'json',
       'list':'categorymembers',
       'cmlimit':500, # move the limit up to the max we can do.
       'cmtitle':'Category:1973_births'}

last_continue = {} # used to keep track of how far we've gone. 
iteration = 1
pages = 0

names = []

while True :
    # Modify it with the values returned in the 'continue' section of the last result.
    req.update(last_continue)
    
    # Call API
    result = requests.get('https://en.wikipedia.org/w/api.php', params=req).json() 
    
    pages += 1
    
    # Grab the names
    for item in result['query']['categorymembers'] :
        names.append(item['title'])
    
    # keep track of our iteration so we can exit if this runs forever
    iteration += 1
    
    # Can we get out?
    if 'continue' not in result :
        break
    else :
        last_continue = result['continue']
    
    if iteration > 300 :
        # it's useful to have a way out of while statements,
        # particularly ones that are framed as "while True"
        break 

print("We pulled {} pages".format(pages))

We pulled 25 pages


In [15]:
pprint(result)

{'batchcomplete': '',
 'query': {'categorymembers': [{'ns': 0,
                                'pageid': 4597415,
                                'title': 'Miho Yamada'},
                               {'ns': 0,
                                'pageid': 52517761,
                                'title': 'Miho Yamada (gymnast)'},
                               {'ns': 0,
                                'pageid': 2104249,
                                'title': 'Sulim Yamadayev'},
                               {'ns': 0,
                                'pageid': 22467670,
                                'title': 'Hidetada Yamagishi'},
                               {'ns': 0,
                                'pageid': 8751597,
                                'title': 'Eri Yamaguchi'},
                               {'ns': 0,
                                'pageid': 26613361,
                                'title': 'Takayuki Yamaguchi (footballer)'},
                               {'ns': 

---

Let's talk through the above code. 

---

To see what's going on, I'll print the first 10 names and the last 10 names.

In [16]:
print(names[:10])
print(names[-10:])

['A-do', 'Jacob Aagaard', 'Aamani', 'Hallvard Aamlid', 'Jason Aaron', 'Ann Kristin Aarønes', 'John-Ragnar Aarset', 'Dagfinn Aarskog (bobsleigh)', 'Eileen Abad', 'Aydo Abay']
['Bert Zuurman', 'Wodage Zvadya', 'Rimantas Žvingilas', 'Arthur Zwane', 'Mandla Zwane', 'Clint Zweifel', 'Claudia Zwiers', 'Serge Zwikker', 'Noam Zylberman', 'File:SteveCorino2012Cropped.png']


The below code does some parsing of the names. 

In [21]:
from collections import Counter

types = []

for name in names :
    if "(" in name :
        parts = name.split("(")
        #break
        types.append(parts[1].replace(")",""))

        
c = Counter(types)
c.most_common(10)

[('footballer', 106),
 ('footballer, born 1973', 81),
 ('musician', 56),
 ('American football', 53),
 ('politician', 45),
 ('cricketer', 39),
 ('actor', 38),
 ('baseball', 35),
 ('basketball', 32),
 ('ice hockey', 26)]

In [20]:
parts

['Dagfinn Aarskog ', 'bobsleigh)']

In [23]:
from collections import Counter

c = Counter([name.split()[0] for name in names])
c.most_common(10)

[('David', 119),
 ('Mark', 94),
 ('Chris', 94),
 ('John', 83),
 ('Paul', 83),
 ('Jason', 79),
 ('Michael', 70),
 ('Peter', 63),
 ('Brian', 56),
 ('Mike', 53)]

In [13]:
len(names)

12277

Now your turn. Pick a year, pull all the names for people born in that year, and count up the most common first names and last names. 

In [27]:
from random import choice


choice(names)

'Axel Lawarée'

In [None]:
# Your code here. 

Last year we had an interesting discussion about Wikipedia fame. Are younger people more likely to be "Wikipedia Famous" (i.e., *on* Wikipedia)? Why might this be true? Why might it be false? 

In order to answer this question, it'd be good to have a list of everyone on Wikipedia born in the last 100 or 150 years. If you get here with some extra time, write some code to do this. Your code should define a starting year and then pull everyone born in that year on Wikipedia. Write out this data to a file, keeping track of the year the person was born in. 

Which year did you get with the maximum number of people? 