# Accessing the Wikipedia API
This notebook pulls some information from the Wikipedia API. This API is nice because it doesn't require authentication. (The Twitter API requires authentication--that's a necessary process to go through, but requires some work.) 

In [1]:
import requests

wikipedia_api_url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmlimit=10"

We can start by just building a simple query, getting 10 people born in 1973. We will use the delightful and amazing [`requests`](http://docs.python-requests.org/en/master/) library in Python. The format of the URL is based on a bunch of reading about the [Wikipedia API](https://www.mediawiki.org/wiki/API:Categorymembers) and trial and error. And error. 

In [None]:
full_url = wikipedia_api_url + "&cmtitle=Category:1973_births"

print(full_url)

r = requests.get(full_url)

In [None]:
from pprint import pprint

Let's use the `pprint` (for "pretty print") to print out the `json` object that returns from the API call.

In [None]:
pprint(r.json())

Feel free to click on the link that appears below cell [2]. You'll see a `pprint` version of what was returned. Thanks Wikipedia!

In this next cell, type `r.` and a tab and look at all the options you have to complete the request object.

In [None]:
r.

One of the most useful is `r.json()`.

In [None]:
r.json()

Compare these results to the entry: https://en.wikipedia.org/wiki/Category:1973_births.

JSON objects look a lot like Python dictonaries. In this case, we've got three main keys, `batchcomplete`, `continue`, and `query`.

In [None]:
for item in r.json() :
    print(item)

In [None]:
for item in r.json()['query'] :
    print(item)

`batchcomplete` tells us if we're done, I think. `continue` is used to continue through the results since we can't request more than 500 items at once. And `query` has the results. 

In [None]:
for item in r.json()['query']['categorymembers'] :
    print(item)

Now let's build a list of everyone born in 1973. I've added a way to get out using an interation counter. Change the `iteration > n` line (line 36) to get a different number of pages of results or make it something like 50 to get all the names. 

In [None]:
# Let's build up our request in a more sustainable way
req = {'action':'query',
       'format':'json',
       'list':'categorymembers',
       'cmlimit':500, # move the limit up to the max we can do.
       'cmtitle':'Category:1973_births'}

last_continue = {} # used to keep track of how far we've gone. 
iteration = 1
pages = 0

names = []

while True :
    # Modify it with the values returned in the 'continue' section of the last result.
    req.update(last_continue)
    
    # Call API
    result = requests.get('https://en.wikipedia.org/w/api.php', params=req).json() 
    
    pages += 1
    
    # Grab the names
    for item in result['query']['categorymembers'] :
        names.append(item['title'])
    
    # keep track of our iteration so we can exit if this runs forever
    iteration += 1
    
    # Can we get out?
    if 'continue' not in result :
        break
    else :
        last_continue = result['continue']
    
    if iteration > 300 :
        # it's useful to have a way out of while statements,
        # particularly ones that are framed as "while True"
        break 

print("We pulled {} pages".format(pages))

In [None]:
pprint(result)

---

Let's talk through the above code. 

---

To see what's going on, I'll print the first 10 names and the last 10 names.

In [None]:
print(names[:10])
print(names[-10:])

The below code does some parsing of the names. 

In [None]:
from collections import Counter

types = []

for name in names :
    if "(" in name :
        parts = name.split("(")
        #break
        types.append(parts[1].replace(")",""))

        
c = Counter(types)
c.most_common(10)

In [None]:
parts

In [None]:
from collections import Counter

c = Counter([name.split()[0] for name in names])
c.most_common(10)

In [None]:
len(names)

Now your turn. Pick a year, pull all the names for people born in that year, and count up the most common first names and last names. 

In [None]:
# Your code here. 

We had a discussion about Wikipedia fame. Are younger people more likely to be "Wikipedia Famous" (i.e., *on* Wikipedia)? Why might this be true? Why might it be false? 

In order to answer this question, it'd be good to have a list of everyone on Wikipedia born in the last 100 or 150 years. If you get here with some extra time, write some code to do this. Your code should define a starting year and then pull everyone born in that year on Wikipedia. Write out this data to a file, keeping track of the year the person was born in. 

Which year did you get with the maximum number of people? Is that result surprising?

In [3]:
# In this section, we'll pull everyone on Wikipedia from 1850 onward. 

years = range(1850,2019)
output_file = "wikipedia_famous.txt"

with open(output_file,'w',encoding="UTF-8") as ofile :
    ofile.write("year\tname\n")

for year in years :
    this_cmtitle = 'Category:' + str(year) + '_births'
    
    req = {'action':'query',
       'format':'json',
       'list':'categorymembers',
       'cmlimit':500, # move the limit up to the max we can do.
       'cmtitle':this_cmtitle}

    last_continue = {} # used to keep track of how far we've gone. 
    iteration = 1
    pages = 0

    names = []

    while True :
        # Modify it with the values returned in the 'continue' section of the last result.
        req.update(last_continue)

        # Call API
        result = requests.get('https://en.wikipedia.org/w/api.php', params=req).json() 

        pages += 1

        # Grab the names
        for item in result['query']['categorymembers'] :
            names.append(item['title'])

        # keep track of our iteration so we can exit if this runs forever
        iteration += 1

        # Can we get out?
        if 'continue' not in result :
            break
        else :
            last_continue = result['continue']

        if iteration >= 300 :
            # it's useful to have a way out of while statements,
            # particularly ones that are framed as "while True"
            print("Hey, we hit the iteration limit at {} in {}".format(iteration,year))
            break 

    print("We pulled {} pages for {}.".format(pages,year))
    
    # After we've pulled the year, let's write out the results
    with open(output_file,'a',encoding="UTF-8") as ofile : # why the 'a' here? 
        for name in names :
            ofile.write("\t".join([str(year),name]) + "\n")
    

We pulled 5 pages for 1850.
We pulled 5 pages for 1851.
We pulled 5 pages for 1852.
We pulled 4 pages for 1853.
We pulled 5 pages for 1854.
We pulled 5 pages for 1855.
We pulled 5 pages for 1856.
We pulled 5 pages for 1857.
We pulled 6 pages for 1858.
We pulled 6 pages for 1859.
We pulled 6 pages for 1860.
We pulled 6 pages for 1861.
We pulled 6 pages for 1862.
We pulled 6 pages for 1863.
We pulled 6 pages for 1864.
We pulled 6 pages for 1865.
We pulled 6 pages for 1866.
We pulled 7 pages for 1867.
We pulled 7 pages for 1868.
We pulled 7 pages for 1869.
We pulled 7 pages for 1870.
We pulled 7 pages for 1871.
We pulled 7 pages for 1872.
We pulled 7 pages for 1873.
We pulled 7 pages for 1874.
We pulled 7 pages for 1875.
We pulled 8 pages for 1876.
We pulled 7 pages for 1877.
We pulled 8 pages for 1878.
We pulled 8 pages for 1879.
We pulled 8 pages for 1880.
We pulled 8 pages for 1881.
We pulled 9 pages for 1882.
We pulled 9 pages for 1883.
We pulled 9 pages for 1884.
We pulled 9 pages fo

More later on estimating the age effect in Wikipedia!