# Accessing the Wikipedia API
This notebook pulls some information from the Wikipedia API. This API is nice because it doesn't require authentication. (The Twitter API requires authentication--that's a necessary process to go through, but requires some work.) 

In [None]:
import requests
from collections import Counter

wikipedia_api_url = "https://en.wikipedia.org/w/api.php"

req = {'action':'query',
          'format':'json',
          'list':'categorymembers',
          'cmlimit':10}

Q: What sort of object is `req`? 

A: It's a dictionary. They values are strings, except for `req['cmlimit']` which is an integer. 

We can start by just building a simple query, getting 10 people born in 1973. We will use the delightful and amazing [`requests`](http://docs.python-requests.org/en/master/) library in Python. The format of the URL is based on a bunch of reading about the [Wikipedia API](https://www.mediawiki.org/wiki/API:Categorymembers) and trial and error. And error. 

In [None]:
# add on the category we're going to look at. 
req['cmtitle'] = 'Category:1973_births'

In [None]:
r = requests.get(wikipedia_api_url,params=req)

Let's take a look at the URL that was requested.

In [None]:
r.url

Q: What is `requests` doing with the params list? 

A: It's adding a "?" to the URL. Then it's gluing the parameters together with a `key1=value1&key2=value2&...` syntax. 

In [None]:
from pprint import pprint

Let's use the `pprint` (for "pretty print") to print out the `json` object that returns from the API call.

In [None]:
pprint(r.json())

Feel free to use your browser to visit that `r.url` above. You'll see a `pprint` version of what was returned. Thanks Wikipedia!

In this next cell, type `r.` and a tab and look at all the options you have to complete the request object.

In [None]:
r.

One of the most useful is `r.json()`.

In [None]:
r.json()

Compare these results to the entry: https://en.wikipedia.org/wiki/Category:1973_births.

Q: What sort of Python object does a JSON object look like? 

A: JSON objects look a lot like Python dictonaries. In this case, we've got three main keys, `batchcomplete`, `continue`, and `query`.

When we have this kind of object, it's nice to iterate through it at various levels and see what's there. 

In [None]:
for item in r.json() :
    print(item)

In [None]:
for item in r.json()['query'] :
    print(item)

`batchcomplete` tells us if we're done with the query, since those are "paged" out, I think. `continue` is used to continue through the results since we can't request more than 500 items at once. And `query` has the results. 

In [None]:
for item in r.json()['query']['categorymembers'] :
    print(item)

Q: What sort of object is `item`? 

A: `item` is a dictionary with three keys and values. The value for the key `title` has the person's name in it. 

Now let's build a list of everyone born in 1973. I've added a way to get out using an interation counter. Change the `iteration > n` line to get a different number of pages of results or make it something like 50 to get all the names. 

In [None]:
# Let's build up our request in a more sustainable way.
req = {'action':'query',
       'format':'json',
       'list':'categorymembers',
       'cmlimit':500, # move the limit up to the max we can do.
       'cmtitle':'Category:1973_births'}

last_continue = {} # used to keep track of how far we've gone. 
iteration = 1
pages = 0

names = []

while True :
    # Modify it with the values returned in the 'continue' section of the last result.
    req.update(last_continue)
    
    # Call API
    result = requests.get('https://en.wikipedia.org/w/api.php', params=req).json() 
    
    pages += 1
    
    # Grab the names
    for item in result['query']['categorymembers'] :
        names.append(item['title'])
    
    # keep track of our iteration so we can exit if this runs forever
    iteration += 1
    
    # Can we get out?
    if 'continue' not in result :
        break
    else :
        last_continue = result['continue']
    
    if iteration > 300 :
        # it's useful to have a way out of while statements,
        # particularly ones that are framed as "while True"
        break 

print("We pulled {} pages".format(pages))

Q: How many names did we pull? 

In [None]:
len(names)

---

Work through the above code and let me know what questions you have. This is a good set of code to talk through in class too. 

---

To see what's going on, I'll print the first 10 names and the last 10 names.

In [None]:
print(names[:10])
print(names[-10:])

Let's look at the most common occupations, which are listed in parentheses after the name. Parse the names to pull out occupations, put those through a Counter object to look at the most common. 

In [None]:
# Container for the job-like things
types = []

# Iterate over the names, split on parens if they're there, 
# store what's in the second part. 
for name in names :
    if "(" in name :
        parts = name.split("(")
        
        types.append(parts[1].replace(")",""))

        
c = Counter(types)
c.most_common(10)

Q: What are the most common first names in the names list?

In [None]:
c = Counter([name.split()[0] for name in names])
c.most_common(10)

In [None]:
len(names)

Now pick a different year, pull all the names for people born in that year, and count up the most common first names and last names. This will lead into the assignment.

In [None]:
# Your code here. 