In [1]:
import re
import requests
import pandas as pd
import numpy as np

#### Below is the wikipedia api call for a category search:

`http://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A+machine+learning&cmlimit=max`

`action=query`: query the wikipedia api

`format=json`: return a json format

`list=categorymembers`: List of pages that belong to a given category, ordered by page sort title

`cmtitle=Category%3A+machine+learning`: title of category

`climit=max`: return up to the maximum amount of responses (500)

You may use this to get page titles from the wikipedia API. Things to watch out for:
* The responses contain categories
* You will want to fetch articles in those subcategories

The API's detailed documentation can be found [here](https://www.mediawiki.org/wiki/API:Main_page)

In [2]:
r = requests.get('http://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A+machine+learning&cmlimit=max')

In [3]:
type(r)

requests.models.Response

In [4]:
r.status_code # 200 means it worked!

200

In [5]:
r.json().keys()

dict_keys(['batchcomplete', 'limits', 'query'])

In [6]:
r.json()['query'].keys()

dict_keys(['categorymembers'])

In [7]:
pd.DataFrame(r.json()['query']['categorymembers'])

Unnamed: 0,ns,pageid,title
0,0,43385931,Data exploration
1,0,49082762,List of datasets for machine learning research
2,0,233488,Machine learning
3,0,53587467,Outline of machine learning
4,0,3771060,Accuracy paradox
5,0,43808044,Action model learning
6,0,28801798,Active learning (machine learning)
7,0,45049676,Adversarial machine learning
8,0,52642349,AIVA
9,0,30511763,AIXI


In [8]:
from string import punctuation

def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)

def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

def get_page_contents(pageid):
    
    query = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&\
             rvprop=content&rvsection=0&format=json&pageids={}'.format(pageid)
    
    my_request = requests.get(query).json()
    
    no_html_string = striphtml(my_request['query']['pages'][str(pageid)]['extract']).replace('\n', ' ')
    
    return strip_punctuation(no_html_string)

In [1]:
query = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&\
             rvprop=content&rvsection=0&format=json&pageids={}'.format(33547228)
my_request = requests.get(query).json()
my_request['query']

NameError: name 'requests' is not defined

#### Make a function that formats a request for pages of a category

In [8]:
# use regex to replace name of category in the search string
#'http://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A+machine+learning&cmlimit=max'
category = re.sub('\s', '+', category) # replace spaces in category with +s so can insert into search string
&cmtitle=Category%3A+machine+learning&

SyntaxError: invalid syntax (<ipython-input-8-ab9dc6c324ca>, line 4)

#### Make a function that uses requests to execute the query and returns the json

In [62]:

query = requests.get('http://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A+machine+learning&cmlimit=max')
return query.json()

SyntaxError: 'return' outside function (<ipython-input-62-5daf42d5d142>, line 3)

#### Make a function that turns the json into a DataFrame

hint: you can't just make a DataFrame of the json

In [63]:
pd.DataFrame(query.json()['query']['categorymembers'])

Unnamed: 0,ns,pageid,title
0,0,43385931,Data exploration
1,0,49082762,List of datasets for machine learning research
2,0,233488,Machine learning
3,0,53587467,Outline of machine learning
4,0,3771060,Accuracy paradox
5,0,43808044,Action model learning
6,0,28801798,Active learning (machine learning)
7,0,45049676,Adversarial machine learning
8,0,52642349,AIVA
9,0,30511763,AIXI


#### Extra: Build out methods that take any subcategories and get the articles in those.

Hint: recursion