# Overview

This week is all about working with data. I'm not going to lie to you. This part might be frustrating - but frustration is an integral part of learning. Real data is almost always messy & difficult ... and learning to deal with that fact, is a key part of being a data scientist. 


Enough about the process, let's get to the content. Today, we will use network science and Wikipedia to learn about American politics. We're going to study how the [US congress](https://en.wikipedia.org/wiki/United_States_Congress) has changed in the past 6 years. We will download the Wikipedia pages about members of the congress from Wikipedia - and then create the network of the pages that link to each other. Next time, we'll use our network skills (as well as new ones) to understand that network and its evolution in time. Further down the line, we'll use natural language processing to understand the text displayed on those pages.

But for today, the tasks are

* Learn about regular expressions
* Learn about Pandas dataframes
* Put together some statistics about the composition of the US congress in the past 6 years 
* Download and store (for later use) all the politicians-pages from Wikipedia
* Extract all the internal wikipedia-links that connect the politician-pages on wikipedia
* Generate the network of politicians on wikipedia. 
* Calculate some simple network statistics.

# Prelude: Regular expressions

Before we get started, we have to get a little head start on the _Natural Language Processing_ part of the class. This is a new direction for us, up to now, we've mostly been doing math-y stuff with Python, but today, we're going to be using Python to work through a text. The central thing we need to be able to do today, is to extract internal wikipedia links. And for that we need regular expressions.

> _Exercises_: Regular expressions round 1\.
> 
> * Read [**this tutorial**](https://developers.google.com/edu/python/regular-expressions) to form an overview of regular expressions. This is important to understand the content of the tutorial (also very useful later), so you may actually want to work through the examples.
> * Now, explain in your own words: what are regular expressions?
> * Provide an example of a regex to match 4 digits numbers (by this, I mean precisely 4 digits, you should not match any part of numbers with e.g. 5 digits). In your notebook, use `findall` to show that your regex works on this [test-text](https://raw.githubusercontent.com/suneman/socialgraphs2017/master/files/test.txt). **Hint**: a great place to test out regular expressions is: https://regex101.com.
> * Provide an example of a regex to match words starting with "super". Show that it works on the [test-text](https://raw.githubusercontent.com/suneman/socialgraphs2017/master/files/test.txt).
> 

Finally, we need to figure out how how to match internal wiki links. Wiki links come in two flavors. They're always enclosed in double square brackets, e.g. `[[wiki-link]]` and can either occur like this:

    ... some text [[Aristotle]] some more text ...

which links to the page [`https://en.wikipedia.org/wiki/Aristotle`](https://en.wikipedia.org/wiki/Aristotle). 

The second flavor has two parts, so that links can handle spaces and other more fancy forms of references, here's an example:

    ... some text [[John_McCain|John McCain]] some more text ...

which links to the page [`https://en.wikipedia.org/wiki/John_McCain`](https://en.wikipedia.org/wiki/Eudemus_of_Rhodes). Now it's your turn.

> _Exercise_: Regular expressions round 2\. Show that you can extract the wiki-links from the [test-text](https://raw.githubusercontent.com/suneman/socialgraphs2017/master/files/test.txt). Perhaps you can find inspiration on stack overflow or similar. **Hint**: Try to solve this exercise on your own (that's what you will get the most out of - learning wise), but if you get stuck ... you will find the solution in one of the video lectures below.
> 

# Prelude part 2: Pandas DataFrames


Before starting, we will also learn a bit about [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), a very user-friendly data structure that you can use to manipulate tabular data. Pandas dataframes are implemented within the [pandas package] (https://pandas.pydata.org/).

Pandas dataframes should be intuitive to use. ** We suggest you to go through the [10 minutes to Pandas tutorial](https://pandas.pydata.org/pandas-docs/version/0.22/10min.html#min) to learn what you need to solve the next exercise. **

# Part A: Study the composition of the US congress

We will study the 113th, 114th, 115th US congresses. The congress is composed by two groups of politicians: the senate and house of representatives. For now, we will consider only the House of representatives

* 113th congress: (Jan 3, 2013 to Jan 3, 2015) [113th house of representatives](https://en.wikipedia.org/wiki/List_of_members_of_the_United_States_House_of_Representatives_in_the_113th_Congress_by_seniority)

* 114th congress: (Jan 3, 2015 to Jan 3, 2017) [114th house of representatives](https://en.wikipedia.org/wiki/List_of_members_of_the_United_States_House_of_Representatives_in_the_114th_Congress_by_seniority)

* 115th congress: (Jan 3, 2017 to Jan 3, 2019) 
[115th house of representatives](https://en.wikipedia.org/wiki/List_of_members_of_the_United_States_House_of_Representatives_in_the_115th_Congress_by_seniority)

As you can see, each of these pages contains a table listing the members of the house, together with some information about them. Extracting tables from the content returned by the API can be a [nightmare](https://github.com/earwig/mwparserfromhell). To help you out, we have downloaded and parsed the tables, which you can find [here](https://github.com/suneman/socialgraphs2018/tree/master/files/data_US_congress). If you want to find out how we have parsed the tables directly from the html of the page, you can find the code [here](https://github.com/suneman/socialgraphs2018/blob/master/files/additional_codes/Parse_tables_US_House.ipynb). 

Each row in a table corresponds to a member of the house and contains the following: Title of the wikipedia page, State, Party. In the video below, we give you some hints on how to analyse this data.

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo("nN9Bmw9OTwY",width=800, height=450)

> _Exercise_: Put together some descriptive statistics on the US house of representatives over time.
> A good way to extract statistics on tabular data is to use [pandas Dataframes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). 

> 
>   * By the word *member* we mean a politician who has been elected to the house of representatives. Plot the number of *members* of the house of Representatives over time. You chose if you want to use a line-chart or a bar-chart. Is this development over time what you would expect? Why? Explain in your own words.
>   * How many members appear in all the three congresses? How many in two? How many in one? Plot your results using a histogram.
>   * Which states are more represented in the house of representatives? Which are less? Plot a histogram showing the number of members per state.
>   * How has the party composition of the house of representative changed over time? Plot your results. 

# Part B: Download the Wikipedia pages of politicians


It's time to download all of the pages of the politicians. Use your experience with APIs from Week 1\. To get started, I **strongly** recommend that you re-watch the **APIs video lecture** from that week - it contains lots of useful tips on this specific activity (yes, I had planned this all along!). I've included it below for your covenience.

In [13]:
YouTubeVideo("9l5zOfh0CRo",width=800, height=450)

** Important: ** we are interested in the temporal evolution of the network of politicians. So, we want to look at the network in real time. 

As an example, the 13th congress ran between Jan 2013 and Jan 2015, so we want to know *** what the links were between its members at that time***. Luckily wikipedia actually stores that information! You might not know this, but each page comes with a record of what it has looked like since the day it was created!! Pretty amazing, right?

Because we now have the power of programming available to us (yay!), *we can simply query the Wikipedia API to retrieve **previous versions** of the pages*, (have a look at the documentation [here](https://www.mediawiki.org/wiki/API:Revisions)). To do so, we need to specify a few additional parameters in our query. You can see an example of a query below. 

##### Query a revision page: example


The parameters *rvend* and *rvstart* give the boundaries of the time period to consider. *rvdir* specifies the order used to sort the revisions. *rvlimit* specifies the number of revisions returned. The query below will return the newest revision of the "John McCain" page written betwen Jan 3, 2000 and Jan 3, 2015. *_Note:_* counterintuitively, you should choose *rvend* < *rvstart*. Check the [documentation](https://www.mediawiki.org/wiki/API:Revisions) for details


In [14]:


baseurl = "http://en.wikipedia.org/w/api.php/?"
action = "action=query"
title = "titles=John_McCain"
content = "prop=revisions"
rvprop ="rvprop=timestamp|content"
dataformat = "format=json"
rvdir = "rvdir=older" #sort revisions from newest to oldest
start = "rvend=2000-01-03T00:00:00Z" #start of my time period
end = "rvstart=2013-01-03T00:00:00Z" #end of my time period
limit = "rvlimit=1" #consider only the first revision

query = "%s%s&%s&%s&%s&%s&%s&%s&%s&%s" % (baseurl, action, title, content, rvprop, dataformat, rvdir, end, start, limit)

> _Exercise_: Download the wikipedia pages of the members of the house of representatives. 

>* Consider all politicians in the 113th house of Representatives. For each of them, use Wikipedia's API to download the full content (using python) of the latest version of the politician's Wikipedia page written before Jan 2015. 
* Save the page in a text file (see below for tips and tricks). 
* Create a folder, and save the pages in that folder using the politician name as filename.
* Repeat for the 114th and 115th congresses, by remembering to consider the right time period (for the 114th congress, consider the last revision before January 2017 and for the 115th congress, consider the last revision before January 2019). 


> ### Important Point Starting
> This is an important point: **Don't get the `html` version of the page**, get the standard [wiki markup](https://en.wikipedia.org/wiki/Help:Wiki_markup) which is what you see when you press "edit" on a wikipedia page.
> 
> **Important Point completed**
>
> A couple of extra tips below:
> 
> * Some pages contain unicode characters, so we recommend you save the files using the [`io.open`](http://stackoverflow.com/questions/5250744/difference-between-open-and-codecs-open-in-python) method with `utf-8` encoding

> * Store the content of all pages. It's up to you how to do this. One strategy is to use Python's built in `pickle` format. Or you can simply write the content of wiki-pages to text files and store those in a folder on your computer. I'm sure there are other ways. It's crucial that you store them in a way that's easy to access, since we'll use these pages a lot throughout the remainder of the course (so you don't want to retrieve them from wikipedia every time).
> 

# Part C: Building the networks

Now, we're going to build 3 NetworkX directed graphs, one for each of the congresses we are considering. Each network will have as nodes the members of the house of representatives, and an edge between nodes A and nodes B should exist if the Wikipedia page of node A includes a link to the Wikipedia page of node B.

 

In [9]:
YouTubeVideo("9i_c31v9Nb0",width=800, height=450)


> 
> _Exercise_: Build the network of members of the 113th house of representatives. 

>Take the pages you have downloaded for the 113th house of representatives. Each page corresponds to a politician, which is a node in your network. Find all the hyperlinks in a politician page that link to another node of the network (e.g. an other politician that is a member of the same congress). There are many ways to do this, but below, I've tried to break it down into natural steps.
> 
> * Use a regular expression to extract all outgoing links from each of the pages you downloaded above. 
> * For each link you extract, check if the target is a member of the same congress. If yes, keep it. If no, discard it.
> * Use a NetworkX [`DiGraph`](https://networkx.github.io/documentation/development/reference/classes.digraph.html) to store the network. Store also the properties of the nodes (state and party of each politician).


> _Exercise_: Simple network statistics for the 113th house of representatives.
>
> * What is the number of nodes in the network? And the number of links?
> * Plot the in and out-degree distributions. 
> * Who is the most connected representative?