# Overview

This week is all about working with data. I'm not going to lie to you. This part might be frustrating - but frustration is an integral part of learning. Real data is almost always messy & difficult ... and learning to deal with that fact, is a key part of being a data scientist. 


Enough about the process, let's get to the content. 

![Text](https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/DC_vs_marvel.png "Great image choice, Sune")

Today, we will use network science and Wikipedia to learn about the relationships of **[DC](https://en.wikipedia.org/wiki/Lists_of_DC_Comics_characters)** and **[Marvel](https://en.wikipedia.org/wiki/Lists_of_Marvel_Comics_characters)** characters. 

To create the network, we will download the Wikipedia pages for all characters in each of the DC and Marvel universes. Next, we ill create the network of the pages that link to each other. Since wikipedia pages link to each other. So [Spider-Man](https://en.wikipedia.org/wiki/Spider-Man) links to [Superman](https://en.wikipedia.org/wiki/Superman), for example (it really does, but most links are "within-universe").

Next time, we'll use our network skills (as well as new ones) to understand that network. Further down the line, we'll use natural language processing to understand the text displayed on those pages.

But for today, the tasks are

* Learn about regular expressions
* Learn about Pandas dataframes
* Download and store (for later use) all the character-pages from Wikipedia
* Extract all the internal wikipedia-links that connect the characters on wikipedia
* Generate the network of characters on wikipedia. 
* Calculate some simple network statistics.

## The informal intro (not to be missed)

Today I talk about 

* The COVID-19 situation
* Results of the user satisfaction questionnaire
* Assignment 1
* Today's exercises

In [3]:
from IPython.display import YouTubeVideo
YouTubeVideo("[link]",width=800, height=450)

---

# Prelude: Regular expressions

Before we get started, we have to get a little head start on the _Natural Language Processing_ part of the class. This is a new direction for us, up to now, we've mostly been doing math-y stuff with Python, but today, we're going to be using Python to work through a text. The central thing we need to be able to do today, is to extract internal wikipedia links. And for that we need regular expressions.

> _Exercises_: Regular expressions round 1\.
> 
> * Read [**this tutorial**](https://developers.google.com/edu/python/regular-expressions) to form an overview of regular expressions. This is important to understand the content of the tutorial (also very useful later), so you may actually want to work through the examples.
> * Now, explain in your own words: what are regular expressions?
> * Provide an example of a regex to match 4 digits numbers (by this, I mean precisely 4 digits, you should not match any part of numbers with e.g. 5 digits). In your notebook, use `findall` to show that your regex works on this [test-text](https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/regex_exercise.txt). **Hint**: a great place to test out regular expressions is: https://regex101.com.
> * Provide an example of a regex to match words starting with "super". Show that it works on the [test-text](https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/regex_exercise.txt).
> 

Finally, we need to figure out how how to match internal wiki links. Wiki links come in two flavors. They're always enclosed in double square brackets, e.g. `[[wiki-link]]` and can either occur like this:

    ... some text [[Aristotle]] some more text ...

which links to the page [`https://en.wikipedia.org/wiki/Aristotle`](https://en.wikipedia.org/wiki/Aristotle). 

The second flavor has two parts, so that links can handle spaces and other more fancy forms of references, here's an example:

    ... some text [[John_McCain|John McCain]] some more text ...

which links to the page [`https://en.wikipedia.org/wiki/John_McCain`](https://en.wikipedia.org/wiki/Eudemus_of_Rhodes). Now it's your turn.

> _Exercise_: Regular expressions round 2\. Show that you can extract the wiki-links from the [test-text](https://raw.githubusercontent.com/suneman/socialgraphs2017/master/files/test.txt). Perhaps you can find inspiration on stack overflow or similar. **Hint**: Try to solve this exercise on your own (that's what you will get the most out of - learning wise), but if you get stuck ... you will find the solution in one of the video lectures below.
> 

In [106]:
import re
import requests
from concurrent import futures 
import networkx as nx

**Answer**

* regular expressions is a tool for matching text patterns
* [0-9]{4}
* super[\w]*

In [16]:
test_url = "https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/regex_exercise.txt"
res = requests.get(test_url)

In [17]:
print(re.findall('[0-9]{4}', res.text))
print(re.findall('super[\w]*', res.text))

['1234', '9999', '2345']
['superpolaroid', 'supertaxidermy', 'superbeer']


In [132]:
pattern1 = '\[\[(\w+)\|[\S\s]*?]]|\[\[(\w+)]]'
baseUrl = 'https://en.wikipedia.org/wiki/'
for group in re.findall(pattern1, res.text):
    gp = group[0] or group[1]
    print(baseUrl + gp)

https://en.wikipedia.org/wiki/gentrify
https://en.wikipedia.org/wiki/hashtag
https://en.wikipedia.org/wiki/Bicycle
https://en.wikipedia.org/wiki/Pitchfork


# Prelude part 2: Pandas DataFrames


Before starting, we will also learn a bit about [pandas dataframes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), a very user-friendly data structure that you can use to manipulate tabular data. Pandas dataframes are implemented within the [pandas package] (https://pandas.pydata.org/).

Pandas dataframes should be intuitive to use. **We suggest you to go through the [10 minutes to Pandas tutorial](https://pandas.pydata.org/pandas-docs/version/0.22/10min.html#min) to learn what you need to solve the next exercise.**

---

# Part A: Download the Wikipedia pages of characters

It's time to download all of the pages of the characters. Use your experience with APIs from Week 1\. To get started, I **strongly** recommend that you re-watch the **APIs video lecture** from that week - it contains lots of useful tips on this specific activity (yes, I had planned this all along!). I've included it below for your covenience.

In [4]:
from IPython.display import YouTubeVideo
YouTubeVideo("9l5zOfh0CRo",width=800, height=450)

Now, back in the day, I had all students first download the names of all the characters, starting from 

* https://en.wikipedia.org/wiki/Lists_of_Marvel_Comics_characters
* https://en.wikipedia.org/wiki/Lists_of_DC_Comics_characters

But that resulted in so much pain and suffering that recently I've decided against that. Instead, you can download all the names, nice and clean, here:
 
* **[Marvel List](https://github.com/SocialComplexityLab/socialgraphs2020/blob/master/files/marvel_characters.csv)**
* **[DC List](https://github.com/SocialComplexityLab/socialgraphs2020/blob/master/files/dc_characters.txt)**

*The files contain names and corresponding wiki-link. If link if absent, then the character does not have a specific page and information about this particular character can be find in the [Marvel](https://en.wikipedia.org/wiki/Lists_of_Marvel_Comics_characters) or [DC](https://en.wikipedia.org/wiki/Lists_of_DC_Comics_characters) lists*

> ### A challenge
> However, if you're feeling tough, you can head over to our [Hardcore List Parsing](https://github.com/SocialComplexityLab/socialgraphs2020/blob/master/files/Hardcore_List_Parsing.ipynb) notebook, full of tricks to help you try out creating these lists on your own! If you manage to do both Marvel and DC on our own, you will officially have graduated to brown-belt Python hacker. (Black belt challenges coming later in the year.)

In [124]:
dcUrl = 'https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/dc_characters.txt'
marvelUrl = 'https://raw.githubusercontent.com/SocialComplexityLab/socialgraphs2020/master/files/marvel_characters.csv'
dcNames = requests.get(dcUrl).text.split('\n')
marvelNames = requests.get(marvelUrl).text.split('\n')[1:]

In [110]:
def downloadDc(names):
    path = './data/DC/'
    def task(name):
        baseUrl = 'https://en.wikipedia.org/wiki/'
        url = baseUrl + name
        page = requests.get(url)
        with open(path + name, 'w') as f:
            f.write(page.text)
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        for name in names:
            name = name.replace('\r', '')
            future = executor.submit(task, name)


def downloadMarvel(names):
    path = './data/Marvel/'
    def task(name):
        baseUrl = 'https://en.wikipedia.org/wiki/'
        name = name.replace('\r', '').split(',')
        if name:
            url = baseUrl + name[-1]
            page = requests.get(url)
            with open(path + name[1], 'w') as f:
                f.write(page.text)
    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        for name in names:
            future = executor.submit(task, name)


In [73]:
downloadDc(dcNames)

In [111]:
downloadMarvel(marvelNames)

---

# Part B: Building the networks

Now, we're going to build one huge NetworkX directed graph, which includes both DC and Marvel Characters. 

The nodes in the network will be all the characters, and we will place an edge between nodes $A$ and $B$ if the Wikipedia page of node $A$ links to the Wikipedia page of node $B$.

 

In [28]:
YouTubeVideo("9i_c31v9Nb0",width=800, height=450)


> 
> _Exercise_: Build the network of Comics Characters 

> Now we can build the network. Isn't this a little bit cool? What a dataset :)

> The overall strategy for this is the following: 
> Take the pages you have downloaded for each character. 
> Each page corresponds to a politician, which is a node in your network. 
> Find all the hyperlinks in a characters page that link to another node of the network (e.g. an other character). > There are many ways to do this, but below, I've tried to break it down into natural steps. 
> Keep in mind that the network should include **both** DC and Marvel characters (and that it is possible that some DC Characters will have links to Marvel Characters and vice-versa).
> 
> **Note**: When you add a node to the network, also include an `attribute` (i.e. that specifies the universe where the character comes from; either DC, or Marvel)
>
>
> * Use a regular expression to extract all outgoing links from each of the pages you downloaded above. 
> * For each link you extract, check if the target is a character. If yes, keep it. If no, discard it.
> * Use a NetworkX [`DiGraph`](https://networkx.github.io/documentation/development/reference/classes.digraph.html) to store the network. Store also the properties of the nodes (i.e. from which universe they hail).
> * When have you finished, you'll notice that some nodes do not have any out- or in- degrees. You may *discard* those from the network.


In [125]:
dcNames = [name.replace('\r', '') for name in dcNames if name]
marvelNames = [name.replace('\r', '') for name in marvelNames]
marvelNames = [name.split(',')[1] for name in marvelNames if name]
dcNames = set(dcNames)
marvelNames = set(marvelNames)

In [143]:
DG = nx.DiGraph()

def getNeighbors(text):
    pattern = '\[\[(\w+)\|[\S\s]*?]]|\[\[(\w+)]]'
    rv = []
    for group in re.findall(pattern, text):
        gp = group[0] or group[1]
        if gp in dcNames:
            rv.append(gp, 'DC')
        elif gp in marvelNames:
            rv.append(gp, 'Marvel')
        
    return rv

for name in dcNames:
    DG.add_node(name, universe="DC")
    try:
        with open('./data/DC/' + name, 'r') as f:
            text = f.read()
            neighbors = getNeighbors(text)
            for n in neighbors:
                DG.add_node(n[0], universe=n[1])
                DG.add_edge(name, n[0])
    except FileNotFoundError as e:
        pass

for name in marvelNames:
    DG.add_node(name, universe="Marvel")
    try:
        with open('./data/Marvel/' + name, 'r') as f:
            text = f.read()
            neighbors = getNeighbors(text)
            for n in neighbors:
                DG.add_node(n[0], universe=n[1])
                DG.add_edge(name, n[0])
    except FileNotFoundError as e:
        pass

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [141]:
len(DG.edges)

0


> *Exercise*: Simple network statistics and analysis

> * What is the number of nodes in the network? 
> * More importantly, what is the number of links?
> * What is the number of links connecting Marvel and DC? What do those links mean?
> * Plot the in and out-degree distributions. What do you observe? Can you explain why the in-degree distribution is different from the out-degree distribution?
>     * Compare the degree distribution to a *random network* with the same number of nodes and *p*
>     * Compare the degree distribution to a *scale-free* network with the same number of nodes.
> * Who are top 5 most connected characters? (Report results for in-degrees and out-degrees). Comment on your findings. Is this what you would have expected.
> * Who are the top 5 most connected Marvel characters (again in terms of both in/out-degree)?
> * Who are the top 5 most connected DC characters (again in terms of both in/out-degree)?

> The total degree distribution (in + out degree) for you network should resemble the distribution displayed on the image below *(`Isolated` means that we have discarded the nodes with zero degrees)*:
![img](https://github.com/SocialComplexityLab/socialgraphs2020/blob/master/files/week4_degrees.png?raw=true)

Big thanks to TA Germans for helping design these exercises.