# Network curation
In this notebook we will use API calls to harvest data from Wikipedia. We will then use the NetworkX library to turn that data into different types of networks. The networks will be exported as .gexf files that can be visualized in the open source software [Gephi](https://gephi.org/users/download/) or, if you want a quick and dirty result without having to install any software, in the web application [Gephisto](https://jacomyma.github.io/gephisto/).


### 1. Install and import the necessary libraries
First, we will ensure that we have the right libraries installed and import them to this notebook.

In [None]:
!pip install wikipedia-api
!pip install wikipedia
!pip install pandas
!pip install networkx

import wikipediaapi
import wikipedia
import pandas as pd
import networkx as nx

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11696 sha256=b753f3c60e2559bbb0fa311f86be41a9c5f6a2106cb67bb8b23154a3a1558159
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


### 2. Get a set of Wikipedia articles
Before we can start building networks we need a set of Wikipedia articles to work on. In the example below we get all the articles from the category on ["computer ethics"](https://en.wikipedia.org/wiki/Category:Computer_ethics). You could change the script to get articles from a different category. We call the an API endpoint called "categorymembers" to get this information.

In [None]:
# Decide which category to get articles from
category_to_extract = "Category:Computer_ethics"

# Create an empty set of articles to fill later on
article_set = set()

# This is an object we use to connect to the API.
# Note that we configure it to use the English Wikipedia.
wiki_wiki = wikipediaapi.Wikipedia(
    language='en',
    extract_format = wikipediaapi.ExtractFormat.WIKI
)

# Create the category object (stuff specific to the API library)
cat = wiki_wiki.page(category_to_extract)

# Recursively build the list of pages (because there are sub-categories)
# For the recursion, we create a function that might call itself
def parse_categorymembers(categorymembers, level=0, max_level=2):
    for c in categorymembers.values():
        if c.ns == wikipediaapi.Namespace.MAIN: # This element is an article
            article_set.add(c.title)
        if c.ns == wikipediaapi.Namespace.CATEGORY and level < max_level: # This element is a sub-category
            parse_categorymembers(c.categorymembers, level=level + 1, max_level=max_level)
parse_categorymembers(cat.categorymembers)

# Transform the set into a data frame for convenience
article_df = pd.DataFrame(article_set, columns=["Article"])

# Output the data frame to check if it works
article_df

Unnamed: 0,Article
0,Fake news website
1,Doxing
2,ISP redirect page
3,"Pills, porn and poker"
4,Tencent Dajia
...,...
540,The Memory Hole (website)
541,Whitelisting
542,Upstream collection
543,Mimecast


## 3. Build two different networks
We are now going to build to types of networks between the articles we have just extracted. They are both going to be monopartite networks where the nodes are the articles. The edges, however, will be built in two different ways.

* In the first network, the edges will be the hyperlinks between the articles. This will show us how articles about computer ethics refer to each other. This network will be unweighted.

* In the second network, the edges will represent the degree to which two articles refer to the same external references. This is essentially a projected bipartite network (we only see one type of nodes - the articles - but they are connected only through their connections to a second type of nodes - the external references - that we do not see here). The network, therefore, will be weighted (some articles have many references in common, while other articles have fewer).

### 3.1.1 Get all the hyperlinks from the articles
For the first network we will begin by calling the Wikipedia API to get all the hyperlinks from each article. 

In [None]:
cat_members_all=[]
for each in article_df['Article']:
  cat_members_all.append(each)

lan="en"

seen = []
network = {}
print("Harvesting all links from "+str(len(cat_members_all))+" wikipedia pages. This might take a while...")
print("")

count=1
for title in cat_members_all:
    if count % 50 == 0:
        print("All links harvested from "+str(count)+" pages out of "+str(len(cat_members_all))+". Continuing harvest...")
    if not title in seen:
        seen.append(title)
        try:
        
            page=wiki_wiki.page(title)
            text_links = []
            links = page.links
            for link_title in sorted(links.keys()):
                text_links.append(link_title)
            network.update({title:text_links})

        except:
            print('SKIPPED: '+title)
            print("")
    count=count+1
    
print("All pages harvested!")

Harvesting all links from 545 wikipedia pages. This might take a while...

All links harvested from 50 pages out of 545. Continuing harvest...
All links harvested from 100 pages out of 545. Continuing harvest...
All links harvested from 150 pages out of 545. Continuing harvest...
All links harvested from 200 pages out of 545. Continuing harvest...
All links harvested from 250 pages out of 545. Continuing harvest...
All links harvested from 300 pages out of 545. Continuing harvest...
All links harvested from 350 pages out of 545. Continuing harvest...
All links harvested from 400 pages out of 545. Continuing harvest...
All links harvested from 450 pages out of 545. Continuing harvest...
All links harvested from 500 pages out of 545. Continuing harvest...
All pages harvested...


### 3.1.2 Build the network of articles (nodes) connected by hyperlinks (edges)
We then use the NetworkX library to buld the network and export the result as a .gexf file. Given that many of the hyperlinks point to articles outside the "computer ethics" category, we will first theck if a link is between two articles on our list before we include it as an edge.

In [None]:
membersonly_edges = []
members = network.keys()
print("Building network...")
print("")
for source in network:
    for target in network[source]:
        edge = (source,target)
        if target in members:
            membersonly_edges.append(edge)
print("Saving network...")
print("")
G = nx.DiGraph()
G.add_edges_from(membersonly_edges)
nx.write_gexf(G,category_to_extract+'_AllLinksNet_membersonly.gexf')

print('DONE!')

Calculating networks...

Saving network...

DONE


### 3.2.1 Get all the external references from the articles
For the second network we will begin by calling the Wikipedia API to get all the external references from each article.

In [None]:
cat_members_dict={}
cat_members_list=[]
for title in cat_members_all:
    try:
        page = wikipedia.page(title,auto_suggest=False)
    except wikipedia.exceptions.DisambiguationError:
        print("Wikipedia thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all capitalized letters")
        try:
            page = wikipedia.page(title.capitalize(),auto_suggest=False)
            print("Success! "+title+" is no longer ambiguous")
        except wikipedia.exceptions.DisambiguationError:
            print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all lower letters")
            try:
                page = wikipedia.page(title.lower(),auto_suggest=False)
                print("Success! "+title+" is no longer ambiguous")
            except wikipedia.exceptions.DisambiguationError:
                print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Skipping page...")
                continue
    except wikipedia.exceptions.PageError:
        print("The page "+title+" could not be found. Skipping page...")
        continue
    except Exception as e:
        print(e)
        
    try:
        refs = page.references
      #  print(target_refs)
        cat_members_dict[title]={"references":refs}
        cat_members_list.append(title)

    except KeyError:
        print("Could not retrieve references for "+title+". Skipping page...")
        continue
print("Succesfully retrieved references from "+str(len(cat_members_dict))+" out of "+str(len(cat_members_all))+" wikipedia pages. Generating network....")




  lis = BeautifulSoup(html).find_all('li')


Wikipedia thinks Comment spam is ambiguous (returns several candidate pages). Trying again with all capitalized letters
Wikipedia still thinks Comment spam is ambiguous (returns several candidate pages). Trying again with all lower letters
Wikipedia still thinks Comment spam is ambiguous (returns several candidate pages). Skipping page...
Succesfully retrieved references from 544 out of 545 wikipedia pages. Generating network....


### 3.2.2 Build the network of articles (nodes) connected by shared references (edges)
We then use the NetworkX library to buld the network and export the result as a .gexf file. Given that articles have varying numbers of external references in common, we will weight the edges to reflect the volume of shared references between two nodes.  

In [None]:
edges = []

for i,source in enumerate(cat_members_list):
    source_refs = cat_members_dict[source]["references"]
    if len(source_refs)>0:
        for target in cat_members_list[i+1:]:
            if target==source:
                continue
            target_refs=cat_members_dict[target]["references"]
            if len(target_refs)>0:
                overlap = len(set(source_refs).intersection(target_refs))
                if overlap>0:
                    if len(source_refs) < len(target_refs):
                        norm_overlap_by_smallest = overlap / len(source_refs)
                    else:
                        norm_overlap_by_smallest = overlap / len(target_refs)
                    edge = (source,target,{'overlap':overlap,'norm_overlap_by_smallest':norm_overlap_by_smallest})
                    edges.append(edge)
print("Network has been generated. Saving...")
G = nx.Graph()
G.add_edges_from(edges)
nx.write_gexf(G,category_to_extract+'_CoReferenceNet_membersonly.gexf')
print("DONE!")

Network has been generated. Saving...
DONE!
