# Mapping controversies script 2: Make two different networks based on all links found on a wikipedia page 

In the script "MCTutorial2_Wikipedia_InText_reference_Network_final" we looked at all the links found in the main text of a Wikipedia article. By doing so, we exclude links that has been assigned to the article based on a template. As wikipedia puts it: _"Templates are pages that are embedded (transcluded) into other pages to allow for the repetition of information"_ ([Wikipedia templates](https://en.wikipedia.org/wiki/Wikipedia:Templates)). The template can be found in the buttom of every Wikipedia page: 

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549631130/Mapping%20controversies%202019/Template.jpg" title="Category:circumcision" style="width: 700px;" /> 

In this tutorial, we will include all "internal" links to other Wikipedia pages found on a page (i.e. the links found in the templates and in the main text). 

This script takes as input a file with category members from Wikipedia (e.g. "cat_members_circumcision_depth_2.json") and builds two networks. One network with the cat members (only) connected by the links and one with the cat-mebers + all the pages they point to.


## Step 1: Installing the right libraries
Libraries for Jupyter can be understood as preprogrammed script parts. This means, that instead of writing a lot of lines of code in order e.g. make contact to Wikipedia, you can do it in one command.


__Obs: in this workbook we will be using the wikipedia and networkx libraries. If you have already installed them once, there is no need to do it again. You may simply skip to step 2.__

In [1]:
# In this cell Jupyter checks whether you have the right libraries installed 

import sys

try: #First, Jupyter tries to import a library
    import wikipediaapi
    print("wikipediaapi library has been imported")
except: #If it fails, it will try to install the library
    print("wikipediaapi library not found. Installing...")
    !pip install wikipedia-api
    try:#... and try to import it again
        import wikipediaapi
    except: #unless it fails, and raises an error.
        print("Something went wrong in the installation of the wikipediaapi library. Please check your internet connection and consult output from the installation below")
try:
    import networkx
    print("NetworkX library has been imported")
except:
    print("NetworkX library not found. Installing...")
    !pip install networkx
    
    try:
        import networkx
    except:
        print("Something went wrong in the installation of the NetworkX library. Please check your internet connection and consult output from the installation below")

        

wikipediaapi library has been imported
NetworkX library has been imported


## Step 2: Make the networks of all links

The next step is to make the network. Here, you need to input the path to the json files you got from the MCTutorial1_Wikipedia_HarvestCatMembers_final script. 

If the JSON files are in the same directory as the scripts, you only need to input relational directions (i.e. the name of the json file e.g. cat_members_circumcision_depth_2)

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549444568/Mapping%20controversies%202019/Script_json_same_folder_in_text.jpg" title="Folder" style="width: 800px;" /> 

In order to run the script, click on the cell below and press "Run" in the menu.

In [3]:
import requests
import json
import csv
import datetime
import re
import wikipediaapi
import networkx as nx
import json

cat_members_all=[]
path="category_members_2019–20_coronavirus_pandemic_depth_3.json"
with open(path) as jsonfile:
    cat_members = json.load(jsonfile)
    jsonfile.close()
for every in cat_members:
    cat_members_all.append(every["title"])

cat_members_all=list(set(cat_members_all))
    
print('Enter the desired language version of wikipedia (e.g. "en","da","fr",etc.) or leave blank to use default (english):')

input_lan = input()
if not input_lan:
    lan="en"
else:
    lan=input_lan
print(" ")
wiki_wiki = wikipediaapi.Wikipedia(lan)
Revisions = []
count=0
S = requests.Session()
def do_stuff(each,page_dict,cat_member):
    
    title=cat_member
    revid=each["revid"]
    
    page_dict[title]["revisions"][revid]=each
    if "userid" in each:
        page_dict[title]["users"].append(each["userid"])
    else:
        page_dict[title]["users"].append("0")
    page_dict[title]["no_revisions"]+=1
page_dict={}
print("Harvesting revision history...")    
URL = "http://"+lan+".wikipedia.org/w/api.php"
for cat_member in cat_members_all:
    if cat_members_all.index(cat_member) % 100 == 0 and cat_members_all.index(cat_member)!=0:
        print("The script has harvested from "+str(cat_members_all.index(cat_member))+" pages.")
        print(count)
    page_dict[cat_member]={"users":[], "no_revisions":0,"revisions":{}}
    PARAMS = {
        "action": "query",
        "prop": "revisions",
        "titles": cat_member,
        "rvlimit": "500",
        "rvprop": "timestamp|user|userid|ids|size|type|comment|tags|flags",
        "rvdir": "newer",
        "rvstart": "2020-02-02T00:00:00Z",
        "formatversion": "2",
        "format": "json"

    }

    try:
        R = S.get(url=URL, params=PARAMS)


        for each in R.json()["query"]["pages"][0]["revisions"]:
            do_stuff(each,page_dict,cat_member)
            count+=1
    except Exception as e:
        print(e)
        continue
    while 'continue' in R.json().keys():
        PARAMS = {
            "action": "query",
            "prop": "revisions",
            "titles": cat_member,
            "rvlimit": "500",
            "rvprop": "timestamp|user|userid|ids|size|type|comment|tags|flags",
            "rvdir": "newer",
            "rvstart": "2020-02-02T00:00:00Z",
            "formatversion": "2",
            "format": "json",
            "rvcontinue": R.json()['continue']['rvcontinue']

        }
        
        try:
            R = S.get(url=URL, params=PARAMS)
            DATA = R.json()
            for each in R.json()["query"]["pages"][0]["revisions"]:
                do_stuff(each,page_dict,cat_member)
                count+=1
        except Exception as e:
            print(e)
            continue
#with open("corona_cat_page_revisions.json", 'w') as outfile:
  #  json.dump(page_dict, outfile)

Enter the name of the category members json file you wish to use for keyword search (e.g.cat_members_circumcision_depth_2). If you have multiple files separate them with a comma
category_members_2019–20_coronavirus_pandemic_depth_0
 
Enter the desired language version of wikipedia (e.g. "en","da","fr",etc.) or leave blank to use default (english):
en
 
Harvesting all links from 14 wikipedia pages. This might take a while...



In [2]:
import json
path="corona_cat_page_revisions.json"
with open(path) as jsonfile:
    page_dict = json.load(jsonfile)
    jsonfile.close()

In [3]:

import requests
import json
import csv
import datetime
import re
import time
import os
try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
root_dir="wiki_revision_dumps/"
Revisions = []
lan="en"
S = requests.Session()
    
    
URL = "http://"+lan+".wikipedia.org/w/api.php"
pages=list(page_dict.keys())
print(len(pages))
for page in pages[941:]:
    start_time=time.time()
    if "/" in page:
        page.replace("/"," ")
    if page+".json" in os.listdir(root_dir):
        continue
    dump_dict=page_dict[page]
    revisions=page_dict[page]["revisions"]
    for revision in revisions:
        PARAMS = {
            "action": "parse",
            "oldid":revision,
            "prop":"externallinks|categories|iwlinks|text", 
            "format": "json"

        }
        try:
            R = S.get(url=URL, params=PARAMS)
        except Exception as e:
            print(e)
            continue
        if not "parse" in R.json():
            continue
        html=R.json()["parse"]["text"]["*"]

        #the HTML code you've written above
        parsed_html = BeautifulSoup(html)
        parsed_text=parsed_html.text
        if "References" in parsed_text:
            split_text=parsed_text.split("References")
            if len(split_text)!=2:
                new_split_text=""
                for each in split_text[:len(split_text)-1]:
                    new_split_text+=each

            else:
                new_split_text=split_text
        else:
            new_split_text=parsed_text
        text_links = []
        html = str(parsed_html).split('<p>')
        for p in html[1:]:
            p = p.split('</p>')[0]
            links = p.split('<a href="')
            for l in links[1:]:
                if(' title="') in l:
                    l = l.split(' title="')[1]
                    l = l.split('">')[0]
                    l = l.replace("&#39;","'")
                    if l not in text_links:
                        text_links.append(l)
        dump_dict["revisions"][revision]["in_text_links"]=text_links
        dump_dict["revisions"][revision]["text"]=new_split_text
        parse=R.json()["parse"]
        dump_dict["revisions"][revision]["categories"]=parse["categories"]
        dump_dict["revisions"][revision]["external_links"]=parse["externallinks"]
        dump_dict["revisions"][revision]["inter_wiki_links"]=parse["iwlinks"]

    with open(root_dir+page+".json", 'w') as outfile:
        json.dump(dump_dict, outfile)
    print(time.time()-start_time)
    print(pages.index(page))




982
417.20498180389404
942
71.44591307640076
943
93.79015231132507
944
106.90655589103699
945
57.627869844436646
946
14.405474185943604
947
10.623589515686035
948
38.267653942108154
949
46.41962480545044
950
3.694121837615967
951
234.70427441596985
952
272.7630937099457
953
1661.549043893814
954
590.4154393672943
955
37.596540689468384
956
3.951838254928589
957
10.126203775405884
958
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
2077.8182735443115
959
3.3934667110443115
960
19.

In [100]:
import requests
import json
import csv
import datetime
import re

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

Revisions = []
lan="en"
S = requests.Session()
    

    
    
URL = "http://"+lan+".wikipedia.org/w/api.php"
for page in page_dict:
    for revision in page_dict[page]["revisions"]:
        PARAMS = {
            "action": "parse",
            "oldid":revision,
            "prop":"externallinks|categories|iwlinks|text", 
            "preview":True,
            "format": "json"

        }

        R = S.get(url=URL, params=PARAMS)
        html=R.json()["parse"]["text"]["*"]

        #the HTML code you've written above
        parsed_html = BeautifulSoup(html)
        parsed_text=parsed_html.text
        if "References" in parsed_text:
            split_text=parsed_text.split("References")
            if len(split_text)!=2:
                new_split_text=""
                for each in split_text[:len(split_text)-1]:
                    new_split_text+=each
              
            else:
                new_split_text=split_text
        else:
            new_split_text=parsed_text

        page_dict[page]["revisions"][revision]["text"]=new_split_text
        parse=R.json()["parse"]
        page_dict[page]["revisions"][revision]["categories"]=parse["categories"]
        page_dict[page]["revisions"][revision]["externallinks"]=parse["externallinks"]
        page_dict[page]["revisions"][revision]["iwlinks"]=parse["iwlinks"]

No references
This article may be affected by a current event. Information in this article may change rapidly as the event progresses. Initial news reports may be unreliable. The last updates to this article may not reflect the most current information. Please feel free to improve this article (but note that updates without valid and reliable references will be removed) or discuss changes on the talk page. (Learn how and when to remove this template message)
2019-2020 China pneumonia outbreak, or China pnumonia, commonly known as Wuhan pnumia (Chinese: 武漢肺炎; pinyin: wǔhàn fèiyán) or pneumonia of unknown origin (Chinese: 不明原因肺炎; pinyin: bùmíng yuányīn fèiyán), is the pneumonia outbreak firstly discovered in Huanan Seafood Market in Wuhan, China. [1][2]</nowiki>
^ "Mystery pneumonia virus probed in China". 2020-01-03. Retrieved 2020-01-05..mw-parser-output cite.citation{font-style:inherit}.mw-parser-output .citation q{quotes:"\"""\"""'""'"}.mw-parser-output .id-lock-free a,.mw-parser-out

In [102]:
R.json()["parse"].keys()

dict_keys(['title', 'pageid', 'revid', 'text', 'categories', 'externallinks', 'iwlinks'])

In [72]:

html = text#the HTML code you've written above
parsed_html = BeautifulSoup(html)
parsed_text=parsed_html.text
if "References" in parsed_text:
    print(len(parsed_text.split("References")))

In [75]:
if "References" in parsed_text:
    print(len(parsed_text.split("References")))

2


In [66]:
for each in R.json()["query"]["pages"][0]["revisions"][:4]:
    content=each["content"]
    refs=re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', content)
    regex='\[(.*?)\]'
    cats=[]
    links=[]
    for cat in re.findall(regex,content):

        if "Category" in cat:
            cats.append(cat.strip("[").strip("]"))
        else:
            links.append(cat.strip("[").strip("]"))
    print(cats)
    print(links)
    print(refs)
    print(content)
    print("__________")

['Category:2010s medical outbreaks', 'Category:January 2020 events', 'Category:December 2019 events']
['Wuhan', 'China']
['https://www.bbc.com/news/world-asia-china-50984025', 'https://www.bloomberg.com/tosv2.html?vid=&uuid=371a3210-2fd3-11ea-8928-dd02747387de&url=L25ld3MvYXJ0aWNsZXMvMjAyMC0wMS0wNC9jaGluYS1wbmV1bW9uaWEtb3V0YnJlYWstc3B1cnMtd2hvLWFjdGlvbi1hcy1teXN0ZXJ5LWxpbmdlcnM=']
{{Current related}}
'''2019-2020 China pneumonia outbreak''', or '''China pnumonia''', commonly known as '''Wuhan pnumia''' ({{Lang-zh|c='''武漢肺炎'''|s=|t=|p=wǔhàn fèiyán}}) or '''pneumonia of unknown origin''' ({{Lang-zh|c='''不明原因肺炎'''|s=|t=|p=bùmíng yuányīn fèiyán}}), is the pneumonia outbreak firstly discovered in Huanan Seafood Market in [[Wuhan]], [[China]]. <ref>{{Cite news|url=https://www.bbc.com/news/world-asia-china-50984025|title=Mystery pneumonia virus probed in China|date=2020-01-03|access-date=2020-01-05|language=en-GB}}</ref><ref>{{Cite web|url=https://www.bloomberg.com/tosv2.html?vid=&uuid=371a32

['https://www.bbc.com/news/world-asia-china-50984025',
 'https://www.bloomberg.com/tosv2.html?vid=&uuid=371a3210-2fd3-11ea-8928-dd02747387de&url=L25ld3MvYXJ0aWNsZXMvMjAyMC0wMS0wNC9jaGluYS1wbmV1bW9uaWEtb3V0YnJlYWstc3B1cnMtd2hvLWFjdGlvbi1hcy1teXN0ZXJ5LWxpbmdlcnM=']

Category:2010s medical outbreaks
Category:January 2020 events
Category:December 2019 events


In [26]:
refs=each["content"].split("refs")


TypeError: string indices must be integers

In [None]:
count=1
for each in cat_members_all:
    title=each["title"]
    if count % 50 == 0:
        print("All links harvested from "+str(count)+" pages out of "+str(len(cat_members_all))+". Continuing harvest...")
    if not title in seen:
        seen.append(title)
        try:
        
            page=wiki_wiki.page(title)
            text_links = []
            links = page.links
            for link_title in sorted(links.keys()):
                text_links.append(link_title)
            network.update({title:text_links})

        except:
            print('SKIPPED: '+title)
            print("")
    count=count+1
    
print("All pages harvested...")
new_cat_members={}
for each in cat_members_all:
    new_cat_members[each["title"]]={"level":each["level"]}
    
membersonly_edges = []
all_edges = []
members = network.keys()


In [None]:
print("Calculating networks...")
print("")
for source in network:
    for target in network[source]:
        edge = (source,target)
        all_edges.append(edge)
        if target in members:
            membersonly_edges.append(edge)
print("Saving networks...")
print("")
G = nx.DiGraph()
G.add_edges_from(membersonly_edges)
nx.write_gexf(G,'MCTutorial2_2_'+ filename+'_AllLinksNet_membersonly.gexf')

G = nx.DiGraph()
G.add_edges_from(all_edges)
for each in G.nodes:
    if each in members:
        G.nodes[each]['member_level'] = 'Level '+str(new_cat_members[each]["level"])
    else:
        G.nodes[each]['member_level'] = 'Not a member'
nx.write_gexf(G, 'MCTutorial2_2_'+filename+'_AllLinksNet_allpages.gexf')
print("The script is done. You can find your network files by following these paths: ")
print("")
locale=!pwd
print(locale[0]+"/"+'MCTutorial2_2_'+filename+'_AllLinksNet_membersonly.gexf')
print("")
print(locale[0]+"/"+'MCTutorial2_2_'+filename+'_AllLinksNet_allpages.gexf')