# Mapping controversies script 3: Make two different networks based on in-text links to other wikipedia pages    

Wikipedia uses different ways of referencing knowledge claims. The first is the system we used in tutorial 2 to build a co-reference network. In this tutorial we will use what we call in-text references. These references are all "internal" to wikipedia, meaning that they __only refer to other wikipedia articles.__ 
<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549628832/Mapping%20controversies%202019/In-text_reference.jpg" title="Category:circumcision" style="width: 700px;" /> 
By in-text we simply mean that __we only include those references that appear in the main text of an article.__ If you want to include ALL "internal" links, you can use the script presented in tutorial 8.  

This script takes as input a JSON file with category members from Wikipedia (e.g. "cat_members_circumcision_depth_2.json") and builds two __directed__ networks. One network with the cat members (only) connected by in-text references and one with the cat-mebers + all the pages they point to connected by in-text references.

The network will look somewhat like this in structure:

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549628658/Mapping%20controversies%202019/InTextRefNet.jpg" title="Category:circumcision" style="width: 800px;" /> 


## Step 1: Installing the right libraries
Libraries for Jupyter can be understood as preprogrammed script parts. This means, that instead of writing a lot of lines of code in order e.g. make contact to Wikipedia, you can do it in one command.


__Obs: in this workbook we will be using the wikipedia and networkx libraries. If you have already installed them once, there is no need to do it again. You may simply skip to step 2.__

In [1]:
# In this cell Jupyter checks whether you have the right libraries installed 

import sys

try: #First, Jupyter tries to import a library
    import wikipedia
    print("Wikipedia library has been imported")
except: #If it fails, it will try to install the library
    print("Wikipedia library not found. Installing...")
    !pip install wikipedia
    try:#... and try to import it again
        import wikipedia
    except: #unless it fails, and raises an error.
        print("Something went wrong in the installation of the wikipedia library. Please check your internet connection and consult output from the installation below")
try:
    import networkx
    print("NetworkX library has been imported")
except:
    print("NetworkX library not found. Installing...")
    !pip install networkx
    
    try:
        import networkx
    except:
        print("Something went wrong in the installation of the NetworkX library. Please check your internet connection and consult output from the installation below")

        

Wikipedia library has been imported
NetworkX library has been imported


## Step 2: Make the in-text links networks

The next step is to make the network. Here, you need to input the path to the json files you got from the MCTutorial1_Wikipedia_HarvestCatMembers_final script. 

If the JSON files are in the same directory as the scripts, you only need to input relational directions (i.e. the name of the json file e.g. cat_members_circumcision_depth_2)

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549444568/Mapping%20controversies%202019/Script_json_same_folder_in_text.jpg" title="Folder" style="width: 800px;" /> 

In order to run the script, click on the cell below and press "Run" in the menu.

In [None]:
import wikipedia
import networkx as nx
import json

cat_members_all=[]
print("Enter the name of the category members json file you wish to use for keyword search (e.g.cat_members_circumcision_depth_2). If you have multiple files separate them with a comma")
filename= input()
if "," in filename:

    for each in filename.split(","):


        if not each.endswith(".json"):
            path=each+".json"
        else: 
            path=each
            each=each.split(".")[0]
        with open(path) as jsonfile:
            cat_members = json.load(jsonfile)
            jsonfile.close()
        for every in cat_members:
            cat_members_all.append(every)
else:
    print(" ")


    if not filename.endswith(".json"):
        path=filename+".json"
    else: 
        path=filename
        filename=filename.split(".")[0]
    with open(path) as jsonfile:
        cat_members_all = json.load(jsonfile)
        jsonfile.close()

    

    
seen = []
network = {}
print("Harvesting in-text links from "+str(len(cat_members_all))+" wikipedia pages. This might take a while...")
print("")
count=1
for each in cat_members_all:
    title=each["title"]
    if count % 50 == 0:
        print("In text links harvested from "+str(count)+" pages out of "+str(len(cat_members_all))+". Continuing harvest...")
    if not title in seen:
        seen.append(title)
        try:
        
            page = wikipedia.page(title)
            
        except wikipedia.exceptions.DisambiguationError:
            print("Wikipedia thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all capitalized letters")
            try:
                page = wikipedia.page(title.capitalize())
                print("Success! "+title+" is no longer ambiguous")
            except wikipedia.exceptions.DisambiguationError:
                print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all lower letters")
                try:
                    page = wikipedia.page(title.lower())
                    print("Success! "+title+" is no longer ambiguous")
                except wikipedia.exceptions.DisambiguationError:
                    print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Skipping page...")
                    continue
        except wikipedia.exceptions.PageError:
            print("The page "+title+" could not be found. Skipping page...")
            continue

        except:
            print("The page "+title+" failed due to unknown reason. Skipping...")
            print("")
            continue
        html = page.html()
        text_links = []
        html = html.split('<p>')
        for p in html[1:]:
            p = p.split('</p>')[0]
            links = p.split('<a href="')
            for l in links[1:]:
                if(' title="') in l:
                    l = l.split(' title="')[1]
                    l = l.split('">')[0]
                    l = l.replace("&#39;","'")
                    text_links.append(l)
        network.update({title:text_links})
    count=count+1
    
print("All pages harvested...")
new_cat_members={}
for each in cat_members_all:
    new_cat_members[each["title"]]={"level":each["level"]}
    
membersonly_edges = []
all_edges = []
members = network.keys()
print("Calculating networks...")
print("")
for source in network:
    for target in network[source]:
        edge = (source,target)
        all_edges.append(edge)
        if target in members:
            membersonly_edges.append(edge)
print("Saving networks...")
print("")
G = nx.DiGraph()
G.add_edges_from(membersonly_edges)
nx.write_gexf(G, 'MCTutorial2_3_'+filename+'_InText_LinkNet_membersonly.gexf')

G = nx.DiGraph()
G.add_edges_from(all_edges)
for each in G.nodes:
    if each in members:
        G.nodes[each]['member_level'] = 'Level '+str(new_cat_members[each]["level"])
    else:
        G.nodes[each]['member_level'] = 'Not a member'
nx.write_gexf(G, 'MCTutorial2_3_'+filename+'_InText_LinkNet_allpages.gexf')
print("The script is done. You can find your network files by following these paths: ")
print("")
locale=!pwd
print(locale[0]+"/"+'MCTutorial2_3_'+filename+'_InText_LinkNet_membersonly.gexf')
print("")
print(locale[0]+"/"+'MCTutorial2_3_'+filename+'_InText_LinkNet_allpages.gexf')