# Mapping controversies script 3: Making a co-reference network from category members  

A co-reference network is based on pages' shared references. This means, that an edge will be made between two pages, if they share the same reference. That edge will become stronger, if the two pages share more references. 
This script only looks at references that are listed under the section "References", and not all the links to other Wikipedia articles found throughout the text. For a script that handles the latter, please consult tutorial 3. 

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549394296/Mapping%20controversies%202019/Circumcision_references.jpg" title="Category:circumcision" style="width: 800px;" /> 

The network outputted from the script is an __undirected__ network, and will look somewhat like this in structure:

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549628658/Mapping%20controversies%202019/CoRefNet.jpg" title="Category:circumcision" style="width: 800px;" /> 



## Step 1: Installing the right libraries
Libraries for Jupyter can be understood as preprogrammed script parts. This means, that instead of writing a lot of lines of code in order e.g. make contact to Wikipedia, you can do it in one command.

__Obs: in this workbook we will be using the wikipedia and networkx libraries. If you have already installed them once, there is no need to do it again. You may simply skip to step 2.__

In [None]:
# In this cell Jupyter checks whether you have the right libraries installed 

try: #First, Jupyter tries to import a library
    import wikipedia
    print("Wikipedia library has been imported")
except: #If it fails, it will try to install the library
    print("Wikipedia library not found. Installing...")
    !pip install wikipedia
    try:#... and try to import it again
        import wikipedia
    except: #unless it fails, and raises an error.
        print("Something went wrong in the installation of the wikipedia library. Please check your internet connection and consult output from the installation below")
try:
    import networkx
    print("NetworkX library has been imported")
except:
    print("NetworkX library not found. Installing...")
    !pip install networkx
    
    try:
        import networkx
    except:
        print("Something went wrong in the installation of the NetworkX library. Please check your internet connection and consult output from the installation below")

        

## Step 2: Make the network

The next step is to make the network. Here, you need to input the path to the json files you got from the MCTutorial2_2_Wikipedia_HarvestCatMembers_final script. 

If the JSON files are in the same directory as the scripts, you only need to input relational directions (i.e. the name of the json file e.g. cat_members_circumcision_depth_2)

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549436096/Mapping%20controversies%202019/Script_json_same_folder.jpg" title="Folder" style="width: 800px;" /> 

In order to run the script, click on the cell below and press "Run" in the menu.

In [None]:
import json
import wikipedia
import networkx as nx
import csv

cat_members_all=[]
print("Enter the name of the category members json file you wish to use for keyword search (e.g.cat_members_circumcision_depth_2). If you have multiple files separate them with a comma")
filename= input()
if "," in filename:

    for each in filename.split(","):


        if not each.endswith(".json"):
            path=each+".json"
        else: 
            path=each
            each=each.split(".")[0]
        with open(path) as jsonfile:
            cat_members = json.load(jsonfile)
            jsonfile.close()
        for every in cat_members:
            cat_members_all.append(every)
else:
    print(" ")


    if not filename.endswith(".json"):
        path=filename+".json"
    else: 
        path=filename
        filename=filename.split(".")[0]
    with open(path) as jsonfile:
        cat_members_all = json.load(jsonfile)
        jsonfile.close()

    
edges = []

cat_members_dict={}
cat_members_list=[]
for cat_member in cat_members_all:
    title=cat_member["title"]
    
    try:
        page = wikipedia.page(title)
    except wikipedia.exceptions.DisambiguationError:
        print("Wikipedia thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all capitalized letters")
        try:
            page = wikipedia.page(title.capitalize())
            print("Success! "+title+" is no longer ambiguous")
        except wikipedia.exceptions.DisambiguationError:
            print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all lower letters")
            try:
                page = wikipedia.page(title.lower())
                print("Success! "+title+" is no longer ambiguous")
            except wikipedia.exceptions.DisambiguationError:
                print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Skipping page...")
                continue
    except wikipedia.exceptions.PageError:
        print("The page "+title+" could not be found. Skipping page...")
        continue
    except Exception as e:
        print(e)
        
    try:
        refs = page.references
      #  print(target_refs)
        cat_members_dict[title]={"level":cat_member["level"], "references":refs}
        cat_members_list.append(title)

    except KeyError:
        print("Could not retrieve references for "+title+". Skipping page...")
        continue
print("Succesfully retrieved references from "+str(len(cat_members_dict))+" out of "+str(len(cat_members_all))+" wikipedia pages. Generating network....")

for i,source in enumerate(cat_members_list):
    source_refs = cat_members_dict[source]["references"]
    if len(source_refs)>0:
        for target in cat_members_list[i+1:]:
            if target==source:
                continue
            target_refs=cat_members_dict[target]["references"]
            if len(target_refs)>0:
                overlap = len(set(source_refs).intersection(target_refs))
                if overlap>0:
                    if len(source_refs) < len(target_refs):
                        norm_overlap_by_smallest = overlap / len(source_refs)
                    else:
                        norm_overlap_by_smallest = overlap / len(target_refs)
                    edge = (source,target,{'overlap':overlap,'norm_overlap_by_smallest':norm_overlap_by_smallest})
                    edges.append(edge)
print("Network has been generated. Saving...")
G = nx.Graph()
G.add_edges_from(edges)
nx.write_gexf(G, 'MCTutorial2_4_'+filename+'_CoReferenceNetwork.gexf')
print('Network saved. You can find the network by following this path: ')
locale=!pwd
print(locale[0]+"/"+'MCTutorial2_4_'+filename+"_CoReferenceNetwork.gexf")
print("")
print("Saving JSON file with all references...")
json_path='MCTutorial2_4_'+filename+"_references.json"

with open(json_path, 'w') as outfile:
    json.dump(cat_members_dict, outfile)

print("")
print("Saving CSV file with all references...")
print("")

headers=['Page_title','Reference']
csv_path='MCTutorial2_4_'+filename+"_references.csv"

with open(csv_path,"w", newline='',encoding='utf-8') as f:
    wr = csv.writer(f, delimiter=",")
    wr.writerow(headers)
    
for cat_member in cat_members_list:
    cat_member_ref=cat_members_dict[cat_member]["references"]
    for ref in cat_member_ref:
        csv_list=[cat_member, ref]
        with open(csv_path,"a", newline='',encoding='utf-8') as f:
            wr = csv.writer(f, delimiter=",")
            wr.writerow(csv_list)

print("CSV file saved. The script is done!")