# Mapping controversies script 1: Harvest Wikipedia category members  

Wikipedia articles are arranged in categories. As an example, the __[Wikipedia Category of Circumcision](https://en.wikipedia.org/wiki/Category:Circumcision)__, encompass an array of articles such as __[Religious male circumcision](https://en.wikipedia.org/wiki/Religious_male_circumcision)__ and __[Circumcision and HIV](https://en.wikipedia.org/wiki/Circumcision_and_HIV)__ as well as subcategories such as __[Female genital mutilation](https://en.wikipedia.org/wiki/Category:Female_genital_mutilation)__.

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549392104/Mapping%20controversies%202019/Category_circumcision.jpg" title="Category:circumcision" style="width: 800px;" /> 



This Jupyter notebook will help you harvest category members of the topic you are working on. The information will later be used generate a co-reference network. 

## Step 1: Installing the right libraries 
Libraries for Jupyter can be understood as preprogrammed script parts. This means, that instead of writing a lot of lines of code in order e.g. make contact to Wikipedia, you can do it in one command.  

In order to run the installation, click on the cell below and press "Run" in the menu. 


In [2]:
# In this cell Jupyter checks whether you have the right libraries installed to carry out the harvest of data from Wikipedia

try: #First, Jupyter tries to import a library
    import wikipedia
    print("Wikipedia library has been imported")
except: #If it fails, it will try to install the library
    print("Wikipedia library not found. Installing...")
    !pip install wikipedia
    try:#... and try to import it again
        import wikipedia
    except: #unless it fails, and raises an error.
        print("Something went wrong in the installation of the wikipedia library. Please check your internet connection and consult output from the installation below")
try:
    import wikipediaapi
    print("Wikipedia api library has been imported")
except:
    print("wikipedia api library not found. Installing...")
    !pip install wikipedia-api
    
    try:
        import wikipediaapi
    except:
        print("Something went wrong in the installation of the wikipedia api library. Please check your internet connection and consult output from the installation below")

        

Wikipedia library has been imported
Wikipedia api library has been imported


## Step 2: Harvest the data from Wikipedia
In this step, we will harvest the data from wikipedia. When you run the cell (ctrl+enter or _Run_ in menu), you will be asked to input a series of informations to the program. It is important that you do not set the depth to more than 2, as you might end up harvesting a substantial part of Wikipedia (which will take a long time, probably crash and get the AAU IP address banned from Wikipedia). 

Depth refers to how many steps from the starting point you wish to include. In the case of Circumcision, dept 0 would only include the articles _immediately_ related to the category, whereas dept 1 would also include articles related to the subcategories (e.g. Female genital mutilation) and dept 2 the subcategories of the subcategories (e.g. [Activists against female genital mutilation](https://en.wikipedia.org/wiki/Category:Activists_against_female_genital_mutilation)).

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1549393775/Mapping%20controversies%202019/Depth.jpg" title="Category:circumcision" style="width: 800px;" /> 


The script outputs two types of documents containing the same information. The first is a JSON file and the other a CSV. When the script is done, the most convenient way to view it, is by using the CSV.

In order to run the script, click on the cell below and press "Run" in the menu.

In [2]:
#imports 2 Python libraries for interaction with the Wikipedia API
import wikipediaapi
import wikipedia
import json
import csv

#asks the user to input a language version of Wikipedia and tells the API to use that version
print('Enter the desired language version of wikipedia (e.g. "en","da","fr",etc.) or leave blank to use default (english):')

input_lan = input()
if not input_lan:
    lan="en"
else:
    lan=input_lan
print(" ")
wiki_wiki = wikipediaapi.Wikipedia(lan)

# asks the user to input the wikipedia category
print('Enter the name of the Wikipedia category you wish to scrape for member articles:')
cat_name = input()
#cat_name=cat_name.lower()
print("")
# asks the user to input a depth level
print('Enter the depth level between 0 and 2 you wish to query for categories:')
depth = input()
print("")
depth=int(depth)

print("Logging category members. This might take a while...")
print("")

def log_categorymembers(categorymembers, level=0):
    for c in categorymembers.values():
        if c.ns == 0:
            entry = {'title':c.title,'level':level}
            cat_members.append(entry)
        if c.ns == 14 and level < depth:
            log_categorymembers (c.categorymembers, level + 1)
            
            
cat_members=[]

cat = wiki_wiki.page("Category:"+cat_name)
log_categorymembers(cat.categorymembers)

print("There are "+ str(len(cat_members))+" members in the category "+cat_name+" depth "+str(depth)+". Saving...")
print("")
if len(cat_name.split(" "))>1:
    new_word=""
    for each in cat_name.split(" "):
        if cat_name.split(" ").index(each)==len(cat_name.split(" "))-1:
            new_word=new_word+each
        else:
            new_word=new_word+each+"_"
else:
    new_word=cat_name
filename='category_members_'+new_word+'_depth_'+str(depth)
json_path = filename+'.json'
csv_path=filename+'.csv'

with open(json_path, 'w') as outfile:
    json.dump(cat_members, outfile)

headers=['Title','Level']

with open(csv_path,"w", newline='',encoding='utf-8') as f:
    wr = csv.writer(f, delimiter=",")
    wr.writerow(headers)
for each in cat_members:
    csv_list=[each["title"], each["level"]]
    with open(csv_path,"a", newline='',encoding='utf-8') as f:
        wr = csv.writer(f, delimiter=",")
        wr.writerow(csv_list)

locale=!pwd
print('Done harvesting. Find your files in the folder of this script:')
print(locale[0]+"/"+csv_path)
print(locale[0]+"/"+json_path)

Enter the desired language version of wikipedia (e.g. "en","da","fr",etc.) or leave blank to use default (english):
en
 
Enter the name of the Wikipedia category you wish to scrape for member articles:
2019–20_coronavirus_pandemic

Enter the depth level between 0 and 2 you wish to query for categories:
0

Logging category members. This might take a while...

There are 14 members in the category 2019–20_coronavirus_pandemic depth 0. Saving...

Done harvesting. Find your files in the folder of this script:
/c/Users/ago/Documents/Jupyter/Mapping controversies 2020/category_members_2019–20_coronavirus_pandemic_depth_0.csv
/c/Users/ago/Documents/Jupyter/Mapping controversies 2020/category_members_2019–20_coronavirus_pandemic_depth_0.json
