# Mapping controversies script 6: Extract text from Wikipedia articles 

This script will extract all text from the pages you input. The text will be outputted as a .csv file, which works with both spreadsheet editors and the tool Cortext we will user later in the course. 



## Step 1: Installing the right libraries
Libraries for Jupyter can be understood as preprogrammed script parts. This means, that instead of writing a lot of lines of code in order e.g. make contact to Wikipedia, you can do it in one command.


__Obs: in this workbook we will be using the wikipediaapi library. If you have already installed it once, there is no need to do it again. You may simply skip to step 2.__

In [None]:
try:
    import wikipediaapi
    print("Wikipedia api library has been imported")
except:
    print("wikipedia api library not found. Installing...")
    !pip install wikipedia-api
    
    try:
        import wikipediaapi
    except:
        print("Something went wrong in the installation of the wikipedia api library. Please check your internet connection and consult output from the installation below")


## Step 2A: Harvest the text from a single page and/or pages from a category members json file

In order to run the script, click on the cell below and press "Run" in the menu.

In [1]:
import wikipediaapi
import csv
import json

pages=["2020 coronavirus pandemic in Afghanistan","2020 coronavirus pandemic in Austria","2020 coronavirus pandemic in Belgium","2020 coronavirus pandemic in Cyprus","2020 coronavirus pandemic in Denmark","2020 coronavirus pandemic in the Netherlands","2020 coronavirus pandemic in Finland","2020 coronavirus pandemic in Germany","2020 coronavirus pandemic in Greece","2020 coronavirus pandemic in Hungary","2020 coronavirus pandemic in Iran","2020 coronavirus pandemic in Israel","2020 coronavirus pandemic in Italy","2020 coronavirus pandemic in Norway","2020 coronavirus pandemic in Spain","2020 coronavirus pandemic in Sweden","2020 coronavirus pandemic in Switzerland","2020 coronavirus pandemic in Turkey","2020 coronavirus pandemic in the United Kingdom","2020 coronavirus pandemic in Serbia","2020 coronavirus pandemic in Bosnia and Herzegovina"]

print('Enter the desired language version of wikipedia (e.g. "en","da","fr",etc.) or leave blank to use default (english):')

input_lan = input()
if not input_lan:
    lan="en"
else:
    lan=input_lan
wiki_wiki = wikipediaapi.Wikipedia(
        language=lan,
        extract_format=wikipediaapi.ExtractFormat.WIKI
)

print("Please enter a prefix for the output file: " )
filename=input()
filename=filename+"_TextFromArticles.csv"

print("Collecting text from "+str(len(pages))+" pages...")
csv_path=filename

headers=["Page", "Page text"]
with open(csv_path,"a", encoding='utf-8', newline='\n') as tsvfile:
    wr = csv.writer(tsvfile, delimiter=';')   

    wr.writerow(headers)

for page in pages:
    p_wiki = wiki_wiki.page(page)
    page_text=p_wiki.text.lower()
    csv_list=[page,page_text]

    with open(csv_path,"a", encoding='utf-8', newline='\n') as tsvfile:
        wr = csv.writer(tsvfile, delimiter=';')   

        wr.writerow(csv_list)


print('CSV file saved. You can find the network by following this path: ')
locale=!pwd
print(locale[0]+"/"+filename)

Enter the desired language version of wikipedia (e.g. "en","da","fr",etc.) or leave blank to use default (english):
en
Please enter a prefix for the output file: 
wiki_corona_EU
Collecting text from 21 pages...
CSV file saved. You can find the network by following this path: 
/c/Users/ago/Documents/Jupyter/Mapping controversies 2020/wiki_corona_EU_TextFromArticles.csv


## Step 2B - Harvest text from multiple category members files

In [None]:
import wikipediaapi
import csv
import json

print('Enter the desired language version of wikipedia (e.g. "en","da","fr",etc.) or leave blank to use default (english):')

input_lan = input()
if not input_lan:
    lan="en"
else:
    lan=input_lan
wiki_wiki = wikipediaapi.Wikipedia(
        language=lan,
        extract_format=wikipediaapi.ExtractFormat.WIKI
)

cat_dict = {}

print("Enter the name of the category members json files you wish to use for keyword search (e.g.category_members_circumcision_depth_2). Separate multiple files with comma")
filename= input()
filename=filename.strip()

for each in filename.split(","):
    if each:
        each=each.strip()
        if not each.endswith(".json"):
            path=each+".json"
        else: 
            path=each
            each=each.split(".")[0]
        for word in each.split("_")[2:]:
            if word=="depth":
                #print(filename.split(",")[0].split("_").index(each))

                index_=each.split("_").index(word)
        cat=""
        for word in each.split("_")[2:index_]:
            cat=cat+" "+word
        cat_name=cat.strip()
        #print(cat_name)
        #cat_name=each.split("_")[2]
        with open(path) as jsonfile:
            cat_members = json.load(jsonfile)
            jsonfile.close()
        cat_dict[cat_name]=[]
        for every in cat_members:
            cat_dict[cat_name].append(every['title'])
    else:
        continue

filename="TextFromWikiarticlesMultipleCategories.csv"

csv_path=filename

page_dict={}
pages=[]
for cat in cat_dict:
    for page in cat_dict[cat]:
        pages.append(page)
for page in pages:
    page_dict[page]={"cats":[]}

for cat in cat_dict:
    for page in cat_dict[cat]:
        page_dict[page]["cats"].append(cat)

print("Collecting text from "+str(len(pages))+" pages...")

for page in page_dict:
    try:
        p_wiki = wiki_wiki.page(page)
        page_text=p_wiki.text.lower()
        page_dict[page]["text"]=page_text
    except:
        print('skipping '+page)

csv_list=[]


headers=["page","text","unique_to_cat"]

for cat in cat_dict:
    headers.append(cat)
    
csv_list.append(headers)
    
for page in page_dict:
    if "text" in page_dict[page]:
        entry=[page, page_dict[page]["text"]]
        if len(page_dict[page]['cats']) == 1:
            entry.append(page_dict[page]['cats'][0])
        else:
            entry.append('none')
        for each in headers[3:]: 
            if each in page_dict[page]["cats"]:
                entry.append("yes")
            else:
                entry.append("no")
        csv_list.append(entry)

with open(csv_path,"w", encoding='utf-8', newline='\n') as tsvfile:
    wr = csv.writer(tsvfile, delimiter=';')   

    wr.writerows(csv_list)


print('CSV file saved. You can find the network by following this path: ')
locale=!pwd
print(locale[0]+"/"+filename)