# Practical exercise: Australian charities

## Introduction

Computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit. Central to engaging in these methods is the ability to write readable and effective code using a programming language.

In this practical we attempt to scrape information on the organisational status of Australian charities.

### Aims

This practical has one aim:
1. Successfully scrape information relating to Australian charities' organisational status e.g., does it still operate? When was it registered?

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is web-scraping?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## What is the general approach for scraping data from a web page?

We begin by identifying a web page containing information we are interested in collecting. Then we need to **know** the following:
1. The location (i.e., web address) where the web page can be accessed. For example, the UK Data Service homepage can be accessed via <a href="https://ukdataservice.ac.uk/" target=_blank>https://ukdataservice.ac.uk/</a>.
2. The location of the information we are interested in within the structure of the web page. This involves visually inspecting a web page's underlying code using a web browser.

And **do** the following:
1. Request the web page using its web address.
2. Parse the structure of the web page so your programming language can work with its contents.
3. Extract the information we are interested in.
4. Write this information to a file for future use.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed.

## Details

This is an example drawn from my (Diarmuid) own research. I am interested in the impact of COVID-19 on the foundation and dissolution charities across a number of countries. To study these phenomena I need the organisational status &mdash; foundation/dissolution date, organisational status &mdash; of individual charities. The Australian charity regulator provides high quality, open data on the organisational status of charities, with the exception of dissolution status. Therefore I wrote a script that takes a list of charity ids and scrapes information on organisational status from the regulator's website.

Your task is to execute and complete sections of this web scraping script.

It's a bit more complicated than what we've encountered so far, but gives you a sense of what web scraping for social research is really like.

## Practical 1

### Identifying the web address

An example of a charity's web page can be viewed at the following web address: https://www.acnc.gov.au/charity/3b7aa8b31249837c15657331aeb54821

### Locating information

The information we need located in the *History* tab underneath the **Registration status history** heading.

#### Visually inspecting the underlying HTML code

**TASK**: inspect the web page of our example charity (https://www.acnc.gov.au/charity/3b7aa8b31249837c15657331aeb54821) and insert the relevant HTML into the cell below.

### Requesting the web page

**TASK**: import the `requests`, `csv` and `os` modules into this Python session (`datetime` and `BeautifulSoup` are already listed for you).

In [None]:
# Import modules

import os
import requests
import csv
from datetime import datetime # module for working with dates and time
from bs4 import BeautifulSoup as soup # module for parsing web pages

print("Succesfully imported necessary modules")

**TASK**: fill in the blanks (e.g., # INSERT URL #) with the necessary code

In [None]:
# Define the URL where the web page can be accessed

url = "https://www.acnc.gov.au/charity/3b7aa8b31249837c15657331aeb54821" # INSERT URL #


# Request the web page from the URL

response =  requests.get(url, allow_redirects = False, timeout = 5)# REQUEST THE URL #


# Check if page was requested successfully #

response.status_code

### Parsing the web page

**TASK**: use the `soup()` method to parse the requested web page.

In [None]:
# Extract the contents of the web page from the response

soup_response = soup(response.text, "html.parser") # Parse the text as a Beautiful Soup object

**QUESTION**: under which tag(s) is the *Registration status history* information found?

**TASK**: find the section containing the *Registration status history* information and save it to a variable; view the contents of this variable

In [None]:
orgdetails = soup_response.find("div", class_="field field-name-acnc-node-charity-status-history field-type-ds field-label-hidden")

orgdetails

**TASK**: change the `orgtable` and `orgdetails` variable names below in order to match your choices from earlier, and execute the code

**QUESTION**: explain what the `find_all("tr")` method is doing, and how it fits with the preceeding methods.

In [None]:
orgtable = orgdetails.find("div", class_="view-content").find("tbody").find_all("tr")
orgtable

### Extracting information

**TASK**: extract organisation status information by inserting code below (HINT: this information is contained in the second column in each row)

In [None]:
for row in orgtable:
    columns = row.find_all("td") # extract all columns in each row
    date = columns[0].text.strip() # organisation status date
    status = columns[1].text.strip() # INSERT CODE HERE #
    observation = date, status
    
    varnames = ["status_date", "status"]
    with open("aus-charity-details.csv", "w") as f:
        writer = csv.writer(f, varnames)
        writer.writerow(varnames)
        writer.writerow(observation)

### Saving results from the scrape

**TASK**: list the contents of the folder where you saved the results of the scrape

**TASK**: open the CSV file where you saved the results of the scrape &mdash; does it look as expected?

In [None]:
# Check presence of file in "downloads" folder

os.listdir()

In [None]:
# Open file and read (import) its contents

with open("aus-charity-details.csv", "r") as f:
    data = f.read()
    
print(data)    

**FINAL TASK**: execute the code below

In [None]:
if 'name' in globals():
    print("{}, good effort on working through this practical!".format(name))
else:
    print("You never told me your name at the beginning but you are still deserving of praise.")

## Conclusion

Congratulations for working through this practical, you have now (at least to some degree) conducted a successful web scrape of real data. I'm sure you can imagine the immense potential of this method for collecting frequently updated social data in an automated and reliable manner.

If you want to see a more complicated version of this practical, then work through Appendix A below.

If you are confident in your abilities so far, then start implementing these techniques on your own web scraping idea by completing the following notebook: *ncrm-web-scraping-practical-own-idea-2021-05-17.ipynb*.

## Appendix A

The following is a snippet of code from a longer programming script that captures data for Australian charities. While more detailed and complicated than what you've encountered so far, the same web scraping logic is applied. 

Execute the commands below to produce the results; then see if you can understand what most of the code does &mdash; don't worry if it is a bit imposing, the code represents a lot of time, effort and errors on my part over the past year!

In [None]:
from datetime import datetime as dt
from bs4 import BeautifulSoup as soup
from time import sleep
import requests
import os
import argparse
import json
import random
import csv
import pandas as pd

In [None]:
# Define functions #

# Download ACNC web pages of charities 

def webpage_download(webid, abn, **args):
    """
        Downloads a charity's web page from the ACNC website, which can be parsed at a later date.

        Takes two mandatory arguments:
            - website id of charity i.e., its unique identifier on the regulator's website
            - abn of charity, which is its unique organisational id

        Dependencies:
            - webid_download | webid_download_from_file 

        Issues:
            - does not deal with cases where a charity has more than one web page (e.g., lots of trustees) [SOLVED]   
    """

    print("Downloading Australian Charity Web Pages")
    print("\r")

    ddate = dt.now().strftime("%Y-%m-%d") # get today's date

    
    # Create folders

    directories = ["webpages", "logs"]

    for directory in directories:
        if not os.path.isdir(directory):
            os.mkdir(directory)
        else:
            #print("{} already exists".format(directory))
            continue 

    
    # Request web page

    session = requests.Session()

    webadd = "https://www.acnc.gov.au/charity/" + str(webid) + "?page=0"
    response = session.get(webadd)

    # Capture metadata

    mdata = dict(response.headers)
    mdata["webid"] = str(webid)
    mdata["abn"] = str(abn)
    mdata["url"] = str(webadd)
    
    # Parse web page

    if response.status_code==200:
        html_org = response.text # Get the text elements of the page.
        soup_org = soup(html_org, "html.parser") # Parse as HTML page


        # Find additional pages i.e., when a charity has more than 16 trustees

        if soup_org.find("li", class_="pager-last"):
            pagination = soup_org.find("li", class_="pager-last").find("a").get("href")
            numpages = int(pagination[-1:]) + 1
        else:
            numpages = 1 


        # Save results to file

        pagenum = 1
        outfile = "./webpages/aus-charity-" + str(abn) + "-page-" + str(pagenum) + "-" + ddate + ".txt"

        try: # potential for encoding issues, therefore need to catch
            with open(outfile, "w", encoding = "utf-8") as f:
                f.write(html_org)

            print("Downloaded web page of charity: {}".format(abn))    
            print("\r")
            print("Web page file is here: '{}'".format(outfile))
        except Exception as e:
             print("Could not write to file (potential encoding issue")
             mdata["write_to_file"] = str(e)  

        if numpages > 1: # request the remaining pages
                 
            for i in range(1, numpages):
                webadd = "https://www.acnc.gov.au/charity/" + str(webid) + "?page=" + str(i)
                response = session.get(webadd)

                if response.status_code==200:
                    html_org = response.text # Get the text elements of the page.
                    soup_org = soup(html_org, "html.parser") # Parse as HTML page

                    
                    # Save results to file

                    pagenum = i + 1
                    outfile = "./webpages/aus-charity-" + str(abn) + "-page-" + str(pagenum) + "-" + ddate + ".txt"
                    
                    try: # potential for encoding issues, therefore need to catch
                        with open(outfile, "w", encoding = "utf-8") as f:
                            f.write(html_org)

                        print("Downloaded web page of charity: {}".format(abn))    
                        print("\r")
                        print("Web page file is here: '{}'".format(outfile))
                    except Exception as e:
                        print("Could not write to file (potential encoding issue")
                        mdata["write_to_file"] = str(e)

                else:
                    print("\r")
                    print("Could not download web page of charity: {}".format(abn))    


    else:
        print("\r")
        print("Could not download web page of charity: {}".format(abn))

    sleep(2)    

    return mdata    


# Download ACNC web pages of charities - from file 

def webpage_download_from_file(infile, prop, **args):
    """
        Takes a CSV file containing webids for Australian charities and
        downloads a charity's web page from the ACNC website, which can be parsed at a later date.

        Takes one mandatory and one optional argument:
            - CSV file containing a list of abns and webids for Australian charities [mandatory]
            - Proportion of charities to download web pages for; default is all (1.0) [optional]

        Dependencies:
            - webid_download | webid_download_from_file 

        Issues:
            - does not deal with cases where a charity has more than one web page (e.g., lots of trustees) [SOLVED]
    """

    ddate = dt.now().strftime("%Y-%m-%d") # get today's date


    # Read in data

    df = pd.read_csv(infile, encoding = "ISO-8859-1", index_col=False) # import file
    prop = float(prop)
    df = df.sample(frac=prop) # take random sample (default is to keep all rows in dataframe)

    
    # Create list and file to store metadata of requests

    mfile = "./logs/aus-webpages-metadata-" + ddate + ".json"
    mlist = []
    

    # Request web pages

    for row in df.itertuples():
        webid = getattr(row, "webid")
        abn = getattr(row, "abn")
        mdata = webpage_download(webid, abn)
        mlist.append(mdata)

     # Write metadata to file

    with open(mfile, "w") as f:
        json.dump(mlist, f) 

    print("\r")
    print("Finished downloading web pages for charities in file: {}".format(infile))
    print("Check log file for metadata about the download: {}".format(mfile))



# History Data #

def history(source, **args):
    """
        Takes a charity's webpage (.txt file) downloaded from the ACNC website and
        extracts the history of the organisation:
            - registration
            - enforcement
            - subtype (i.e., charitable purpose)

        Takes one mandatory argument:
            - A directory with .txt files containing HTML code of a charity's ACNC web page

        Dependencies:
            - webpage_download | webpage_download_from_file 

        Issues:
            - duplicates the information for the final charity in the loop  [SOLVED]     
    """

   # Create folders

    directories = ["history", "logs"]

    for directory in directories:
        if not os.path.isdir(directory):
            os.mkdir(directory)
        else:
            #print("{} already exists".format(directory))
            continue 

    ddate = dt.now().strftime("%Y-%m-%d") # get today's date
    

    # Define output files

    enffile = "./history/aus-enforcement-" + ddate + ".csv"
    subfile = "./history/aus-subtype-" + ddate + ".csv"
    regfile = "./history/aus-registration-" + ddate + ".csv"
    logfile = "./logs/aus-history-metadata-" + ddate + ".csv"

   
    # Define variable names for the output files
    
    evarnames = ["abn", "enforcement", "enforcement_date", "summary", "variation", "variation_date", "report", "note"]
    rvarnames = ["abn", "status_date", "status", "note"]
    svarnames = ["abn", "purpose", "start_date", "end_date", "note"]
    lvarnames = ["abn", "enforcement_history", "subtype_history", "registration_history"]


    # Write headers to the output files

    with open(enffile, "w", newline="") as f:
        writer = csv.writer(f, evarnames)
        writer.writerow(evarnames)

    with open(regfile, "w", newline="") as f:
        writer = csv.writer(f, rvarnames)
        writer.writerow(rvarnames)  

    with open(subfile, "w", newline="") as f:
        writer = csv.writer(f, svarnames)
        writer.writerow(svarnames)

    with open(logfile, "w", newline="") as f:
        writer = csv.writer(f, lvarnames)
        writer.writerow(lvarnames)    
    

    # Read data

    for file in os.listdir(source):
        if file.endswith(".txt"):
            abn = file[12:23] # extract abn from file name
            f = os.path.join(source, file)
            print("Opening {} of charity {}".format(f, abn))
            with open(f, "r", encoding = "utf-8") as f:
                data = f.read()
                soup_org = soup(data, "html.parser") # Parse the text as a BS object.
            

            # Extract specific pieces of information: registration, enforcement, charitable purpose

            # Enforcement

            #print("Extracting enforcement history of charity: {}".format(abn))

            enfdetails = soup_org.find("div", class_="field field-name-acnc-node-charity-compliance-history field-type-ds field-label-hidden")      
            """
                Groups have an enforcement section on their webpage; they do not for registration or subtype.
            """

            if enfdetails.find("div", class_="view-empty"): # If there is no enforcement history
                enforcement_history = 0
                with open(enffile, "a", newline="") as f:
                    row = abn, "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "No enforcement history"
                    writer = csv.writer(f)
                    writer.writerow(row)

            elif enfdetails.find("div", class_="view-content"):
                enforcement_history = 1
                enfcontent = enfdetails.find("div", class_="view-content")
                enftable = enfcontent.find("tbody").find_all("tr")
               
                for row in enftable:
                    td_list = row.find_all("td")
                    enftype = td_list[0].text.strip() # Type of enforcement
                    enfdate = td_list[1].text.strip() # Date of enforcement
                    enfsummary = td_list[2].text.strip() # Text summary of enforcement
                    enfvar = td_list[3].text.strip() # Variation in enforcement
                    enfvardate = td_list[4].text.strip() # Date of variation in enforcement
                    enfrep = td_list[5].find("a").get("href") # Link to enforcement report
                    note = "NULL"
                    row = abn, enftype, enfdate, enfsummary, enfvar, enfvardate, enfrep, note
                    print(row)  
                    with open(enffile, "a", newline="") as f:
                        writer = csv.writer(f)
                        writer.writerow(row)

            else: # Couldn't find enforcement details
                enforcement_history = -9
                with open(enffile, "a", newline="") as f:
                    row = abn, "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "Could not find enforcement information on web page"
                    writer = csv.writer(f)
                    writer.writerow(row)

            
            # Registration and revocation

            #print("Extracting registration and revocation history of charity: {}".format(abn))

            revdetails = soup_org.find("div", class_="field field-name-acnc-node-charity-status-history field-type-ds field-label-hidden")

            if soup_org.find("div", class_="group-info field-group-div"): # If charity is part of a group
                registration_history = -8
                with open(regfile, "a", newline="") as f:
                    row = abn, "NULL ", "NULL", "Group charity"
                    writer = csv.writer(f)
                    writer.writerow(row)
                continue          

            elif revdetails.find("div", class_="view-empty"): # If there is no registration history
                registration_history = 0
                with open(regfile, "a", newline="") as f:
                    row = abn, "NULL", "NULL", "No status information found"
                    writer = csv.writer(f)
                    writer.writerow(row)

            elif revdetails.find("div", class_="view-content"): # if there is a registration history
                registration_history = 1
                revcontent = revdetails.find("div", class_="view-content")
                revtable = revcontent.find("tbody").find_all("tr")
                
                for row in revtable:
                    td_list = row.find_all("td")
                    # Get relevant tds and write to output file
                    revdate = td_list[0].text.strip() # Effective date
                    revstatus = td_list[1].text.strip() # Status
                    note = "NULL"
                    row = abn, revdate, revstatus, note
                    with open(regfile, "a", newline="") as f:
                        writer = csv.writer(f)
                        writer.writerow(row)

            else: # Could not find registration and revocation details
                registration_history = -9
                with open(subfile, "a", newline="") as f:
                    row = abn, "NULL", "NULL", "NULL", "Could not find registration and revocation information on web page"
                    writer = csv.writer(f)
                    writer.writerow(row)            


            # Purposes

            #print("Extracting charitable purpose history of charity: {}".format(abn))

            subdetails = soup_org.find("div", class_="field field-name-acnc-node-charity-subtype-history field-type-ds field-label-hidden")

            if soup_org.find("div", class_="group-info field-group-div"): # If charity is part of a group
                subtype_history = -8
                with open(subfile, "a", newline="") as f:
                    row = abn, "NULL", "NULL", "NULL", "Group charity"
                    writer = csv.writer(f)
                    writer.writerow(row)
                continue                     

            elif subdetails.find("div", class_="view-empty"): # If there is no subtype history
                subtype_history = 0
                with open(subfile, "a", newline="") as f:
                    row = abn, "NULL", "NULL", "NULL", "No subtype history"
                    writer = csv.writer(f)
                    writer.writerow(row)

            elif subdetails.find("div", class_="view-content"):
                subtype_history = 1
                subcontent = subdetails.find("div", class_="view-content")
                subtable = subcontent.find("tbody").find_all("tr")
               
                for row in subtable:
                    td_list = row.find_all("td")
                    subtype = td_list[0].text.strip() # Type of purpose
                    sdate = td_list[1].text.strip() # Start date of purpose
                    edate = td_list[2].text.strip() # End date of purpose
                    if edate == "—":
                        edate = edate.replace("—", "NULL")
                    note = "NULL"
                    row = abn, subtype, sdate, edate, note 
                    with open(subfile, "a", newline="") as f:
                        writer = csv.writer(f)
                        writer.writerow(row)

            else: # Couldn't find purpose details
                subtype_history = -9
                with open(subfile, "a", newline="") as f:
                    row = abn, "NULL", "NULL", "NULL", "Could not find subtype information on web page"
                    writer = csv.writer(f)
                    writer.writerow(row)

    
            # Write metadata to logfile

            with open(logfile, "a", newline="") as f:
                row = abn, enforcement_history, subtype_history, registration_history
                writer = csv.writer(f)
                writer.writerow(row)

    print("\r")
    print("Finished extracting history data from charity web pages found in: {}".format(source))
    print("Check log file for metadata about the extraction: {}".format(logfile))

In [None]:
# Execute functions #

webpage_download_from_file("./data/aus-webids-master.csv", prop=.10)
history("./webpages/")

In [None]:
# View downloaded data #

os.listdir("history")

In [None]:
registration = pd.read_csv("./history/aus-registration-2021-05-15.csv", index_col = False)
registration

In [None]:
purposes = pd.read_csv("./history/aus-subtype-2021-05-15.csv", index_col = False)
purposes

In [None]:
enforcement = pd.read_csv("./history/aus-enforcement-2021-05-15.csv", index_col = False)
enforcement

--END OF FILE--