# Practical exercise: Australian charities

## Introduction

Computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit. Central to engaging in these methods is the ability to write readable and effective code using a programming language.

In this practical we attempt to scrape information on the organisational status of Australian charities.

### Aims

This practical has one aim:
1. Successfully scrape information relating to Australian charities' organisational status e.g., does it still operate? When was it registered?

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is web-scraping?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## What is the general approach for scraping data from a web page?

We begin by identifying a web page containing information we are interested in collecting. Then we need to **know** the following:
1. The location (i.e., web address) where the web page can be accessed. For example, the UK Data Service homepage can be accessed via <a href="https://ukdataservice.ac.uk/" target=_blank>https://ukdataservice.ac.uk/</a>.
2. The location of the information we are interested in within the structure of the web page. This involves visually inspecting a web page's underlying code using a web browser.

And **do** the following:
1. Request the web page using its web address.
2. Parse the structure of the web page so your programming language can work with its contents.
3. Extract the information we are interested in.
4. Write this information to a file for future use.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed.

## Details

This is an example drawn from my (Diarmuid) own research. I am interested in the impact of COVID-19 on the foundation and dissolution charities across a number of countries. To study these phenomena I need the organisational status &mdash; foundation/dissolution date, organisational status &mdash; of individual charities. The Australian charity regulator provides high quality, open data on the organisational status of charities, with the exception of dissolution status. Therefore I wrote a script that takes a list of charity ids and scrapes information on organisational status from the regulator's website.

Your task is to execute and complete sections of this web scraping script.

It's a bit more complicated than what we've encountered so far, but gives you a sense of what web scraping for social research is really like.

## Practical 1

### Identifying the web address

An example of a charity's web page can be viewed at the following web address: https://www.acnc.gov.au/charity/3b7aa8b31249837c15657331aeb54821

### Locating information

The information we need located in the *History* tab underneath the **Registration status history** heading.

#### Visually inspecting the underlying HTML code

**TASK**: inspect the web page of our example charity (https://www.acnc.gov.au/charity/3b7aa8b31249837c15657331aeb54821) and insert the relevant HTML into the cell below.

### Requesting the web page

**TASK**: import the `requests`, `csv` and `os` modules into this Python session (`datetime` and `BeautifulSoup` are already listed for you).

In [None]:
# Import modules

import os
import requests
import csv
from datetime import datetime # module for working with dates and time
from bs4 import BeautifulSoup as soup # module for parsing web pages

print("Succesfully imported necessary modules")

**TASK**: fill in the blanks (e.g., # INSERT URL #) with the necessary code

In [None]:
# Define the URL where the web page can be accessed

url = "https://www.acnc.gov.au/charity/3b7aa8b31249837c15657331aeb54821" # INSERT URL #


# Request the web page from the URL

response =  requests.get(url, allow_redirects = False, timeout = 5)# REQUEST THE URL #


# Check if page was requested successfully #

response.status_code

### Parsing the web page

**TASK**: use the `soup()` method to parse the requested web page.

In [None]:
# Extract the contents of the web page from the response

soup_response = soup(response.text, "html.parser") # Parse the text as a Beautiful Soup object

**QUESTION**: under which tag(s) is the *Registration status history* information found?

**TASK**: find the section containing the *Registration status history* information and save it to a variable; view the contents of this variable

In [None]:
orgdetails = soup_response.find("div", class_="field field-name-acnc-node-charity-status-history field-type-ds field-label-hidden")

orgdetails

**TASK**: change the `orgtable` and `orgdetails` variable names below in order to match your choices from earlier, and execute the code

**QUESTION**: explain what the `find_all("tr")` method is doing, and how it fits with the preceeding methods.

In [None]:
orgtable = orgdetails.find("div", class_="view-content").find("tbody").find_all("tr")
orgtable

### Extracting information

**TASK**: extract organisation status information by inserting code below (HINT: this information is contained in the second column in each row)

In [None]:
for row in orgtable:
    columns = row.find_all("td") # extract all columns in each row
    date = columns[0].text.strip() # organisation status date
    status = columns[1].text.strip() # INSERT CODE HERE #
    observation = date, status
    
    varnames = ["status_date", "status"]
    with open("aus-charity-details.csv", "w") as f:
        writer = csv.writer(f, varnames)
        writer.writerow(varnames)
        writer.writerow(observation)

### Saving results from the scrape

**TASK**: list the contents of the folder where you saved the results of the scrape

**TASK**: open the CSV file where you saved the results of the scrape &mdash; does it look as expected?

In [None]:
# Check presence of file in "downloads" folder

os.listdir()

In [None]:
# Open file and read (import) its contents

with open("aus-charity-details.csv", "r") as f:
    data = f.read()
    
print(data)    

**FINAL TASK**: execute the code below

In [None]:
if 'name' in globals():
    print("{}, good effort on working through this practical!".format(name))
else:
    print("You never told me your name at the beginning but you are still deserving of praise.")

## Conclusion

Congratulations for working through this practical, you have now (at least to some degree) conducted a successful web scrape of real data. I'm sure you can imagine the immense potential of this method for collecting frequently updated social data in an automated and reliable manner.

If you need help completing this practical then you can view the version containing solutions: *ncrm-web-scraping-practical-aus-charities-solution-2021-05-17.ipynb*.

If you are confident in your abilities so far, then start implementing these techniques on your own web scraping idea by completing the following notebook: *ncrm-web-scraping-practical-own-idea-2021-05-17.ipynb*.

--END OF FILE--