UK Data Service Webscraping seminar 13 May 2020 - More complex example


Working through the webinar....part 2 (slightly more complex example)

We're going to scrape some data from the Charity Commission website.

In [15]:
# Import modules needed 

import requests # module for requesting urls
import os # module for performing operating system tasks
import pandas as pd # module for working with datasets
from IPython.display import IFrame # module for embedding web pages, documents etc
from bs4 import BeautifulSoup as soup # module for parsing web pages

Identifying the web address

We're going to use the Charity Commission for England and Wales' website to capture policy data: <a href="https://beta.charitycommission.gov.uk/" target=_blank>https://beta.charitycommission.gov.uk/</a>

We're going to focus on just one charity for now - Oxfam; therefore the web address looks like this: <a href="https://beta.charitycommission.gov.uk/charity-details/?regId=202918&subId=0" target=_blank>https://beta.charitycommission.gov.uk/charity-details/?regId=202918&subId=0</a>


In [16]:
#So basically IFrame simply embeds the webpage in my Jupyter Notebook
IFrame("https://beta.charitycommission.gov.uk/charity-details/?regId=202918&subId=0", width="800", height="650")

Locating information

Policy data is located in the Documents tab under a heading called Policies, which in terms of the source code is here:
Policies
Risk management
Investment
Safeguarding vulnerable beneficiaries
Conflicts of interest
Volunteer management
Complaints handling
Paying staff

In [17]:
#Requesting the webpage
#Now that we possess the necessary information, let's begin the process of scraping the web page.

url = "https://beta.charitycommission.gov.uk/charity-details/?regId=202918&subId=0"

response = requests.get(url, allow_redirects=True)
response.status_code

200

In [23]:
#Next we parse the webpage
soup_response = soup(response.text, "html.parser")
#soup_response.body # view HTML code

The next bit is fiddly - we need to look at the source html and work out where the content that
we are interested in is located. The policies are contained within a set of <div></div> tags where the class attribute equals "pcg-charity-details__block col-lg-6". There are multiple sets of tags with this id, therefore we need to use the find_all() method.

In [24]:
sections = soup_response.find_all("div", class_="pcg-charity-details__block col-lg-6")
len(sections) # view how many sets of tags are returned

7

In [25]:
#So we have multiple sets of tags and need to identify the right one

searchterm = "Policies" # search term identifying section containing list of policies

for section in sections: # for each section contained in the sections list:
    if searchterm in str(section): # if the search term exists in the section
        policy_location = sections.index(section) # store the list location of the correct section
        print(policy_location) # view the location of the policies in the list (i.e, is it the first element in the list?)
    else:
        continue

policy_section = sections[policy_location] # create a new variable containing the correct section
policy_section

#this bit of code has identified the part of the html that contains the policy info we 
#are interested in.

5


<div class="pcg-charity-details__block col-lg-6">
<h3>Policies</h3>
<span>Risk management</span>
<br/>
<span>Investment</span>
<br/>
<span>Safeguarding vulnerable beneficiaries</span>
<br/>
<span>Conflicts of interest</span>
<br/>
<span>Volunteer management</span>
<br/>
<span>Complaints handling</span>
<br/>
<span>Paying staff</span>
<br/>
</div>

Let's unpick the logic of the code above:

-We know the list of policies is contained in a section (<div>) where class_="pcg-charity-details__block col-lg-6".
-We find all sections where the class attribute equals "pcg-charity-details__block col-lg-6", and navigate to the correct one by evaluating whether it contains a relevant piece of text ("Policies"). 
-This process revealed that the list of policies was contained in the sixth section (remember: lists begin at position 0, so 5 identifies the sixth element of a list). If we knew that the list of beneficiaries was always contained in the fifth section we wouldn't need the use of a search term, but this way is more robust to deviations in the structure and content of each charity's web page.


In [26]:
#Now that we have the correct set of <div></div> tags, we need to extract the policy data from 
#with the <span></span> tags.
policy_list = [] # define a blank list for storing the policy data
charity_name = "Oxfam" # define a variable for storing the charity's name

for tag in policy_section.find_all("span"): # for each set of span tags in the policy section
    policy = tag.text # extract the text from the tag
    observation = [charity_name, policy] # combine charity name and a policy
    policy_list.append(observation) # append the charity name and policy to the blank list
    
policy_list # view list of policies for the charity (long format)

[['Oxfam', 'Risk management'],
 ['Oxfam', 'Investment'],
 ['Oxfam', 'Safeguarding vulnerable beneficiaries'],
 ['Oxfam', 'Conflicts of interest'],
 ['Oxfam', 'Volunteer management'],
 ['Oxfam', 'Complaints handling'],
 ['Oxfam', 'Paying staff']]

Again, let's unpack the code above:

    We define a variable called policy_list which will store the extracted text; at this point the list is empty. We also define a variable for storing the charity's name (charity_name).
    Then, for each set of <span></span> tags in the policy_section variable, we extract the text from within the tags. We also define a variable called observation with stores a list of values: a charity's name and a given policy; finally we append the information to the empty list.



In [27]:
#Saving the webscaped info; converting it into pandas dataframe
policy_data = pd.DataFrame(list(policy_list), columns=["charity_name", "policy"])
policy_data

Unnamed: 0,charity_name,policy
0,Oxfam,Risk management
1,Oxfam,Investment
2,Oxfam,Safeguarding vulnerable beneficiaries
3,Oxfam,Conflicts of interest
4,Oxfam,Volunteer management
5,Oxfam,Complaints handling
6,Oxfam,Paying staff
