# Web Scraping

Extracting text from websites requires you to do some web scraping, which can be done relatively simply with a library called Beautiful Soup in Python.

**NOTE:** This notebook assumes that you're already familiar with using APIs as shown here: https://github.com/Australian-Text-Analytics-Platform/open-australia-api/blob/main/api.ipynb

You don't need to know how to use APIs to do any web scraping. We are just using the information from that notebook to target which websites we will scrape.

## The Task

Open Australia have saved parliamentary debates on their websites. You can find the latest ones here: <a href="https://www.openaustralia.org.au/debates/">https://www.openaustralia.org.au/debates/</a>

This notebook gets one of their webpages with a debate and scrapes the text from it.

<div class="alert alert-block alert-success">
<b>Skills:</b>
<ul>    
<li> Website scraping </li>
</ul>    
<br>
<b>Skill Level:</b> Beginner/Intermediate
</div>

In [None]:
# Before we begin, let's make sure that we install all the requirements that we need
import sys
!{sys.executable} -m pip install -r requirements.txt

## Getting the Webpages to Scrape (Optional)

<div class="alert alert-block alert-danger">
<b>Before Running the Next Block of Code</b> 

You will need an API key to run the next block of code. You can find out how here: <a href="https://github.com/Australian-Text-Analytics-Platform/open-australia-api/blob/main/api.ipynb">API Notebook</a>
</div>

 
If you're not interested in using APIs, don't worry. You can skip running the following cell and run the rest of the notebook. We are simply using the Open Australia API because their website hosts parliamentary speeches and we are aiming to scrape the text from their webpages.

In [None]:
import requests

# Define constants at the very beginning
OA_API = 'http://www.openaustralia.org/api/'
API_KEY = None  # add your key here as a string, eg 'RPLDbrHE9cPoEn2MIfQWfRcA' (with the quotes)
OUTPUT_FORMAT = 'js'

# Define a dictionary of parameters as arguments to the function
params = dict()
params['key'] = API_KEY
params['output'] = OUTPUT_FORMAT

# The function we are using is 'getRepresentatives' so we can fetch the politician's member_id
my_function = 'getRepresentatives'

# The full endpoint is the string: OA_API+my_function. In Python you can concatenate strings with '+' 
response = requests.get(OA_API+my_function, params=params)
result = response.json()

print('NUMBER OF PARLIAMENTARIANS:', len(result))

# Loop through the results to find the relevant information
for politician in result:
    if politician['last_name'] == 'Bandt':
        print('INFO ON BANDT:')
        print(politician)
        bandt = politician

# Use the function getHansard
my_function = 'getHansard'
my_params = params        # require the old params, such as the api_key and the output type
my_params['order'] = 'd'  # d for date; r for relevance; p for person

# Bandt
my_params['person'] = bandt['person_id']
response = requests.get(OA_API+my_function, params=my_params)
hansard_bandt = response.json()
hansard_bandt

## Without Using the Open Australia API

If you don't already have an API_KEY from Open Australia, then you can simply get the required 'hansard_bandt' data by running the cell below.

In [None]:
import json

# We saved the required 'hansardt_bandt' data in a file called hansard_bandt.json that you can read.
FILE = 'hansard_bandt.json'
hansard_bandt = json.loads(open(FILE).read())

# Print the output
hansard_bandt

### Inspecting Your Data

It's always a good idea to inspect your data. In this notebook you can do this by simply printing out the variables that you've defined.

In [None]:
# If you're unsure about what kind of variable you have, you can always print out its type
type(hansard_bandt)

dict

<div class="alert alert-block alert-warning">
<b>Data Structure: dict</b> 

The dictionary or dict() is a useful data type that is a set of key-value pairs, which makes retrieving data very convenient.

You can learn more about this data type here: <a href="https://docs.python.org/3/tutorial/datastructures.html#dictionaries">Dictionaries</a>
</div>


In [None]:
# Inspect the dictionary structure
hansard_bandt.keys()

dict_keys(['info', 'searchdescription', 'rows'])

In [None]:
# Print the first 3 rows
print(hansard_bandt['rows'][:3])

# Printing out the rows shows us that 'listurl' is where we will find the webpages to scrape the debates
# However, it is a partial url and we need to define the prefix:
URL_BASE = 'https://www.openaustralia.org'

[{'gid': '2021-10-27.125.1', 'hdate': '2021-10-27', 'htype': '12', 'major': '1', 'section_id': '780809', 'subsection_id': '780810', 'relevance': 28, 'speaker_id': '600', 'hpos': '268', 'body': "Finally, after eight years, a piece of legislation that might actually do something good for renewables. Why did it take eight years? Could it be because the energy minister, whos in charge of this legislation&#8212;but doesn't have the courage to come and sit here at the dispatch box, so he leaves it up to his junior to come in to fly the flag&#8212;earned his political stripes campaigning...", 'listurl': '/debates/?id=2021-10-27.116.2&amp;s=speaker%3A10734#g125.1', 'speaker': {'member_id': '600', 'title': '', 'first_name': 'Adam', 'last_name': 'Bandt', 'house': '1', 'constituency': 'Melbourne', 'party': 'Australian Greens', 'person_id': '10734', 'url': '/mp/?m=600'}, 'parent': {'body': 'Bills: Offshore Electricity Infrastructure Bill 2021, Offshore Electricity Infrastructure (Regulatory Levies

### Debate Snippets

Snippets of the whole debate are stored under the key 'body' in each row that's been printed. We can use these snippets to narrow down which debate we're interested in.

In [None]:
# Define key_words to look for in the debate snippet
key_words = ['pollution', 'climate', 'fossil-fuel']
# Define a variable to save the list of debates we've narrowed down
speeches = list()

# Traverse the data and filter out the debates have the key_words
for row in hansard_bandt['rows']:
    if all(word in row['body'].lower() for word in key_words):
        # Save the url in the variable 'speeches'
        speeches.append(URL_BASE+row['listurl'])
        # Print the row as we find them
        print(row)

{'gid': '2021-10-25.18.2', 'hdate': '2021-10-25', 'htype': '12', 'major': '1', 'section_id': '779732', 'subsection_id': '779751', 'relevance': 35, 'speaker_id': '600', 'hpos': '44', 'body': 'I move: That this bill be now read a second time. With our coal and gas exports, Australia is the worlds third-biggest exporter of fossil-fuel pollution after Russia and Saudi Arabia, who are coincidentally our only allies in the upcoming climate negotiations in Glasgow. Four-fifths of the coal we extract, we export overseas. This is Australias biggest contribution to the climate emergency...', 'listurl': '/debates/?id=2021-10-25.18.1&amp;s=speaker%3A10734#g18.2', 'speaker': {'member_id': '600', 'title': '', 'first_name': 'Adam', 'last_name': 'Bandt', 'house': '1', 'constituency': 'Melbourne', 'party': 'Australian Greens', 'person_id': '10734', 'url': '/mp/?m=600'}, 'parent': {'body': 'Bills: Coal Prohibition (Quit Coal) Bill 2021; Second Reading'}}


In [None]:
# Print the contents of the 'speeches' variable
speeches

['https://www.openaustralia.org/debates/?id=2021-10-25.18.1&amp;s=speaker%3A10734#g18.2']

<div class="alert alert-block alert-success">
<b>Try it Yourself</b>
<ul>    
<li> Check out how many results you get in 'speeches' when you change the 'key_words' </li>
<li> Within the for-loop above, there is the conditional if-clause. What would happen if you changed 'all()' to 'any()' in this condition?</li>
</ul> 
</div>

In [None]:
# There is only one speech so let's save that
speech_url = speeches[0]
speech_url

## Scraping

We'll be using Beautiful Soup to help us scrape the text from the webpage 'speech_url'. This library is designed for extracting data from html and xml documents. It parses the html/xml and then allows you to traverse the embedded structure.

You can learn more about Beautiful Soup here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
# Import the libraries to retrieve the webpage and scrape it
from requests import get
from bs4 import BeautifulSoup

# Retrieve the webpage (html)
url_response = get(speech_url)
speech_html = url_response.text

# Print the first 1000 characters
speech_html[:1000]

In [None]:
# Parse the webpage and save it as 'page'
page = BeautifulSoup(speech_html, 'html.parser')

### Beautiful But Not Magical

There are no magic tricks when it comes to webpage scraping and while Beautiful Soup is a great tool for scraping text from a webpage, you need to encode what to scrape.

One way to do this is to inspect the html to find what you want to scrape. Luckily there are conventions, such as the tag ```<p/>``` being used to encode paragraphs.

In this html page we see the these paragraph are normally inside a ```<p/>``` class labelled "speaker", so our first step would be to find this ```<p/>``` section, and then we can iterate over all the other non-labelled paragraphs ```<p/>``` and extract the text from it.

The algorithm above won't be the same for every webpage you encounter, which is why it's a good idea to inspect the html source that you want to scrape. This might seem like overkill for one webpage but if you want to do this for hundreds of pages, it's worthwhile!

Different browsers have different ways to inspect the html source, and a quick search web search will help you find out how.

<img src="https://github.com/Australian-Text-Analytics-Platform/web-scraping/blob/main/img/bandt-oa.png?raw=true" class="center"/>

In [None]:
# In the Beautiful Soup object 'page' find 
labelled = page.find('p', class_="speaker")

The word “class” is a reserved word in Python, so if you use ```class``` as an argument you will get a syntax error. Instead you can search for the item "class" using the keyword ```class_``` in Beautiful Soup.

In [None]:
motion = list()
paragraphs = labelled.find_all('p')
for p in paragraphs:
    if 'class' not in p.attrs:
        motion.append(p.text)

In [None]:
# Print out how many paragraphs that were extracted
len(motion)

44

In [None]:
# Each paragraph is saved as an element in the list
# Print out the first 10 paragraphs
motion[:10]

['I move:',
 "With our coal and gas exports, Australia is the world's third-biggest exporter of fossil-fuel pollution after Russia and Saudi Arabia, who are coincidentally our only allies in the upcoming climate negotiations in Glasgow.",
 "Four-fifths of the coal we extract, we export overseas. This is Australia's biggest contribution to the climate emergency that our carbon accounts exclude and our establishment political parties ignore.",
 'But, without a plan for coal, there is no plan to stop runaway global heating.',
 'Almost all of our exports go to Japan, China, and South Korea. All three have pledged net zero, which means the first thing they will target is pushing thermal coal out of their electricity system.',
 "We are at a crossroads and right now we're staring down the wrong path. On the eve of the Glasgow climate summit, while the rest of the world are making plans to get out of coal and gas, here we have the Liberal and Labor parties backing more coal and gas—including t

<div class="alert alert-block alert-success">
<b>Try it Yourself</b>

In this notebook we scraped the debate text from the page. There is a lot more information contained in this webpage to scrape!

For example you can:
<ul>    
<li> Extract the names of the politicians involved in this motion.</li>
<li> Find out which electorate each of the politicians represent.</li>
</ul> 
</div>