UK Data Service Webscraping seminar 13 May 2020 - Simple example


In [1]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

Enter your name and press enter:

Hello Sonja, enjoy learning more about Python and web-scraping!


Working through the webinar....part 1 (simple example)

This lesson - Collecting data I: web-scraping - has two aims:

 -Demonstrate how to use Python to collect data found on websites.
  -Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data collection problem using a computational method.



What is the general approach for scraping data from a web page?

We begin by identifying a web page containing information we are interested in collecting. Then we need to know the following:

    1)The location (i.e., web address) where the web page can be accessed. For example, the UK Data Service homepage can be accessed via <a href="https://ukdataservice.ac.uk/" target=_blank>https://ukdataservice.ac.uk/</a>.
    2)The location of the information we are interested in within the structure of the web page. This involves visually inspecting a web page's underlying code using a web browser.

And do the following:

    1)Request the web page using its web address.
    2)Parse the structure of the web page so your programming language can work with its contents.
    3)Extract the information we are interested in.
    4)Write this information to a file for future use.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this pseudo-code, as it is captures the main tasks and the order in which they need to be executed.


In [3]:
#1) Identify the webpage
from IPython.display import IFrame
IFrame("https://httpbin.org/html", width="800", height="650")


2) Locating information we are interested in

Our task is to extract the text on this web page. In order to do so, we need to understand where the text is located within the underlying source code of the web page. Web pages are written in a langauge called HyperText Markup Language (HTML). HTML describes the structure of a web page, and consists of a number of elements (e.g., paragraphs, tables, headers), with each element represented by a tag (e.g., <p>, <table>, <h1>).

See <a href="https://www.w3schools.com/html/html_intro.asp" target=_blank>https://www.w3schools.com/html/html_intro.asp</a> for more information on HTML.
isually inspecting the underlying HTML code

Therefore, what we need are the tags that identify the section of the web page where the text is stored. We can discover the tags by examining the source code (HTML) of the web page. This can be done using your web browser: for example, if you use use Firefox you can right-click on the web page and select View Page Source from the list of options. (Chrome: View page source; Safari: follow <a href="https://www.lifewire.com/view-html-source-in-safari-3469315" target=_blank>these instructions</a>).


Viewing the source....view-source:https://httpbin.org/html

<!DOCTYPE html>
<html>
  <head>
  </head>
  <body>
      <h1>Herman Melville - Moby-Dick</h1>

      <div>
        <p>
          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the 
          bla bla abbeviated bla bla
        is old blacksmith to thyself ere his full ruin came upon him, then had the young widow had a delicious grief, and her orphans a truly venerable, legendary sire to dream of in their after years; and all of them a care-killing competency.
        </p>
      </div>
  </body>
</html>

In [4]:
#Now we have the info let's scrape the webpage

In [5]:
# Import modules

import os # module for navigating your machine (e.g., file directories)
import requests # module for requesting urls
from bs4 import BeautifulSoup as soup # module for parsing web pages

print("Succesfully imported necessary modules")

Succesfully imported necessary modules


In [6]:
# Define the URL where the web page can be accessed

url = "https://httpbin.org/html"

# Request the web page

response = requests.get(url, allow_redirects=True) # request the url and call it "response"
response.status_code # check if page was requested successfully

200

We get a status code of 200, which means the request was successful. A status code in 400s or 500s represent an unsuccessful attempt at requesting a web page (see <a href="https://www.textbook.ds100.org/ch/07/web_http.html" target=_blank>Lau, Gonzalez and Nolan</a> for a succinct description of different types of response status codes).

Let's unpack the code a bit. First, we define a variable (also known as an 'object' in Python) called url that contains the web address of the page we want to request. Next, we use the get() method of the requests module to request the web page, and in the same line of code, we store the results of the request in a variable called response. Finally, we check whether the request was successful by calling on the status_code attribute of the response variable.


In [7]:
response.headers

{'Date': 'Thu, 14 May 2020 08:53:42 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '3741', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}

In [8]:
response.text

"<!DOCTYPE html>\n<html>\n  <head>\n  </head>\n  <body>\n      <h1>Herman Melville - Moby-Dick</h1>\n\n      <div>\n        <p>\n          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm.

In [14]:
type(response)# so this ia requests.models.Response type of object. Never come across this.

requests.models.Response

#So we have the info.
#Its' difficult to use in this format, so this is where beautiful soup comes in.
Now it's time to identify and understand the structure of the web page we requested. We do this by converting the content contained in the response.text attribute into a BeautifulSoup variable. BeautifulSoup is a Python module that provides a systematic way of navigating the elements of a web page and extracting its contents. Let's see how it works in practice:

In [9]:
# Extract the contents of the webpage from the response

soup_response = soup(response.text, "html.parser") # Parse the text as a Beautiful Soup object
soup_response
#the hierarchical structure of the web page is now recognised by Python? Not only that, 
#BeautifulSoup provides some methods for accessing the tags contained in the web page

<!DOCTYPE html>

<html>
<head>
</head>
<body>
<h1>Herman Melville - Moby-Dick</h1>
<div>
<p>
          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm. No murmur, no impatience, no petula

In [10]:
#The enxt stage is to find the info we are interested in which means using the html tags
#We used the find() method on the soup_response variable to capture the set of <p></p> tags on the 
#page. Remember, we used our visual inspection of the source code to identify that the text 
#we needed was contained within a set of <p></p> tags, and that there was only one set.
paragraph = soup_response.find("p")
paragraph

<p>
          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm. No murmur, no impatience, no petulance did come from him. Silent, slow, and solemn; bowing over still further his chronicall

In [11]:
#We're near the end of the scrape: we just need to extract the text from within the tags like so:
data = paragraph.text
print(data)


          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm. No murmur, no impatience, no petulance did come from him. Silent, slow, and solemn; bowing over still further his chronically b

In [None]:
#Saving results from scrape
# Define a file to store the data

#outfile = "./moby-dick-scraped-data.txt" # location and name of file

# Open the file and write (save) the data to it

#with open(outfile, "w") as f:
   # f.write(data)

In [13]:
# Check presence of file in current folder

#os.listdir()

In [None]:
# Open file and read (import) its contents

with open(outfile, "r") as f:
    data = f.read()
    
print(data)