In [6]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


# Scraping
Scraping is the automated extraction of data from websites.

This notebook:

1. Demonstrates scraping HTML tables using a Wikipedia table of G7 meetings as an example
2. Introduces more advanced scraping techniques

# Data Scraping


For this example, we will scrape the list of G7 summits from Wikipedia, <a href="https://en.wikipedia.org/wiki/G7#List_of_summits">here</a>. It's a good target because it is:

- Available Online
- Not available as a clean download (Excel, CSV)

<img
height=500 src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/g7_wiki.png"> </img>

First we define the URL of web-page we want to scrape from.




In [7]:
url = "https://en.wikipedia.org/wiki/G7#List_of_summits" # The URL of the webpage
tables_from_webpage = pd.read_html(url) # Read the tables from the webpage into a list of DataFrames

pd.read_html returns a list of dataframes, one for each table on the webpage.
We can look through this list one by one by trying tables_from_webpage[0], tables_from_webpage[1], ...

In [8]:
tables_from_webpage[2].head(10)

Unnamed: 0,#,Date,Host,Host leader,Location held,Notes
0,1st,15–17 November 1975,France,Valéry Giscard d'Estaing,"Château de Rambouillet, Yvelines",The first and last G6 summit.
1,2nd,27–28 June 1976,United States,Gerald R. Ford,"Dorado, Puerto Rico[74]","Also called ""Rambouillet II"". Canada joined th..."
2,3rd,7–8 May 1977,United Kingdom,James Callaghan,"London, England",The President of the European Commission was i...
3,4th,16–17 July 1978,West Germany,Helmut Schmidt,"Bonn, North Rhine-Westphalia",
4,5th,28–29 June 1979,Japan,Masayoshi Ōhira,Tokyo,
5,6th,22–23 June 1980,Italy,Francesco Cossiga,"Venice, Veneto",Prime Minister Ōhira died in office on 12 June...
6,7th,20–21 July 1981,Canada,Pierre E. Trudeau,"Montebello, Québec",
7,8th,4–6 June 1982,France,François Mitterrand,"Versailles, Yvelines",
8,9th,28–30 May 1983,United States,Ronald Reagan,"Williamsburg, Virginia",
9,10th,7–9 June 1984,United Kingdom,Margaret Thatcher,"London, England",


Trying tables_from_webpage[0], tables_from_webpage[1] and tables_from_webpage[2], we can see that tables_from_webpage[2] is the table we need.

Let's set the variable df equal to this table, for ease of use and take a look at the first few rows with df.head()

In [9]:
df = tables_from_webpage[2]
df.head()

Unnamed: 0,#,Date,Host,Host leader,Location held,Notes
0,1st,15–17 November 1975,France,Valéry Giscard d'Estaing,"Château de Rambouillet, Yvelines",The first and last G6 summit.
1,2nd,27–28 June 1976,United States,Gerald R. Ford,"Dorado, Puerto Rico[74]","Also called ""Rambouillet II"". Canada joined th..."
2,3rd,7–8 May 1977,United Kingdom,James Callaghan,"London, England",The President of the European Commission was i...
3,4th,16–17 July 1978,West Germany,Helmut Schmidt,"Bonn, North Rhine-Westphalia",
4,5th,28–29 June 1979,Japan,Masayoshi Ōhira,Tokyo,


## Manipulating the Data

We can now manipulate the data. Let's try and make a chart of number of G7 meetings location.

First, let's group by the column 'Location held' and sort for just the most common places.

In [10]:
df = tables_from_webpage[2]
df = df.groupby('Location held').aggregate({'Host': 'count'}) # Group the data by the 'Location held' column and count the number of occurrences
df = df.sort_values(by='Host', ascending=False) # Sort the data by the number of occurrences in descending order
df = df[df['Host'] > 1] # Keep only the rows where the number of occurrences is greater than 1
df = df.rename(columns={'Host': 'Count'}) # Rename the 'Host' column to 'Count'

df

Unnamed: 0_level_0,Count
Location held,Unnamed: 1_level_1
Tokyo,3
"London, England",3
"Bonn, North Rhine-Westphalia",2
"Venice, Veneto",2


## Uploading the Data

Now let's save our table to upload to GitHub and use in Vega-lite

In [11]:
df.to_csv('g7_summits.csv') # Save the data to a CSV file

Next we have to upload our output (e.g. "g7_summits.csv") to GitHub. Got to your own repository and click 'Add file':

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/uploading_to_github.png"> </img>

Then find the file and click 'raw'

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/getting_raw.png"> </img>

and finally copy the url to use in Vega-lite:

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/getting_url.png"> </img>





# Scraping the HTML source (advanced)

Scraping HTML tables is easy, but sometimes we want to access data that isn't as nicely formatted. For example:

- **Prices**: you might want data on a type of product or from a shop
- **Weather**: maybe you want to automate the collection of weather data from the Met office or weather.com
- **News and Media**: Scraping headlines and summaries can tell you about current affairs

In this example, we will scrape the Economics Observatory website to collect the latest article names.

## Investigating the webpage

Before writing any code, let's take a look at the webpage.

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/eco_website.png"> </img>

We want to extract a list of article titles, such as "What do we know about labour market power in the UK?". To do this, we need to know where they appear in the HTML and how they are defined. By using inspect-element (right/ctrl click), we can see the HTML code that creates the titles.

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/inspect_element.png"> </img>

Here we can see that article titles have the class "home__blocks-item-title". We'll use this information to extract just the article titles.

## Scraping the page

First, we'll download the HTML which defines the page, using the requests module.

In [12]:
req = requests.get("https://www.economicsobservatory.com") # Make a request to the ECO home-page
page_html = req.text # store the HTML in page_html

Now we have the page's source stored in {{page_html}}. Next we're going to use a module called BeautifulSoup to turn this text into a representation of the page we can interact with. We'll store this in a variable called {{soup}}.

In [13]:
soup = BeautifulSoup(page_html, 'html.parser') # Create a BeautifulSoup object to interact with the page's HTML

Now we'll look for article titles by searching for elements with the class "home__blocks-item-title" which we identified above.


In [14]:
article_title_elements = soup.find_all(class_="home__blocks-item-title")
article_title_elements

[<h3 class="home__blocks-item-title">Is work in the UK becoming more insecure?</h3>,
 <h3 class="home__blocks-item-title">What do we know about labour market power in the UK?</h3>,
 <h3 class="home__blocks-item-title">How can we reduce gender gaps in mathematics education?</h3>,
 <h3 class="home__blocks-item-title">How have minorities been treated by the UK’s judicial system?</h3>,
 <h3 class="home__blocks-item-title">How are plastics harming marine ecosystems?</h3>,
 <h3 class="home__blocks-item-title">Read the latest edition of our magazine here</h3>,
 <h3 class="home__blocks-item-title">How might house prices affect workers’ productivity in OECD economies?</h3>,
 <h3 class="home__blocks-item-title">Youth custody: who ends up there and how does it affect their later lives?</h3>,
 <h3 class="home__blocks-item-title" style="text-align: left;">Central Bank Independence by Continent</h3>,
 <h3 class="home__blocks-item-title">The UK’s productivity gap: what did it look like twenty years a

We also care about the taglines/'teasers' of each article.

These are contained in \<span\> elements. Let's search for all Span elements to get these summaries.

In [None]:
sorted

And where from here?
We now have a list of articles, how could this be useful?

- **Automated News Roundups**: you could write code to collect news titles each day to produce a daily roundup
- **Sentiment Analysis**: If you scale up the data collection, you could perform [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) to learn about the emotional valience of news stories.

### Making a Chart: Term Frequencies

Today, we can make a chart of term frequencies from the headlines. This will tell us about the topics covered by the website.

To do this, we will:

1. Define a list of common words to avoid (e.g. "the", "how", "should")
2. Work out how many times each word appears, excluding the common words
3. Save our data

#### 1: Making a list of common words

Thankfully, someone has already defined a list of common words [here](https://raw.githubusercontent.com/6/stopwords-json/master/dist/en.json). We can download this list to use.

In [17]:
# downloading the list of common words into a list
common_words = requests.get("https://raw.githubusercontent.com/6/stopwords-json/master/dist/en.json").json()

#### 2: Count How many Times each word appears

In [19]:
# we'll store how many times each word appears in words
words = {}

# using a loop to go through every article title
for title in article_title_elements:
  title = title.text # We only care about the title text itself, not the whole HTML that defines it
  title = title.lower() # making it lowercase
  for word in title.split():
    if word in common_words:
      continue # if this word is a common word (e.g. "the"), skip it
    if word in words: # if we've already seen this word, just increase the count
      words[word] += 1
    else:
      words[word] = 1


words

{'work': 1,
 'uk': 1,
 'insecure?': 1,
 'labour': 1,
 'market': 1,
 'power': 1,
 'uk?': 1,
 'reduce': 1,
 'gender': 1,
 'gaps': 1,
 'mathematics': 1,
 'education?': 1,
 'minorities': 1,
 'treated': 1,
 'uk’s': 3,
 'judicial': 1,
 'system?': 1,
 'plastics': 1,
 'harming': 1,
 'marine': 1,
 'ecosystems?': 1,
 'read': 1,
 'latest': 1,
 'edition': 1,
 'magazine': 1,
 'house': 1,
 'prices': 1,
 'affect': 2,
 'workers’': 1,
 'productivity': 4,
 'oecd': 1,
 'economies?': 1,
 'youth': 1,
 'custody:': 1,
 'ends': 1,
 'lives?': 1,
 'central': 1,
 'bank': 1,
 'independence': 1,
 'continent': 1,
 'gap:': 1,
 'twenty': 1,
 'years': 1,
 'ago?': 1,
 'slow': 1,
 'growing': 1,
 'policy': 1,
 'institution': 1,
 'solve': 1,
 'problem?': 1,
 'investments': 1,
 'human': 1,
 'capital': 1,
 'boost': 1,
 'growth?': 1,
 'what’s': 1,
 'worth': 1,
 'reading': 1,
 '2023': 1,
 'holiday': 1,
 'season?': 1}

['work',
 'uk',
 'insecure?',
 'labour',
 'market',
 'power',
 'uk?',
 'reduce',
 'gender',
 'gaps',
 'mathematics',
 'education?',
 'minorities',
 'treated',
 'uk’s',
 'judicial',
 'system?',
 'plastics',
 'harming',
 'marine',
 'ecosystems?',
 'read',
 'latest',
 'edition',
 'magazine',
 'house',
 'prices',
 'affect',
 'workers’',
 'productivity',
 'oecd',
 'economies?',
 'youth',
 'custody:',
 'ends',
 'lives?',
 'central',
 'bank',
 'independence',
 'continent',
 'gap:',
 'twenty',
 'years',
 'ago?',
 'slow',
 'growing',
 'policy',
 'institution',
 'solve',
 'problem?',
 'investments',
 'human',
 'capital',
 'boost',
 'growth?',
 'what’s',
 'worth',
 'reading',
 '2023',
 'holiday',
 'season?']

In [14]:
common_words

['a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

In [12]:
words

{'is': 1,
 'work': 1,
 'in': 5,
 'the': 7,
 'uk': 1,
 'becoming': 1,
 'more': 1,
 'insecure?': 1,
 'what': 2,
 'do': 1,
 'we': 2,
 'know': 1,
 'about': 1,
 'labour': 1,
 'market': 1,
 'power': 1,
 'uk?': 1,
 'how': 5,
 'can': 1,
 'reduce': 1,
 'gender': 1,
 'gaps': 1,
 'mathematics': 1,
 'education?': 1,
 'have': 1,
 'minorities': 1,
 'been': 1,
 'treated': 1,
 'by': 2,
 'uk’s': 3,
 'judicial': 1,
 'system?': 1,
 'are': 1,
 'plastics': 1,
 'harming': 1,
 'marine': 1,
 'ecosystems?': 1,
 'read': 1,
 'latest': 1,
 'edition': 1,
 'of': 1,
 'our': 1,
 'magazine': 1,
 'here': 1,
 'might': 1,
 'house': 1,
 'prices': 1,
 'affect': 2,
 'workers’': 1,
 'productivity': 4,
 'oecd': 1,
 'economies?': 1,
 'youth': 1,
 'custody:': 1,
 'who': 1,
 'ends': 1,
 'up': 1,
 'there': 1,
 'and': 1,
 'does': 1,
 'it': 2,
 'their': 1,
 'later': 1,
 'lives?': 1,
 'central': 1,
 'bank': 1,
 'independence': 1,
 'continent': 1,
 'gap:': 1,
 'did': 1,
 'look': 1,
 'like': 1,
 'twenty': 1,
 'years': 1,
 'ago?': 1,
 

In [8]:
title.text

'What’s worth reading over the 2023 holiday season?'