# Scraping the HTML source (advanced)

Scraping HTML tables is easy, but sometimes we want to access data that isn't as nicely formatted. For example:

- **Prices**: you might want data on a type of product or from a shop
- **Weather**: maybe you want to automate the collection of weather data from the Met office or weather.com
- **News and Media**: Scraping headlines and summaries can tell you about current affairs

In this example, we will scrape the Economics Observatory website to collect the latest article names.

## Investigating the webpage

Before writing any code, let's take a look at the webpage.

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/eco_website.png"> </img>

We want to extract a list of article titles, such as "What do we know about labour market power in the UK?". To do this, we need to know where they appear in the HTML and how they are defined. By using inspect-element (right/ctrl click), we can see the HTML code that creates the titles.

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/inspect_element.png"> </img>

Here we can see that article titles have the class "home__blocks-item-title". We'll use this information to extract just the article titles.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Scraping the page

First, we'll download the HTML which defines the page, using the requests module.

In [2]:
req = requests.get("https://www.economicsobservatory.com") # Make a request to the ECO home-page
page_html = req.text # store the HTML in page_html

Now we have the page's source stored in {{page_html}}. Next we're going to use a module called BeautifulSoup to turn this text into a representation of the page we can interact with. We'll store this in a variable called {{soup}}.

In [3]:
soup = BeautifulSoup(page_html, 'html.parser') # Create a BeautifulSoup object to interact with the page's HTML

Now we'll look for article titles by searching for elements with the class "home__blocks-item-title" which we identified above.


In [9]:
article_title_elements = soup.find_all(class_="home__blocks-item-title") # Find all elements with the class "home__blocks-item-title"
article_titles = [element.text for element in article_title_elements] # Extract the text from each element
article_titles

['Is work in the UK becoming more insecure?',
 'What do we know about labour market power in the UK?',
 'How can we reduce gender gaps in mathematics education?',
 'How have minorities been treated by the UK’s judicial system?',
 'How are plastics harming marine ecosystems?',
 'Read the latest edition of our magazine here',
 'How might house prices affect workers’ productivity in OECD economies?',
 'Youth custody: who ends up there and how does it affect their later lives?',
 'Central Bank Independence by Continent',
 'The UK’s productivity gap: what did it look like twenty years ago?',
 'Slow growing',
 'Could a new policy institution help solve the UK’s productivity problem?',
 'Which investments in human capital will boost productivity growth?',
 'What’s worth reading over the 2023 holiday season?']

We also care about the taglines/'teasers' of each article.

These are contained in \<spans\> and \<p\> tags contained in divs with the class "home__blocks-item-teaser display"

In [10]:
# find all divs with the class "home__blocks-item-teaser display"
tagline_divs = soup.find_all(class_="home__blocks-item-teaser display")
# get all the <p> tags from the tagline_divs
taglines = [div.find("p") for div in tagline_divs]
# extract the text from each tag
tagline_texts = [tagline.text for tagline in taglines]
tagline_texts

['The increase in zero-hour contracts and the emergence of the gig economy over the past decade have raised concerns that working life is becoming less secure. Evidence suggests that while the share of workers experiencing insecure work has not gone up, some groups are more at risk.',
 'Disparate treatment of minorities by the judicial system is not a construct of modern institutions. Irish defendants and victims faced harsher treatment and outcomes at London’s Old Bailey in the 19th century. Lessons from this period can help to inform criminal justice policy today.',
 'The oceans have become a waste-sink for plastics—just like the atmosphere is for greenhouse gas emissions. A higher carbon price may help tackle both problems.',
 'Higher house prices may be partly to blame for the sluggish growth of labour productivity in the OECD countries in recent decades. The adverse impact seems to be less severe in more complex economies – those that produce a greater diversity of products based 

And where from here?
We now have a list of articles, how could this be useful?

- **Automated News Roundups**: you could write code to collect news titles each day to produce a daily roundup
- **Sentiment Analysis**: If you scale up the data collection, you could perform [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) to learn about the emotional valience of news stories.

### Making a Chart: Term Frequencies

Today, we can make a chart of term frequencies from the headlines. This will tell us about the topics covered by the website.

To do this, we will:

1. Define a list of common words to avoid (e.g. "the", "how", "should")
2. Work out how many times each word appears, excluding the common words
3. Save our data

#### 1: Making a list of common words

Thankfully, someone has already defined a list of common words [here](https://raw.githubusercontent.com/6/stopwords-json/master/dist/en.json). We can download this list to use.

In [13]:
# downloading the list of common words into a list
common_words = requests.get("https://raw.githubusercontent.com/6/stopwords-json/master/dist/en.json").json()

#### 2: Count How many Times each word appears

In [22]:
# we'll store how many times each word appears in words
words = {}

# using a loop to go through every article title and tagline
for text in article_titles+tagline_texts:
  text = text.lower() # making it lowercase
  for word in text.split():
    if word in common_words or not word.isalpha():
      continue # if this word is a common word (e.g. "the"), skip it
    if word in words: # if we've already seen this word, just increase the count
      words[word] += 1
    else:
      words[word] = 1

words

{'work': 2,
 'uk': 4,
 'labour': 2,
 'market': 1,
 'power': 1,
 'reduce': 1,
 'gender': 1,
 'gaps': 1,
 'mathematics': 1,
 'minorities': 2,
 'treated': 1,
 'judicial': 2,
 'plastics': 1,
 'harming': 1,
 'marine': 1,
 'read': 1,
 'latest': 1,
 'edition': 1,
 'magazine': 1,
 'house': 2,
 'prices': 2,
 'affect': 2,
 'productivity': 8,
 'oecd': 2,
 'youth': 3,
 'ends': 1,
 'central': 2,
 'bank': 1,
 'independence': 1,
 'continent': 1,
 'twenty': 2,
 'years': 2,
 'slow': 1,
 'growing': 1,
 'policy': 2,
 'institution': 1,
 'solve': 1,
 'investments': 1,
 'human': 1,
 'capital': 1,
 'boost': 2,
 'worth': 1,
 'reading': 1,
 'holiday': 1,
 'increase': 1,
 'contracts': 1,
 'emergence': 1,
 'gig': 1,
 'economy': 1,
 'past': 1,
 'decade': 1,
 'raised': 1,
 'concerns': 1,
 'working': 1,
 'life': 1,
 'evidence': 1,
 'suggests': 2,
 'share': 1,
 'workers': 1,
 'experiencing': 1,
 'insecure': 1,
 'groups': 1,
 'disparate': 1,
 'treatment': 2,
 'system': 1,
 'construct': 1,
 'modern': 1,
 'irish': 1,
 

#### 3: Saving the data

In [24]:
df = pd.DataFrame(words.items(), columns=["word", "count"]) # Create a DataFrame from the words dictionary
df = df.sort_values("count", ascending=False).head(10) # Sort the DataFrame by count and take the top 10
df.to_csv("top_words.csv", index=False) # Save the DataFrame to a CSV file


In [21]:
"–".isalpha()

False

In [14]:
common_words

['a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

In [12]:
words

{'is': 1,
 'work': 1,
 'in': 5,
 'the': 7,
 'uk': 1,
 'becoming': 1,
 'more': 1,
 'insecure?': 1,
 'what': 2,
 'do': 1,
 'we': 2,
 'know': 1,
 'about': 1,
 'labour': 1,
 'market': 1,
 'power': 1,
 'uk?': 1,
 'how': 5,
 'can': 1,
 'reduce': 1,
 'gender': 1,
 'gaps': 1,
 'mathematics': 1,
 'education?': 1,
 'have': 1,
 'minorities': 1,
 'been': 1,
 'treated': 1,
 'by': 2,
 'uk’s': 3,
 'judicial': 1,
 'system?': 1,
 'are': 1,
 'plastics': 1,
 'harming': 1,
 'marine': 1,
 'ecosystems?': 1,
 'read': 1,
 'latest': 1,
 'edition': 1,
 'of': 1,
 'our': 1,
 'magazine': 1,
 'here': 1,
 'might': 1,
 'house': 1,
 'prices': 1,
 'affect': 2,
 'workers’': 1,
 'productivity': 4,
 'oecd': 1,
 'economies?': 1,
 'youth': 1,
 'custody:': 1,
 'who': 1,
 'ends': 1,
 'up': 1,
 'there': 1,
 'and': 1,
 'does': 1,
 'it': 2,
 'their': 1,
 'later': 1,
 'lives?': 1,
 'central': 1,
 'bank': 1,
 'independence': 1,
 'continent': 1,
 'gap:': 1,
 'did': 1,
 'look': 1,
 'like': 1,
 'twenty': 1,
 'years': 1,
 'ago?': 1,
 

In [8]:
title.text

'What’s worth reading over the 2023 holiday season?'