In this tutorial we use [requests](https://pypi.org/project/requests/) and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) Python packages to extract information from BBC articles.

## Package imports

In [1]:
import requests
from bs4 import BeautifulSoup

## News article URL

In [2]:
url = "https://www.bbc.co.uk/news/business-65321487"

## Getting the HTML source code with requests

We start by using the requests package to get the HTML source code for the URL above.

In [3]:
page_html = requests.get(url = url)

In [4]:
page_html

<Response [200]>

In [50]:
# bytes variable
type(page_html.content)

bytes

## Parsing HTML with BeautifulSoup

In [6]:
soup = BeautifulSoup(markup = page_html.content, features = "html.parser")

In [7]:
type(soup)

bs4.BeautifulSoup

We make use of `print()` and the `prettify()` functions to better understand the HTML structure.

In [9]:
# Using prettify and print help with understanding the HTML since they add indentation
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en-GB">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title data-rh="true">
   The 'exploding' demand for giant heat pumps - BBC News
  </title>
  <meta content="Whole towns in Europe are being heated by huge, energy efficient heat pumps." data-rh="true" name="description"/>
  <meta content="#FFFFFF" data-rh="true" name="theme-color"/>
  <meta content="https://www.facebook.com/bbcnews" data-rh="true" property="article:author"/>
  <meta content="100004154058350" data-rh="true" property="fb:admins"/>
  <meta content="1609039196070050" data-rh="true" property="fb:app_id"/>
  <meta content="1143803202301544,317278538359186,1392506827668140,742734325867560,185246968166196,156060587793370,137920769558355,193435954068976,21263239760,156400551056385,929399697073756,154344434967,228735667216,80758950658,260212261199,294662213128,1086451581439054,283348121682053,295830058648,239931389545417,3

In [10]:
# We only really want the author
soup.find('div', attrs={'data-component':"byline-block"}).text

'By Chris BaraniukTechnology of Business reporter'

## Let us now extract information from our "soup"

### Main Heading

`find()` will return the first matching item

In [11]:
title = soup.find(id="main-heading")

In [12]:
title.text

"The 'exploding' demand for giant heat pumps"

### Body of text

Let us start by using `find()` to get the first matching item.

In [17]:
soup.find(name='div', attrs={'data-component':"text-block"})

<div class="ssrcss-11r1m41-RichTextComponentWrapper ep2nwvo0" data-component="text-block"><div class="ssrcss-7uxr49-RichTextContainer e5tfeyi1"><p class="ssrcss-1q0x1qg-Paragraph eq5iqo00"><b class="ssrcss-hmf8ql-BoldText e5tfeyi3">There are 2.5 million litres of water in an Olympic-sized swimming pool. </b></p></div></div>

In [18]:
first_match = soup.find(name='div', attrs={'data-component':"text-block"})
print(first_match.prettify())

<div class="ssrcss-11r1m41-RichTextComponentWrapper ep2nwvo0" data-component="text-block">
 <div class="ssrcss-7uxr49-RichTextContainer e5tfeyi1">
  <p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">
   <b class="ssrcss-hmf8ql-BoldText e5tfeyi3">
    There are 2.5 million litres of water in an Olympic-sized swimming pool.
   </b>
  </p>
 </div>
</div>



In [19]:
first_match.text

'There are 2.5 million litres of water in an Olympic-sized swimming pool. '

`find_all()` returns a list of matching items

In [20]:
# Full list of matching items
all_matches = soup.find_all(name='div', attrs={'data-component':"text-block"})
all_matches

[<div class="ssrcss-11r1m41-RichTextComponentWrapper ep2nwvo0" data-component="text-block"><div class="ssrcss-7uxr49-RichTextContainer e5tfeyi1"><p class="ssrcss-1q0x1qg-Paragraph eq5iqo00"><b class="ssrcss-hmf8ql-BoldText e5tfeyi3">There are 2.5 million litres of water in an Olympic-sized swimming pool. </b></p></div></div>,
 <div class="ssrcss-11r1m41-RichTextComponentWrapper ep2nwvo0" data-component="text-block"><div class="ssrcss-7uxr49-RichTextContainer e5tfeyi1"><p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">If for some reason you wanted to bring it from a pleasant 20C to boiling point, German firm MAN Energy Solutions (MAN ES) has a heat pump that could do it. And it would take less time than Kenneth Branagh's film version of Hamlet.</p></div></div>,
 <div class="ssrcss-11r1m41-RichTextComponentWrapper ep2nwvo0" data-component="text-block"><div class="ssrcss-7uxr49-RichTextContainer e5tfeyi1"><p class="ssrcss-1q0x1qg-Paragraph eq5iqo00">"We can do this in less than four hours," ex

In [21]:
# List size
len(all_matches)

31

In [22]:
# List of text strings
[t.text for t in all_matches]

['There are 2.5 million litres of water in an Olympic-sized swimming pool. ',
 "If for some reason you wanted to bring it from a pleasant 20C to boiling point, German firm MAN Energy Solutions (MAN ES) has a heat pump that could do it. And it would take less time than Kenneth Branagh's film version of Hamlet.",
 '"We can do this in less than four hours," explains Raymond Decorvet, who works in business development at MAN ES. "Or we could freeze the whole thing in about 11 hours."',
 'Theirs is among the largest heat pump units in the world. Heat pumps work by compressing gently warmed refrigerants to raise the temperature of these fluids. That heat can then be passed on to homes or industrial machinery.',
 'Heat pumps require electricity to work but can produce around three or four kilowatt hours of heat for every kilowatt hour of power they consume.',
 "Heat pumps are increasingly popular with some home owners but domestic devices are relatively small and tend to have outputs of sever

In [23]:
# All sentences together
print(" ".join([t.text for t in all_matches]))

There are 2.5 million litres of water in an Olympic-sized swimming pool.  If for some reason you wanted to bring it from a pleasant 20C to boiling point, German firm MAN Energy Solutions (MAN ES) has a heat pump that could do it. And it would take less time than Kenneth Branagh's film version of Hamlet. "We can do this in less than four hours," explains Raymond Decorvet, who works in business development at MAN ES. "Or we could freeze the whole thing in about 11 hours." Theirs is among the largest heat pump units in the world. Heat pumps work by compressing gently warmed refrigerants to raise the temperature of these fluids. That heat can then be passed on to homes or industrial machinery. Heat pumps require electricity to work but can produce around three or four kilowatt hours of heat for every kilowatt hour of power they consume. Heat pumps are increasingly popular with some home owners but domestic devices are relatively small and tend to have outputs of several kilowatts or so. MA

### Publication time

Time as text:

In [26]:
time_info = soup.find_all(name='time')
time_info

[<time data-testid="timestamp" datetime="2023-05-29T23:23:50.000Z"><span class="ssrcss-1mh4yp1-IconContainer ecn1o5v0" data-testid="time-and-date:clock"><svg aria-hidden="true" class="ssrcss-xi5oyi-StyledIcon e6m7o991" focusable="false" height="1em" viewbox="0 0 32 32" width="1em"><path d="M16 31c8.5 0 15-6.5 15-15S24.5 1 16 1 1 7.5 1 16s6.5 15 15 15zm0-2.7C9 28.3 3.7 23 3.7 16S9 3.7 16 3.7C23 3.7 28.3 9 28.3 16S23 28.3 16 28.3zm6.2-6.7 1-1.5-5.7-4.5-.6-8.6H15l-.7 10.5 7.9 4.1z"></path></svg></span>30 May</time>]

In [28]:
print(time_info[0].prettify())

<time data-testid="timestamp" datetime="2023-05-29T23:23:50.000Z">
 <span class="ssrcss-1mh4yp1-IconContainer ecn1o5v0" data-testid="time-and-date:clock">
  <svg aria-hidden="true" class="ssrcss-xi5oyi-StyledIcon e6m7o991" focusable="false" height="1em" viewbox="0 0 32 32" width="1em">
   <path d="M16 31c8.5 0 15-6.5 15-15S24.5 1 16 1 1 7.5 1 16s6.5 15 15 15zm0-2.7C9 28.3 3.7 23 3.7 16S9 3.7 16 3.7C23 3.7 28.3 9 28.3 16S23 28.3 16 28.3zm6.2-6.7 1-1.5-5.7-4.5-.6-8.6H15l-.7 10.5 7.9 4.1z">
   </path>
  </svg>
 </span>
 30 May
</time>



In [29]:
soup.find(name='time').text

'30 May'

Time as datime:

In [31]:
soup.find(name='time').attrs

{'data-testid': 'timestamp', 'datetime': '2023-05-29T23:23:50.000Z'}

In [42]:
soup.find(name='time').attrs["datetime"]

'2023-05-29T23:23:50.000Z'

### All links found on the webpage

In [40]:
# First link
soup.find(name="a").get("href")

'https://www.bbc.co.uk'

In [39]:
# Printing all links found on the webpage
for link in soup.find_all(name='a'):
    print(link.get('href'))


https://www.bbc.co.uk
#main-heading
https://www.bbc.co.uk/accessibility/
https://account.bbc.com/account?lang=en-GB&ptrt=https://www.bbc.co.uk/news/business-65321487
https://www.bbc.co.uk/notifications
https://www.bbc.co.uk
https://www.bbc.co.uk/news
https://www.bbc.co.uk/sport
https://www.bbc.co.uk/weather
https://www.bbc.co.uk/iplayer
https://www.bbc.co.uk/sounds
https://www.bbc.co.uk/bitesize
#chameleon-global-navigation-more-menu
#chameleon-global-navigation-more-menu
/search?d=NEWS_PS
https://www.bbc.co.uk
https://www.bbc.co.uk/news
https://www.bbc.co.uk/sport
https://www.bbc.co.uk/weather
https://www.bbc.co.uk/iplayer
https://www.bbc.co.uk/sounds
https://www.bbc.co.uk/bitesize
https://www.bbc.co.uk/cbbc
https://www.bbc.co.uk/cbeebies
https://www.bbc.co.uk/food
#more-menu-button
/news
#product-navigation-menu
/news
/news/topics/cljev4jz3pjt
/news/world-60525350
/news/science-environment-56837908
/news/uk
/news/world
/news/business
/news/politics
/news/entertainment_and_arts
/news/

### Topic links

In [37]:
# List of topic links
all_links = soup.find_all(name='a')
topic_links = [link.get('href') for link in all_links if "topics" in link.get('href')]
topic_links

['/news/topics/cljev4jz3pjt',
 '/news/topics/c77jz3md4rwt',
 '/news/topics/c77jz3mdqv1t',
 '/news/topics/cvpmrdkex04t',
 '/news/topics/cx1m7zg0gpet']

### Topic names list

In [43]:
# Topics all in one string without a clear separator
[i.text for i in soup.find_all(name='div', attrs={'data-component':"topic-list"})]

['Related TopicsSwedenDenmarkHeat pumpsRenewable energy']

In [44]:
soup.find(name='div', attrs={'data-component':"topic-list"}).find_all("li")

[<li><a class="ssrcss-w6az1r-StyledLink ed0g1kj0" href="/news/topics/c77jz3md4rwt">Sweden</a></li>,
 <li><a class="ssrcss-w6az1r-StyledLink ed0g1kj0" href="/news/topics/c77jz3mdqv1t">Denmark</a></li>,
 <li><a class="ssrcss-w6az1r-StyledLink ed0g1kj0" href="/news/topics/cvpmrdkex04t">Heat pumps</a></li>,
 <li><a class="ssrcss-w6az1r-StyledLink ed0g1kj0" href="/news/topics/cx1m7zg0gpet">Renewable energy</a></li>]

In [45]:
[topic.text for topic in soup.find(name='div', attrs={'data-component':"topic-list"}).find_all("li")]

['Sweden', 'Denmark', 'Heat pumps', 'Renewable energy']

### Author information


In [46]:
# We only really want the author
soup.find(name='div', attrs={'data-component':"byline-block"}).text

'By Chris BaraniukTechnology of Business reporter'

In [48]:
by = soup.find(name='div', attrs={'data-component':"byline-block"})
print(by.prettify())

<div class="ssrcss-1bdte2-BylineComponentWrapper e8mq1e90" data-component="byline-block">
 <div class="ssrcss-qt5zqv-BylineWrapper e8mq1e917">
  <div class="ssrcss-h3c0s8-ContributorContainer e8mq1e916">
   <div class="ssrcss-1u2in0b-Container-ContributorDetails e8mq1e913">
    <div class="ssrcss-68pt20-Text-TextContributorName e8mq1e96">
     By Chris Baraniuk
    </div>
    <div class="ssrcss-84ltp5-Text e8mq1e910">
     Technology of Business reporter
    </div>
   </div>
  </div>
 </div>
 <div class="ssrcss-jlwt2c-Divider e8mq1e915">
 </div>
</div>



In [49]:
by.find("div", attrs={"class":"ssrcss-68pt20-Text-TextContributorName e8mq1e96"}).text

'By Chris Baraniuk'

## Exercises

1. Does the code above work on other articles from bbc.co.uk? 

You can give it a try with the following ones or others:

```
url = "https://www.bbc.co.uk/news/science-environment-57159056"
url = "https://www.bbc.co.uk/news/business-64261457"
```

2. This article `https://www.bbc.co.uk/news/science-environment-57159056` contains headlines (besides the main title). Write code to extract those headlines.

**Steps:**

a) Use requests to get the HTML source code

b) Use BeautifulSoup to parse the HTML.

c) Build a rule that allows you to extract the information you need. 

3. Extract the figure captions from one of the articles.

4. Extract relevant metadata information from the pictures in one of the articles.

5. How do you interpret the following `robots.txt` files?


**Robots #1:**

```
User-agent: Google
Disallow:

User-agent: *
Disallow: /
```

**Robots #2:**

```
User-agent: BadBot
Disallow: /

User-agent: *
Disallow: /search/
Request-rate: 15/100
```