# Examples, course on webscraping

-- * By Olav ten Bosch and Dick Windmeijer *

#### Documentation: [Requests.py](http://docs.python-requests.org) [Beautifulsoup.py](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) [regular expressions](https://developers.google.com/edu/python/regular-expressions)

In [4]:
# Imports:
import requests
from bs4 import BeautifulSoup
import re
import time

In [9]:
# Retrieving home page of Statistics Netherlands:
r1 = requests.get('https://www.cbs.nl/en-gb')

#print(r1.headers)
print(r1.status_code, r1.headers['content-type'], r1.encoding)
#print(r1.text)

200 text/html; charset=utf-8 utf-8


In [25]:
# Retrieving home page of Statistics Netherlands with user-agent string:
headers = {'user-agent': 'scrapingCourseBot'}
r2 = requests.get('https://www.cbs.nl/en-gb', headers=headers)

# Headers of the request:
print(r2.request.headers)
# Headers of the response:
print(r2.headers)
#print(r2.status_code, r2.headers['content-type'], r2.encoding)
#print(r2.text)

{'user-agent': 'scrapingCourseBot', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
{'Cache-Control': 'private', 'Content-Type': 'text/html; charset=utf-8', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'Set-Cookie': 'website#lang=en-GB; path=/; secure, ASP.NET_SessionId=q3ezmiemgighgcyohnn3b3ns; path=/; secure; HttpOnly, SC_ANALYTICS_GLOBAL_COOKIE=17b032bc36134a3fa9c3e5b2dcf49cf8|False; expires=Sun, 11-Feb-2029 14:05:32 GMT; path=/; secure; HttpOnly', 'X-Frame-Options': 'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=31536000', 'Access-Control-Allow-Origin': '*', 'Date': 'Mon, 11 Feb 2019 14:05:32 GMT', 'Content-Length': '13294'}


In [13]:
# Use a regexp to retrieve tile of main article:
# The main article is in an h2:
#  <h2 style="background-color:#0058b8;" class="bottom">TITLE</h2>
match = re.search(r'<h2 style="background-color:#0058b8;" class="bottom">.*</h2>', r1.text)
if match:
    print(match.group())
else:
    print('not found')

<h2 style="background-color:#0058b8;" class="bottom">Nearly 80 million air passengers in 2018</h2>


In [23]:
# Issue a request with parameters:
pars = {'products': 2, 'years': 2}
r3 = requests.get('http://testing-ground.scraping.pro/table?', params=pars)        
print(r3.url)
#print(r3.text)

http://testing-ground.scraping.pro/table?products=2&years=2


In [24]:
# Using soup to access parts of page:
r4 = requests.get('https://www.cbs.nl/en-gb')
soup = BeautifulSoup(r4.text, 'lxml')
print(soup.title.text)
print(soup.find("h2").text)

CBS - Statistics Netherlands
More asylum seekers in 2018


In [48]:
# Get the URLS to all news articles of CBS:
articles = soup.find_all("div", class_='thumbnail')
for article in articles:
    link = article.find("a")['href']
    print(link)

/en-gb/dossier/brexit-monitor
/en-gb/news/2019/05/dutch-manufacturers-less-positive
/en-gb/news/2019/05/manufacturing-output-prices-almost-1-percent-up
/en-gb/news/2019/05/rising-imports-and-exports-of-construction-services
/en-gb/news/2019/04/eu-webshops-earn-over-400-million-in-the-netherlands
/en-gb/news/2019/04/house-prices-over-8-percent-higher-in-december
/en-gb/news/2019/04/brussels-sprouts-exports-hit-record-high
/en-gb/news/2019/04/over-600-billion-kg-of-inbound-goods-in-2017
/en-gb/news/2019/04/investments-up-in-november
/en-gb/news/2019/04/household-consumption-2-percent-up-in-november
/en-gb/news/2019/04/largest-drop-in-consumer-confidence-in-over-7-years
/en-gb/news/2019/03/agricultural-export-value-over-90-bn-euros-in-2018


In [53]:
headlines = soup.select("div.thumbnail h3")

print(headlines)

[<h3>Brexit Monitor</h3>, <h3>Dutch manufacturers less positive</h3>, <h3>Manufacturing output prices almost 1 percent up</h3>, <h3>Rising imports and exports of construction services</h3>, <h3>EU webshops earn over €400 million in the Netherlands</h3>, <h3>House prices over 8 percent higher in December</h3>, <h3>Brussels sprouts exports hit record high</h3>, <h3>Over 600 billion kg of inbound goods in 2017</h3>, <h3>Investments up in November</h3>, <h3>Household consumption 2 percent up in November</h3>, <h3>Largest drop in consumer confidence in over 7 years</h3>, <h3>Agricultural export value over 90 bn euros in 2018</h3>]


In [60]:
# Get all texts of news articles of CBS:
articles = soup.find_all("div", class_='thumbnail')
links3 = []
for article in articles:
    links3.append(article.find("a")['href'])

for link in links3:
    r = requests.get('https://www.cbs.nl'+link)
    #print(r.url)
    soup2 = BeautifulSoup(r.text, 'lxml')
    leadtext = soup2.find('section', class_='leadtext')
    if leadtext is None: continue
    print(leadtext.text)
    time.sleep(1) # in robots.txt CBS advises a delay of 1 second


                According to Statistics Netherlands (CBS), producer confidence among Dutch manufacturers has declined in January 2019. Confidence stands at 5.8 in January, down from 7.5 in December. Producers are mainly less positive about their future output.
            

                Statistics Netherlands (CBS) reports that prices of Dutch-manufactured products were 0.6 percent up in December 2018 year-on-year. The price increase was smaller than in the previous month, when prices in manufacturing were up by 2.7 percent.
            

                In the first three quarters of 2018, Dutch companies had transactions with foreign companies in construction and dredging services to a total amount of 4.3 billion euros, the highest amount ever recorded over three consecutive quarters. Statistics Netherlands (CBS) reports this on the basis of the latest figures on international service trade.
            

                In Q3 2018, Dutch consumers purchased approximately 406 mil