# Some examples on webscraping and BeautifulSoup

For a very few years this was a major hit on the internet, however, many modern websites have code that is not compatible with webscraping...

## Contents
0. Install packages
1. My first BeautifulSoup
2. Getting financial data from the web (examples protected)
3. Getting text from a .pdf with PyPDF2
4. Getting data from OryxSpioenkop
5. Generating .pdfs from web sites with pdfkit
6. Webscraping with pandas.read_html()

## 0. Install packages

In [1]:
! pip install requests



In [2]:
!pip install bs4

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-cp37-none-any.whl size=1278 sha256=c35451e9ac33c3d4f881a4f2a21a6b549fa4cbbb6b32cbbed67cd5062e1a5b25
  Stored in directory: C:\Users\Michiel\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


## 1. BeautifulSoup

In [33]:
import requests
result = requests.get("https://www.python.org/")
                      
print(result.status_code)
print(result.text)

200
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">
    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js">

    <meta name="application-name" content="Python.org">
    <meta name="msapplication-tooltip" content="The official home of the Python Programming Language">
    <meta name="apple-mobile-web-app-title" content="Python.org">
    <meta name="apple-mobile-web-app-capable" content="yes">
    <meta name="apple-mobile-web-app-status-bar-style" content="black">

    <meta na

In [29]:
from bs4 import BeautifulSoup
src = result.text
soup = BeautifulSoup(src, 'lxml')
h2_headings = soup.find_all("h2")
h2_headings

[<h2 class="widget-title"><span aria-hidden="true" class="icon-get-started"></span>Get Started</h2>,
 <h2 class="widget-title"><span aria-hidden="true" class="icon-download"></span>Download</h2>,
 <h2 class="widget-title"><span aria-hidden="true" class="icon-documentation"></span>Docs</h2>,
 <h2 class="widget-title"><span aria-hidden="true" class="icon-jobs"></span>Jobs</h2>,
 <h2 class="widget-title"><span aria-hidden="true" class="icon-news"></span>Latest News</h2>,
 <h2 class="widget-title"><span aria-hidden="true" class="icon-calendar"></span>Upcoming Events</h2>,
 <h2 class="widget-title"><span aria-hidden="true" class="icon-success-stories"></span>Success Stories</h2>,
 <h2 class="widget-title"><span aria-hidden="true" class="icon-python"></span>Use Python for…</h2>,
 <h2 class="widget-title">
 <span class="prompt">&gt;&gt;&gt;</span> <a href="/dev/peps/">Python Enhancement Proposals<span class="say-no-more"> (PEPs)</span></a>: The future of Python<span class="say-no-more"> is di

In [35]:
for heading in h2_headings:
    print(heading.name + " " + heading.text.strip())

h2 Get Started
h2 Download
h2 Docs
h2 Jobs
h2 Latest News
h2 Upcoming Events
h2 Success Stories
h2 Use Python for…
h2 >>> Python Enhancement Proposals (PEPs): The future of Python is discussed here.
 RSS
h2 >>> Python Software Foundation


In [36]:
# A demo script to select all h2 headers from python.org
import requests
from bs4 import BeautifulSoup

#get the data from the python.org website
result = requests.get("https://www.python.org/")
src = result.text

# Do the beautiful soup magic
soup = BeautifulSoup(src, 'lxml')

#select and print all the h2 headings in the soup
h2_headings = soup.find_all("h2")
for heading in h2_headings:
    print(heading.name + " " + heading.text.strip())

h2 Get Started
h2 Download
h2 Docs
h2 Jobs
h2 Latest News
h2 Upcoming Events
h2 Success Stories
h2 Use Python for…
h2 >>> Python Enhancement Proposals (PEPs): The future of Python is discussed here.
 RSS
h2 >>> Python Software Foundation


In [None]:
for heading in h2_headings:
    print(heading.name + " " + heading.text.strip())

In [28]:
import requests
from bs4 import BeautifulSoup
url = 'https://www.python.org/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'lxml')
print("List of all the h1, h2, h3 :")
for heading in soup.find_all(["h1","h2","h3"]):
    print(heading.name + ' ' + heading.text.strip())

List of all the h1, h2, h3 :
h1 
h1 Functions Defined
h1 Compound Data Types
h1 Intuitive Interpretation
h1 All the Flow You’d Expect
h1 Quick & Easy to Learn
h2 Get Started
h2 Download
h2 Docs
h2 Jobs
h2 Latest News
h2 Upcoming Events
h2 Success Stories
h2 Use Python for…
h2 >>> Python Enhancement Proposals (PEPs): The future of Python is discussed here.
 RSS
h2 >>> Python Software Foundation


### Getting the data from the KNMI

In [53]:
url = 'https://www.knmi.nl/nederland-nu/weer/verwachtingen'
knmi = requests.get(url)
knmi.status_code
knmi.text



In [54]:
import bs4
soup = BeautifulSoup(knmi.text, 'lxml')

In [56]:
para = soup.find_all("p")
for p in para:
    print(p.name + " " + p.text.strip())

p Eerst plaatselijk dichte mist en gladheid
p Waarschuwingen Plaatselijk gladheid, in het midden zeer plaatselijk dichte mist (code geel).Vanochtend komt er met uitzondering van het zuidoosten plaatselijk dichte mist voor en kan het bovendien nog plaatselijk glad zijn. De gladheid zal op de meeste plaatsen verdwijnen, de mist kan op sommig plekken hardnekkig aanwezig blijven. Verder is er hier en daar ook ruimte voor de zon en is het op de meeste plaatsen droog. Er staat weinig wind, in het noordelijk kustgebied staat een matige wind uit zuid tot zuidwest.Vanmiddag hebben we een afwisseling van zon en bewolking en op een enkele plek kan het nog mistig zijn. Daar waar de zon schijnt kan het 3°C worden, in gebieden met hardnekkige mist komt de temperatuur amper boven nul. Vanavond is het droog en kan er opnieuw mist ontstaan. Bovendien gaat het overal licht vriezen waardoor het weer glad kan worden. Er staat weinig wind.
p Komende nacht ontstaat plaatselijk weer mist en kan het ook glad 

In [12]:
print(dir(soup))



In [22]:
#Get the tags
from bs4 import BeautifulSoup
tag = soup.h2 
print(tag)
for heading in soup.find_all(["h1","h2","h3"]):
    print(heading.name)
    print('-------------------------------')
    print(heading.name + ' ' + heading.text.strip())

<h2 class="widget-title"><span aria-hidden="true" class="icon-get-started"></span>Get Started</h2>
h1
-------------------------------
h1 
h1
-------------------------------
h1 Functions Defined
h1
-------------------------------
h1 Compound Data Types
h1
-------------------------------
h1 Intuitive Interpretation
h1
-------------------------------
h1 All the Flow You’d Expect
h1
-------------------------------
h1 Quick & Easy to Learn
h2
-------------------------------
h2 Get Started
h2
-------------------------------
h2 Download
h2
-------------------------------
h2 Docs
h2
-------------------------------
h2 Jobs
h2
-------------------------------
h2 Latest News
h2
-------------------------------
h2 Upcoming Events
h2
-------------------------------
h2 Success Stories
h2
-------------------------------
h2 Use Python for…
h2
-------------------------------
h2 >>> Python Enhancement Proposals (PEPs): The future of Python is discussed here.
 RSS
h2
-------------------------------
h2 >>> 

In [2]:
#a script to get the headlines of the Dutch nu.nl news website
import urllib.request
from bs4 import BeautifulSoup

url = "https://www.nu.nl/"
try:
    page = urllib.request.urlopen(url)
except Exception as e:
    print(e)
    pass
soup = BeautifulSoup(page, "html.parser")

table = soup.find(class_="timestamp")
nieuw = table.find_all('span')

for titel in nieuw:
    print("Title: {}".format(titel.get("title")))
    #print(titel.prettify())

Title: Voorgevel woning aan diggelen door explosie in Roosendaal
Title: None
Title: None
Title: A7 tussen Heerenveen en Groningen dicht vanwege brandende kraanwagen
Title: None
Title: Hoe volg jij het nieuws? Praat mee over kranten en media
Title: None
Title: Britse krant The Guardian vermoedelijk getroffen door ransomwareaanval
Title: None
Title: Krapte op arbeidsmarkt is volgens het UWV over het hoogtepunt heen
Title: None


## 2. Getting financial data from the web

Source: https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/

### 2a. Getting value of AEX from Bloomberg (no sensible results, website protected)
Web:  https://www.bloomberg.com/quote/AEX:IND

In [18]:
import requests
import urllib
from bs4 import BeautifulSoup

In [19]:
quote_page = 'http://www.bloomberg.com/quote/AEX:IND'
result = requests.get(quote_page)

src = result.content
print(src)

b'<!doctype html>\n<html>\n<head>\n    <title>Bloomberg - Are you a robot?</title>\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <link rel="stylesheet" type="text/css" href="https://assets.bwbx.io/font-service/css/BWHaasGrotesk-55Roman-Web,BWHaasGrotesk-75Bold-Web,BW%20Haas%20Text%20Mono%20A-55%20Roman/font-face.css">\n    <style rel="stylesheet" type="text/css">\n        html, body, div, span, applet, object, iframe,\n        h1, h2, h3, h4, h5, h6, p, blockquote, pre,\n        a, abbr, acronym, address, big, cite, code,\n        del, dfn, em, img, ins, kbd, q, s, samp,\n        small, strike, strong, sub, sup, tt, var,\n        b, u, i, center,\n        dl, dt, dd, ol, ul, li,\n        fieldset, form, label, legend,\n        table, caption, tbody, tfoot, thead, tr, th, td,\n        article, aside, canvas, details, embed,\n        figure, figcaption, footer, header, hgroup,\n        menu, nav, output, ruby, section, summary,\n        time, mark, audio,

In [20]:
soup = BeautifulSoup(src, 'lxml')
urls = []
for price in soup.find_all('h1'):
    #a_tag = h2_tag.find('a')
    urls.append(price)
print(urls)

[<h1 class="logo">Bloomberg</h1>]


### 2b. Getting the Dutch TTF data (idem)

In [44]:
import requests
#url = 'https://www.theice.com/products/27996665/Dutch-TTF-Gas-Futures/data?marketId=5429405'
#url ='https://www.barchart.com/futures/quotes/TG*1'
url ='https://nl.investing.com/commodities/ice-dutch-ttf-gas-c1-futures'
result = requests.get(url)
src = result.text
#print(type(src))
#print(src)
#print(type(src))
#print(src.split("Bied/laat")[1])
#print(len(src.split("Bied/laat")))
text = src.split("Bied/laat")[1]
span = text.split('span class="">')
print(span[1].split("<")[0])

155,905


In [42]:
span

['<!-- -->:</div><div class="trading-hours_value__2MrOn" data-test="bid-value"><',
 '156,315</span><span class="px-1">/</span><',

In [45]:
import requests
url ='https://evcompany.evc-net.com/RechargeSpots/Detail/1000011184'
response = requests.get(url)
print(response.text)

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no" name="viewport">
        <link rel='stylesheet' href='/min/?g=css_ext&v=220825'><link rel='stylesheet' href='/api/style/css&v=220825'><link rel='stylesheet' href='/min/?g=css&v=220825'><link rel='shortcut icon' href='/api/style/profile/favicon'>
        <script type='text/javascript'>
            var JS_FOLDER = "\/js\/";
            var GOOGLE_KEY = "AIzaSyAoKg0_I3WMgjbpM6jwlffsWxXpW9NriI0";
            var DECIMAL_SEPERATOR = ".";
            var DATE_FORMAT = "DD-MM-YYYY";
            var DATE_TIME_FORMAT = "DD-MM-YYYY HH:mm:ss";
            var LANGUAGE = "en";
            var CURRENCY = {"code":"EUR","exchangeRate":null,"symbol":"\u20ac","name":"Euro","supported":true};
            var AUTOFOCUS = false;
            var DARKMODE = 0;
  

In [10]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(src, 'lxml')
urls = []
for price in soup.find_all('158'):
    #a_tag = h2_tag.find('a')
    urls.append(price)
print(urls)

[]


In [11]:
soup

<!DOCTYPE html>
<html class="no-js" data-is-cms-page="true" lang="en"><head><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1, minimum-scale=1" name="viewport"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><title>ICE Futures and Options</title><link href="https://www.ice.com/api/static/icegroupweb-styles/6.0.0/css/ice.css" rel="stylesheet"/><link href="https://static.ice.com/cms/35.0.5/css/ice.css" rel="stylesheet"/><link href="https://static.ice.com/favicons/1.0.0/ice.ico" rel="shortcut icon" type="image/x-icon"/><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-TDCRN82');</script><script defer="" src="https://static.ice.com/cms/35.0.5/js/hydrate/index.js" type="module"></

## 3. Getting text from a .pdf with PyPDF2 (to clean up)

source: https://medium.com/@umerfarooq_26378/python-for-pdf-ef0fac2808b0

In [10]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77 kB)
Building wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py): started
  Building wheel for PyPDF2 (setup.py): finished with status 'done'
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61087 sha256=05e8f61b1ed16660924a476f72d1b6c12d0c82ba2557ea8cc36a90ac6f00f754
  Stored in directory: c:\users\31653\appdata\local\pip\cache\wheels\b1\1a\8f\a4c34be976825a2f7948d0fa40907598d69834f8ab5889de11
Successfully built PyPDF2
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0


In [11]:
import PyPDF2
# pdf file object
# you can find find the pdf file with complete code in below
pdfFileObj = open('COVID-19_WebSite_rapport_20200512_1143 (1).pdf', 'rb') # insert your file here
# pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# number of pages in pdf
print('paginaas = '+ str(pdfReader.numPages))
# a page object
pageObj = pdfReader.getPage(9)
# extracting text from page.
# this will print the text you can also save that into String
print(pageObj.extractText())

FileNotFoundError: [Errno 2] No such file or directory: 'COVID-19_WebSite_rapport_20200512_1143 (1).pdf'

In [12]:
!pip install tabula-py

Collecting tabula-py
  Downloading tabula_py-2.2.0-py3-none-any.whl (11.7 MB)
Collecting distro
  Downloading distro-1.5.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.5.0 tabula-py-2.2.0


In [13]:
import tabula
# readinf the PDF file that contain Table Data
# you can find find the pdf file with complete code in below
# read_pdf will save the pdf table into Pandas Dataframe
df = tabula.read_pdf("COVID-19_WebSite_rapport_20200512_1143 (1).pdf")
# in order to print first 5 lines of Table
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'COVID-19_WebSite_rapport_20200512_1143 (1).pdf'

## 4. Getting data from OryxSpioenkop


In [9]:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.oryxspioenkop.com/2022/02/attack-on-europe-documenting-equipment.html")
                      
print(result.status_code)
print(result.headers)

200
{'Content-Type': 'text/html; charset=UTF-8', 'Expires': 'Tue, 05 Apr 2022 09:13:51 GMT', 'Date': 'Tue, 05 Apr 2022 09:13:51 GMT', 'Cache-Control': 'private, max-age=0', 'Last-Modified': 'Tue, 05 Apr 2022 08:16:05 GMT', 'ETag': 'W/"8f89c699b861e3f922a86b4edb3da876d3fe03e6acdb1083e32602da03eac609"', 'Content-Encoding': 'gzip', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Server': 'GSE', 'Transfer-Encoding': 'chunked'}


In [19]:
src = result.content
soup = BeautifulSoup(src, 'lxml')
links = soup.find_all("a")
links

[<a href="https://www.oryxspioenkop.com/">Home</a>,
 <a href="https://www.oryxspioenkop.com/p/contact.html" itemprop="url">Contact</a>,
 <a href="https://www.patreon.com/oryxspioenkop" itemprop="url">Patreon</a>,
 <a href="https://www.helion.co.uk/military-history-books/the-armed-forces-of-north-korea-on-the-path-of-songun.php?sid=6d05c760e672b5fa614872a16a896afa" itemprop="url">Our Book</a>,
 <a href="https://oryxspioenkoptr.blogspot.com/" itemprop="url">Türkçe Için</a>,
 <a href="http://spioenkopjp.blogspot.com/" itemprop="url">日本語で</a>,
 <a class="twitter" href="https://twitter.com/oryxspioenkop" title="twitter"></a>,
 <a class="facebook" href="https://www.facebook.com/oryxspioenkop" title="facebook"></a>,
 <a class="youtube" href="https://www.youtube.com/channel/UCFqUJaJJWQusy1hvBQXrV3w/videos" title="youtube"></a>,
 <a class="email" href="mailto:oryxspioenkop@gmail.com" title="email"></a>,
 <a href="https://www.oryxspioenkop.com/" style="display: block"><h1 style="display:none"></

In [40]:
name_box = soup.find_all('h3')
name_box[3]

<h3><span class="mw-headline" id="Pistols">Tanks </span>(425, of which destroyed: 201, damaged: 6, abandoned: 42, captured: 176)</h3>

In [41]:
name_box = name_box[3]
name = name_box.text.strip() # strip() is used to remove starting and trailing
print(name)

Tanks (425, of which destroyed: 201, damaged: 6, abandoned: 42, captured: 176)


In [29]:
type(name)

str

In [42]:
split = name.split(",")

In [43]:
split

['Tanks (425',
 ' of which destroyed: 201',
 ' damaged: 6',
 ' abandoned: 42',
 ' captured: 176)']

In [36]:
#to do pandas read table to convert this string to tabular data

## 5. Generating .pdfs from web sites with pdfkit (websites often protected)

In [3]:
pip install pdfkit

Collecting pdfkit
  Downloading pdfkit-1.0.0-py3-none-any.whl (12 kB)
Installing collected packages: pdfkit
Successfully installed pdfkit-1.0.0
Note: you may need to restart the kernel to use updated packages.


In [4]:
#works properly with google com
import pdfkit
pdfkit.from_url('http://google.com', 'out.pdf')

True

In [26]:
from glob import glob
my_pdfs = glob('*.pdf')
my_pdfs

['out.pdf', 'TTF.pdf']

In [10]:
#getting this data from newspaper TROUW results in an empty .pdf
url = 'https://www.trouw.nl/duurzaamheid-economie/de-groene-baan-lonkt-en-deze-mensen-gooiden-het-roer-om~bfa32614/'
name = 'Gideon.pdf' 
import pdfkit
pdfkit.from_url(url, name)

True

In [27]:
#also TTF website results in empty .pdf
url = 'https://www.theice.com/products/27996665/Dutch-TTF-Gas-Futures/data?marketId=5429405'
name = 'TTF.pdf' 
import pdfkit
pdfkit.from_url(url, name)

True

## 6. Webscraping with pandas.read_html()

source: https://pythonbasics.org/pandas-web-scraping/