<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Example-01:-Extract-text" data-toc-modified-id="Example-01:-Extract-text-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Example 01: Extract text</a></span><ul class="toc-item"><li><span><a href="#Title?" data-toc-modified-id="Title?-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Title?</a></span></li><li><span><a href="#Text-per-section-(e.g.-1.-What-is-cryptocurrency?)" data-toc-modified-id="Text-per-section-(e.g.-1.-What-is-cryptocurrency?)-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Text per section (e.g. 1. What is cryptocurrency?)</a></span></li></ul></li><li><span><a href="#Example-02:-Extract-table-info" data-toc-modified-id="Example-02:-Extract-table-info-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Example 02: Extract table info</a></span></li><li><span><a href="#Example-03:-Extract-information-from-hyperlink" data-toc-modified-id="Example-03:-Extract-information-from-hyperlink-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Example 03: Extract information from hyperlink</a></span></li></ul></li></ul></div>

# Introduction

In [1]:
# importing packages

import requests
from bs4 import BeautifulSoup
import pandas as pd

## Example 01: Extract text

In [2]:
url_01 = "https://www.nerdwallet.com/article/investing/cryptocurrency-7-things-to-know#:~:text=A%20cryptocurrency%20(or%20%E2%80%9Ccrypto%E2%80%9D,sell%20or%20trade%20them%20securely."

In [3]:
# Send request and catch response
response = requests.get(url_01)

# get the content of the response
content = response.content

# parse webpage
parser = BeautifulSoup(content, 'lxml')

`parser` is a `BeautifulSoup object`, which represents the document as a nested data structure.

The `prettify()` method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, making it much easy to visualize the tree structure.

In [4]:
# make it easier to visualize

print(parser.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <title>
   What Is Cryptocurrency? Here’s What You Should Know - NerdWallet
  </title>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <link as="style" href="https://www.nerdwallet.com/cdn/apps/prod/global-markup/nds.992df188256af94dc2a7.css" rel="preload"/>
  <link as="style" href="https://www.nerdwallet.com/cdn/apps/prod/global-markup/nav.992df188256af94dc2a7.css" rel="preload"/>
  <link as="style" href="https://www.nerdwallet.com/cdn/apps/prod/article-client/build/css/app.2fc6d537a77322b03f6c.css" rel="preload"/>
  <link as="style" href="https://www.nerdwallet.com/cdn/apps/prod/article-client/build/css/chunks/components/table-of-contents.bf46fe98c0f46c9ebdae.css" rel="preload"/>
  <link as="style" href="https://www.nerdwallet.com/cdn/apps/prod/article-client/build/css/chunks/components/mini-product-card.2d9241cf96498cba430d.css" rel="preload"/>
  <link as="style" href="https://www.nerdwa

In [5]:
def parse_website(url):
    """ 
    Parse content of a website
    
    Args:
        url (str): url of the website of which we want to acess the content 
        
    Return:
        parser: representation of the document as a nested data structure.
    """
    # Send request and catch response
    response = requests.get(url)

    # get the content of the response
    content = response.content

    # parse webpage
    parser = BeautifulSoup(content, "lxml")
    
    return parser  

In [6]:
parser_01 = parse_website(url_01)

### Title?

In [7]:
# access title of the web page
title = parser_01.title

#obtain text between tags
title = title.text
title

'What Is Cryptocurrency? Here’s What You Should Know - NerdWallet'

### Text per section (e.g. 1. What is cryptocurrency?)

![](../images/crypto_currency_section.png)

In [8]:
subtitles = parser_01.find_all("span", class_ = "_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p")

In [9]:
texts = parser_01.find_all("span", class_="_2GMChG _3-to_p")

In [10]:
subtitles

[<span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">1. Cryptocurrency definition</span>,
 <span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">2. How to buy cryptocurrency</span>,
 <span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">3. Best cryptocurrencies by market capitalization</span>,
 <span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">4. Keeping crypto safe</span>,
 <span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">5. Pros and cons of cryptocurrency</span>,
 <span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">6. Crypto investing guidelines</span>,
 <span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">7. Legality of cryptocurrencies</span>,
 <span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">1. Cryptocurrency definition</span>,
 <span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">2. How to buy cryptocurrency</span>,
 <span class="_23_GlS _3ZRnke _2McN3Z _2GMChG _3-to_p">3. Best cryptocurrencies by market capitalization</span>,
 <span class="_23_GlS _3ZRnke _2M

In [11]:
text_01 = texts[0:6]

In [12]:
text_01

[<span class="_2GMChG _3-to_p">Cryptocurrencies are digital assets created using computer networking software that enables secure trading and ownership.</span>,
 <span class="_2GMChG _3-to_p"> </span>,
 <span class="_2GMChG _3-to_p">Bitcoin</span>,
 <span class="_2GMChG _3-to_p"> and most other cryptocurrencies are supported by a technology known as blockchain, which maintains a tamper-resistant record of transactions and keeps track of who owns what. Public blockchains are usually decentralized, which means they operate without a central authority such as a bank or government.</span>,
 <span class="_2GMChG _3-to_p">The term cryptocurrencies comes from the cryptographic processes that developers have put in place to guard against fraud. These innovations addressed a problem faced by previous efforts to create purely digital currencies: how to prevent people from making copies of their holdings and attempting to spend them twice.</span>,
 <span class="_2GMChG _3-to_p">Individual units o

In [13]:
text_01 = text_01[0:4]

Just some cleaning....

Notice that we used the [`module unidecode`](https://pypi.org/project/Unidecode/) to convert the non-ASCII `\xa0` to the closest ASCII charactere. In this case, using a simple `replace` would make the job but unicode is a more general way if you need to face more cases like this with the text extracted.

In [14]:
# !pip install Unidecode

In [15]:
# Remove none values if there is some
import unidecode

text_01 = list(filter(None, text_01)) 
text_01 = [unidecode.unidecode(txt.text) for txt in text_01]
text_01

['Cryptocurrencies are digital assets created using computer networking software that enables secure trading and ownership.',
 ' ',
 'Bitcoin',
 ' and most other cryptocurrencies are supported by a technology known as blockchain, which maintains a tamper-resistant record of transactions and keeps track of who owns what. Public blockchains are usually decentralized, which means they operate without a central authority such as a bank or government.']

In [16]:
text_01 = ' '.join(text_01)
text_01

'Cryptocurrencies are digital assets created using computer networking software that enables secure trading and ownership.   Bitcoin  and most other cryptocurrencies are supported by a technology known as blockchain, which maintains a tamper-resistant record of transactions and keeps track of who owns what. Public blockchains are usually decentralized, which means they operate without a central authority such as a bank or government.'

## Example 02: Extract table info

In [17]:
url_02 = "https://www.worldometers.info/population/countries-in-the-eu-by-population/"

In [18]:
parser_02 = parse_website(url_02)

In [19]:
print(parser_02.prettify())

<!DOCTYPE html>
<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Countries in the EU by Population (2022) - Worldometer
  </title>
  <meta content="List of countries the European Union ranked by population, from the most populous. Growth rate, median age, fertility rate, area, density, population density, urbanization, urban population, share of world population." name="description"/>
  <link href="/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
  <link href="/favicon/apple-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
  <link href="/favicon/apple-icon-72x72.png" rel="apple-touch-icon" siz

![](../images/population_EU.png)

In [20]:
# Obtain information from tag <table>
table = parser_02.find('table', id='example2')
table

<table cellspacing="0" class="table table-striped table-bordered" id="example2" width="100%"> <thead> <tr> <th>#</th> <th>Country (or dependency)</th> <th>Population<br/> (2020)</th> <th>Yearly<br/> Change</th> <th>Net<br/> Change</th> <th>Density<br/> (P/Km²)</th> <th>Land Area<br/> (Km²)</th> <th>Migrants<br/> (net)</th> <th>Fert.<br/> Rate</th> <th>Med.<br/> Age</th> <th>Urban<br/> Pop %</th> <th>World<br/> Share</th> </tr> </thead> <tbody> <tr> <td>1</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/germany-population/">Germany</a></td> <td style="font-weight: bold;">83,783,942</td> <td>0.32 %</td> <td>266,897</td> <td>240</td> <td>348,560</td> <td>543,822</td> <td>1.6</td> <td>46</td> <td>76 %</td> <td>1.07 %</td> </tr> <tr> <td>2</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/france-population/">France</a></td> <td style="font-weight: bold;">65,273,511</td> <td>0.22 %</td> <td>143,783</td

In [21]:
print(table.prettify())

<table cellspacing="0" class="table table-striped table-bordered" id="example2" width="100%">
 <thead>
  <tr>
   <th>
    #
   </th>
   <th>
    Country (or dependency)
   </th>
   <th>
    Population
    <br/>
    (2020)
   </th>
   <th>
    Yearly
    <br/>
    Change
   </th>
   <th>
    Net
    <br/>
    Change
   </th>
   <th>
    Density
    <br/>
    (P/Km²)
   </th>
   <th>
    Land Area
    <br/>
    (Km²)
   </th>
   <th>
    Migrants
    <br/>
    (net)
   </th>
   <th>
    Fert.
    <br/>
    Rate
   </th>
   <th>
    Med.
    <br/>
    Age
   </th>
   <th>
    Urban
    <br/>
    Pop %
   </th>
   <th>
    World
    <br/>
    Share
   </th>
  </tr>
 </thead>
 <tbody>
  <tr>
   <td>
    1
   </td>
   <td style="font-weight: bold; font-size:15px; text-align:left">
    <a href="/world-population/germany-population/">
     Germany
    </a>
   </td>
   <td style="font-weight: bold;">
    83,783,942
   </td>
   <td>
    0.32 %
   </td>
   <td>
    266,897
   </td>
   <td>
    24

In [22]:
# Obtain column names within tag <th> with attribute col
list_col = table.find_all('th')
list_col = [item.text.strip() for item in list_col]
list_col

['#',
 'Country (or dependency)',
 'Population (2020)',
 'Yearly Change',
 'Net Change',
 'Density (P/Km²)',
 'Land Area (Km²)',
 'Migrants (net)',
 'Fert. Rate',
 'Med. Age',
 'Urban Pop %',
 'World Share']

In [23]:
# Create a dataframe
EU_population_data = pd.DataFrame(columns = list_col)

From the table prettify we see that the rows are located under tag <tr> and items are located under tag <td> . Using a for loop within `tr` and `td` we are able to fill our dataframe.

In [24]:
# Create a for loop to fill EU_population_data
for j in table.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(EU_population_data)
    EU_population_data.loc[length] = row

In [25]:
EU_population_data

Unnamed: 0,#,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,1,Germany,83783942,0.32 %,266897,240,348560,543822,1.6,46,76 %,1.07 %
1,2,France,65273511,0.22 %,143783,119,547557,36527,1.9,42,82 %,0.84 %
2,3,Italy,60461826,-0.15 %,-88249,206,294140,148943,1.3,47,69 %,0.78 %
3,4,Spain,46754778,0.04 %,18002,94,498800,40000,1.3,45,80 %,0.60 %
4,5,Poland,37846611,-0.11 %,-41157,124,306230,-29395,1.4,42,60 %,0.49 %
5,6,Romania,19237691,-0.66 %,-126866,84,230170,-73999,1.6,43,55 %,0.25 %
6,7,Netherlands,17134872,0.22 %,37742,508,33720,16000,1.7,43,92 %,0.22 %
7,8,Belgium,11589623,0.44 %,50295,383,30280,48000,1.7,42,98 %,0.15 %
8,9,Czech Republic (Czechia),10708981,0.18 %,19772,139,77240,22011,1.6,43,74 %,0.14 %
9,10,Greece,10423054,-0.48 %,-50401,81,128900,-16000,1.3,46,85 %,0.13 %


## Example 03: Extract information from hyperlink

Applying web scraping to [`https://jadsmkbdatalab.nl/voorbeeldcases/`](https://jadsmkbdatalab.nl/voorbeeldcases/).

Right click to `inspect` element in the webpage. Notice that the information we look for is between h3's...

![](../images/mkb_inspect_page.PNG)

In [26]:
url_03 = 'https://jadsmkbdatalab.nl/voorbeeldcases/'
response = requests.get(url_03)
response

<Response [403]>

In this case, if you don't use headers you get a HTTP 403 which is an HTTP status code meaning access to the requested resource is forbidden.

In [27]:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
response = requests.get(url_03, headers=headers)
response

<Response [200]>

The following steps stay the same.

In [28]:
# get the content of the response
content = response.content

# parse webpage
parser_03 = BeautifulSoup(content, "lxml")

In [29]:
print(parser_03.prettify())

<!DOCTYPE html>
<html lang="nl" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://gmpg.org/xfn/11" rel="profile"/>
  <!-- Search Engine Optimization by Rank Math - https://s.rankmath.com/home -->
  <title>
   Voorbeeldcases - JADS MKB Datalab
  </title>
  <meta content="follow, index, max-snippet:-1, max-video-preview:-1, max-image-preview:large" name="robots"/>
  <link href="https://jadsmkbdatalab.nl/voorbeeldcases/" rel="canonical"/>
  <meta content="nl_NL" property="og:locale"/>
  <meta content="article" property="og:type"/>
  <meta content="Voorbeeldcases - JADS MKB Datalab" property="og:title"/>
  <meta content='Lees hier meer over bedrijven die we geholpen hebben, guidelines die jouw organisatie verder kunnen helpen en tutorials om zelf mee aan de slag te gaan! Bekijk onze voorbeeldcases Voorbeeld cases Een model voor Welvaarts Weegsystemen dat voorspelt hoe vol een onder

In [30]:
links = parser_03.find_all('h3')
links

[<h3 class="uael-post__title">
 <a href="https://jadsmkbdatalab.nl/hoe-via-met-hulp-van-het-mkb-datalab-de-verkeersveiligheid-vergroot-2/" target="_self">
 				Hoe VIA met hulp van het MKB Datalab de verkeersveiligheid vergroot			</a>
 </h3>,
 <h3 class="uael-post__title">
 <a href="https://jadsmkbdatalab.nl/merford/" target="_self">
 </h3>,
 <h3 class="uael-post__title">
 <a href="https://jadsmkbdatalab.nl/welvaarts-weegsystemen/" target="_self">
 				Een kritische blik op de data van Welvaarts Weegsystemen om uiteindelijk transportkosten te besparen			</a>
 </h3>,
 <h3 class="uael-post__title">
 <a href="https://jadsmkbdatalab.nl/ravu/" target="_self">
 				Tijdbesparing en Werknemerstevredenheid verhoging middels Rooster Optimalisatie			</a>
 </h3>,
 <h3 class="uael-post__title">
 <a href="https://jadsmkbdatalab.nl/academictransfer/" target="_self">
 				Vacatures Monitoren middels een PowerBI Dashboard			</a>
 </h3>]

In [31]:
urls = []

for h3_tag in parser_03.find_all('h3'):
    a_tag = h3_tag.find('a')
    urls.append(a_tag.attrs['href'])
    
print(urls)

['https://jadsmkbdatalab.nl/hoe-via-met-hulp-van-het-mkb-datalab-de-verkeersveiligheid-vergroot-2/', 'https://jadsmkbdatalab.nl/merford/', 'https://jadsmkbdatalab.nl/welvaarts-weegsystemen/', 'https://jadsmkbdatalab.nl/ravu/', 'https://jadsmkbdatalab.nl/academictransfer/']


Updating function to include headers...

In [32]:
def parse_website(url):
    """ 
    Parse content of a website
    
    Args:
        url (str): url of the website of which we want to acess the content 
        
    Return:
        parser: representation of the document as a nested data structure.
    """
    # Send request and catch response
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
    response = requests.get(url, headers=headers)

    # get the content of the response
    content = response.content

    # parse webpage
    parser = BeautifulSoup(content, "lxml")
    
    return parser  

In [33]:
# parse and prettify one of the obtained urls
parser_03_0 = parse_website(urls[0])
print(parser_03_0.prettify())

<!DOCTYPE html>
<html lang="nl" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="https://gmpg.org/xfn/11" rel="profile"/>
  <!-- Search Engine Optimization by Rank Math - https://s.rankmath.com/home -->
  <title>
   Hoe VIA met hulp van het MKB Datalab de verkeersveiligheid vergroot - JADS MKB Datalab
  </title>
  <meta content="index, follow, max-snippet:-1, max-video-preview:-1, max-image-preview:large" name="robots"/>
  <link href="https://jadsmkbdatalab.nl/hoe-via-met-hulp-van-het-mkb-datalab-de-verkeersveiligheid-vergroot-2/" rel="canonical"/>
  <meta content="nl_NL" property="og:locale"/>
  <meta content="article" property="og:type"/>
  <meta content="Hoe VIA met hulp van het MKB Datalab de verkeersveiligheid vergroot - JADS MKB Datalab" property="og:title"/>
  <meta content='VIA is een verkeerskundig ICT-bureau dat haar software zelf ontwikkeld. De VIA Software biedt interactie

In [34]:
# find all paragraphs
paragraphs = parser_03_0.find_all('p')
paragraphs

[<p>VIA is een verkeerskundig ICT-bureau dat haar software zelf ontwikkeld. De VIA Software biedt interactieve kaarten en grafieken om inzicht te krijgen in de verkeersongevallen, gereden snelheden en wegkenmerken. In samenwerking met VIA ontwikkelde het MKB Datalab een algoritme om de verkeersveiligheid van Nederlandse gemeenten inzichtelijk te maken en te benchmarken.</p>,
 <p>Giedo Donkers van VIA kwam in contact met het MKB Datalab tijdens een inspiratiesessie: “We wilden aan de slag met Data Science en verkeersveiligheid, maar misten de kennis en technieken.” Tijdens een intake sessie werden de wensen van VIA geïnventariseerd en is ook gekeken of de beschikbare data geschikt was.</p>,
 <p>Het is de ambitie van VIA om een blauwdruk van de verkeersveiligheid per gemeente te creëren. Op die manier is het makkelijker om vergelijkingen te maken tussen gemeenten en kan de verbetering – of verslechtering – in veiligheid worden gemonitord. </p>,
 <p>Het bestaande platform van VIA boodt al

In [35]:
# Obtain text of paragraphs
for paragraph in paragraphs:
    print(paragraph.text)

VIA is een verkeerskundig ICT-bureau dat haar software zelf ontwikkeld. De VIA Software biedt interactieve kaarten en grafieken om inzicht te krijgen in de verkeersongevallen, gereden snelheden en wegkenmerken. In samenwerking met VIA ontwikkelde het MKB Datalab een algoritme om de verkeersveiligheid van Nederlandse gemeenten inzichtelijk te maken en te benchmarken.
Giedo Donkers van VIA kwam in contact met het MKB Datalab tijdens een inspiratiesessie: “We wilden aan de slag met Data Science en verkeersveiligheid, maar misten de kennis en technieken.” Tijdens een intake sessie werden de wensen van VIA geïnventariseerd en is ook gekeken of de beschikbare data geschikt was.
Het is de ambitie van VIA om een blauwdruk van de verkeersveiligheid per gemeente te creëren. Op die manier is het makkelijker om vergelijkingen te maken tussen gemeenten en kan de verbetering – of verslechtering – in veiligheid worden gemonitord. 
Het bestaande platform van VIA boodt al de mogelijkheden om dwarsdoors

In [36]:
# saving the content of this page

with open("../data/processed/text_mkb.txt",'a') as file:
    for paragraph in paragraphs:
        file.write(paragraph.text)


In [37]:
# for a version without empty rows or paragraphs
# with open("../data/processed/text_mkb_02.txt",'a') as file:
#     for paragraph in paragraphs:
#         file.write(paragraph.text.replace('\n','').replace('\t',' ').strip())
