
# **Web scraping**


Web scraping, also known as web harvesting or web data extraction, is a type of data scraping used to gather information from websites.

In this session, we will cover the following concepts with the help of a business use case:
* Data acquisition through Web scraping

## **Health Care Rankings for Different European Countries** 

**Beautiful Soup** is a Python package that is used for web scraping. The urllib package is used to simplify the tasks of building, loading and parsing URLs. The Python datetime module supplies classes to work with date and time.

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests


- We are going to scrape data from Wikipedia. The data indicate rankings on different health indices such as patient rights and information, accessibility (waiting time for treatment), outcomes, range, the reach of services provided, prevention, and pharmaceuticals. The data are from the Euro Health Consumer index. In the following code, we read the data and use Beautiful Soup to convert the data into **bs4.BeautifulSoup** data.

In [3]:
url = 'https://en.wikipedia.org/wiki/Healthcare_in_Europe' 
page = requests.get(url)
print(page.content)


b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Healthcare in Europe - Wikipedia</title>\n<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(){var cookie=document.

In [4]:
soup = BeautifulSoup(page.content)
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Healthcare in Europe - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled";(function(){var cooki

- First, we must choose the table that we want to scrape. As many webpages have tables, we'll retrieve the exact table names from the HTML and store them in a list called `lst.`

- This list `lst` has a length of 5.

- Now let us display `lst`.

- We will scrape the first table, and use index 0 in `lst` to capture the first table name. Now, read the table using Beautiful Soup's `find` function. A simple option is to type the table name. You can simply select the name in `lst`, which in this case is "wikitable floatright sortable".

In [5]:
table=soup.find('table', {'class', 'wikitable floatright sortable'})


In [7]:
type(table)

bs4.element.Tag

In [8]:
table

<table class="wikitable floatright sortable" style="font-size: 90%">
<caption style="font-size:100%">EU countries with the highest life expectancy (2019)<sup class="reference" id="cite_ref-hdr-life-exp_4-0"><a href="#cite_note-hdr-life-exp-4">[4]</a></sup>
</caption>
<tbody><tr>
<th>World<br/>Rank
</th>
<th>EU<br/>Rank
</th>
<th>Country
</th>
<th colspan="3">Life expectancy<br/>at birth (years)
</th></tr>
<tr>
<td>5.
</td>
<td>1.
</td>
<td><a href="/wiki/Spain" title="Spain">Spain</a>
</td>
<td>83.4
</td></tr>
<tr>
<td>6.
</td>
<td>2.
</td>
<td><a href="/wiki/Italy" title="Italy">Italy</a>
</td>
<td>83.4
</td></tr>
<tr>
<td>11.
</td>
<td>3.
</td>
<td><a href="/wiki/Sweden" title="Sweden">Sweden</a>
</td>
<td>82.7
</td></tr>
<tr>
<td>12.
</td>
<td>4.
</td>
<td><a href="/wiki/France" title="France">France</a>
</td>
<td>82.5
</td></tr>
<tr>
<td>13.
</td>
<td>5.
</td>
<td><a href="/wiki/Malta" title="Malta">Malta</a>
</td>
<td>82.4
</td></tr>
<tr>
<td>16.
</td>
<td>6.
</td>
<td><a href="/w

- Now, it would be good to read the header and row names separately, so later we can easily make a DataFrame.

In [10]:
headers= [header.text for header in table.find_all('th')]

In [11]:
headers

['WorldRank\n', 'EURank\n', 'Country\n', 'Life expectancyat birth (years)\n']

In [12]:
rows = []
for row in table.find_all('tr'):
    rows.append([val.text.encode('utf8').decode() for val in row.find_all('td')])


- Now, all elements, rows, and headers are available to build the DataFrame, which we will call `df1`.

In [13]:
df1 = pd.DataFrame(rows, columns=headers)

- Let's display first seven rows of the `df1`

In [14]:
df1.head(9)

Unnamed: 0,WorldRank\n,EURank\n,Country\n,Life expectancyat birth (years)\n
0,,,,
1,5.\n,1.\n,Spain\n,83.4\n
2,6.\n,2.\n,Italy\n,83.4\n
3,11.\n,3.\n,Sweden\n,82.7\n
4,12.\n,4.\n,France\n,82.5\n
5,13.\n,5.\n,Malta\n,82.4\n
6,16.\n,6.\n,Ireland\n,82.1\n
7,17.\n,7.\n,Netherlands\n,82.1\n
8,19.\n,8.\n,Luxembourg\n,82.1\n


In [15]:
def preproc(dat):
    dat.dropna(axis=0, how='all', inplace=True)
    dat.replace(["\n"], [""],regex=True, inplace=True)
   
    return(dat)

In [16]:
df1 = preproc(df1)

In [17]:
df1

Unnamed: 0,WorldRank\n,EURank\n,Country\n,Life expectancyat birth (years)\n
1,5.0,1.0,Spain,83.4
2,6.0,2.0,Italy,83.4
3,11.0,3.0,Sweden,82.7
4,12.0,4.0,France,82.5
5,13.0,5.0,Malta,82.4
6,16.0,6.0,Ireland,82.1
7,17.0,7.0,Netherlands,82.1
8,19.0,8.0,Luxembourg,82.1
9,20.0,9.0,Greece,82.1
