# Web Scraping

Web scraping, also known as web harvesting or web data extraction, is a type of data scraping used to gather information from websites.

In this notebook, We will do Data acquisition through Web Scraping.

We are going to scrape data from Wikipedia. The data indicate rankings on different health indices such as patient rights and
information, accessibility (waiting time for treatment), outcomes, range, the reach of services provided, prevention, and
pharmaceuticals. 

The data are from the Euro Health Consumer index. In the following code, we read the data and use Beautiful Soup
to convert the data into bs4.BeautifulSoup data.

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
## import the Required Libraries for web Scraping

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import csv
import re
import urllib.request as urllib2
from datetime import datetime
import os
import sys
import matplotlib.pyplot as plt
import matplotlib.image as mpimg


In [4]:
## Get the data from url
url = 'https://en.wikipedia.org/wiki/Healthcare_in_Europe'
r = requests.get(url)
HCE = BeautifulSoup(r.text)
type(HCE)

bs4.BeautifulSoup

In [13]:
## Retrieve the exact table names from the HTML
webpage = urllib2.urlopen(url)
htmlpage = webpage.readlines()
#print(htmlpage)
lst = []

for line in htmlpage:
    line = str(line).rstrip()
    if re.search('table class', line):
        lst.append(line)
print(len(lst))

5


In [15]:
## Display the table list
lst

['b\'<table class="wikitable floatright sortable" style="font-size: 90%">\\n\'',
 'b\'<div class="navbox-styles"><style data-mw-deduplicate="TemplateStyles:r1129693374">.mw-parser-output .hlist dl,.mw-parser-output .hlist ol,.mw-parser-output .hlist ul{margin:0;padding:0}.mw-parser-output .hlist dd,.mw-parser-output .hlist dt,.mw-parser-output .hlist li{margin:0;display:inline}.mw-parser-output .hlist.inline,.mw-parser-output .hlist.inline dl,.mw-parser-output .hlist.inline ol,.mw-parser-output .hlist.inline ul,.mw-parser-output .hlist dl dl,.mw-parser-output .hlist dl ol,.mw-parser-output .hlist dl ul,.mw-parser-output .hlist ol dl,.mw-parser-output .hlist ol ol,.mw-parser-output .hlist ol ul,.mw-parser-output .hlist ul dl,.mw-parser-output .hlist ul ol,.mw-parser-output .hlist ul ul{display:inline}.mw-parser-output .hlist .mw-empty-li{display:none}.mw-parser-output .hlist dt::after{content:": "}.mw-parser-output .hlist dd::after,.mw-parser-output .hlist li::after{content:" \\xc2\\xb7

In [16]:
## Reading the table using 'find' function for class name 'wikitable floatright sortable'

table=HCE.find('table', {'class', 'wikitable floatright sortable'})


In [17]:
type(table)

bs4.element.Tag

>- Alternatively, there is a way to automate this step by capturing the first data from the list and then stripping off the unnecessary characters like ^ " * .

In [22]:
x=lst[0]
extr=re.findall('"([^"]*)"', x)
print(extr[0])
table=HCE.find('table', {'class', extr[0]})

wikitable floatright sortable


In [19]:
type(table)


bs4.element.Tag

In [24]:
## Reading the header from the table

headers= [header.text for header in table.find_all('th')]

In [25]:
headers

['WorldRank\n', 'EURank\n', 'Country\n', 'Life expectancyat birth (years)\n']

In [27]:
## Reading the rows from the table
rows = []
for row in table.find_all('tr'):
    rows.append([val.text.encode('utf8').decode() for val in row.find_all('td')])

In [29]:
rows

[[],
 ['5.\n', '1.\n', 'Spain\n', '83.4\n'],
 ['6.\n', '2.\n', 'Italy\n', '83.4\n'],
 ['11.\n', '3.\n', 'Sweden\n', '82.7\n'],
 ['12.\n', '4.\n', 'France\n', '82.5\n'],
 ['13.\n', '5.\n', 'Malta\n', '82.4\n'],
 ['16.\n', '6.\n', 'Ireland\n', '82.1\n'],
 ['17.\n', '7.\n', 'Netherlands\n', '82.1\n'],
 ['19.\n', '8.\n', 'Luxembourg\n', '82.1\n'],
 ['20.\n', '9.\n', 'Greece\n', '82.1\n']]

In [30]:
## Creating the dataframe by using header and rows

df1 = pd.DataFrame(rows, columns=headers)

In [32]:
## Display the five rows of the df1
df1.head()

Unnamed: 0,WorldRank\n,EURank\n,Country\n,Life expectancyat birth (years)\n
0,,,,
1,5.\n,1.\n,Spain\n,83.4\n
2,6.\n,2.\n,Italy\n,83.4\n
3,11.\n,3.\n,Sweden\n,82.7\n
4,12.\n,4.\n,France\n,82.5\n


## Health Expenditure

>- Just like we did for above web page (**Health Care Rankings for Different European Countries**), we have to repeat the same steps in this web page as well (**Health Expenditure**).

In [36]:
## Getting the table class value

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita'
r = requests.get(url)
HEE = BeautifulSoup(r.text)
webpage = urllib2.urlopen(url)
htmlpage= webpage.readlines()
lst = []
for line in htmlpage:
    line = str(line).rstrip()
    if re.search('table class', line) :
        lst.append(line)
x=lst[1]
print(x)

b'<table class="wikitable sortable static-row-numbers plainrowheaders srn-white-background" border="1" style="text-align:right;">\n'


In [38]:
## Creating the dataframe by finding header and rows from the data

extr=re.findall('"([^"]*)"', x)
table=HEE.find('table', {'class', extr[0]})
headers= [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
    rows.append([val.text.encode('utf8').decode() for val in row.find_all('td')])
headers = [i.replace("\n", "") for i in headers]
df2 = pd.DataFrame(rows, columns=headers)

In [40]:
## display the 5 rows of the data
df2.head()

Unnamed: 0,Location,2018,2019,2020,2021
0,,,,,
1,Australia *\n,"5,194\n","5,130\n","5,627\n",..\n
2,Austria *\n,"5,519\n","5,624\n","5,883\n","6,693\n"
3,Belgium *\n,"5,315\n","5,353\n","5,407\n",..\n
4,Canada *\n,"5,308\n","5,190\n","5,828\n","5,905\n"


If we look at the DataFrame, we can see that there are still some issues that prohibit numeric computations.
>- There are undesired characters ('\n')
>- The undesired decimal format (,) should be removed
>- There are cells with non-numeric characters ('x') that should be NAN

In [42]:
## Creating the 'preproc' function for formatting data properly

def preproc(dat):
    dat.dropna(axis=0, how='all', inplace=True)
    dat.columns = dat.columns.str.replace("\n", "")
    dat.replace(["\n"], [""], regex=True, inplace=True)
    dat.replace([r"\s\*$"], [""], regex=True, inplace=True)
    dat.replace([","], [""], regex=True, inplace=True)
    dat.replace(r"\b[a-zA-Z]\b", np.nan, regex=True, inplace=True)
    dat.replace([r"^\s"], [""], regex=True, inplace=True)
    dat = dat.apply(pd.to_numeric, errors='ignore') 
    return(dat)


In [44]:
## apply function for both the dataframe
df1 = preproc(df1)
df2 = preproc(df2)

In [45]:
df2.head()

Unnamed: 0,Location,2018,2019,2020,2021
1,Australia,5194,5130,5627,..
2,Austria,5519,5624,5883,6693
3,Belgium,5315,5353,5407,..
4,Canada,5308,5190,5828,5905
5,Chile,2281,2297,2413,2608


All the above mention issue are resolved and dataframe looks proper.

In [46]:
## Checking for null values in both the dataframe(df1, df2)

print(df1.isnull().sum().sum())
print(df2.isnull().sum().sum())

0
0


In [47]:
## df1 data types
df1.dtypes

WorldRank                          float64
EURank                             float64
Country                             object
Life expectancyat birth (years)    float64
dtype: object

In [49]:
## df2 data types
df2.dtypes

Location    object
2018         int64
2019         int64
2020         int64
2021        object
dtype: object

In [51]:
## Making column name shorter by renaming them

df1.columns = ['WorldRank', 'EURank', 'Country', 'Life expectancy in (years)']
df2.columns = ['Country', '2018', '2019', '2020', '2021']

**Analyzing Final dataframe**

In [52]:
df1.head()

Unnamed: 0,WorldRank,EURank,Country,Life expectancy in (years)
1,5.0,1.0,Spain,83.4
2,6.0,2.0,Italy,83.4
3,11.0,3.0,Sweden,82.7
4,12.0,4.0,France,82.5
5,13.0,5.0,Malta,82.4


In [53]:
df2.head()

Unnamed: 0,Country,2018,2019,2020,2021
1,Australia,5194,5130,5627,..
2,Austria,5519,5624,5883,6693
3,Belgium,5315,5353,5407,..
4,Canada,5308,5190,5828,5905
5,Chile,2281,2297,2413,2608


In [54]:
## Merging both the dataframe based on Country column

pd.merge(df1, df2, how='left', on='Country').head()

Unnamed: 0,WorldRank,EURank,Country,Life expectancy in (years),2018,2019,2020,2021
0,5.0,1.0,Spain,83.4,3427.0,3523.0,3718.0,..
1,6.0,2.0,Italy,83.4,3496.0,3565.0,3747.0,4038
2,11.0,3.0,Sweden,82.7,5419.0,5388.0,5757.0,6262
3,12.0,4.0,France,82.5,5099.0,5168.0,5468.0,6115
4,13.0,5.0,Malta,82.4,,,,
