# Web Scraping with BeautifulSoup


## BeautifulSoup

A Data Scientist or Analyst will at one point or another find themselves needing to scrape data off a website. 

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for pulling data out of HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data from HTML/XML, which is useful for web scraping.

Here, we'll go through the typical workflow for extracting data from a html web page via the BeautifulSoup library in Python, and then converting it into a format more conducive to analysis. 

[This webpage](https://www.diabetes.org.uk/about_us/news_landing_page/uk-has-worlds-5th-highest-rate-of-type-1-diabetes-in-children/list-of-countries-by-incidence-of-type-1-diabetes-ages-0-to-14) with info on Diabetes occurrences by Country (per 100,000 individuals, for people under 15) will be scraped: 

## Introduction
We'll start by importing the libraries we'll be using:

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

We will use `requests.get(url).text` to ping the website and return the html.

The `BeautifulSoup()` function will be used to turn the website's html into a BeautifulSoup object, using a html parser.

In [2]:
html = requests.get("https://www.diabetes.org.uk/about_us/news_landing_page/uk-has-worlds-5th-highest-rate-of-type-1-diabetes-in-children/list-of-countries-by-incidence-of-type-1-diabetes-ages-0-to-14").text

soup = BeautifulSoup(html, "html.parser")

We'll use the `prettify()` function to see how the tags are nested in the webpage.

In [3]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
 <head>
  <!-- Page-hiding snippet (recommended)  -->
  <style>
   .async-hide { opacity: 0 !important}
  </style>
  <script>
   (function(a,s,y,n,c,h,i,d,e){s.className+=' '+y;h.start=1*new Date;
    h.end=i=function(){s.className=s.className.replace(RegExp(' ?'+y),'')};
    (a[n]=a[n]||[]).hide=h;setTimeout(function(){i();h.end=null},c);h.timeout=c;
    })(window,document.documentElement,'async-hide','dataLayer',4000,
    {'GTM-5LGZWF7':true});
  </script>
  <!-- Modified Analytics tracking code with Optimize plugin -->
  <script>
   (function(i,s,o,g,r,a,m){i['GoogleAnalytic

We see that the table we want to extract information from is listed under `<div class="description" itemprop="description">`

## Extracting the information

Let's use the `soup.find()` function to isolate that table, by extracting **only the first** tag that meets our specified arguments. In contrast, `soup.findall()` extracts all tags that match the specified argument, although that will come later.

In [4]:
table = soup.find('div', {"class" : "description"})

table

<div class="description" itemprop="description"><table border="1" width="100%"><tbody><tr><th>Position</th>
<th>Country</th>
<th>Incidence<br/>
			(per 100,000)</th>
</tr><tr><td>1</td>
<td>Finland</td>
<td>57.6</td>
</tr><tr><td>2</td>
<td>Sweden</td>
<td>43.1</td>
</tr><tr><td>3</td>
<td>Saudi Arabia</td>
<td>31.4</td>
</tr><tr><td>4</td>
<td>Norway</td>
<td>27.9</td>
</tr><tr><td>5</td>
<td>United Kingdom</td>
<td>24.5</td>
</tr><tr><td>6</td>
<td>USA</td>
<td>23.7</td>
</tr><tr><td>7</td>
<td>Australia</td>
<td>22.5</td>
</tr><tr><td>8</td>
<td>Kuwait</td>
<td>22.3</td>
</tr><tr><td>9</td>
<td>Denmark</td>
<td>22.2</td>
</tr><tr><td>10</td>
<td>Canada</td>
<td>21.7</td>
</tr><tr><td>11</td>
<td>Netherlands</td>
<td>18.6</td>
</tr><tr><td>12</td>
<td>Germany</td>
<td>18</td>
</tr><tr><td>12</td>
<td>New Zealand</td>
<td>18</td>
</tr><tr><td>14</td>
<td>Poland</td>
<td>17.3</td>
</tr><tr><td>15</td>
<td>Czech Republic</td>
<td>17.2</td>
</tr><tr><td>16</td>
<td>Estonia</td>
<td>17.1<

Looking over the above code, we see that we have exactly the table we need. All that remains is to parse the above html code into a format suitable for Python to interpret. We will use the `soup.find_all()` function to extract **all information** within the `<th>` and `<td>` tags.

It needs to be kept in mind that the `<th>` (header) and `<td>` (data) are nested within `<tr>` (row) tags. We will thus have to extract info from all the `<tr>` tags first.

### Extracting Headers

In [5]:
header_row = [] # Initialize an empty list.


for each_row in table.find_all('tr'): # Over each iteration of the loop, all <tr> tags are extracted.
    
    headers = each_row.find_all("th") # From the <tr> tags, all <th> tags are then extracted.
    
    header_row.append(headers) # objects with <th> tags are appended to our list.
    
# At this point header_row resembles a list of lists. Only the first list within it is populated, as only one row has <th> tags
# Let's see what the first element within it looks like:

header_row[0]

[<th>Position</th>, <th>Country</th>, <th>Incidence<br/>
 			(per 100,000)</th>]

We see above that we successfully retrieved all the `<th>` tags, and because of the way the for loop is iterating over each row (the `for each_row in table.find_all('tr')` line of code ensures this, since each row is distinguished by the `<tr>` tag), each individual row is a separate list from all other rows. 

Now that we know what the objects we retrieved from the above code look like, we'll repeat the same code, but this time we'll extract just the text elements.

In [6]:
header_row_text = [] 


for each_row in table.find_all('tr'):   
    
    headers = each_row.find_all("th")
    
    # Try clause has to be used to avoid index error when iterating over rows without "th" tags.
    try:
        # From the headers list, all 3 elements are indexed, and only their text content is isolated.
        header_row_text.append((headers[0].text, headers[1].text, headers[2].text))
    except: 
        pass

header_row_text

[('Position', 'Country', 'Incidence\n\t\t\t(per 100,000)')]

### Extracting Rows

We'll repeat the above process to isolate each the rest of the rows that aren't headers, i.e. the rows inside the `<td>` tags.

In [7]:
row_text = [] 


for each_row in table.find_all('tr'):   
    
    rows = each_row.find_all("td")
    
    # Try clause has to be used to avoid index error when iterating over rows without "th" tags.
    try:
        # From the rows list, all 3 elements are indexed, and their text content is isolated.
        row_text.append((rows[0].text, rows[1].text, rows[2].text))
        
        # Note the use of double brackets above when appending to pass in the tuple of 3 strings as a single argument!
    except: 
        pass

row_text

[('1', 'Finland', '57.6'),
 ('2', 'Sweden', '43.1'),
 ('3', 'Saudi Arabia', '31.4'),
 ('4', 'Norway', '27.9'),
 ('5', 'United Kingdom', '24.5'),
 ('6', 'USA', '23.7'),
 ('7', 'Australia', '22.5'),
 ('8', 'Kuwait', '22.3'),
 ('9', 'Denmark', '22.2'),
 ('10', 'Canada', '21.7'),
 ('11', 'Netherlands', '18.6'),
 ('12', 'Germany', '18'),
 ('12', 'New Zealand', '18'),
 ('14', 'Poland', '17.3'),
 ('15', 'Czech Republic', '17.2'),
 ('16', 'Estonia', '17.1'),
 ('17', 'Puerto Rico', '16.8'),
 ('18', 'Ireland', '16.3'),
 ('18', 'Montenegro', '16.3'),
 ('20', 'Malta', '15.6'),
 ('21', 'Luxembourg', '15.5'),
 ('22', 'Belgium', '15.4'),
 ('23', 'Cyprus', '14.9'),
 ('24', 'Iceland', '14.7'),
 ('25', 'Slovakia', '13.6'),
 ('26', 'Austria', '13.3'),
 ('27', 'Portugal', '13.2'),
 ('28', 'Spain', '13'),
 ('29', 'Serbia', '12.9'),
 ('30', 'United States Virgin Islands', '12.8'),
 ('31', 'France', '12.2'),
 ('32', 'Italy', '12.1'),
 ('32', 'Russian Federation', '12.1'),
 ('34', 'Qatar', '11.4'),
 ('35', 'H

To simplify things, we'll merge the contents of `header_row_text` and `row_text`.

In [8]:
for each in row_text:
    header_row_text.append(each)

## Converting to DataFrame

With our resulting full list of tuples above, it looks like we are very near the final product. 

We'll just convert it into a DataFrame using the pandas library.

In [9]:
df = pd.DataFrame(header_row_text)

df.head()

Unnamed: 0,0,1,2
0,Position,Country,"Incidence\n\t\t\t(per 100,000)"
1,1,Finland,57.6
2,2,Sweden,43.1
3,3,Saudi Arabia,31.4
4,4,Norway,27.9


To wrap up our job here, we'll simply set the first row in our column as the headers.

In [10]:
# Setting column names as first row:
df.columns = df.iloc[0]

# Dropping that first row from the dataframe:
df = df.drop(df.index[0])

# Resetting index. drop=True prevents the new index from also being saved as a column.
df = df.reset_index(drop=True)

In [11]:
df.head(3)

Unnamed: 0,Position,Country,"Incidence  (per 100,000)"
0,1,Finland,57.6
1,2,Sweden,43.1
2,3,Saudi Arabia,31.4


In [12]:
df.tail(3)

Unnamed: 0,Position,Country,"Incidence  (per 100,000)"
86,86,Thailand,0.3
87,88,Papua New Guinea,0.1
88,88,Venezuala,0.1


## End

That's it for the web-scraping. Optionally, we could have also modified the dataframe above so that the `Position` column is set to be the index:

In [13]:
df = df.set_index('Position')

df.head(3)

Unnamed: 0_level_0,Country,"Incidence  (per 100,000)"
Position,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Finland,57.6
2,Sweden,43.1
3,Saudi Arabia,31.4
