# Web Scraping

## Goals

- Learn the basics of scraping content from web pages
- Perform scraping of text from a web page
- Perform extraction of an HTML table from a web page into a Pandas data frame

Web scraping refers to the automatic extraction of information from a web page. This information is often a page's content, but it can also include information in the page's headers, links present on the page, or any other information embedded in the page's HTML. Because of this, scraping has become one of the most popular ways to extract data from the web. 

"""With basic knowledge of HTML and the help of a few Python libraries, you can obtain information from just about any page on the internet."""

### Scraping a Simple Web Page

In [2]:
import requests

url = 'https://thegurus.tech'
html = requests.get(url).content
html[:500]

b'<!DOCTYPE html>\n<html lang="en">\n\n<head>\n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n        <meta name="viewport" content="width=device-width, initial-scale=1">\n\n\n        <title>The Gurus</title>\n\n            <link href="https://thegurus.tech/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="The Gurus Full Atom Feed" />\n        <!-- Bootstrap Core CSS -->\n        <link href="https://thegurus.tech/theme/css/bootstrap.min.css" '

### Beautiful Soup

In [3]:
!pip install BeautifulSoup4



In [4]:
!pip install lxml



In [5]:
from bs4 import BeautifulSoup

# lxml is the parsing module
soup = BeautifulSoup(html, 'lxml')
soup.contents[0:5]

['html', <html lang="en">
 <head>
 <meta charset="utf-8"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <title>The Gurus</title>
 <link href="https://thegurus.tech/feeds/all.atom.xml" rel="alternate" title="The Gurus Full Atom Feed" type="application/atom+xml"/>
 <!-- Bootstrap Core CSS -->
 <link href="https://thegurus.tech/theme/css/bootstrap.min.css" rel="stylesheet"/>
 <!-- Custom CSS -->
 <link href="https://thegurus.tech/theme/css/clean-blog.min.css" rel="stylesheet"/>
 <!-- Code highlight color scheme -->
 <link href="https://thegurus.tech/theme/css/code_blocks/github.css" rel="stylesheet"/>
 <!-- CSS specified by the user -->
 <link href="https://thegurus.tech/static/css/custom.css" rel="stylesheet"/>
 <!-- Custom Fonts -->
 <link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet" type="text/css"/>
 <link href="https://fonts.googleapis.com/css?fami

In [7]:
tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p']
text = [element for element in soup.find_all(tags)]
text

[<h1>The Gurus</h1>, <h2 class="post-title">
                     Deepracer re:Invent 2019
                 </h2>, <p class="article-summary">
                     Our experience competing in the most important autonomous driving competition
                 </p>, <p class="post-meta">Posted by
                     <a href="https://thegurus.tech/author/pedro-munoz-botas.html">Pedro Muñoz Botas</a>
                  on Mon 09 December 2019
             </p>, <p>There are <a data-disqus-identifier="posts/2019/12/deepracer/" href="https://thegurus.tech/posts/2019/12/deepracer/#disqus_thread">comments</a>.</p>, <h2 class="post-title">
                     Our experience as Lead Teachers at IronHack Data <span class="amp">&amp;</span> Analytics Bootcamp
                 </h2>, <p class="article-summary">
                     How is teaching in one of the best digital schools?
                 </p>, <p class="post-meta">Posted by
                     <a href="https://thegurus.tech/author/dav

### More Complex Single-Page Scraping

Suppose we wanted to extract data that was contained within an HTML table and store it in a Pandas DataFrame.

In [8]:
url = 'https://en.wikipedia.org/wiki/List_of_European_countries_by_life_expectancy'
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")

In [9]:
# sortable wikitable comes from inspecting the element on the web browser
table = soup.find_all('table',{'class':'sortable wikitable'})[0]
table

<table class="sortable wikitable">
<tbody><tr bgcolor="#efefef">
<th>Rank
</th>
<th>Country</th>
<th><a href="/wiki/List_of_countries_by_life_expectancy" title="List of countries by life expectancy">Life expectancy</a><sup class="reference" id="cite_ref-:0_1-1"><a href="#cite_note-:0-1">[1]</a></sup>
</th></tr>
<tr>
<td>1
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="750" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/19px-Flag_of_Monaco.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/29px-Flag_of_Monaco.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/38px-Flag_of_Monaco.svg.png 2x" width="19"/> </span><a href="/wiki/Monaco" title="Monaco">Monaco</a><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup>
</td>
<td>89.4
</td></tr>
<tr>
<td>2
</td>
<td><span class="flagicon"><i

In [10]:
# tr represent the table rows
rows = table.find_all('tr')
rows_parsed = [row.text for row in rows]
rows_parsed

['\nRank\n\nCountry\nLife expectancy[1]\n',
 '\n1\n\n\xa0Monaco[2]\n\n89.4\n',
 '\n2\n\n\xa0San Marino[3]\n\n83.4\n',
 '\n3\n\n\xa0\xa0Switzerland\n83.0\n',
 '\n4\n\n\xa0Spain\n82.8\n',
 '\n5\n\n\xa0Liechtenstein\n82.7\n',
 '\n6\n\n\xa0Italy\n82.5\n',
 '\n7\n\n\xa0Norway\n82.5\n',
 '\n8\n\n\xa0Iceland\n82.5\n',
 '\n9\n\n\xa0Luxembourg\n82.3\n',
 '\n10\n\n\xa0France\n82.3\n',
 '\n11\n\n\xa0Sweden\n82.2\n',
 '\n12\n\n\xa0Malta\n81.8\n',
 '\n13\n\n\xa0Finland\n81.8\n',
 '\n14\n\n\xa0Ireland\n81.6\n',
 '\n15\n\n\xa0Netherlands\n81.5\n',
 '\n16\n\n\xa0Portugal\n81.1\n',
 '\n17\n\n\xa0Greece\n81.0\n',
 '\n18\n\n\xa0United Kingdom\n81.0\n',
 '\n19\n\n\xa0Belgium\n81.0\n',
 '\n20\n\n\xa0Austria\n80.9\n',
 '\n21\n\n\xa0Slovenia\n80.8\n',
 '\n22\n\n\xa0Denmark\n80.7\n',
 '\n23\n\n\xa0Germany\n80.6\n',
 '\n24\n\n\xa0Cyprus\n80.5\n',
 '\n25\n\n\xa0Albania\n78.3\n',
 '\n26\n\n\xa0Czech Republic\n78.3\n',
 '\n27\n\n\xa0Croatia\n78.0\n',
 '\n28\n\n\xa0Estonia\n77.7\n',
 '\n29\n\n\xa0Poland\n77.5\n',


In [11]:
import re

def smart_parser(row_text):
    row_text = row_text.replace('\n\n', '\n').strip('\n')
    row_text = re.sub('\[\d\]', '', row_text)
    return list(map(lambda x: x.strip(), row_text.split('\n')))

well_parsed = list(map(lambda x: smart_parser(x), rows_parsed))

well_parsed

[['Rank', 'Country', 'Life expectancy'],
 ['1', 'Monaco', '89.4'],
 ['2', 'San Marino', '83.4'],
 ['3', 'Switzerland', '83.0'],
 ['4', 'Spain', '82.8'],
 ['5', 'Liechtenstein', '82.7'],
 ['6', 'Italy', '82.5'],
 ['7', 'Norway', '82.5'],
 ['8', 'Iceland', '82.5'],
 ['9', 'Luxembourg', '82.3'],
 ['10', 'France', '82.3'],
 ['11', 'Sweden', '82.2'],
 ['12', 'Malta', '81.8'],
 ['13', 'Finland', '81.8'],
 ['14', 'Ireland', '81.6'],
 ['15', 'Netherlands', '81.5'],
 ['16', 'Portugal', '81.1'],
 ['17', 'Greece', '81.0'],
 ['18', 'United Kingdom', '81.0'],
 ['19', 'Belgium', '81.0'],
 ['20', 'Austria', '80.9'],
 ['21', 'Slovenia', '80.8'],
 ['22', 'Denmark', '80.7'],
 ['23', 'Germany', '80.6'],
 ['24', 'Cyprus', '80.5'],
 ['25', 'Albania', '78.3'],
 ['26', 'Czech Republic', '78.3'],
 ['27', 'Croatia', '78.0'],
 ['28', 'Estonia', '77.7'],
 ['29', 'Poland', '77.5'],
 ['30', 'Montenegro', '77.1'],
 ['31', 'Bosnia and Herzegovina', '76.9'],
 ['32', 'Slovakia', '76.6'],
 ['33', 'Turkey', '75.8'],
 ['

In [12]:
import pandas as pd

colnames = well_parsed[0]
data = well_parsed[1:]

df = pd.DataFrame(data, columns=colnames)
df

Unnamed: 0,Rank,Country,Life expectancy
0,1,Monaco,89.4
1,2,San Marino,83.4
2,3,Switzerland,83.0
3,4,Spain,82.8
4,5,Liechtenstein,82.7
5,6,Italy,82.5
6,7,Norway,82.5
7,8,Iceland,82.5
8,9,Luxembourg,82.3
9,10,France,82.3
