# Web Scraping

## Goals

- Learn the basics of scraping content from web pages
- Perform scraping of text from a web page
- Perform extraction of an HTML table from a web page into a Pandas data frame

Web scraping refers to the automatic extraction of information from a web page. This information is often a page's content, but it can also include information in the page's headers, links present on the page, or any other information embedded in the page's HTML. Because of this, scraping has become one of the most popular ways to extract data from the web. 

"""With basic knowledge of HTML and the help of a few Python libraries, you can obtain information from just about any page on the internet."""

### Scraping a Simple Web Page

In [2]:
import requests

url = 'https://www.reuters.com/article/us-shazam-m-a-apple-eu/eu-clears-apples-purchase-of-shazam-idUSKCN1LM1TZ'
html = requests.get(url).content
html[0:500]

b'<!--[if !IE]> This has NOT been served from cache <![endif]-->\n<!--[if !IE]> Request served from apache server: produs--i-078cab3c64aed2f56 <![endif]-->\n<!--[if !IE]> token: ebec98cb-57bb-4b48-88b9-77820198533c <![endif]-->\n<!--[if !IE]> App Server /produs--i-078cab3c64aed2f56/ <![endif]-->\n\n<!doctype html><html lang="en" data-edition="BETAUS">\n    <head>\n\n    <title>\n                EU clears Apple\'s purchase of Shazam - Reuters</title>\n        <meta http-equiv="X-UA-Compatible" content="IE=edg'

### Beautiful Soup

In [3]:
!pip install BeautifulSoup4



In [4]:
!pip install lxml



In [5]:
from bs4 import BeautifulSoup

# lxml is the parsing module
soup = BeautifulSoup(html, 'lxml')
soup

<!--[if !IE]> This has NOT been served from cache <![endif]--><!--[if !IE]> Request served from apache server: produs--i-078cab3c64aed2f56 <![endif]--><!--[if !IE]> token: ebec98cb-57bb-4b48-88b9-77820198533c <![endif]--><!--[if !IE]> App Server /produs--i-078cab3c64aed2f56/ <![endif]--><!DOCTYPE html>
<html data-edition="BETAUS" lang="en">
<head>
<title>
                EU clears Apple's purchase of Shazam - Reuters</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta charset="utf-8"/><meta content="on" http-equiv="x-dns-prefetch-control"/><link href="//s1.reutersmedia.net" rel="dns-prefetch"/><link href="//s2.reutersmedia.net" rel="dns-prefetch"/><link href="//s3.reutersmedia.net" rel="dns-prefetch"/><link href="//s4.reutersmedia.net" rel="dns-prefetch"/><link href="//static.reuters.com" rel="dns-prefetch"/><link href="//www.googletagservices.com" rel="dns-prefetch"/><link href="//www.googletagmanager.com" rel="dns-prefetch"/><link href="//www.google-analytics.com" rel

In [6]:
tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p']
text = [element.text for element in soup.find_all(tags)]
text

["EU clears Apple's purchase of Shazam",
 '2 Min Read',
 'BRUSSELS (Reuters) - The European Union approved Apple’s planned acquisition of British music discovery app Shazam on Thursday, saying an EU antitrust investigation showed it would not harm competition in the bloc. ',
 'The deal, announced in December last year, would help the iPhone maker better compete with Spotify, the industry leader in music streaming services. Shazam identifies songs when a smartphone is pointed at an audio source. ',
 '“After thoroughly analyzing Shazam’s user and music data, we found that their acquisition by Apple would not reduce competition in the digital music streaming market,” EU competition commissioner Margrethe Vestager said in a statement. ',
 '“Data is key in the digital economy. We must therefore carefully review transactions which lead to the acquisition of important sets of data, including potentially commercially sensitive ones,” she added. ',
 'The European Commission opened a full-scale 

### More Complex Single-Page Scraping

Suppose we wanted to extract data that was contained within an HTML table and store it in a Pandas DataFrame.

In [7]:
url = 'https://en.wikipedia.org/wiki/List_of_European_countries_by_life_expectancy'
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")

In [8]:
# sortable wikitable comes from inspecting the element on the web browser
table = soup.find_all('table',{'class':'sortable wikitable'})[0]
table

<table class="sortable wikitable">
<tbody><tr bgcolor="#efefef">
<th>Rank
</th>
<th>Country</th>
<th><a href="/wiki/List_of_countries_by_life_expectancy" title="List of countries by life expectancy">Life expectancy</a><sup class="reference" id="cite_ref-:0_1-1"><a href="#cite_note-:0-1">[1]</a></sup>
</th></tr>
<tr>
<td>1
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="750" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/19px-Flag_of_Monaco.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/29px-Flag_of_Monaco.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Flag_of_Monaco.svg/38px-Flag_of_Monaco.svg.png 2x" width="19"/> </span><a href="/wiki/Monaco" title="Monaco">Monaco</a><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup>
</td>
<td>89.4
</td></tr>
<tr>
<td>2
</td>
<td><span class="flagicon"><i

In [10]:
# tr represent the table rows
rows = table.find_all('tr')
rows_parsed = [row.text for row in rows]
rows_parsed

['\nRank\n\nCountry\nLife expectancy[1]\n',
 '\n1\n\n\xa0Monaco[2]\n\n89.4\n',
 '\n2\n\n\xa0San Marino[3]\n\n83.4\n',
 '\n3\n\n\xa0\xa0Switzerland\n83.0\n',
 '\n4\n\n\xa0Spain\n82.8\n',
 '\n5\n\n\xa0Liechtenstein\n82.7\n',
 '\n6\n\n\xa0Italy\n82.5\n',
 '\n7\n\n\xa0Norway\n82.5\n',
 '\n8\n\n\xa0Iceland\n82.5\n',
 '\n9\n\n\xa0Luxembourg\n82.3\n',
 '\n10\n\n\xa0France\n82.3\n',
 '\n11\n\n\xa0Sweden\n82.2\n',
 '\n12\n\n\xa0Malta\n81.8\n',
 '\n13\n\n\xa0Finland\n81.8\n',
 '\n14\n\n\xa0Ireland\n81.6\n',
 '\n15\n\n\xa0Netherlands\n81.5\n',
 '\n16\n\n\xa0Portugal\n81.1\n',
 '\n17\n\n\xa0Greece\n81.0\n',
 '\n18\n\n\xa0United Kingdom\n81.0\n',
 '\n19\n\n\xa0Belgium\n81.0\n',
 '\n20\n\n\xa0Austria\n80.9\n',
 '\n21\n\n\xa0Slovenia\n80.8\n',
 '\n22\n\n\xa0Denmark\n80.7\n',
 '\n23\n\n\xa0Germany\n80.6\n',
 '\n24\n\n\xa0Cyprus\n80.5\n',
 '\n25\n\n\xa0Albania\n78.3\n',
 '\n26\n\n\xa0Czech Republic\n78.3\n',
 '\n27\n\n\xa0Croatia\n78.0\n',
 '\n28\n\n\xa0Estonia\n77.7\n',
 '\n29\n\n\xa0Poland\n77.5\n',


In [19]:
import re

def smart_parser(row_text):
    row_text = row_text.replace('\n\n', '\n').strip('\n')
    row_text = re.sub('\[\d\]', '', row_text)
    return list(map(lambda x: x.strip(), row_text.split('\n')))

well_parsed = list(map(lambda x: smart_parser(x), rows_parsed))

well_parsed

[['Rank', 'Country', 'Life expectancy'],
 ['1', 'Monaco', '89.4'],
 ['2', 'San Marino', '83.4'],
 ['3', 'Switzerland', '83.0'],
 ['4', 'Spain', '82.8'],
 ['5', 'Liechtenstein', '82.7'],
 ['6', 'Italy', '82.5'],
 ['7', 'Norway', '82.5'],
 ['8', 'Iceland', '82.5'],
 ['9', 'Luxembourg', '82.3'],
 ['10', 'France', '82.3'],
 ['11', 'Sweden', '82.2'],
 ['12', 'Malta', '81.8'],
 ['13', 'Finland', '81.8'],
 ['14', 'Ireland', '81.6'],
 ['15', 'Netherlands', '81.5'],
 ['16', 'Portugal', '81.1'],
 ['17', 'Greece', '81.0'],
 ['18', 'United Kingdom', '81.0'],
 ['19', 'Belgium', '81.0'],
 ['20', 'Austria', '80.9'],
 ['21', 'Slovenia', '80.8'],
 ['22', 'Denmark', '80.7'],
 ['23', 'Germany', '80.6'],
 ['24', 'Cyprus', '80.5'],
 ['25', 'Albania', '78.3'],
 ['26', 'Czech Republic', '78.3'],
 ['27', 'Croatia', '78.0'],
 ['28', 'Estonia', '77.7'],
 ['29', 'Poland', '77.5'],
 ['30', 'Montenegro', '77.1'],
 ['31', 'Bosnia and Herzegovina', '76.9'],
 ['32', 'Slovakia', '76.6'],
 ['33', 'Turkey', '75.8'],
 ['

In [20]:
import pandas as pd

colnames = well_parsed[0]
data = well_parsed[1:]

df = pd.DataFrame(data, columns=colnames)
df

Unnamed: 0,Rank,Country,Life expectancy
0,1,Monaco,89.4
1,2,San Marino,83.4
2,3,Switzerland,83.0
3,4,Spain,82.8
4,5,Liechtenstein,82.7
5,6,Italy,82.5
6,7,Norway,82.5
7,8,Iceland,82.5
8,9,Luxembourg,82.3
9,10,France,82.3


In [21]:
df['Country']

0                     Monaco
1                 San Marino
2                Switzerland
3                      Spain
4              Liechtenstein
5                      Italy
6                     Norway
7                    Iceland
8                 Luxembourg
9                     France
10                    Sweden
11                     Malta
12                   Finland
13                   Ireland
14               Netherlands
15                  Portugal
16                    Greece
17            United Kingdom
18                   Belgium
19                   Austria
20                  Slovenia
21                   Denmark
22                   Germany
23                    Cyprus
24                   Albania
25            Czech Republic
26                   Croatia
27                   Estonia
28                    Poland
29                Montenegro
30    Bosnia and Herzegovina
31                  Slovakia
32                    Turkey
33           North Macedonia
34            

In [22]:
print(df['Country'][0])

Monaco


In [23]:
df['Country'][0]

'Monaco'