# Web scraping 

The following content heavily relies on [Web Scraping with Python](https://proquest.safaribooksonline.com/book/programming/python/9781491985564) (2018) by Ryan Mitchell. Python is great for web scraping because its [beautiful soup](https://www.crummy.com/software/BeautifulSoup/) library makes parsing HTML so easy.

## Definition

Web scraping collects data from Web other than using API. You can do that by writing a simple program to query a web server, request data, and parse the HTML data to extract information you need.

**Web scraping workflow**

- Request by a user (likely you) 
- Respond by the server 
- Parse the html data 

In most cases, collecting data from API is more convenient and legally safe. But when API does not exist, you have to do web scraping within *technical*, *legal*, and *ethical* boundaries. The issues around web scraping are complex because they are tied to Internet security, intellectual property, as well as knowledge as commons.


## Request and respond

Let's start to work with the wikipedia entry of [Democracy Index](https://en.wikipedia.org/wiki/Democracy_Index). What you are going to get from this code is pretty nonlegible, unless you are able to parse the HTML document with your own eyes. Most novices can't. 

The basic idea behind web scraping is mimicking how a web browser works. The result below shows that the url contains an html document.  

In [1]:
from urllib.request import urlopen 
from urllib.error import HTTPError
from urllib.error import URLError

try:
    page = urlopen('https://en.wikipedia.org/wiki/Democracy_Index')
except HTTPError as e:
    print(e)
except URLError as e:
    print("The server is broken")
else:
    print("The site is working")
    print(page.read()) # print the result

The site is working
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Democracy Index - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Democracy_Index","wgTitle":"Democracy Index","wgCurRevisionId":875951842,"wgRevisionId":875951842,"wgArticleId":8775637,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: Multiple names: authors list","All articles with specifically marked weasel-worded phrases","Articles with specifically marked weasel-worded phrases from November 2015","All articles with unsourced statements","Articles with unsourced statements from April 2018","Commons category

The document tells something. But it's hard to read... Frankly, I have no idea as to where to start.

In [2]:
import requests 

page = requests.get('https://en.wikipedia.org/wiki/Democracy_Index')

print(page.status_code) # to check whether the down is successful

200


## Parse

beautiful soup makes parsing HTML much easier. 

You can install beautiful soup library in several ways.

- 1. Unix/Linux: type `sudo apt-get install python-bs4` in terminal. This is same for Windows OS, though you should do it in bash.
- 2. Mac: `sudo easy_install pip` (in case, you havent't installed pip already) then `pip install beautifulsoup4`

### HTML parser

The most popular parser is html.parser. For malformed HTML documents, lxml and html5lib parsers work better.   

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, "html.parser")

You can inspect the document using prettify(). 

### Parsing HTML

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Democracy Index - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Democracy_Index","wgTitle":"Democracy Index","wgCurRevisionId":875951842,"wgRevisionId":875951842,"wgArticleId":8775637,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: Multiple names: authors list","All articles with specifically marked weasel-worded phrases","Articles with specifically marked weasel-worded phrases from November 2015","All articles with unsourced statements","Articles with unsourced statements from April 2018","Commons category li

After exploring the web site of interest, you can extract parts of the document by identifying specific HTML/CSS tags or attributes.

In [5]:
soup.find_all('table') # all tables

soup.find_all('tr') # all rows in a table 

soup.find_all('td') # all cells in a table 

soup.find_all('div') # all sections 

soup.find_all('a') # all hyperlinks 

# you can combine these commands in a sequence: soup.find_all('table').find_all('a')
# you can combine these commands simultaneously: soup.find_all(['h1','h2','h3'])

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a class="image" href="/wiki/File:EIU_Democracy_Index_2017.svg"><img alt="" class="thumbimage" data-file-height="1314" data-file-width="2560" height="282" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/f8/EIU_Democracy_Index_2017.svg/550px-EIU_Democracy_Index_2017.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/f8/EIU_Democracy_Index_2017.svg/825px-EIU_Democracy_Index_2017.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/f8/EIU_Democracy_Index_2017.svg/1100px-EIU_Democracy_Index_2017.svg.png 2x" width="550"/></a>,
 <a class="internal" href="/wiki/File:EIU_Democracy_Index_2017.svg" title="Enlarge"></a>,
 <a href="/wiki/Economist_Intelligence_Unit" title="Economist Intelligence Unit">Economist Intelligence Unit</a>,
 <a href="#cite_note-index2017-1">[1]</a>,
 <a class="new" href="/w/index.php?title=Templa

### Extracing a table

In [6]:
wiki_table = soup.find('table',{'class':'wikitable sortable'})

# the same code can be written in multiple ways ways 
# soup.find('table').find(class_= 'wikitable sortable')
# also try sortable instead of wikitable sortable. Does it work?

In [7]:
wiki_table

<table class="wikitable sortable" style="text-align:center;">
<caption>Democracy Index 2017
</caption>
<tbody><tr>
<th data-sort-type="number">Rank
</th>
<th data-sort-type="text">Country
</th>
<th data-sort-type="number">Score
</th>
<th data-sort-type="number" style="line-height: 1em;">Electoral process<br/>and pluralism
</th>
<th data-sort-type="number" style="line-height: 1em;">Functioning of<br/>government
</th>
<th data-sort-type="number" style="line-height: 1em;">Political<br/>participation
</th>
<th data-sort-type="number" style="line-height: 1em;">Political<br/>culture
</th>
<th data-sort-type="number" style="line-height: 1em;">Civil<br/>liberties
</th>
<th data-sort-type="number">Category
</th></tr>
<tr>
<td>1</td>
<td style="text-align:left;"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="800" data-file-width="1100" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Norway.svg/21px-Flag_of_Norway.svg.png" srcset="//upload.wi

#### Specific solution

Now, let's learn how to save the country information from the table using a particular attribute.

In [8]:
country_list = wiki_table('a') # by a (hyperlink)

In [9]:
country_list

[<a href="/wiki/Norway" title="Norway">Norway</a>,
 <a href="/wiki/Iceland" title="Iceland">Iceland</a>,
 <a href="/wiki/Sweden" title="Sweden">Sweden</a>,
 <a href="/wiki/New_Zealand" title="New Zealand">New Zealand</a>,
 <a href="/wiki/Denmark" title="Denmark">Denmark</a>,
 <a href="/wiki/Republic_of_Ireland" title="Republic of Ireland">Ireland</a>,
 <a href="/wiki/Canada" title="Canada">Canada</a>,
 <a href="/wiki/Australia" title="Australia">Australia</a>,
 <a href="/wiki/Finland" title="Finland">Finland</a>,
 <a href="/wiki/Switzerland" title="Switzerland">Switzerland</a>,
 <a href="/wiki/Netherlands" title="Netherlands">Netherlands</a>,
 <a href="/wiki/Luxembourg" title="Luxembourg">Luxembourg</a>,
 <a href="/wiki/Germany" title="Germany">Germany</a>,
 <a href="/wiki/United_Kingdom" title="United Kingdom">United Kingdom</a>,
 <a href="/wiki/Austria" title="Austria">Austria</a>,
 <a href="/wiki/Mauritius" title="Mauritius">Mauritius</a>,
 <a href="/wiki/Malta" title="Malta">Malta<

In [10]:
countries = []

for country in country_list:
    countries.append(country.get('title')) # we need get('title') to get only title information not 
    # the other elements of beautiful soup objects
    
print(countries)

['Norway', 'Iceland', 'Sweden', 'New Zealand', 'Denmark', 'Republic of Ireland', 'Canada', 'Australia', 'Finland', 'Switzerland', 'Netherlands', 'Luxembourg', 'Germany', 'United Kingdom', 'Austria', 'Mauritius', 'Malta', 'Uruguay', 'Spain', 'South Korea', 'United States', 'Italy', 'Japan', 'Cape Verde', 'Costa Rica', 'Chile', 'Portugal', 'Botswana', 'France', 'Estonia', 'Israel', 'Belgium', 'Taiwan', 'Taiwan', 'Czech Republic', 'Cyprus', 'Slovenia', 'Lithuania', 'Greece', 'Jamaica', 'Latvia', 'South Africa', 'India', 'East Timor', 'Slovakia', 'Panama', 'Trinidad and Tobago', 'Bulgaria', 'Argentina', 'Brazil', 'Suriname', 'Philippines', 'Ghana', 'Poland', 'Colombia', 'Dominican Republic', 'Lesotho', 'Hungary', 'Croatia', 'Malaysia', 'Mongolia', 'Peru', 'Sri Lanka', 'Guyana', 'Romania', 'El Salvador', 'Serbia', 'Mexico', 'Indonesia', 'Tunisia', 'Singapore', 'Hong Kong', 'Namibia', 'Paraguay', 'Senegal', 'Papua New Guinea', 'Ecuador', 'Albania', 'Moldova', 'Georgia (country)', 'Guatemala'

#### General solution

You can scrap the entire table using looping. You also need to use regular expressions to differentiate strings from numbers (or some other tasks).

In [11]:
wiki_table.find_all('th') # heading 
#wiki_table.find_all('tr')[1].find_all('td') # to get some ideas about how looping would work 
#len(wiki_table.find_all('tr')[1].find_all('td'))

[<th data-sort-type="number">Rank
 </th>, <th data-sort-type="text">Country
 </th>, <th data-sort-type="number">Score
 </th>, <th data-sort-type="number" style="line-height: 1em;">Electoral process<br/>and pluralism
 </th>, <th data-sort-type="number" style="line-height: 1em;">Functioning of<br/>government
 </th>, <th data-sort-type="number" style="line-height: 1em;">Political<br/>participation
 </th>, <th data-sort-type="number" style="line-height: 1em;">Political<br/>culture
 </th>, <th data-sort-type="number" style="line-height: 1em;">Civil<br/>liberties
 </th>, <th data-sort-type="number">Category
 </th>, <th data-sort-type="number">Rank
 </th>, <th data-sort-type="text">Country
 </th>, <th data-sort-type="number">Score
 </th>, <th data-sort-type="number" style="line-height: 1em;">Electoral process<br/>and pluralism
 </th>, <th data-sort-type="number" style="line-height: 1em;">Functioning of<br/>government
 </th>, <th data-sort-type="number" style="line-height: 1em;">Political<br/>

In [12]:
import re

# create empty lists
rank = []  
country = []
score = []
electoral = []  
government = [] 
participation = []
culture = []
liberties = [] 
category = []

for row in wiki_table.find_all('tr'): # for rows 
    cells = row.find_all('td') # to iterater through each row
    if len(cells) == 9: # no heading
        rank.append(cells[0].find(text=re.compile('[0-9]+'))) # to differentiate strings from numbers
        country.append(cells[1].find_all(text=True))
        score.append(cells[2].find(text=re.compile('[0-9]+')))
        electoral.append(cells[3].find(text=re.compile('[0-9]+')))
        government.append(cells[4].find(text=re.compile('[0-9]+')))
        participation.append(cells[5].find(text=re.compile('[0-9]+')))
        culture.append(cells[6].find(text=re.compile('[0-9]+')))
        liberties.append(cells[7].find(text=re.compile('[0-9]+')))
        category.append(cells[8].find(text=True))
        print(len(rank), len(score), len(electoral), len(government),
             len(participation), len(culture), len(liberties), len(category)) # for debugging
    else:
        print("Something is wrong") # for debugging

Something is wrong
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9
10 10 10 10 10 10 10 10
11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12
13 13 13 13 13 13 13 13
14 14 14 14 14 14 14 14
15 15 15 15 15 15 15 15
16 16 16 16 16 16 16 16
17 17 17 17 17 17 17 17
18 18 18 18 18 18 18 18
19 19 19 19 19 19 19 19
20 20 20 20 20 20 20 20
21 21 21 21 21 21 21 21
22 22 22 22 22 22 22 22
23 23 23 23 23 23 23 23
24 24 24 24 24 24 24 24
25 25 25 25 25 25 25 25
26 26 26 26 26 26 26 26
27 27 27 27 27 27 27 27
28 28 28 28 28 28 28 28
29 29 29 29 29 29 29 29
30 30 30 30 30 30 30 30
31 31 31 31 31 31 31 31
32 32 32 32 32 32 32 32
33 33 33 33 33 33 33 33
34 34 34 34 34 34 34 34
35 35 35 35 35 35 35 35
36 36 36 36 36 36 36 36
37 37 37 37 37 37 37 37
38 38 38 38 38 38 38 38
39 39 39 39 39 39 39 39
40 40 40 40 40 40 40 40
41 41 41 41 41 41 41 41
42 42 42 42 42 42 42 42
43 43 43 43 43 43 43 43
44 44 44 44 44 44 44 

## Turn into a data frame

Combine these lists as parts of the same data frame.

In [13]:
import pandas as pd # convention

demo_pd = pd.DataFrame() # create a data frame
 
demo_pd['rank'] = rank
demo_pd['country'] = country
demo_pd['score'] = score
demo_pd['electoral'] = electoral
demo_pd['government'] = government
demo_pd['participation'] = participation
demo_pd['culture'] = culture
demo_pd['liberties'] = liberties
demo_pd['category'] = category

ValueError: Length of values does not match length of index

In [None]:
demo_pd

But country column values look weird. What's going on? And how can you fix this?

In [None]:
type(demo_pd['country'][1])

The solution was already suggested above. In the end, exploring both ways of scraping a table is not a waste of our time.

In [None]:
wiki_table.find_all('tr')[1].find_all('td')[1].find('a').get('title') # get some ideas about how looping works 


## Export the file

In [None]:
# demo_pd.to_csv("type the file address where you want to save the dataframe")