# Session 9: Web Scraping

*Nicklas Johansen*

## Agenda

In this session you will be introduced to `web scraping`: 
- The Web Scraping Recipe
- Connecting to the Internet
- Introduction to HTML
- Parsing HTML with BeautifulSoup
- Navigating websites to collect links
- Ethical Considerations
- Interactions and Automated Browsing

## The Web Scraping Recipe

To scrape information from the web is:
1. **MAPPING**: Finding URLs of the pages containing the information you want.
2. **DOWNLOAD**: Fetching the pages via HTTP.
3. **PARSE**: Extracting the information from HTML.  
  
  
You could also add `connection`, `storing`, `logging`, etc.        
   


### Packages used
Today we will mainly build on the python skills you have gotten so far, and tomorrow we will look into more specialized packages.

* for connecting to the internet we use: **requests**
* for parsing: **beautifulsoup** and **regex**
* for automatic browsing / screen scraping: **selenium** 
* for mitigating errors we use: **time**

We will write our scrapers with basic python, for larger projects consider looking into the packages **scrapy**

In [1]:
# check that you can import these lbraries
# otherwise you they can easily be installed using pip
# example: https://pypi.org/project/beautifulsoup4/

import requests
from bs4 import BeautifulSoup
import re
import selenium
import time
import pandas as pd

## Connecting to the Internet


**Connecting to the internet** **HTTP**

*URL* : the adressline in our browser.

Via HTTP we send a **get** request to an *address* with *instructions* ( - or rather our dns service provider redirects our request to the right address)

*Address / Domain*: www.google.com

*Instructions*: /search?q=who+is+mister+miyagi

*Header*: information send along with the request, including user agent (operating system, browser), cookies, and prefered encoding.

*HTML*: HyperTextMarkupLanguage the language of displaying web content.


In [30]:
# DO2021 website
url = 'https://nicklasjohansen.github.io/DO2021'
print(url)

https://nicklasjohansen.github.io/DO2021


In [31]:
requests.get(url)

<Response [200]>

In [32]:
# Datadrevet Organisationsanalyse
response = requests.get(url)
response.text


'<!DOCTYPE html>\n<html lang="en" itemscope itemtype="http://schema.org/WebPage">\n  <head>\n    \n\n  <meta charset="utf-8" />\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">\n\n  <title>Datadrevet Organisationsanalyse - Datadrevet Organisationsanalyse</title>\n  <meta name="description" content="Datadrevet Organisationsanalyse afholdes for første gang i efteråret 2021. Det er et valgfag der udbydes på det Institut for Statskundskab ved Københavns Universitet. Denne side indeholder kursusinformation til det underliggende GitHub repository.">\n  <meta name="author" content="Nicklas Johansen"/><script type="application/ld+json">\n{\n    "@context": "http://schema.org",\n    "@type": "WebSite",\n    "name": "Datadrevet Organisationsanalyse",\n    \n    "url": "https:\\/\\/nicklasjohansen.github.io\\/DO2021\\/"\n}\n</script><script type="application/ld+json">\n{\n  "@context": "http://sch

## Introduction to HTML
[What is HTML?](https://www.w3schools.com/whatis/whatis_html.asp)  

HTML has a Tree structure. 

Each node in the tree has:
- Children, siblings, parents, descendants. 
- Ids and attributes

<img src="http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png"/>


### Important syntax and patterns
_______________
```html 
<p>The p tag indicates a paragraph <p/>
```
_______________
```html 
<b>The b tag makes the text bold, giving us a clue to its importance </b>
```
output: <b>The b tag makes the text bold, giving us a clue to its importance </b>
```html 

<em>The em tag emphasize the text</em>, giving us a clue to its importance
```
output: <em>The em tag makes emphasize the text</em>, giving us a clue to its importance
___________
```html 
<h1>h1</h1><h2>h2</h2><h3>h3</h3><b>Headers give similar clues</b>
```
output:
<h1>h1</h1><h2>h2</h2><h3>h3</h3><b>Headers give similar clues</b>  
  
```html 
<a href="www.google.com">The a tag creates a hyperlink <a/>
```
output: <a href="www.google.com">The a tag creates a hyperlink <a/>

### How do we find our way around this tree?
1. ```BeautifulSoup```: A powerful, principled and readable way to parse data and navigate HTML
2. CSS-selectors: Specifying paths using css-selectors, xpath syntax.
3. Regex: Extracting string patterns using .split and regular expresssions

## Parsing HTML with BeautifulSoup
BeautifulSoup makes the html tree navigable. 
It allows you to:
- Search for elements by tag name and/or by attribute.
- Iterate through them, go up, sideways or down the tree.
- Furthermore it helps you with standard tasks such as extracting raw text from html, which would be a very tedious task if you had to hardcode it using `.split` commands and using your own regular expressions will be unstable.

In [38]:
# DO2021 Website using BeautifulSoup
url = 'https://nicklasjohansen.github.io/DO2021/'
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')
soup

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0" name="viewport"/>
<title>Datadrevet Organisationsanalyse - Datadrevet Organisationsanalyse</title>
<meta content="Datadrevet Organisationsanalyse afholdes for første gang i efteråret 2021. Det er et valgfag der udbydes på det Institut for Statskundskab ved Københavns Universitet. Denne side indeholder kursusinformation til det underliggende GitHub repository." name="description"/>
<meta content="Nicklas Johansen" name="author"/><script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "WebSite",
    "name": "Datadrevet Organisationsanalyse",
    
    "url": "https:\/\/nicklasjohansen.github.io\/DO2021\/"
}
</script><script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Organization",
  "na

In [37]:
# DO2021 Website without using BeautifulSoup
response = requests.get(url)
response.text

'<!DOCTYPE html>\n<html lang="en" itemscope itemtype="http://schema.org/WebPage">\n  <head>\n    \n\n  <meta charset="utf-8" />\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0">\n\n  <title>Datadrevet Organisationsanalyse - Datadrevet Organisationsanalyse</title>\n  <meta name="description" content="Datadrevet Organisationsanalyse afholdes for første gang i efteråret 2021. Det er et valgfag der udbydes på det Institut for Statskundskab ved Københavns Universitet. Denne side indeholder kursusinformation til det underliggende GitHub repository.">\n  <meta name="author" content="Nicklas Johansen"/><script type="application/ld+json">\n{\n    "@context": "http://schema.org",\n    "@type": "WebSite",\n    "name": "Datadrevet Organisationsanalyse",\n    \n    "url": "https:\\/\\/nicklasjohansen.github.io\\/DO2021\\/"\n}\n</script><script type="application/ld+json">\n{\n  "@context": "http://sch

In [42]:
print(type(response.text))
print(type(soup))

<class 'str'>
<class 'bs4.BeautifulSoup'>


In [16]:
soup.find_all('h1')

[<h1>Datadrevet Organisationsanalyse</h1>]

In [17]:
soup.find_all('h2')

[<h2 class="post-title">Cases</h2>,
 <h2 class="post-title">Velkommen</h2>,
 <h2 class="post-title">Install</h2>,
 <h2 class="post-title">Assignments</h2>,
 <h2 class="post-title">Eksamen</h2>]

In [23]:
soup.find_all('a')

[<a class="navbar-brand" href="https://nicklasjohansen.github.io/DO2021/">Datadrevet Organisationsanalyse</a>,
 <a href="/DO2021/" title="Home">Home</a>,
 <a class="navlinks-parent">Posts</a>,
 <a href="/DO2021/post/cases/">Cases</a>,
 <a href="/DO2021/post/velkommen/">Velkommen</a>,
 <a href="/DO2021/post/install/">Install</a>,
 <a href="/DO2021/post/assignments/">Assignments</a>,
 <a href="/DO2021/post/eksamen/">Eksamen</a>,
 <a href="/DO2021/page/l%c3%a6sning/" title="Læsning">Læsning</a>,
 <a href="/DO2021/page/tidsplan/" title="Tidsplan">Tidsplan</a>,
 <a href="https://nicklasjohansen.github.io/DO2021/" title="Datadrevet Organisationsanalyse">
 <img alt="Datadrevet Organisationsanalyse" class="avatar-img" src="https://nicklasjohansen.github.io/DO2021/img/ku_logo_uk_v.png"/>
 </a>,
 <a href="https://kurser.ku.dk/course/astk18379u/2021-2022">Datadrevet Organisationsanalyse</a>,
 <a href="https://polsci.ku.dk/">Institut for Statskundskab</a>,
 <a href="https://github.com/NicklasJohan

In [28]:
soup.find_all('a')[0].get('href')

'https://nicklasjohansen.github.io/DO2021/'

## Navigating websites to collect links
Now I will show you a few common ways of finding the links to the pages you want to scrape.

### Building URLS using a recognizable pattern.
A nice trick is to understand how urls are constructed to communicate with a server. 

Lets look at how [jobindex.dk](https://www.jobindex.dk/) does it. We simply click around and take note at how the addressline changes.

This will allow us to navigate the page, without having to parse information from the html or click any buttons.

* / is like folders on your computer.
* ? entails the start of a query with parameters 
* = defines a variable: e.g. page=1000 or offset = 100 or showNumber=20
* & separates different parameters.
* \+ is html for whitespace

In [8]:
# Mapping exercise
url = 'https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=2&q=python'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup


<!DOCTYPE html>

<html lang="da-DK">
<head>
<title>Ledige job - Python - Storkøbenhavn, side 2 ud af 7 | Jobindex</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<link href="/res/bootstrap-multiselect/dist/css/bootstrap-multiselect.css?h=6a6b68a249811e054fa8d759452816a9248c0748" rel="stylesheet"/><link href="/res/select2/dist/css/select2.min.css?h=a170ecdd58f00519741ed4b63abc064ef35db1a9" rel="stylesheet"/><link href="/res/bootstrap-datepicker/dist/css/bootstrap-datepicker3.standalone.min.css?h=f02cbfe4614ade97b3e5823be92702ae7bd445cd" rel="stylesheet"/><link href="/res/mapbox-gl/dist/mapbox-gl.css?h=0221a0dab467f93c80e8f5264c4f146e6d11496d" rel="stylesheet"/><link href="/res/font-awesome/css/font-awesome.css?h=ee906a8196d0fbd581c27a9d5615db4c250860f2" rel="stylesheet"/><link href="/css/_scss/fonts/roboto.css?h=e5bcd6527330b9ea940dd0de1cc29edbdd15519b" rel="stylesheet"/><link href="/css/_scss/fonts/frank_ruhl_libre.css?h=f908f8924442cd3fc8a7a73091

In [55]:
jobs = int(soup.find('span',attrs={'class':'d-md-none'}).text[0:3])
jobs

126

In [10]:
# 20 jobs per page
for i in range(round(jobs/20)+1):
    print(i)

0
1
2
3
4
5
6


In [11]:
for i in range(round(jobs/20)+1):
    print('https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=' + str(i) +'&q=python')

https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=0&q=python
https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=1&q=python
https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=2&q=python
https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=3&q=python
https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=4&q=python
https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=5&q=python
https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=6&q=python


In [59]:
data = []

for i in range(round(jobs/20)+1):
    url = 'https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=' + str(i) +'&q=python'
    response = requests.get(url)
    soup = BeautifulSoup(response.text,'lxml')
    temp = soup.find_all('b')
    data.append(temp[::2])
    
data

## Ethical Considerations
* If a regular user can’t access it, we shouldn’t try to get it [That is considered hacking](https://www.dr.dk/nyheder/penge/gjorde-opmaerksom-paa-cpr-hul-nu-bliver-han-politianmeldt-hacking). 
* Don't hit it to fast: Essentially a DENIAL OF SERVICE attack (DOS). [Again considered hacking](https://www.dr.dk/nyheder/indland/folketingets-hjemmeside-ramt-af-hacker-angreb). 
* Add headers stating your name and email with your requests to ensure transparency. 
* Be careful with copyrighted material.
* Fair use (take only the stuff you need)
* If monetizing on the data, be careful not to be in direct competition with whom you are taking the data from.

<img src="https://github.com/snorreralund/images/raw/master/Sk%C3%A6rmbillede%202017-08-03%2014.46.32.png"/>

## Interactions and Automated Browsing
Sometimes scraping tasks demand interactions (e.g. login, scrolling, clicking), and a no XHR data can be found easily, so you need the browser to execute the scripts before you can get the data. XHR is short for XMLHttpRequest - a JavaScript API - like the one we found in the jobnet.dk exerise.

Here we use the `Selenium` package in combination with the `ChromeDriver` - you can download the latest release [here](https://chromedriver.chromium.org/downloads). It allows you to animate a browser. 

Make sure to download the driver as well as the newest version of Selenium. "pip install selenium" should do the trick. 

Some developers prefer to you [geckodriver](https://github.com/mozilla/geckodriver/releases) as an alternative to `ChromeDriver`.


In [61]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager


url = 'https:google.com'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

[WDM] - Current google-chrome version 95.0.4638
[WDM] - Trying to download new driver from http://chromedriver.storage.googleapis.com/95.0.4638.69/chromedriver_mac64.zip


 


[WDM] - Unpack archive /Users/nicklasjohansen/.wdm/drivers/chromedriver/95.0.4638.69/mac64/chromedriver.zip


In [48]:
# You can also download the driver to your computer
# Save it in your working directory and write the code

# import os
# directory = os.getcwd()
# path = os.path.join(directory, 'chromedriver')
# driver = webdriver.Chrome(executable_path=path)

### Benifits from autoamting browsing
1. You can access data that are not directly in the HTML code but that is being generating while browsing
2. You can get thorugh login screens and other scraping barriers
3. You can automate browsing behaviour such as scrolling down

## Example: nboards.dk

In [62]:
# step 1: load the webpage we want to scrape in our virtual browser
url = 'https://nboard.dk/search'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

[WDM] - Cache is valid for [10/11/2021]
[WDM] - Looking for [chromedriver 95.0.4638.69 mac64] driver in cache 
[WDM] - Driver found in cache [/Users/nicklasjohansen/.wdm/drivers/chromedriver/95.0.4638.69/mac64/chromedriver]


 


In [63]:
# step 2: scroll down the page to load more profiles
import time

url = 'https://nboard.dk/search'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
time.sleep(3)

for i in range(5):
    time.sleep(3)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")



[WDM] - Cache is valid for [10/11/2021]
[WDM] - Looking for [chromedriver 95.0.4638.69 mac64] driver in cache 
[WDM] - Driver found in cache [/Users/nicklasjohansen/.wdm/drivers/chromedriver/95.0.4638.69/mac64/chromedriver]


 


NoSuchWindowException: Message: no such window: window was already closed
  (Session info: chrome=95.0.4638.69)


In [64]:
# step 3: save the soup and keep track of runtime

import time
from bs4 import BeautifulSoup

start_time = time.time()

url = 'https://nboard.dk/search'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

for i in range(5):
    time.sleep(3)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")

soup = BeautifulSoup(driver.page_source, 'lxml')

print("--- %s seconds ---" % round((time.time() - start_time),2))

[WDM] - Cache is valid for [10/11/2021]
[WDM] - Looking for [chromedriver 95.0.4638.69 mac64] driver in cache 
[WDM] - Driver found in cache [/Users/nicklasjohansen/.wdm/drivers/chromedriver/95.0.4638.69/mac64/chromedriver]


 
--- 20.61 seconds ---


In [54]:
# step 3: save the soup and keep track of runtime

import time
from bs4 import BeautifulSoup

start_time = time.time()

url = 'https://nboard.dk/search'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

time.sleep(3)

lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
    lastCount = lenOfPage
    time.sleep(1)
    lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    if lastCount==lenOfPage:
        match=True

time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'lxml')

print("--- %s seconds ---" % round((time.time() - start_time),2))

[WDM] - Cache is valid for [12/08/2020]
[WDM] - Looking for [chromedriver 84.0.4147.30 mac64] driver in cache 
[WDM] - Driver found in cache [/Users/nicklasjohansen/.wdm/drivers/chromedriver/84.0.4147.30/mac64/chromedriver]


 
--- 193.0 seconds ---


In [68]:
# step 4: use the soup to generate our mapping of urls (profiles) that we want to scrape

names = soup.find_all('span', {'class': 'name'})

urls = []
for i in range(len(names)):
    temp = 'https://nboard.dk/candidate_profile/'+ str(names[i].text)
    temp = temp.replace(' ','-')
    temp = temp.replace('--','-')
    urls.append(temp)

print(len(urls))
print(urls[3])

80
https://nboard.dk/candidate_profile/Niels-Brinch


In [70]:
# step 5: scraping profiles 
import requests
import pandas as pd

start_time = time.time()

name = []
subtitle = []
location = []
resume = []

for i in range(5): #len(urls)
    response = requests.get(urls[i])
    html = response.text
    
    if 'Internal server error' in html:
        continue
    
    soup = BeautifulSoup(html, "html.parser")
    name.append(soup.find('title').text)
    subtitle.append(soup.find('span', {'class': 'sub-title'}).text)
    location.append(soup.find('span', {'class': 'location'}).text)
    resume.append(soup.find('span', {'class': 'resume'}).text)

df = pd.DataFrame({'name':name, 
                   'subtitle':subtitle, 
                   'location':location, 
                   'resume':resume})

print("--- %s seconds ---" % round((time.time() - start_time),2))

df

--- 37.55 seconds ---


Unnamed: 0,name,subtitle,location,resume
0,Carsten Nielsen,Erfaren leder med stærke kommercielle kompetencer,"København, Danmark",Mere end 25 års ledelseserfaring som ejerleder...
1,Niels Brinch,Specialist i ledelse af SaaS-produkter,"København, Danmark",Leder af SaaS-lignende produkter siden 1999 og...


### Next level scrapers

You have know learned some of the fundamentals of collecting and parsing data and should be well suited for your exam project. Though I find it important to adress that you might run into some challenges that we have not learn dealing with yet. Facebook, LinkedIn, Google and all the other big tech firms are battling scrapers and has done all kinds of thing to make it hard for us to steal public data on their sites. I have found som article that you might find interessting.

- [Most Commonly used techniques to Prevent Scraping:](https://medium.com/@betoayesa/using-the-content-as-an-anti-scrape-weapon-draft-9bb10cd30e5c)
- [Advanced Web Scraping Tactics](https://www.pluralsight.com/guides/advanced-web-scraping-tactics-python-playbook)
- [Scraping Sites That Use JavaScript and AJAX](https://oup-arc.com/protected/files/content/file/1505319833942-CH9---Scraping-Sites-that-Use-JavaScript-and-AJAX.pdf)
- [Get Started Scraping LinkedIn With Python and Selenium](https://medium.com/nerd-for-tech/linked-in-web-scraper-using-selenium-15189959b3ba)

# Associated Readings+

Readings:
- [Python for Data Analysis, chapter 6](https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf)
- [A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)
- [An introduction to web scraping with Python](https://towardsdatascience.com/an-introduction-to-web-scraping-with-python-a2601e8619e5)
- [Introduction to Web Scraping using Selenium](https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72)

# session_9_exercises.ipynb
Will be uploaded on github.
- Method 1: sync your cloned repo
- Method 2: download from git repo

`Remember` to create a local copy of the notebook