# Session 8: Advanced webscraping; Automated Browsing and Regular Expressions



## Recap

In sessions 6 and 7 we learned how to:

1. Map our URLs of interest
2. Download the HTML of the webpages
3. Parse the data from the HTML
    - HTML is the language behind webpages
    - We can use `BeautifulSoup` to find the right places in the HTML (where is the data of interest hidden in the HTML?)
    - We learned how to structure our acquired data in dataframes

Our focus was on the HTML and how to extract information from it! 
- Sometimes just downloading the HTML is not enough to extract the data you need

We might need to interact with the webpage to bring forward the information in the HTML
- Here automated browsing is our friend, and that is the focus of session 8!

In this session, you will also learn about regular expressions that you can use to process raw text (i.e. not HTML text)

## Required readings

- Introduction to Web Scraping using Selenium: [An introduction to Selenium in Python](https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72)

- Introduction to pattern matching using regex: [An introduction to regex in python](https://www.digitalocean.com/community/tutorials/an-introduction-to-regex-in-python)

# General Questions in session 7

The webpage: https://www.basketball-reference.com/leagues/NBA_2018.html

1. How do you locate the "Eastern Conference" table?
2. How do I go through all the rows in the HTML code?

# 1. How do you locate the "Eastern Conference" table?
- Go to the Chrome Developer Tools on the webpage
- Notice that the tag is ```<table>```, and it has an "id" attribute: id = "confs_standings_E --> we can uniquely identify the table!

In [44]:
import requests
from bs4 import BeautifulSoup

# Define our URL
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html' 

# Connects to site
response = requests.get(url)

# Parse data with BeautifulSoup
soup = BeautifulSoup(response.content,'lxml')

# Identify table to scrape by inspecting site
table_node = soup.find(id = 'confs_standings_E') 

## Now we begin to extract the information in the table

### First extract the column names:

In [45]:
columns_html = table_node.thead.find_all('th')
# Extract the text
columns = []
for col in columns_html:
    columns.append(col.text)

In [46]:
table_node.find('thead').find_all('th')

[<th aria-label="Eastern Conference" class="poptip sort_default_asc left" data-stat="team_name" scope="col">Eastern Conference</th>,
 <th aria-label="Wins" class="poptip right" data-stat="wins" data-tip="Wins" scope="col">W</th>,
 <th aria-label="Losses" class="poptip right" data-stat="losses" data-tip="Losses" scope="col">L</th>,
 <th aria-label="Win-Loss Percentage" class="poptip right" data-stat="win_loss_pct" data-tip="Win-Loss Percentage" scope="col">W/L%</th>,
 <th aria-label="GB" class="poptip sort_default_asc right" data-stat="gb" data-tip="Games Behind" scope="col">GB</th>,
 <th aria-label="Points Per Game" class="poptip right" data-stat="pts_per_g" data-tip="Points Per Game" scope="col">PS/G</th>,
 <th aria-label="Opponent Points Per Game" class="poptip right" data-stat="opp_pts_per_g" data-tip="Opponent Points Per Game" scope="col">PA/G</th>,
 <th aria-label="Simple Rating System" class="poptip right" data-stat="srs" data-tip="Simple Rating System; a team rating that takes i

In [47]:
columns

['Eastern Conference', 'W', 'L', 'W/L%', 'GB', 'PS/G', 'PA/G', 'SRS']

### Second we want to find the HTML of the row nodes where the rest of the data is:

In [48]:
rows_list = table_node.tbody.find_all('tr')

In [49]:
rows_list

[<tr class="full_table"><th class="left" data-stat="team_name" scope="row"><a href="/teams/TOR/2018.html">Toronto Raptors</a>*</th><td class="right" data-stat="wins">59</td><td class="right" data-stat="losses">23</td><td class="right" data-stat="win_loss_pct">.720</td><td class="right" data-stat="gb">—</td><td class="right" data-stat="pts_per_g">111.7</td><td class="right" data-stat="opp_pts_per_g">103.9</td><td class="right" data-stat="srs">7.29</td></tr>,
 <tr class="full_table"><th class="left" data-stat="team_name" scope="row"><a href="/teams/BOS/2018.html">Boston Celtics</a>*</th><td class="right" data-stat="wins">55</td><td class="right" data-stat="losses">27</td><td class="right" data-stat="win_loss_pct">.671</td><td class="right" data-stat="gb">4.0</td><td class="right" data-stat="pts_per_g">104.0</td><td class="right" data-stat="opp_pts_per_g">100.4</td><td class="right" data-stat="srs">3.23</td></tr>,
 <tr class="full_table"><th class="left" data-stat="team_name" scope="row">

#### Here are the children of one of the rows:

In [50]:
rows_list[0].contents

[<th class="left" data-stat="team_name" scope="row"><a href="/teams/TOR/2018.html">Toronto Raptors</a>*</th>,
 <td class="right" data-stat="wins">59</td>,
 <td class="right" data-stat="losses">23</td>,
 <td class="right" data-stat="win_loss_pct">.720</td>,
 <td class="right" data-stat="gb">—</td>,
 <td class="right" data-stat="pts_per_g">111.7</td>,
 <td class="right" data-stat="opp_pts_per_g">103.9</td>,
 <td class="right" data-stat="srs">7.29</td>]

# 2. How do I go through all the rows in the HTML code?
- We now have the column names and the HTML of the row nodes
    - We want to extract the rest of the data from the HTML of the row nodes
        - --> We need to take each row (first loop) and then go through all different elements in the row node HTML (second loop)

#### Let's first do it for *one* row node only:

In [51]:
row_node = rows_list[0]
row_node

<tr class="full_table"><th class="left" data-stat="team_name" scope="row"><a href="/teams/TOR/2018.html">Toronto Raptors</a>*</th><td class="right" data-stat="wins">59</td><td class="right" data-stat="losses">23</td><td class="right" data-stat="win_loss_pct">.720</td><td class="right" data-stat="gb">—</td><td class="right" data-stat="pts_per_g">111.7</td><td class="right" data-stat="opp_pts_per_g">103.9</td><td class="right" data-stat="srs">7.29</td></tr>

In [52]:
row = []
for child in row_node.children:
     row.append(child.text)

In [53]:
row

['Toronto Raptors*', '59', '23', '.720', '—', '111.7', '103.9', '7.29']

#### Now we will loop through all the row nodes:

In [54]:
data = []
for row_node in rows_list:
    row = []
    for child in row_node.children:
        row.append(child.text)
    data.append(row)

In [55]:
data

[['Toronto Raptors*', '59', '23', '.720', '—', '111.7', '103.9', '7.29'],
 ['Boston Celtics*', '55', '27', '.671', '4.0', '104.0', '100.4', '3.23'],
 ['Philadelphia 76ers*', '52', '30', '.634', '7.0', '109.8', '105.3', '4.30'],
 ['Cleveland Cavaliers*', '50', '32', '.610', '9.0', '110.9', '109.9', '0.59'],
 ['Indiana Pacers*', '48', '34', '.585', '11.0', '105.6', '104.2', '1.18'],
 ['Miami Heat*', '44', '38', '.537', '15.0', '103.4', '102.9', '0.15'],
 ['Milwaukee Bucks*', '44', '38', '.537', '15.0', '106.5', '106.8', '-0.45'],
 ['Washington Wizards*', '43', '39', '.524', '16.0', '106.6', '106.0', '0.53'],
 ['Detroit Pistons', '39', '43', '.476', '20.0', '103.8', '103.9', '-0.26'],
 ['Charlotte Hornets', '36', '46', '.439', '23.0', '108.2', '108.0', '0.07'],
 ['New York Knicks', '29', '53', '.354', '30.0', '104.5', '108.0', '-3.53'],
 ['Brooklyn Nets', '28', '54', '.341', '31.0', '106.6', '110.3', '-3.67'],
 ['Chicago Bulls', '27', '55', '.329', '32.0', '102.9', '110.0', '-6.84'],
 ['O

## Overview of Session 8
Today we will learn about automated browsing and regular expressions

1. Automated browsing
    - Why is it useful?
    - Learning by doing: We will browse through www.nboard.dk
        - You will learn about scrolling, clicking, sending keys and combining `Selenium` and `BeautifulSoup`
2. Regular Expressions
    - What is it?
    - Where can you learn more?
    - Build simple regular expressions

# 1. Interactions and Automated Browsing

Automated browsing means letting the computer do the things you normally do:
- Log in
- Type in search text
- Scroll down page
- Click on links

Sometimes webscraping demands such interactions with the webpage to extract the information you want

To make the interactions we use the Python package [`Selenium`](https://selenium-python.readthedocs.io/). In combination with the virtual browser [`ChromeDriver`](https://chromedriver.chromium.org/) we can completely automate our browsing!

Note: If you have not installed `Selenium` yet, "pip install selenium" should do the trick.

#### Let 's see how it works:

- In the code below we first open our virtual browser 
- From the virtual browser we can execute `Selenium` commands (for example go to google.com)

In [56]:
#pip install selenium

In [57]:
#pip install webdriver_manager

In [59]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

# We open up google.com in a virtual browser
url = 'https://www.google.dk'

# We need to import the ChromeDriverManager to download and set up the Chrome driver
driver = webdriver.Chrome() # Download and install the Chrome driver

# Once the driver is set up, we can use it to navigate to the provided URL
driver.get(url) # Go to google.com

## Benefits from automated browsing
1. You can access data that are not directly in the HTML but that is being generated while browsing
2. You can get through login screens and other scraping barriers
3. You can automate browsing behaviour such as scrolling down

# Video 8.1: Automated browsing with Selenium

## Learning by doing: Automated browsing of www.nboard.dk

www.nboard.dk is a website that connects companies with potential board members. 

In this exercise we want to browse the site for potential board members. We will do this automatically with `Selenium`.

#### Step 1: 
Load the webpage we want to scrape in our virtual browser

In [61]:
url = 'https://nboard.dk/search'
driver = webdriver.Chrome()
driver.get(url)

#### Step 2: 
We want to click away the "cookie notification"

In [63]:
from selenium.webdriver.common.by import By
cookie = driver.find_element(By.CSS_SELECTOR, '.cc-dismiss') #Here we use a CSS selector
cookie.click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".cc-dismiss"}
  (Session info: chrome=115.0.5790.114); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
0   chromedriver                        0x000000010fa57af8 chromedriver + 4987640
1   chromedriver                        0x000000010fa4eeb3 chromedriver + 4951731
2   chromedriver                        0x000000010f6028d7 chromedriver + 444631
3   chromedriver                        0x000000010f648985 chromedriver + 731525
4   chromedriver                        0x000000010f648b41 chromedriver + 731969
5   chromedriver                        0x000000010f68c7c4 chromedriver + 1009604
6   chromedriver                        0x000000010f66eb3d chromedriver + 887613
7   chromedriver                        0x000000010f689d31 chromedriver + 998705
8   chromedriver                        0x000000010f66e8e3 chromedriver + 887011
9   chromedriver                        0x000000010f63a9b9 chromedriver + 674233
10  chromedriver                        0x000000010f63bb9e chromedriver + 678814
11  chromedriver                        0x000000010fa13dc9 chromedriver + 4709833
12  chromedriver                        0x000000010fa18de4 chromedriver + 4730340
13  chromedriver                        0x000000010fa1fc99 chromedriver + 4758681
14  chromedriver                        0x000000010fa19b3a chromedriver + 4733754
15  chromedriver                        0x000000010f9ed35c chromedriver + 4551516
16  chromedriver                        0x000000010fa37908 chromedriver + 4856072
17  chromedriver                        0x000000010fa37a87 chromedriver + 4856455
18  chromedriver                        0x000000010fa47def chromedriver + 4922863
19  libsystem_pthread.dylib             0x00007ff8108521d3 _pthread_start + 125
20  libsystem_pthread.dylib             0x00007ff81084dbd3 thread_start + 15


#### Step 3: 
We only want the board members ("Bestyrelsesmedlem")
- We need to click the box "Bestyrelsesmedlem"

In [None]:
boardmember = driver.find_element(By.ID, 'mat-checkbox-2') #Here we use the id attribute to find the boardmember box
boardmember.click()

#### Step 4: 
Now we scroll down the page to load more profiles

In [None]:
import time

for i in range(5): #We scroll down 5 times and sleep for 3 seconds each time to wait for the webpage to load
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") #Execute JavaScript on the browser that scroll down page
    time.sleep(3)

#### Step 5: 
We decide that we only want profiles with the surname "Hansen". So we need to go to "Søg på kandidatnavn" and type in Hansen

In [64]:
# Find the place to type in search text
candidate = driver.find_element(By.ID, 'mat-input-4')
candidate.click() #And click
# Type the search text
candidate.send_keys('Hansen') #Use the `.send_keys` to type text. `.send_keys` imitates your computer keyboard, so you can for example also press the 'Return' or 'PgDn' botton.


SyntaxError: invalid syntax (3926428364.py, line 6)

#### Step 6: 
Now we want to know how many profiles satisfy our criteria. So first we need to save the HTML with `BeautifulSoup`

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source, 'lxml') #The Selenium Driver keeps the HTML in the `.page_source` property

# Find the place where the number of profiles is shown
results = soup.find_all('span', class_ = "ng-tns-c10-0")
results = results[4] #More elements matches the search, so we take the number of profiles element only

# We only want the text content of the HTML
profiles = results.text #We take the text

In [None]:
profiles

#### Step 7: 
We only want the number of profiles, but there are still some other text as well

We cannot get any further with `BeautifulSoup`, so we use `Regex` to take out the number from the string

In [None]:
# We are only interested in the number, so we take out the number from the string using RegEx
import re
number_profiles = re.search(r'\d+', profiles) #The '\d' searches for digits in the string, and the '+' tells regex to search for all digits. The 'r' in the front of the string makes the string into a raw string; it means that for example \n (new line) is not interpreted as new line, but is just seen as '\n'.
number_profiles = number_profiles.group()

In [None]:
number_profiles

## Next level scrapers

You have now learned the fundamentals of collecting and parsing data from the web. 

#### One last note about challenges of scraping: 
Many online companies have made a business out of data
- They do not necessarily want to share all their data with you 
- Facebook, LinkedIn, Google and all the other big tech firms are battling scrapers by creating all kinds of obstacles to make it hard for us to scrape their data

I have found some articles that address these obstacles. You might find them interesting.

- [Most Commonly used techniques to Prevent Scraping:](https://medium.com/@betoayesa/using-the-content-as-an-anti-scrape-weapon-draft-9bb10cd30e5c)
- [Advanced Web Scraping Tactics](https://www.pluralsight.com/guides/advanced-web-scraping-tactics-python-playbook)
- [Scraping Sites That Use JavaScript and AJAX](https://oup-arc.com/protected/files/content/file/1505319833942-CH9---Scraping-Sites-that-Use-JavaScript-and-AJAX.pdf)
- [Get Started Scraping LinkedIn With Python and Selenium](https://medium.com/nerd-for-tech/linked-in-web-scraper-using-selenium-15189959b3ba)

# Remember
You know the fundamentals about web scraping
- But a web scraping course will never be able to prepare you for all situations
- The only way to get better is to go work on your own web scraping problems

# Video 8.2: Extracting patterns from text using RegEx

# 2. Regex
A regular expression (shortened as regex) is a sequence of characters that defines a search pattern

The patterns are used by string-searching algorithms for "find"- or "find and replace"- operations on strings

### Examples
- Extract currency and amount from raw text: $ 20, 10.000 dollars 10,000 £
- Email addresses: Design a pattern, that captures only the uses of @ within an email.
- URLs: Define all the different ways of writing URLs (https, http, no http). 
- Dates: There are many variations: 17th of June 2017, 06/17/17 or 17. June 17
- Addresses 
- Phone numbers: 8888888 or 88 88 88 88 or +45 88 88 88 88
- Emojiies in text: Capturing all the different ways of expressing smiley faces with one regular expression

#### Note:
- We will only scratch the surface of regex!

- It takes time to understand the intuition behind regex

- The only way to become better is by using it in practice!

## Ressources
- Use this interactive regex tester to test your regex: http://regexr.com/
- Interactive tutorial: https://regexone.com/
- Lookup all special characters: https://www.regular-expressions.info/refquick.html

## Some important syntax for build your own expresions
### See more in this [tutorial](https://www.digitalocean.com/community/tutorials/an-introduction-to-regex-in-python) and this [guide](https://www.regular-expressions.info/refquick.html)
* \+ = 1 or more times  -- e.g. "a+" will match: "a", and "aaa"
* \* = 0 or more times  -- e.g. "ba*" will match: "b", and "ba", and "baaa"
* {3} = exactly three times --- e.g. "ba{3}" will match "baaa", but not "baa"
* ? = once or none
* \\ = escape character, used to find characters that has special meaning with regex: e.g. \+ \*
* [] = allows you to define a set of characters
* () = groups a part of the regular expression
* ^ = applied within a set, it becomes the inverse of the set defined. Applied outside a set it entails the beginning of a string. $ entails the end of a string.
* . = any characters except line break
* | = or statement. -- e.g. a|b means find characters a or b.
* \d = digits
* \D = non-digits.
* \s = whitespace-separator
* \w = matches alphanumeric character [a-zA-Z0-9_]
* \W = matches any non-alphanumeric character [^a-zA-Z0-9]

Sequences
* (?:) = Defines a Non-capturing group. -- e.g. "(?:abc)+", will match "abc" and "abcabcabc", but not "aabbcc"
* (?=)	= Positive lookahead - only match a certain pattern if a certain pattern comes after it.
* (?!)	= Negative lookahead - only match a certain pattern if **not** a certain pattern comes after it.
* (?<=)	= Positive lookbehind - only match a certain pattern if a certain pattern precedes it.
* (?<!) = Negative lookbehind - only match a certain pattern if **not** a certain pattern precedes it.

## Regular expressions in action

### In the code pieces below, you will see some common uses of regular expressions

#### First we need some text to practice on
We will use a piece of one of the articles we downloaded from www.dr.dk in session 7

In [None]:
text = 'Gazprom halverer gasleverancerne til Europa via Nord Stream 1. Årsagen er ifølge selskabet vedligehold af en gasturbine. Den daglige gasforsyning via gasledningen vil fra onsdag morgen blive reduceret til 33 millioner kubikmeter, oplyser Gazprom.Det svarer til cirka 20 procent af den maksimale kapacitet, og det fremgår ikke, hvor længe den yderligt reducerede forsyning af gas vil stå på.Den tyske regering anser den forklaringen om vedligeholdelse for at være opfundet til lejligheden.- Ifølge vores oplysninger er der ingen teknisk grund til en reduktion i leverancerne, siger en talskvinde for Finansministeriet og minister Robert Habeck til Frankfurter Allgemeine Zeitung.Tyskerne får 25 procent af deres energi fra gas, hvor en overvejende del er kommet fra Rusland.Gasprisen stiger med 10 procentDet er anden gang indenfor en uge, at Gazprom reducerer leverancen af gas under påskud af reperation af gasturbiner. Da Gazprom efter ti dages vedligehold i sidste uge genåbnede for gasforsyningen til Tyskland, var meldingen, at der dagligt ville blive leveret cirka 67 millioner kubikmeter.Gazproms seneste melding betyder altså, at leverancerne til Europa bliver omtrent halveret fra onsdag. Gasledningen kan, når den kører for fuld kraft, levere cirka 167 millioner kubikmeter gas om dagen.Nordstream 1-faciliter i Lubmin i Tyskland. (Foto:\xa0HANNIBAL HANSCHKE ©\xa0Ritzau Scanpix)'

In [None]:
text

#### 1. Find the first digit in a text
Use the [`search()`](https://www.pythontutorial.net/python-regex/python-regex-search/) function

- `\d` finds any digit in the text

In [None]:
import re
first_digit = re.search(r'\d+', text) 
first_digit.group() #group() returns the matched string

#### 2. Find all digits in a text
Use the [`findall()`](https://www.pythontutorial.net/python-regex/python-regex-findall/) function:

In [None]:
all_digits = re.findall(r'\d+', text) 
all_digits

#### 3. Find all digits with 'millioner' after 


In [None]:
millioner = re.findall(r'\d+ millioner', text) 
millioner

#### 4. We are now interested in the quotes in the text
We need to search for the text with the pattern: 
- First a '-'
- Then the text
- Ended by a ','

`\w` finds any alphanumeric characters \[a-zA-Z0-9_\]. Oppositely, `\W` finds any non-alphanumeric character \[^a-zA-Z0-9\]

In [None]:
quote = re.findall(r'- [\w ]+,', text) 
quote

#### 5. What if we want the information about the photo 
We want the information about the photo inside parentheses
- Remember that "(" and ")" are special characters in regex, so we have to escape its special function with "\"
- "." matches any character except "\n" (new line)

In [None]:
photo = re.findall(r'\(.+\)', text)
photo

#### 6. The text consists of different sentences. We want to break the text down to each sentence

The [`split()`](https://www.pythontutorial.net/python-regex/python-regex-split/) function can split a string at the occurrences of matches of a regular expression

Each sentence ends with a ".". Let us split on that:

In [None]:
sentences = re.split(r'\.', text) #Remember that "." is a special character in regex, so we need to escape it with "\"
sentences