<img src="images/lasalle_logo.png" style="width:375px;height:110px;">

# Week 7 - Web Scraping

### WIM250 - Introduction to Scripting Languages 
### Instructor: Ivaldo Tributino

Source:
    
- Automate The Boring Stuff With Python by AL Sweigart
- Python for Everybody Exploring Data Using Python 3 by Dr. Charles R. Severance
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- https://realpython.com/beautiful-soup-web-scraper-python/


## Introduction

`Web scraping` or `web data extraction` is used for extracting data from websites. The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest that data, you’ll need to become skilled at `web scraping`. The Python libraries `requests` and `Beautiful Soup` are powerful tools for the job.  

<img src="images/webScraping.png" style="width:400px;height:300px;">

For example, Google runs many web scraping programs to index web pages for its seach engine. I this class, you will learn about several `modules` that make it easy to scrape web pages in Python.

- **webbrowser** Comes with Python and open a browser to a specific page.
- **Requests** Python HTTP library that dowloads files and web pages from the internet.
- **urllib** is a package that collects several modules for working with URLs.
- **Beautiful Soup** Parsing HTML and XML documents. Can be used to extract data from HTML.


## 1. webbrowser Module

The `webbrowser` module provides a high-level interface to allow displaying Web-based documents to users. Under most circumstances, simply calling the `open()` function from this module will do the right thing.

For more information, see: https://docs.python.org/3/library/webbrowser.html

In [None]:
import webbrowser
webbrowser.open('https://www.lasallecollegevancouver.com')

The web browser tab opened the URL `https://www.lasallecollegevancouver.com` This is about the only thing the webbrowser module can do. Even so, the `open()` function does make some interesting things possible. For example,  bring up a map of it on Google Maps. 

In [None]:
ad1 = '2665 Renfrew St. Vancouver'
ad2 = '4600 Cambie St, Vancouver'
ad3 = '1423 E 13th Ave, Vancouver'

for ad in [ad1, ad2, ad3]:
    webbrowser.open('https://www.google.com/maps/place/%s' %ad)

## 2. Requests Module

The `requests module` lets you easily download files from the web without having to worry about complicated issues such as network errors, connection problems, and data compression. The requests module doesn’t come with Python, so you’ll have to install it first.

To install libraries in an Anaconda environment, you could try 

<img src="images/install1.png">

or

Open the terminal, select the environment and install the library according to the example below:

<img src="images/install.png">

In [None]:
import requests

In [None]:
url = 'https://automatetheboringstuff.com/files/rj.txt'
webbrowser.open(url)
r = requests.get(url) # GET request to retrieve data
r.status_code # 200 The request has succeeded, 404 status code for Not Found

In [None]:
type(r)

In [None]:
r.encoding # ISO 8859-1 refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. 

In [None]:
r.headers['content-type'] # provide information about the request context

Plain Text is regular text, with no formatting options such as bold, italics, underlines, or special layout options

In [None]:
st = r.text[:251]
print(st)  

### 2.1 Checking for Errors

If you run a cell with the following line:
```python
res = requests.get('https://www.miktutors.com/page_that_does_not_exist')
```
You will not receive a `Traceback`. However, this `url` does not exist. Therefore, it is a good idea to use `raise_for_status()` to check the success of the call.

<img src="images/404.png" style="width:350px;height:225px;">

In [None]:
res = requests.get('https://www.miktutors.com/page_that_does_not_exist') 
res.status_code # to confirm that the url does not exist

In [None]:
res = requests.get('https://www.miktutors.com/page_that_does_not_exist')
res.raise_for_status() # returns an HTTP Error object if an error has occurred during the process

The `raise_for_status()` method is a good way to ensure that a program halts if a bad download occurs. You can wrap the `raise_for_status()` line with `try and except` statements to handle this error case without crashing.

In [None]:
res = requests.get('https://www.miktutors.com/page_that_does_not_exist')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

### 2.2 Saving download files to the hard drive

First, you must open the file in write binary mode by passing the string `'wb'` as the second argument to `open()`. Even if the page is in plaintext (such as the Romeo and Juliet text we will download in the next cell), you need to write binary data instead of text data in order to maintain the Unicode encoding of the text.

To write the web page to a file, you can use a for loop with the Response object’s `iter_content()` method.

<img src="images/savingFiles.png" style="width:200px;height:270px;">

In [None]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))
    
    
playFile = open('RomeoAndJuliet.txt', 'wb') # "wb" mode opens the file in binary format for writing
for chunk in res.iter_content(100000): # One hundred thousand bytes at a time
        playFile.write(chunk)
        
playFile.close()        

Now that we have a file on our hard drive. Let's obtain some information using `Regular Expressions`, learned in the last class, such as get the `Title`, `Author` and `Posting Date`.

In [None]:
import re # Regular expression
fhand = open('RomeoAndJuliet.txt')

for line in fhand:
    x = re.findall('Posting Date: (.+) ',line) # try 'Author: (.+)' and 'Posting Date: (.+) \[' 
    if len(x)>0:
        print(x)
        break
     

Now we are going to do something more interesting such as discovering which name appears more Romeo or Juliet.

In [None]:
#How many times does the name Romeo and Juliet appear?
fhand = open('RomeoAndJuliet.txt')
dic = {}
for line in fhand:
    if line.find('Romeo')==-1 and line.find('Juliet')==-1 :
        continue
    if line.find('Romeo')!=-1: 
        dic['Romeo'] = dic.get('Romeo',0)+1
    if re.search('Juliet',line): 
        dic['Juliet'] = dic.get('Juliet',0)+1
dic

If the result is surprising. Romeo appears more than a doble compared to Juliet.

Okay, but we are not only interested in text files. We would also like to get information in HTML format. Now we need to call the `Beautiful Soup`. To Install it, look for `beautifulsoup4`.

## 3. Beautiful Soup

`Beautiful Soup` is a Python library for pulling data out of `HTML` and `XML` files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Hypertext Markup Language (HTML) is the format that web pages are written in. This course assumes you have some basic experience with HTML, but if you need a beginner tutorial, I suggest the following site:
https://htmldog.com/guides/html/beginner/

**A Quick Refresher**
In case it’s been a while since you’ve looked at any `HTML`, here’s a quick overview of the basics. An `HTML` file is a plaintext file with the .html file extension. The text in these files is surrounded by `tags`, which are words enclosed in `angle brackets`.

```
<head>
  <title>
   The Dormouse's story
  </title>
 </head>

```

There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the `angle brackets`. For example, the `<a>` tag encloses text that should be a link. The URL that the text links to is determined by the `href` attribute. Here’s an example:

```
Al's free <a href="https://inventwithpython.com">Python books</a>
```


Let's work on the example using the `Beautiful Soup`:


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>     

In [None]:
# The HTML used to produce the cell above
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
from bs4 import BeautifulSoup

soup1 = BeautifulSoup(html_doc, 'html.parser') # parsing text files formatted in HTML

type(soup1)

In [None]:
print(soup1.prettify()) # it will enable us to view how the tags are nested in the document.

### 3.1 Extracting all the URLs found within a page’s \<a> tags:

In [None]:
# The <a> tag defines a hyperlink. The href attribute specifies the URL of the page.
for link in soup1.find_all('a'):   #  This code finds all the <a> tags in the document:
    print(link.get('href'))        # to get the attributes

### 3.2 extracting all the text from a page:

In [None]:
# Be careful with this function, we don't want to print anything that is excessively heavy.
print(soup1.get_text()) 

<font color='red'>DON’T USE REGULAR EXPRESSIONS TO PARSE HTML</font>

Locating a specific piece of HTML in a string seems like a perfect case for regular expressions. However, I advise you against it. There are many different ways that HTML can be formatted and still be considered valid HTML, but trying to capture all these possible variations in a regular expression can be tedious and error prone. A module developed specifically for parsing HTML, such as bs4, will be less likely to result in bugs.

You can find an extended argument for why you shouldn’t parse HTML with regular expressions at https://stackoverflow.com/a/1732454/1893164/.

### 3.3 Now let's try to get information directly from a website.

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://nostarch.com'
res = requests.get(url)        
res.raise_for_status()
  
soup2 = BeautifulSoup(res.text, 'html.parser') 
type(soup2)

In [None]:
# We will see the information in HTML format
print(soup2.prettify()[0:1000])    # 

In [None]:
#One common task is extracting all the URLs found within a page’s <link> tags:
for link in soup2.find_all('link'):
    print(link.get('href'))


In [None]:
# Take a look at the site
# MacOS
chrome_path = 'open -a /Applications/Google\ Chrome.app %s'

# Windows
# chrome_path = 'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe %s'

# Linux
# chrome_path = '/usr/bin/google-chrome %s'

import webbrowser
webbrowser.get(chrome_path).open('https://nostarch.com')

### 3.4 Finding an Element with the `select()` Method

You can retrieve a web page element from a `BeautifulSoup` object by calling the `select()`method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: they specify a pattern to look for in this case, in HTML pages instead of general text strings.

**CSS** is the acronym of `Cascading Style Sheets`. CSS is a computer language for laying out and structuring web pages (HTML or XML).

A full discussion of CSS selector syntax is beyond the scope of this course. However here’s a short introduction to selectors.


**Selector passed to the `select()` method // Will match . . .**

soup.select('div') ----------------------- All elements named `<div>`
    
soup.select('#author') ------------------- The element with an id attribute of `author`

soup.select('.notice') ------------------- All elements that use a **CSS** class attribute named notice

soup.select('div span') ------------------ All elements named `<span>` that are within an element named `<div>`

soup.select('div > span') ---------------- All elements named `<span>` that are directly within an element named `<div>`, with no other element in between

soup.select('input[name]') --------------- All elements named `<input>` that have a name attribute with any value

soup.select('input[type="button"]') ------ All elements named `<input>` that have an attribute named type with value button



In [None]:
# First, let's work on a local file called example.html
from bs4 import BeautifulSoup

exampleFile = open('example.html')
exampleSoup = BeautifulSoup(exampleFile.read(), 'html.parser')
elems = exampleSoup.select('#author')
print(type(elems)) # elems is a list of Tag objects. <class 'list'>
print(len(elems))

In [None]:
exampleFile = open('example.html')
a = exampleFile.read()
a

In [None]:
elems[0].get('id') # like {'id': 'author'}.get('id')

In [None]:
print(elems[0].getText())

In [None]:
elems[0].attrs

In [None]:
print(exampleSoup.prettify())

We are now specialists in `select()` method. We will use it on the Lasalle college website to create a list with the words at the footer of the website.

<img src="images/footerLinks.png" style="width:600px;height:300px;">

In [None]:
exceeded = False
try:
    resLasalle = requests.get('https://www.lasallecollegevancouver.com')
except requests.exceptions.ConnectionError:
    exceeded = True
    print('Max retries exceeded with URL in requests')
    
    
if exceeded is False:
    lasalleSoup = BeautifulSoup(resLasalle.text, 'html.parser')
    elems = lasalleSoup.select('#FooterLinksSection')

    if len(elems)>0:

        footer = elems[0].getText()
        print(footer)

If you send many requests from the same IP address in a short period of time, you will likely receive the `'Max retries exceeded with URL in requests'`alert. Then, try the program below.

```python

import urllib.request
import ssl
import re

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://www.lasallecollegevancouver.com'
uh = urllib.request.urlopen(url, context=ctx)

lasalleSoup = BeautifulSoup(uh, 'html.parser')
elems = lasalleSoup.select('#FooterLinksSection')

if len(elems)>0:

    footer = elems[0].getText()
    print(footer)
```    

In [None]:
footer

In [None]:
import re
dic = {}
keys = re.findall('\n\n(.+)\n\n',footer)
for idx in range(len(keys)):
    start = footer.find(keys[idx])
    try:
        end = footer.find(keys[idx+1])
    except:
        end = len(footer)    
    dic[keys[idx]] = re.findall('\n(.+)',footer[start:end])
dic 

In [None]:
dic['About Us']

### 3.5 Using the developer tools to Find HTML elemets

What if you’re interested in scraping Corrie Hering background? Right-click where it is on the page (or CONTROL-click on macOS) and select Inspect Element from the context menu that appears. This will bring up the Developer Tools window, which shows you the HTML that produces this particular part of the web page. Figure below shows the developer tools open to the HTML of:

```
https://www.lasallecollegevancouver.com/about-us/our-professors/corrie-heringa
```
Note that if the site changes the design of its web pages, you’ll need to repeat this process to inspect the new elements.

<img src="images/corrie.png" style="width:1000px;height:500px;">

In [None]:
import requests, bs4, time

exceeded = False
url = 'https://www.lasallecollegevancouver.com/about-us/our-professors/corrie-heringa'
try:
    resCorrie = requests.get(url)
except requests.exceptions.ConnectionError:
    exceeded = True
    print('Max retries exceeded with URL in requests')
    
if exceeded is False:
    CorrieSoup = bs4.BeautifulSoup(resCorrie.text, 'html.parser')
    elems = CorrieSoup.select('[class="MainContent"]')

    if len(elems)>0:

        text = elems[0].getText()
        print(text)

### 3.6 Download images from a website

Let's fallow a project from the book <i>Automate The Boring Stuff With Python by AL Sweigart</i>. The project is to download each ciomic from the website http://xkcd.com/ . The website has a Prev button that guides the user back trough prior comics. So we need to creare a program does:

- Loads the XKCD home page.
- Saves the comic images on the page
- Follows the Previus Comic link
- Repeads until it reaches the first comic.


You’ll have a url variable that starts with the value https://xkcd.com and repeatedly update it 
(in a for loop) At every step in the loop, you’ll download the comic at url. You’ll know to end the loop 
when url ends with '#'. 

```python
url = 'https://xkcd.com'
```
You will download the image files to a folder in the current working directory named xkcd. 
The call `os.makedirs()` ensures that this folder exists, and the exist_ok=True keyword argument prevents 
the function from throwing an exception if this folder already exists. The remaining code is just 
comments that outline the rest of your program.

```python
os.makedirs('xkcd', exist_ok=True) 
```


In [None]:
import requests, os, bs4

url = 'https://xkcd.com'

os.makedirs('xkcd', exist_ok=True) 

while not url.endswith('#'):
    
    #  Download the page.
    print('Downloading page %s...' % url)
    res = requests.get(url)
    res.raise_for_status()  # to throw an exception and end the program if something went wrong with the download.
    
    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # Find the URL of the comic image.
    
    comicElem = soup.select('#comic img') # <img> element inside a <div> element with the id comic
    if comicElem == []:
        print('Could not find comic image.')
    else:
        comicUrl = 'https:' + comicElem[0].get('src') # get the src attribute
    # Download the image.
        print('Downloading image %s...' % (comicUrl))
        res = requests.get(comicUrl)
        res.raise_for_status()

    # TODO: Save the image to ./xkcd.
    
    imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)),'wb') # naming the images
    for chunk in res.iter_content(100000): # writes out chunks of the image data (at most 100,000 bytes each)
        imageFile.write(chunk)
    imageFile.close()

    # TODO: Get the Prev button's url.
    prevLink = soup.select('a[rel="prev"]')[0]      # <a> element with the rel attribute set to prev
    url = 'https://xkcd.com' + prevLink.get('href') # get the previous comic’s URL

print('Done.')

Now let's take a look at Hill Tribe's work, for more information, see https://fusionip.wixsite.com/thitima.

In [None]:
import webbrowser, bs4

exceeded = False
try:
    res = requests.get('https://www.lcieducation.com/en/Portfolios/Students/20697-15868.aspx')
except requests.exceptions.ConnectionError:
    exceeded = True
    print('Max retries exceeded with URL in requests')

if exceeded is False:
    soup4 = bs4.BeautifulSoup(res.text, 'html.parser')
    images = soup4.find_all('img', {'src':re.compile('.jpg')}) # find_all(name, attrs, recursive, string, limit, **kwargs)
    for image in images: 
        webbrowser.open(image['src']+'\n')


If you have exceeded the maximum number of requests, run the program below using the `urllib.request` library instead of `requests`.

for more informations, see; https://docs.python.org/3/library/urllib.request.html

```python

import urllib.request
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://www.lasallecollegevancouver.com/about-us/our-professors/corrie-heringa'
uh = urllib.request.urlopen(url, context=ctx)

CorrieSoup = bs4.BeautifulSoup(uh, 'html.parser')
elems = CorrieSoup.select('[class="MainContent"]')

if len(elems)>0:

    text = elems[0].getText()
    print(text)

```

```python
import urllib.request
import ssl
import re

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://www.lcieducation.com/en/Portfolios/Students/20697-15868.aspx'
uh = urllib.request.urlopen(url, context=ctx)

soup4 = bs4.BeautifulSoup(uh, 'html.parser')
images = soup4.find_all('img', {'src':re.compile('.jpg')})
for image in images: 
    webbrowser.open(image['src']+'\n')
```
