# MAIS Fall 2019 Workshop 3 - Scraping and Cleaning

In this notebook, we will be walking through the basics of webscraping using the `requests`, `BeautifulSoup4`, and `selenium` python packages. With these packages at your disposal, you should be able to scrape both statically and dynamically loaded sites by the end of this workshop

The content in this notebook is broken up into the following sections:

1.   Introduction to `requests`, `BeautifulSoup4`, and DOM traversal
2.   Scrape a simple statically loaded site with `requests` and `bs4`
3.   Scrape a dynamically loaded site with `selenium` and `bs4`

## 1. Introduction to requests, BeautifulSoup4, and DOM traversal

We'll start by importing the necessary packages

In [0]:
import requests
from bs4 import BeautifulSoup

Next, we'll play around a bit with these packages to get familiar with them. Specifically, we'll try scraping the 7-day weather forecast for Boston, MA.

URL: https://forecast.weather.gov/MapClick.php?lat=42.3587&lon=-71.0567

Start by checking out the website (use Inspect Element), and get a feel for the site's HTML structure.

In [0]:
URL = 'https://forecast.weather.gov/MapClick.php?lat=42.3587&lon=-71.0567'

Once you've checked out the site and know what you're looking for, it's time to get scraping! We'll start by using the `requests` package which, as we've already mentioned, lets us retrieve a site's HTML via a GET request.

In [0]:
page = requests.get(URL)
print(page)

A [200] response from a GET request is good news! The `Response` object's `content` field contains a byte string of the HTML of the website

In [0]:
# print out the page's "content" field
#--- YOUR CODE HERE ---

print(page.content)

#----------------------

As you can see, the HTML is messy. Lucky for us we can use `BeautifulSoup4` to parse it.

In [0]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)

The `BeautifulSoup` object also comes with a handful of functions to traverse the html DOM tree

In [0]:
# 'soup' is a BeautifulSoup instance
print(type(soup))

# it comes with a .children attribute which stores a list iterator
print(type(soup.children))

In [0]:
# let's iterate through the immediate children of soup
# print out the variable type of each child
#--- YOUR CODE HERE ---

for child in soup.children:
  print(type(child))

#----------------------

As you can see, all the children of the HTML tree have been converted to `bs4` elements.

Next, let's convert the iterator to a list so it's easier to work with

In [0]:
#--- YOUR CODE HERE ---

children = list(soup.children)
print(children)

#----------------------

The third element of the children list contained the actual HTML document.

Retrieve the third element (index of 2) and convert its elements to a list as well

In [0]:
#--- YOUR CODE HERE ---

html = list(soup.children)[2]
print(html)

#----------------------

We can keep traversing down this tree, until we reach an endpoint. Let's try parsing another child element one more time.

In [0]:
# get a list of the children of the html bs4 element
html_children = list(html.children)
print(html_children)

In [0]:
# the fourth element is the body of the HTML
body = html_children[3]
print(body)

While we can keep traversing the tree to find what we want, this clearly isn't very efficient. The last thing we will cover in this section is how to access HTML elements by a known tag, id, or class.

In [0]:
# BeautifulSoup4 lets us find all instances of a tag
# we can also find the first instance with just soup.find('p')
p_tags = list(soup.find_all('p'))
print(p_tags)

In [0]:
# we can search for all instances of an id
forecast_list_id = list(soup.find
                        _all(id='seven-day-forecast-list'))
print(forecast_list_id)

In [0]:
# we can search for all instances of a class too
# and we can search for it in a previously
tombstone_classes = list(soup.find_all(class_='forecast-tombstone'))
print(tombstone_classes)

In [0]:
# lastly, we can use CSS combinator selectors with the 'select()' method
# to learn more about CSS combinator selectors, visit
# https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS/Combinators_and_multiple_selectors

# this line finds all <p> elements in a <div> element
p_in_div = list(soup.select("div p"))
print(p_in_div)

Once we get to a leaf node in the HTML tree structure, we can use the `get_text()` function to retrieve the content.

In [0]:
print(p_in_div[3].get_text())

## 2. Scrape a simple statically loaded site with requests and bs4

We should now have enough knowledge to scrape the URL from the National Weather Service. From the previous section, we should also have a good idea of what elements in the HTML we should focus on. However, if you are unsure, you can always use the `Inspect Element` tool in Chrome to find the relevant tags/ids/classes.

To begin, let's rerun our GET request and re-instantiate a new `soup` variable

In [0]:
URL = 'https://forecast.weather.gov/MapClick.php?lat=42.3587&lon=-71.0567'
# send GET request
page = requests.get(URL)
# instantiate BeautifulSoup object
soup = BeautifulSoup(page.content, 'html.parser')

For this scrape, we want to get the information stored in this container:

![7-day-forecast](https://drive.google.com/uc?id=1sFbnCXFWaGZDDENo1bFRwTVW3ojZlDqS)

Formatted in the following manner:

period | short_desc | temp
--- | --- | ---
Tonight | Showers \n Likely and \n Patchy Fog | Low: 62 °F 
Monday | Mostly Sunny | High: 75 °F 

In [0]:
# Start by extracting the 7-day forecast panel shown in the first image above
# Hint: find the id for this object, and use one of the functions covered in the previous section

#--- YOUR CODE HERE ---

seven_day = soup.find(id='seven-day-forecast')

#----------------------

# Within this new object, select all the ".period-name" classes that are within 
# a ".tombstone-containers" class and save it to a variable
# Hint: CSS combinator selectors may be helpful here

#--- YOUR CODE HERE ---

period_tags = seven_day.select('.tombstone-container .period-name')

#----------------------

# In a similar manner to above, select all the ".short-desc" classes that are in
# a ".tombstone-containers" class and save it to a variable
# Hint: CSS combinator selectors may be helpful here

#--- YOUR CODE HERE ---

short_descs_tags = seven_day.select('.tombstone-container .short-desc')

#----------------------

# Once again, select all the ".temp" classes that are in
# a ".tombstone-containers" class and save it to a variable
# Hint: CSS combinator selectors may be helpful here

#--- YOUR CODE HERE ---

temps_tags = seven_day.select('.tombstone-container .temp')

#----------------------

In [0]:
# print some of your results out to make sure you did it right

#--- YOUR CODE HERE ---

print(seven_day)
print(period_tags)

#----------------------

In [0]:
# Now that we have all of our data in an iterable form (as bs4.ResultSet),
# we can loop through the elements we want and use the get_text() method to retrieve
# our data without HTML tags

# loop through your ".period-name" classes from above to get your period names
# save the results to a list

#--- YOUR CODE HERE ---

periods = [pt.get_text() for pt in period_tags]

#----------------------

# loop through the ".short-desc" classes from above to get your short descriptions
# save the results to a list

#--- YOUR CODE HERE ---

short_descs = [sd.get_text() for sd in short_descs_tags]

#----------------------

# loop through the ".temp" classes from above to get your temperatures
# save the results to a list

#--- YOUR CODE HERE ---

temps = [t.get_text() for t in temps_tags]

#----------------------

In [0]:
# lastly, we combine these lists into a pandas dataframe
import pandas as pd

weather = pd.DataFrame({
    'period': periods, #--- YOUR CODE HERE ---,
    'short_desc': short_descs, #--- YOUR CODE HERE ---,
    'temp': temps, #--- YOUR CODE HERE ---
})

weather

Congrats, you just scraped your first website!

We can continue to process the data we just collected in many ways. For instance, we could extract the numerical value of the temperature from the `temps` column rather than leaving it as a string. However, we're not going to move onto scraping more advanced sites.

## 3. Scrape a dynamically loaded site with selenium and bs4

We now know that we can use `selenium` to automate web browsing. `selenium` works by instantiating a "WebDriver" (in this case, we will use the ChromeDriver) to run and interface with a web browser. Your code can then control what this browser does, such as clicking all of the "Read More" buttons on a page or sending keystrokes.

**Selenium Documentation**:  
https://selenium-python.readthedocs.io/  
https://selenium-python.readthedocs.io/locating-elements.html

### 3.1 Installing Selenium

To use `selenium`, we need to install the package and the Chrome WebDriver executable

In [0]:
!apt update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:4 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease [3,626 B]
Hit:5 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:10 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic InRelease [15.4 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:12 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ Packages [73.5 k

### 3.2 Simple HTML Page Demo


Let's test out `selenium` on a simple website: https://duckduckgo.com/. First, start our Chrome web browser...

In [0]:
import time
from selenium import webdriver

# Specify the configuration for the Chrome webdriver.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
options.add_argument('--no-sandbox')

# Start the Chrome webdriver (i.e., the web browser).
driver = webdriver\
    .Chrome(
            executable_path = 'chromedriver',
            options=options)

In [0]:
# Instruct the web driver to navigate to DuckDuckGo.
URL = 'https://duckduckgo.com/'
driver.get(URL)
time.sleep(1)

# Use selenium's API to find the HTML element representing the search bar,
# and then type in our query.
search_bar = driver.find_element_by_id('search_form_input_homepage')
search_bar.send_keys('Hello World!')
time.sleep(2)

# Lastly, find the HTML element representing the 'Search' button and click it.
search_button = driver.find_element_by_id('search_button_homepage')
search_button.click()
time.sleep(2)

In [0]:
# Once we are done interacting with the page, we can also download and save the 
# contents of the webpage we are currently on. We can also use BS4 to process
# the HTML file afterwards if we desire.
page_source = driver.page_source
with open('duckduckgo.html','w', encoding='utf-8') as file:
  file.write(page_source)
driver.close()

from google.colab import files
files.download('duckduckgo.html')

### 3.3 TripAdvisor Demo

Next, lets repeat this exercise on a more complicated website...

In [0]:
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
options.add_argument('--no-sandbox')
driver = webdriver\
    .Chrome(
            executable_path = 'chromedriver',
            options=options)

We can now use the webdriver's built-in functions to find and press all the "Read More" buttons.

If we use "Inspect Element" on the TripAdvisor page, we'll see that the "Read More" buttons are all defined by the class with the name: <br>
"location-review-review-list-parts-ExpandableReview__cta--2mR2g"

In [0]:
URL = 'https://www.tripadvisor.com/Airline_Review-d8729157-Reviews-Spirit-Airlines#REVIEWS'
# get webpage
driver.get(URL)

# get a list of the "Read More" buttons; clicking the any of the buttons will expand all of the reviews
time.sleep(2)
read_more_buttons = driver.find_elements_by_xpath("//span[text()='Read more']")
for button in read_more_buttons:
    if button.is_displayed():
        button.click()
        break
      
# save source HTML of page now that the JS has run
page_source = driver.page_source

Now that we have the source HTML, the process is the same as before. We can use `bs4` to traverse the HTML DOM and get all the info we need.

In [0]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')

# Use the BeautifulSoup "find_all" function to get the class that contains all the expanded reviews
# Hint: you may need to open the page in your browser and find the correct class after pressing the "Read More" function

#--- YOUR CODE HERE ---

reviews_selector = soup.find_all('q', class_='location-review-review-list-parts-ExpandableReview__reviewText--gOmRC')

#----------------------

When we begin scraping larger webpages, it's more memory efficient to write to a csv line-by-line, rather than constructing an entire dataframe and converting the entire variable to a csv. The example above is not too large to require this, but we'll try it for learning purposes.

In [0]:
# Open the csv file to write to.
with open('trip_advisor_scrape.txt','w') as file:
  
    # Loop through the bs4.ResultSet obtained from the previous cell
    # for each entry in the list, get its text, use the strip() function
    # to get rid of trailing whitespace, and then write it to the file with:
          # file.write(line)
          # file.write('\n')
    
    #--- YOUR CODE HERE ---
    for review_selector in reviews_selector:
        line = review_selector.get_text().strip()
        file.write(line)
        file.write('\n')
    #----------------------

Finally, since we're working in Google Collab, we need to download the files

In [0]:
from google.colab import files
files.download('trip_advisor_scrape.txt')