# Web Scraping Demo

This is a demonstration given in a Women in Data meetup informal talk.

# Method Comparisons

Here we compare different approaches to webscraping: BeautifulSoup, BeautifulSoup with Multithreading, and Scrapy. We will scrape a list of sites from the CIA world factbook and  

### Create the list of sites

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import time

url = 'https://www.cia.gov/library/publications/the-world-factbook/geos/af.html'
r = requests.get(url)
html_contents = r.text
html_soup = BeautifulSoup(html_contents, 'html.parser')

country_letters = {}
for found in html_soup.find_all('option'):
    match = re.search(r'([^x][^x])\.(html)', str(found))
    if match:
        name = str(found.string)
        country_letters[name] = match.group(1)

sites = ['https://www.cia.gov/library/publications/the-world-factbook/geos/%s.html' % 
         country_letters[i] for i in country_letters]


### BeautifulSoup Functions

In [2]:
def get_soup(url, sleep_length=5):
    """
    Return a BeautifulSoup object of the url
    
    Parameters:
    ----------------
    url: string
        Website address of a country profile in the CIA world factbook
    sleep_length: int
        Length of pause in seconds before trying another request
        
    Returns:    
    ----------------
    html_soup: BeautifulSoup object
         description
    raw_text: string
        Raw html text of the webpage
    """ 
    while True:
        try:
            r = requests.get(url)
            break
        except requests.exceptions.RequestExceptions as e:
            print(e)
            sleep(sleep_length)
    
    raw_text = r.text
    html_soup = BeautifulSoup(raw_text, 'html.parser')
    return html_soup, raw_text

def site_crawl(country_site, sleep_length=0):
    """
    Test retrieval of site information by printing out country name.
    
    Parameters
    -----------
    country_site : string
        URL of the country profile site
    sleep_length : int
        Number of seconds to pause before retrying a web page request
    """
    
    html_soup, text = get_soup(country_site, sleep_length)
    try:
        print(html_soup.find('span', {'class': 'region_name1 countryName'}).string)
    except:
        print("Trouble with: %s" % country_site)



### BeautifulSoup Performance

In [3]:
import time
from threading import Thread
from queue import Queue
import pandas as pd

In [4]:
methods = ['BeautifulSoup', 'BeautifulSoup + Multithreading', 'Scrapy']
times = []

start_time = time.time()
for site in sites:
    site_crawl(site)
end_time = time.time()

bs_time = end_time - start_time
times.append(bs_time)
print("Serial time=", bs_time)

Afghanistan
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua and Barbuda
Argentina
Armenia
Aruba
Ashmore and Cartier Islands
Atlantic Ocean
Australia
Austria
Azerbaijan
Bahamas, The
Bahrain
United States Pacific Island Wildlife Refuges
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bermuda
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Bouvet Island
Brazil
British Indian Ocean Territory
British Virgin Islands
Bulgaria
Burkina Faso
Burma
Burundi
Cabo Verde
Cambodia
Cameroon
Canada
Cayman Islands
Central African Republic
Chad
Chile
China
Christmas Island
Clipperton Island
Cocos (Keeling) Islands
Colombia
Comoros
Congo, Democratic Republic of the
Congo, Republic of the
Cook Islands
Coral Sea Islands
Costa Rica
Cote d'Ivoire
Croatia
Cuba
Curacao
Cyprus
Czechia
Denmark
Djibouti
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Eswatini
Ethiopia
Falkland Islands (Islas Malvinas)
Faroe Islands
Fiji
Finland
France
French Polynesi

### BeautifulSoup with Multithreading Performance

In [5]:
# Set up threading
NUM_WORKERS = 8
task_queue = Queue()

def worker():
    while True:
        # Constantly check the queue for addresses
        site = task_queue.get()
        site_crawl(site)

        # Mark task as done
        task_queue.task_done()

# Create worker threads

start_time = time.time()
threads = [Thread(target=worker) for _ in range(NUM_WORKERS)]

# Add website to task queue
[task_queue.put(site) for site in sites]

# Start all workers
[thread.start() for thread in threads]

# Wait for all the tasks in the queue to be processed
task_queue.join()
end_time = time.time()

bs_multi_time = end_time - start_time
times.append(bs_multi_time)
print("Threading time=", bs_multi_time)


Andorra
AnguillaAntarctica

American Samoa
Afghanistan
Angola
Albania
Algeria
Antigua and Barbuda
Ashmore and Cartier Islands
Armenia
Atlantic Ocean
Aruba
Australia
Argentina
Austria
Azerbaijan
Bahrain
United States Pacific Island Wildlife Refuges
Bahamas, The
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bermuda
Bouvet Island
Bhutan
British Indian Ocean Territory
Bosnia and Herzegovina
Botswana
BrazilBolivia

British Virgin Islands
Cabo Verde
Cameroon
Burma
Burkina Faso
CambodiaBulgaria

Burundi
Canada
Christmas IslandCocos (Keeling) Islands

Clipperton Island
ChadCentral African Republic

Cayman Islands
Chile
China
Coral Sea Islands
Cook Islands
Congo, Republic of the
Comoros
ColombiaCote d'Ivoire
Costa Rica

Congo, Democratic Republic of the
Czechia
Curacao
Cuba
DenmarkCyprus

Dominica
Djibouti
Croatia
Equatorial Guinea
El Salvador
Eritrea
EswatiniEcuadorDominican Republic


Estonia
Egypt
Faroe Islands
Finland
French Polynesia
Falkland Islands (Islas Malvinas)
Fiji
Ethiopia
Franc

### Scrapy Framework

Here we use the list of sites but Scrapy's own methods of finding content on a page.

In [6]:
from twisted.internet import reactor
import scrapy
import logging
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
import time

class CountryNameSpider(scrapy.Spider):
    name = "names"
    start_urls = sites
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        }
    
    def parse(self, response):        
        name = response.xpath('//*[@id="geos_title"]/span[1]/text()').get()
        print(name)

start_time = time.time()
configure_logging
runner = CrawlerRunner()
runner.crawl(CountryNameSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()

end_time = time.time()
scrapy_time = end_time - start_time
times.append(scrapy_time)
 
print("Scrapy time=", scrapy_time)

Albania
Angola
Afghanistan
Antarctica
American Samoa
Anguilla
Andorra
Algeria
Ashmore and Cartier Islands
Armenia
Antigua and Barbuda
Atlantic Ocean
Argentina
Aruba
Australia
Austria
Bahamas, The
Azerbaijan
Bahrain
United States Pacific Island Wildlife Refuges
Bangladesh
Barbados
Belarus
Benin
Belgium
Belize
Bermuda
Bhutan
Bolivia
Bouvet Island
British Indian Ocean Territory
Botswana
Brazil
Bosnia and Herzegovina
British Virgin Islands
Burkina Faso
Cabo Verde
Burundi
Bulgaria
Burma
Cambodia
Cameroon
Canada
Chile
Cayman Islands
Clipperton Island
Chad
Cocos (Keeling) Islands
Central African Republic
Christmas Island
China
Colombia
Congo, Democratic Republic of the
Congo, Republic of the
Comoros
Coral Sea Islands
Cook Islands
Cote d'Ivoire
Costa Rica
Croatia
Czechia
Cuba
Curacao
Cyprus
Djibouti
Denmark
Dominica
Dominican Republic
Ecuador
Eritrea
Equatorial Guinea
Egypt
El Salvador
Eswatini
Ethiopia
Falkland Islands (Islas Malvinas)
Faroe Islands
Estonia
Fiji
French Polynesia
Finland
Franc

### Performance Time Summary

In [7]:
d = {'Methods': methods, 'Performance Times (s)': times}
df = pd.DataFrame(d)
df

Unnamed: 0,Methods,Performance Times (s)
0,BeautifulSoup,110.8075
1,BeautifulSoup + Multithreading,60.715519
2,Scrapy,11.251928


### Selenium Demo

#### Searching Google

In [13]:
from selenium import webdriver

# Create an object to do stuff with
driver = webdriver.Chrome()

# Load the page
driver.get('http://www.google.com');

# Interact with page
search_box = driver.find_element_by_xpath('//*[@id="tsf"]/div[2]/div/div[1]/div/div[1]/input')
search_box.send_keys('Statistics')

search_box.submit()
time.sleep(5) # Let the user actually see something!
driver.quit()

#### WID Attendenace List

In [14]:
from selenium import webdriver
driver = webdriver.Chrome()

# Load the page
driver.get('https://www.meetup.com/Sacramento-Women-in-Data/events/263664767/attendees/');

names = driver.find_elements_by_xpath("""//*[@id="mupMain"]/div/div/div/section/div[3]
                              /div/div/div/ul/li[*]/div/div[2]/div[1]/div/a/h4""")

for i in names:
    print(i.text)

    

Eddie C.
Michelle Harmon L.
Spencer J.
Maryam
Suriani A.
Kasey
Sjn
Mark K.
Emma
Codewerk
Joel B.
muna
Jonathan S.
earth'smart
Emily B.
josh
Dan F.
Liz A.
Ashma S.
Serena K.
Ting T.
Ingrid D.
Tammy C.
Happy W.
Tony S.
Saravanan K.
Sasa
Aiko
Andy S.
Kate W.
Phoenix S.
Anushree S.
Adam F.
Daniella E.
Chris
Marielle
Shweta
Bascomb A.
Melissa
Jenessa P.
Hanna Maria K.


In [15]:
not_going = driver.find_element_by_xpath('//*[@id="mupMain"]/div/div/div/section/nav/ul/li[2]/span')
not_going.click()

not_going_names = driver.find_elements_by_xpath("""//*[@id="mupMain"]/div/div/div/section/div[3]
/div/div/div/ul/li/div/div[2]/div[1]/div/a/h4""")


In [16]:
for i in not_going_names:
    print(i.text)

Priya P.
Tran (Tron) W.
