# Web Scraping

*Oh what a tangled web we weave  
When first we practice to deceive!*

This notebook introduces some of the tools/techniques used in web scraping.

## Set up

### Imports

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup  # Probably needs installing
from time import sleep

### Utilities

In [2]:
def extract_tor_book_details(card):
    """
    Takes details of a book scraped from Tor as a BeautifulSoup object;
    returns a dictionary of key information.
    """
    return {
        "author": card.find("p", class_="result--author").find("a").get_text(),
        "title": card.find("h2", class_="result--title").find("a").get_text(),
        "description": card.find("section", class_="result--description").get_text().strip(),
        "formats": [x.get_text() for x in card.find_all("a", class_="result--link")]
    }

### URLs

In [3]:
SPIDER_ENEMIES_URL = "https://en.wikipedia.org/wiki/List_of_Spider-Man_enemies"
DIVING_SPIDER_URL = "https://www.nationalgeographic.com/science/article/the-diving-bell-and-the-spider"
TOR_URL = "https://publishing.tor.com/books"

## Web scraping with Pandas

In [4]:
# Get all <table> elements on a specific page as dataframes

tables = pd.read_html(SPIDER_ENEMIES_URL, 
                      attrs={"class": "wikitable"}, match="Vulture")

In [5]:
# Count your tables

len(tables)

1

In [6]:
# Check a dataframe

tables[0].sample()

Unnamed: 0,Group name,Original members,First Appearance,Description
3,Spider-Man Revenge Squad[139],SpotGrizzlyKangaroo IIGibbon,The Spectacular Spider-Man #246 (May 1997),A team of lesser-known and weaker Spider-Man v...


## Web Scraping with Requests

In [7]:
# Grab a whole page

response = requests.get(DIVING_SPIDER_URL)

In [8]:
# Check that all is okay - 2__ is good

response

<Response [200]>

In [9]:
# Pull out the body of the response

raw_text = response.text

In [10]:
# Manually extract the information

start = raw_text.find("In the days before scuba tanks")
end = raw_text.find("in the water”.") + len("in the water”.")

content = raw_text[start : end]

## Web Scraping with BeautifulSoup

[BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [11]:
# Read the raw information as HTML

soup = BeautifulSoup(response.content)

In [12]:
# Search for relevant information by HTML attributes

title = soup.find(class_="Article__Headline__Title").get_text()
author = soup.find(class_="Byline__Author").get_text()

In [13]:
# Check them

title, author

('The diving bell and the spider', 'Ed Yong')

In [14]:
# Find elements by tag

links = soup.find_all("a")

In [15]:
# Iterate through results

links = [x["href"] for x in links if x["href"].startswith("http")]

## Web scraping with subterfuge

In [16]:
# Fake being human

headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0"}

response = requests.get(TOR_URL, headers=headers)

In [17]:
# Read the page as HTML

soup = BeautifulSoup(response.content)

In [18]:
# Grab the desired data

cards = soup.find_all("article")

## Web scraping with pagination

In [19]:
# Holder for cards

cards = []

# Loop through each page
for page in range(1, 25):
    
    # Get the page
    response = requests.get(f"{TOR_URL}/?page_number={page}", headers=headers)
    
    # Read it as HTML
    soup = BeautifulSoup(response.content)
    
    # Add the data to the holder
    cards.extend(soup.find_all("article", class_="result"))
    
    # Pause, politely
    sleep(1)

In [20]:
# How much did we get?

len(cards)

240

## Data extraction

In [21]:
# Convert the dicts to a dataframe

books = pd.get_dummies(pd.DataFrame([extract_tor_book_details(x)
                                     for x in cards]).explode('formats'),
                       columns=['formats'])

In [22]:
# Check it

books.head()

Unnamed: 0,author,title,description,formats_,formats_Compact Disc,formats_Digital Audio,formats_Hardcover,formats_Trade Paperback,formats_e-Book,formats_e-Book Bundle
0,Michael R. Underwood,The Absconded Ambassador,"Last Week, She Was Working Open Mics. Now She’...",0,0,0,0,0,1,0
0,Michael R. Underwood,The Absconded Ambassador,"Last Week, She Was Working Open Mics. Now She’...",0,0,0,0,1,0,0
0,Michael R. Underwood,The Absconded Ambassador,"Last Week, She Was Working Open Mics. Now She’...",0,0,1,0,0,0,0
1,Dave Hutchinson,Acadie,The first humans still hunt their children acr...,0,0,0,0,0,1,0
1,Dave Hutchinson,Acadie,The first humans still hunt their children acr...,0,0,0,0,1,0,0
