# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!

In this lab, you'll practice your scraping skills on an online music magazine and events website called Resident Advisor.

## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://ra.co website. For reproducibility we will use the [Internet Archive](https://archive.org/) Wayback Machine to retrieve a version of this page from March 2019.

Start by navigating to the events page [here](https://web.archive.org/web/20210325230938/https://ra.co/events/us/newyork?week=2019-03-30) in your browser. It should look something like this:

<img src="images/ra_top.png">

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

## Write a Function to Scrape all of the Events on the Given Page

The function should return a Pandas DataFrame with columns for the `Event_Name`, `Venue`, and `Number_of_Attendees`.

Start by importing the relevant libraries, making a request to the relevant URL, and exploring the contents of the response with `BeautifulSoup`. Then fill in the `scrape_events` function with the relevant code.

In [1]:
# Relevant imports
import pandas as pd
import json
from bs4 import BeautifulSoup
import numpy as np
import time
import requests

In [2]:
EVENTS_PAGE_URL = "https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-30"

# Exploration: making the request and parsing the response
response = requests.get(EVENTS_PAGE_URL)
soup = BeautifulSoup(response.content, "html.parser")

In [3]:
# Find the container with event listings in it
events_all_div = soup.find('div', attrs={'data-tracking-id': 'events-all'})

In [4]:
# Find a list of events by date within that container
event_listings = events_all_div.find('ul').find('li')

print(len(event_listings))
print(event_listings.text[:1000])

13
̸Sat, 30 MarUnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRARA Tickets457Cocoon New York: Sven Väth, Ilario Alicante, Butch & TaimurSven Vath, Butch, Taimur, Ilario Alicante99 Scott AveRARA Tickets407Horse Meat Disco - New York ResidencyHorse Meat Disco, The Carry Nation, Amber ValentineElsewhereRARA Tickets375RA PickRave: Underground Resistance All NightNomadico, Mark FlashNowadaysRARA Tickets232Detroit techno's most iconic crew dispatches two of its DJs for a rare New York gig at one of the city's homiest clubs. Believe You Me // Beta Librae, Stephan Kimbel, Born 2 BoogieBeta Librae, Stephan KimbelTBA - New YorkRARA Tickets89M.A.N.D.Y. with DJ Lisa Frank Plus Mnmlktchn with DJ QuM.A.N.D.Y., DJ Qu, Michel Harruch, DJ Lisa FrankGood RoomRARA Tickets87Nails & Wax presents: RekaReka, pulsewidthmod, dolli B, Andres TavioH0L0RARA Tickets67Wings of Illusion - Burning Man 2019 Art Car FundraiserAmine K, Bo (Borzu), CarlitaBogart HouseRARA Tickets30T

In [5]:
dates = event_listings.findChildren(recursive=False)

print("0 index:", dates[0].text[:100])
print("1 index:", dates[1].text[:100])
print("2 index:", dates[2].text[:100])

0 index: ̸Sat, 30 MarUnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRA
1 index: 
2 index: ̸Sun, 31 MarSunday: Soul SummitNowadaysRARA Tickets132New Dad & Aaron Clark (Honcho)Aaron Clark, New


In [6]:
# Extract the date (e.g. Sat, 30 Mar) from one of those containers
first_date = dates[0]
print(first_date.text[:8])

̸Sat, 30


In [7]:
date = first_date.find('div', class_='sticky-header').text
print(date)

̸Sat, 30 Mar


In [8]:
# Extract the name, venue, and number of attendees from one of the
# events within that container
first_date_events = first_date.findChildren('ul')
first_event = first_date_events[0]

name = first_event.find('h3').text
print(name)

UnterMania II


In [9]:
venue_and_attendees = first_event.findAll('div', attrs={'height': 30})
print(venue_and_attendees)

[<div class="Box-omzyfs-0 sc-AxjAm fOOuYI" height="30"><div class="Box-omzyfs-0 sc-AxjAm hoMiiH" color="accent" height="24" width="24"><svg aria-label="Location" height="100%" viewbox="0 0 24 24" width="100%"><g fill="none" fill-rule="evenodd"><path d="M0 0h24v24H0z" fill="none"></path><path d="M13.613 15.075c.26-.501.505-.988.732-1.456C15.393 11.456 16 9.785 16 9a4 4 0 10-8 0c0 .785.607 2.456 1.655 4.619.227.468.472.955.732 1.456A83.615 83.615 0 0012 18.022c.55-.962 1.1-1.96 1.613-2.947zM18 9c0 1.2-.67 3.045-1.855 5.491-.236.486-.49.99-.758 1.506a86.17 86.17 0 01-2.532 4.522 1 1 0 01-1.71 0 85.564 85.564 0 01-.793-1.35 86.17 86.17 0 01-1.74-3.172 60.318 60.318 0 01-.757-1.506C6.67 12.045 6 10.201 6 9a6 6 0 016-6 6 6 0 016 6z" fill="currentColor"></path><path d="M13 9a1 1 0 11-2 0 1 1 0 112 0" fill="currentColor"></path></g></svg></div><span class="Text-sc-1t0gn2o-0 hhfigA" color="primary" font-weight="normal">TBA - New York</span></div>, <div class="Box-omzyfs-0 sc-AxjAm jYikCb" heigh

In [10]:
venue = venue_and_attendees[0].text
number_attend = int(venue_and_attendees[-1].text)

print("Name:", name)
print("Venue:", venue)
print("Date:", date)
print("Number of Attendees:", number_attend)

Name: UnterMania II
Venue: TBA - New York
Date: ̸Sat, 30 Mar
Number of Attendees: 457


In [11]:
# Loop over all of the event entries, extract this information
# from each, and assemble a dataframe
last_event = first_date_events[-1]
name = last_event.find('h3').text

venue_and_attendees = last_event.findAll('div', attrs={'height': 30})
venue = venue_and_attendees[0].text
number_attend = int(venue_and_attendees[-1].text)

ValueError: invalid literal for int() with base 10: 'H0L0'

In [12]:
try:
    number_attend = int(venue_and_attendees[-1].text)
except ValueError:
    number_attend = np.nan
print("Name:", name)
print("Venue:", venue)
print("Date:", date)
print("Number of Attendees:", number_attend)


Name: Petra, Matthusen & Lang, White & Pitsiokos, and Zorn
Venue: H0L0
Date: ̸Sat, 30 Mar
Number of Attendees: nan


In [13]:
rows = []

for date_container in dates:
    if not date_container.text:
        continue
    
    date = date_container.find('div', class_="sticky-header").text
    date = date.strip("'/")
    
    events = date_container.findChildren('ul')
    for event in events:
        name = event.find('h3').text
        venue_and_attendees = event.findAll('div', attrs={'height': 30})
        venue = venue_and_attendees[0].text
        try:
            number_attend = int(venue_and_attendees[-1].text)
        except ValueError:
            number_attend = np.nan
        rows.append([name, venue, date, number_attend])

df = pd.DataFrame(rows)
df

Unnamed: 0,0,1,2,3
0,UnterMania II,TBA - New York,"̸Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"̸Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"̸Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"̸Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"̸Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"̸Fri, 5 Apr",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"̸Fri, 5 Apr",1.0
116,Erik the Lover,Erv's,"̸Fri, 5 Apr",1.0
117,Wax On Vissions,Starliner,"̸Fri, 5 Apr",1.0


In [14]:
# Bring it all together in a function that makes the request, gets the
# list of entries from the response, loops over that list to extract the
# name, venue, date, and number of attendees for each event, and returns
# that list of events as a dataframe

def scrape_events(events_page_url):
    # Make the request and parse the response as HTML
    response = requests.get(events_page_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the container with the relevant content
    events_all_div = soup.find('div', attrs={"data-tracking-id": "events-all"})
    event_listings = events_all_div.find("ul").find("li")
    dates = event_listings.findChildren(recursive=False)
    
    # Loop over all dates, an all events on each date, and
    # add them to the list
    rows = []
    for date_container in dates:
        
        if not date_container.text:
            continue

        date = date_container.find("div", class_="sticky-header").text
        date = date.strip("'̸")

        events = date_container.findChildren("ul")
        for event in events:
            
            name = event.find("h3").text
            venue_and_attendees = event.findAll("div", attrs={"height": 30})
            venue = venue_and_attendees[0].text
            try:
                num_attendees = int(venue_and_attendees[-1].text)
            except ValueError:
                num_attendees = np.nan

            rows.append([name, venue, date, num_attendees])

    df = pd.DataFrame(rows)
    # This time also specify the column names
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]  
    return df

df

Unnamed: 0,0,1,2,3
0,UnterMania II,TBA - New York,"̸Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"̸Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"̸Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"̸Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"̸Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"̸Fri, 5 Apr",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"̸Fri, 5 Apr",1.0
116,Erik the Lover,Erv's,"̸Fri, 5 Apr",1.0
117,Wax On Vissions,Starliner,"̸Fri, 5 Apr",1.0


In [15]:
# Test out your function
scrape_events(EVENTS_PAGE_URL)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"Fri, 5 Apr",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"Fri, 5 Apr",1.0
116,Erik the Lover,Erv's,"Fri, 5 Apr",1.0
117,Wax On Vissions,Starliner,"Fri, 5 Apr",1.0


## Write a Function to Retrieve the URL for the Next Page

As you scroll down, there should be a button labeled "Next Week" that will take you to the next page of events. Write code to find that button and extract the URL from it.

This is a relative path, so make sure you add `https://web.archive.org` to the front to get the URL.

![next page](images/ra_next.png)

In [16]:
# Find the button, find the relative path, create the URL for the current `soup`
svg = soup.find('svg', attrs={'aria-label': 'Right arrow'})

svg_parent = svg.parent

link = svg.parent.previousSibling

relative_path = link.get('href')
next_page_url = 'https://web.archive.org' + relative_path
next_page_url

'https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

In [None]:
# Fill in this function, to take in the current page's URL and return the
# next page's URL
def next_page(next_page_url):
    response = requests.get(next_page_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    svg = soup.find('svg', attrs={'aria-label': 'Right arrow'})
    svg_parent = svg.parent
    link = svg.parent.previousSibling
    relative_path = link.get('href')
    next_page_url = 'https://web.archive.org' + relative_path
    return next_page_url

In [18]:
# Test out your function
next_page(EVENTS_PAGE_URL)

ConnectionError: HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-30 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000028A43800B50>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

## Scrape the Next 500 Events

In other words, repeatedly call `scrape_events` and `next_page` until you have assembled a dataframe with at least 500 rows.

Display the data sorted by the number of attendees, greatest to least.

We recommend adding a brief `time.sleep` call between `requests.get` calls to avoid rate limiting.

In [None]:
# Your code here
overall_df = pd.DataFrame()

current_url = EVENTS_PAGE_URL
while overall_df.shape[0] <= 500:
    # Get all events from the current URL
    df = scrape_events(current_url)
    time.sleep(.2)
    # Add the data to the overall df
    overall_df = pd.concat([overall_df, df])
    # Get the next URL and set it as the current URL
    current_url = next_page(current_url)
    time.sleep(.2)

overall_df

In [None]:
overall_df.sort_values("Number_of_Attendees", ascending=False)

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!