# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [3]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import time

In [4]:
events_page_url = 'https://web.archive.org/web/20210325230938/https://ra.co/events/us/newyork?week=2019-03-30'
resp = requests.get(events_page_url)
soup = BeautifulSoup(resp.content, 'html.parser')

In [5]:
# Find the container with event listings in it

# This page is organized somewhat unusually, and many of
# the CSS attributes seem auto-generated. We notice that
# there is a div with "events-all" in its attributes that
# looks promising
events_all_div = soup.find('div', attrs={"data-tracking-id": "events-all"})

# The actual content is nested in a ul containing a single
# li within that div. Unclear why they are using a "list"
# concept for one element, but let's go ahead and select it
event_listings = events_all_div.find("ul").find("li")

# Print out some chunks of the text inside to make sure we
# have everything we need in here

# Beginning has events for March 30th
print(event_listings.text[:200])
print()
# Later we have events for March 31st
march_31st_start = event_listings.text.find("Sun, 31 Mar")
print(event_listings.text[march_31st_start:march_31st_start + 200])

# It looks like everything we need will be inside this event_listings tag

̸Sat, 30 MarUnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRARA Tickets457Cocoon New York: Sven Väth, Ilario Alicante, Butch & TaimurSven Vath, Butch, Taimur, Il

Sun, 31 MarSunday: Soul SummitNowadaysRARA Tickets132New Dad & Aaron Clark (Honcho)Aaron Clark, New DadAce Hotel3ParadiscoOccupy The DiscoLe Bain3Sunday Soiree: Unknown Showcase (Detroit)Ryan Dahl, Ha


In [6]:
# Find a list of events by date within that container

# Now we look at what is inside of that event_listings li tag.
# Based on looking at the HTML with developer tools, we see
# that there are 13 children of that tag, all divs. Each div
# is either a container of events on a given date, or empty

# Let's create a collection of those divs. recursive=False
# means we stop at 1 level below the event_listings li
dates = event_listings.findChildren(recursive=False)

# Now let's print out the start of the March 30th and March
# 31st sections again. This time each is in its own "date"
# container

# March 30th is at the 0 index
print("0 index:", dates[0].text[:200])
print()
# The 1 index is empty. We'll need to skip this later
print("1 index: ", dates[1].text)
print()
# March 31st is at the 2 index
print("2 index:", dates[2].text[:200])

# Now we know we can loop over all of the items in the dates
# list of divs to find the dates, although some will be blank
# so we'll need to skip them

0 index: ̸Sat, 30 MarUnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRARA Tickets457Cocoon New York: Sven Väth, Ilario Alicante, Butch & TaimurSven Vath, Butch, Taimur, Il

1 index:  

2 index: ̸Sun, 31 MarSunday: Soul SummitNowadaysRARA Tickets132New Dad & Aaron Clark (Honcho)Aaron Clark, New DadAce Hotel3ParadiscoOccupy The DiscoLe Bain3Sunday Soiree: Unknown Showcase (Detroit)Ryan Dahl, H


In [7]:
# Extract the date (e.g. Sat, 30 Mar) from one of those containers

# Grabbing just one to practice on
first_date = dates[0]

# This div contains a div with the date, followed by several uls
# containing actual event information

# The div with the date happens to have another human-readable
# CSS class, so let's use that to select it then grab its text
date = first_date.find("div", class_="sticky-header").text

# There is a / thing used for aesthetic reasons; let's remove it
date = date.strip("'̸")
date

'Sat, 30 Mar'

In [8]:
first_date_events = first_date.findChildren("ul")
first_date_events

[<ul class="Grid__GridStyled-sc-1l00ugd-0 fuNsvk grid" data-test-id="ticketed-event"><li class="Column-sc-18hsrnn-0 jHShKh"><div class="Box-omzyfs-0 sc-AxjAm dqkjhR" data-test-id="ticketed-event"><h3 class="Box-omzyfs-0 Heading__StyledBox-sc-120pa9w-0 fhMVGI"><a class="Link__AnchorWrapper-k7o46r-1 bmWkiB" data-test-id="event-listing-heading" data-tracking-id="/events/1234892" href="/web/20210325230938/https://ra.co/events/1234892"><span class="Text-sc-1t0gn2o-0 Link__StyledLink-k7o46r-0 fAmOyf" color="primary" data-test-id="event-listing-heading" data-tracking-id="/events/1234892" font-weight="normal" href="/events/1234892">UnterMania II</span></a></h3><div class="Box-omzyfs-0 sc-AxjAm jVLhoy"><span class="Text-sc-1t0gn2o-0 dWOMtb" color="primary" font-weight="normal">Mary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady Machine</span></div><div class="Box-omzyfs-0 sc-AxjAm fCFvgO"><div class="Box-omzyfs-0 sc-AxjAm ebaaK"><div class="Box-omzyfs-0 sc-AxjAm fOOuYI" height="30"><div class="

In [9]:
# Extract the name, venue, and number of attendees from one of the
# events within that container

# As noted previously, the div with information about events on
# this date contains several ul tags, each with information about
# a specific event. Get a list of them.
# (Again this is an odd use of HTML, to have an unordered list
# containing a single list item. But we scrape what we find!)
first_date_events = first_date.findChildren("ul")

# Grabbing the first event ul to practice on
first_event = first_date_events[0]

# Each event ul contains a single h3 with the event name, easy enough
name = first_event.find("h3").text

# Venue and attendees is more complicated. Across the bottom are 1-3
# divs with height 30. The 0th contains a location pin SVG and then
# the location text. The -1th (last), when present, contains a person
# icon SVG and then the number of attendees. Sometimes there is a
# middle div with a ticket icon SVG and the words "RA Tickets", which
# we will plan to ignore

# First, get all 1-3 divs that match this description
venue_and_attendees = first_event.findAll("div", attrs={"height": 30})
# The venue is the 0th (left-most) div, get its text
venue = venue_and_attendees[0].text
# The number of attendees is the last div (although it's sometimes
# missing), get its text
num_attendees = int(venue_and_attendees[-1].text)

# Print out everything for one event
print("Name:", name)
print("Venue:", venue)
print("Date:", date)
print("Number of attendees:", num_attendees)

Name: UnterMania II
Venue: TBA - New York
Date: Sat, 30 Mar
Number of attendees: 457


In [10]:
# Testing that code out on an event with a missing attendee count
last_event = first_date_events[-1]

name = last_event.find("h3").text

venue_and_attendees = last_event.findAll("div", attrs={"height": 30})
venue = venue_and_attendees[0].text
num_attendees = int(venue_and_attendees[-1].text)

ValueError: invalid literal for int() with base 10: 'H0L0'

In [11]:
# Ok, that crashes because there is no attendee count. Let's
# put a try/except and set the attendee count to NaN, since
# that represents "missing data" reasonably

try:
    num_attendees = int(venue_and_attendees[-1].text)
except ValueError:
    num_attendees = np.nan
    
print("Name:", name)
print("Venue:", venue)
print("Date:", date)
print("Number of attendees:", num_attendees)

# Now we have code that should work for events with and
# without attendee counts

Name: Petra, Matthusen & Lang, White & Pitsiokos, and Zorn
Venue: H0L0
Date: Sat, 30 Mar
Number of attendees: nan


In [13]:
# Loop over all of the event entries, extract this information
# from each, and assemble a dataframe

# Create an empty list to hold results

rows = []

# Loop over all date containers on the page

for date_container in dates:
    # First check if this is one of the empty divs. If it is,
    # skip ahead to the next one
    if not date_container.text:
        continue
    # Same logic as above to extract the date
    date = date_container.find("div", class_="sticky-header")
    
    
    # This time, loop over all of the events
    events = date_container.findChildren("ul")
    for event in events:
        
         # Same logic as above to extract the name, venue, attendees
        name = event.find("h3").text
        venue_and_attendees = event.findAll("div", attrs={"height": 30})
        venue = venue_and_attendees[0].text
        try:
            num_attendees = int(venue_and_attendees[-1].text)
        except ValueError:
            num_attendees = np.nan
        # New piece here: appending the new information to rows list
        rows.append([name, venue, date, num_attendees])
        
# Make the list of lists into a dataframe and display
df = pd.DataFrame(rows)
df
        


Unnamed: 0,0,1,2,3
0,UnterMania II,TBA - New York,"[[[<span class=""Text-sc-1t0gn2o-0 gSvLLX"" colo...",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"[[[<span class=""Text-sc-1t0gn2o-0 gSvLLX"" colo...",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"[[[<span class=""Text-sc-1t0gn2o-0 gSvLLX"" colo...",375.0
3,Rave: Underground Resistance All Night,Nowadays,"[[[<span class=""Text-sc-1t0gn2o-0 gSvLLX"" colo...",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"[[[<span class=""Text-sc-1t0gn2o-0 gSvLLX"" colo...",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"[[[<span class=""Text-sc-1t0gn2o-0 gSvLLX"" colo...",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"[[[<span class=""Text-sc-1t0gn2o-0 gSvLLX"" colo...",1.0
116,Erik the Lover,Erv's,"[[[<span class=""Text-sc-1t0gn2o-0 gSvLLX"" colo...",1.0
117,Wax On Vissions,Starliner,"[[[<span class=""Text-sc-1t0gn2o-0 gSvLLX"" colo...",1.0


In [None]:
def scrape_events(events_page_url):
    #Your code here
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

## Write a Function to Retrieve the URL for the Next Page

In [None]:
def next_page(url):
    #Your code here
    return next_page_url

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [None]:
#Your code here

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!