# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!

In this lab, you'll practice your scraping skills on an online music magazine and events website called Resident Advisor.

## Objectives

You will be able to:

* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://ra.co website. For reproducibility we will use the [Internet Archive](https://archive.org/) Wayback Machine to retrieve a version of this page from March 2019.

Start by navigating to the events page [here](https://web.archive.org/web/20210325230938/https://ra.co/events/us/newyork?week=2019-03-30) in your browser. It should look something like this:

<img src="images/ra_top.png">

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

## Write a Function to Scrape all of the Events on the Given Page

The function should return a Pandas DataFrame with columns for the `Event_Name`, `Venue`, and `Number_of_Attendees`.

Start by importing the relevant libraries, making a request to the relevant URL, and exploring the contents of the response with `BeautifulSoup`. Then fill in the `scrape_events` function with the relevant code.

In [15]:
# Relevant imports

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

In [16]:
EVENTS_PAGE_URL = "https://web.archive.org/web/20210326225933/https://ra.co/events/us/newyork?week=2019-03-30"

# Exploration: making the request and parsing the response
html = requests.get(EVENTS_PAGE_URL)
soup = BeautifulSoup(html.content, "html.parser")

In [17]:
# Find the container with event listings in it
# Some hints are giving along the way

# This page is organized somewhat unusually, and many of
# the CSS attributes seem auto-generated. We notice that
# there is a div with "events-all" in its attributes that
# looks promising if we use soup.find(), call this events_all_div

events_all_div = soup.find('div', attrs={"data-tracking-id": "events-all"})

# The actual content is nested in a ul containing a single
# li within that div. Unclear why they are using a "list"
# concept for one element, but let's go ahead and select it
# Call this event_listings and use it to find ul and li in 
# events_all_div

event_listings = events_all_div.find("ul").find("li")


# Print out some chunks of the text inside to make sure we
# have everything we need in here
# For example print the events for March 30th and 31st

print(event_listings.text[:200])
start = event_listings.text.find("Sun, 31 Mar")
print(event_listings.text[start: start+200])

̸Sat, 30 MarUnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRARA Tickets457Cocoon New York: Sven Väth, Ilario Alicante, Butch & TaimurSven Vath, Butch, Taimur, Il
Sun, 31 MarSunday: Soul SummitNowadaysRARA Tickets132New Dad & Aaron Clark (Honcho)Aaron Clark, New DadAce Hotel3ParadiscoOccupy The DiscoLe Bain3Sunday Soiree: Unknown Showcase (Detroit)Ryan Dahl, Ha


In [18]:
# Find a list of events by date within that container

# Now we look at what is inside of that event_listings li tag.
# Based on looking at the HTML with developer tools, we see
# that there are 13 children of that tag, all divs. Each div
# is either a container of events on a given date, or empty

# Let's create a collection of those divs. recursive=False
# means we stop at 1 level below the event_listings li
dates = event_listings.findChildren(recursive=False)

# Now let's print out the start of the March 30th and March
# 31st sections again. This time each is in its own "date"
# container

# March 30th is at the 0 index
print("0 index:", dates[0].text[:200])
print()
# The 1 index is empty. We'll need to skip this later
print("1 index: ", dates[1].text)
print()
# March 31st is at the 2 index
print("2 index:", dates[2].text[:200])

# Now we know we can loop over all of the items in the dates
# list of divs to find the dates, although some will be blank
# so we'll need to skip thems

0 index: ̸Sat, 30 MarUnterMania IIMary Yuzovskaya, Manni Dee, Umfang, Juana, The Lady MachineTBA - New YorkRARA Tickets457Cocoon New York: Sven Väth, Ilario Alicante, Butch & TaimurSven Vath, Butch, Taimur, Il

1 index:  

2 index: ̸Sun, 31 MarSunday: Soul SummitNowadaysRARA Tickets132New Dad & Aaron Clark (Honcho)Aaron Clark, New DadAce Hotel3ParadiscoOccupy The DiscoLe Bain3Sunday Soiree: Unknown Showcase (Detroit)Ryan Dahl, H


In [19]:
# Extract the date (e.g. Sat, 30 Mar) from one of those containers
# Call this first_date

# Grabbing just one to practice on
first_date = dates[0]

# This div contains a div with the date, followed by several uls
# containing actual event information

# The div with the date happens to have another human-readable
# CSS class, so let's use that to select it then grab its text
# Call this date, and use class_=sticky header as an argument for
# first_date.find
date = first_date.find("div", {"class": "sticky-header"}).text

# There is a / thing used for aesthetic reasons; let's remove it
date = date[1:]
print(date)

Sat, 30 Mar


In [20]:
# Extract the name, venue, and number of attendees from one of the
# events within that container

# As noted previously, the div with information about events on
# this date contains several ul tags, each with information about
# a specific event. Get a list of them.
# (Again this is an odd use of HTML, to have an unordered list
# containing a single list item. But we scrape what we find!)
first_date_events = first_date.findChildren("ul")

# Grabbing the first event ul to practice on
first_event = first_date_events[0]

# Each event ul contains a single h3 with the event name, easy enough
name = first_event.find("h3").text

# First, get all 1-3 divs that match this description,
# where first_event.findAll has attrs={"height": 30}
# as one of its arguments
venue_and_attendees = first_event.findAll("div", {"height": 30})
# The venue is the 0th (left-most) div, get its text
venue = venue_and_attendees[0].text
# The number of attendees is the last div (although it's sometimes
# missing), get its text
num_attendees = venue_and_attendees[-1].text

In [26]:
# Run the code below
# Make sure you understand it since it will
# for the basis of the definition of scrape_events below

# Create an empty list to hold results
rows = []

# Loop over all date containers on the page
for date_container in dates:
    
    # First check if this is one of the empty divs. If it is,
    # skip ahead to the next one
    if not date_container.text:
        continue
    
    # Same logic as above to extract the date
    date = date_container.find("div", class_="sticky-header").text
    date = date.strip("'̸")
    
    # This time, loop over all of the events
    events = date_container.findChildren("ul")
    for event in events:
        
        # Same logic as above to extract the name, venue, attendees
        name = event.find("h3").text
        venue_and_attendees = event.findAll("div", attrs={"height": 30})
        venue = venue_and_attendees[0].text
        try:
            num_attendees = int(venue_and_attendees[-1].text)
        except ValueError:
            num_attendees = np.nan
            
        # New piece here: appending the new information to rows list
        rows.append([name, venue, date, num_attendees])

# Make the list of lists into a dataframe and display
df = pd.DataFrame(rows)
df  

Unnamed: 0,0,1,2,3
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"Fri, 5 Apr",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"Fri, 5 Apr",1.0
116,Erik the Lover,Erv's,"Fri, 5 Apr",1.0
117,Wax On Vissions,Starliner,"Fri, 5 Apr",1.0


In [34]:
# Bring it all together in a function that makes the request, gets the
# list of entries from the response, loops over that list to extract the
# name, venue, date, and number of attendees for each event, and returns
# that list of events as a dataframe

def scrape_events(events_page_url):
    
    html = requests.get(events_page_url)
    soup = BeautifulSoup(html.content, "html.parser")
    
    events_all_div = soup.find('div', attrs={"data-tracking-id": "events-all"})
    event_listings = events_all_div.find("ul").find("li")
    
    dates = event_listings.findChildren(recursive=False)
    
    rows = []
    
    for date_container in dates:

        # First check if this is one of the empty divs. If it is,
        # skip ahead to the next one
        if not date_container.text:
            continue

        # Same logic as above to extract the date
        date = date_container.find("div", class_="sticky-header").text
        date = date.strip("'̸")

        # This time, loop over all of the events
        events = date_container.findChildren("ul")
        for event in events:

            # Same logic as above to extract the name, venue, attendees
            name = event.find("h3").text
            venue_and_attendees = event.findAll("div", attrs={"height": 30})
            venue = venue_and_attendees[0].text
            try:
                num_attendees = int(venue_and_attendees[-1].text)
            except ValueError:
                num_attendees = np.nan

            # New piece here: appending the new information to rows list
            rows.append([name, venue, date, num_attendees])
    
    df2 = pd.DataFrame(rows)
    df2.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df2

In [35]:
# Test out your function
scrape_events(EVENTS_PAGE_URL)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
114,A Night at the Baths,C'mon Everybody,"Fri, 5 Apr",1.0
115,Blaqk Audio,Music Hall of Williamsburg,"Fri, 5 Apr",1.0
116,Erik the Lover,Erv's,"Fri, 5 Apr",1.0
117,Wax On Vissions,Starliner,"Fri, 5 Apr",1.0


## Write a Function to Retrieve the URL for the Next Page

As you scroll down, there should be a button labeled "Next Week" that will take you to the next page of events. Write code to find that button and extract the URL from it.

This is a relative path, so make sure you add `https://web.archive.org` to the front to get the URL.

![next page](images/ra_next.png)

In [45]:
# Find the button, find the relative path, create the URL for the current `soup`

# This is tricky again, since there are not a lot of
# human-readable CSS classes

# One unique thing we notice is a > icon on the part where
# you click to go to the next page. It's an SVG with an 
# aria-label of "Right arrow", this soup.find() will have
# attrs={"aria-label": "Right arrow"} as an argument

avg = soup.find("svg", {"aria-label": "Right arrow"})

# That SVG is inside of a div
svg_parent = avg.parent

# And the tag right before that div (its "previous sibling")
# is an anchor (link) tag with the path we need
link = svg_parent.previousSibling

# Then we can extract the path from that link to build the full URL
relative_path = link.attrs["href"]
next_page_url = "https://web.archive.org/" + relative_path
next_page_url

'https://web.archive.org//web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

In [49]:
# Fill in this function, to take in the current page's URL and return the
# next page's URL
def next_page(url):
    #Your code here
    html = requests.get(url)
    soup = BeautifulSoup(html.content, "html.parser")
    
    avg = soup.find("svg", {"aria-label": "Right arrow"})
    svg_parent = avg.parent
    link = svg_parent.previousSibling
    relative_path = link.attrs["href"]
    
    next_page_url = "https://web.archive.org/" + relative_path
    next_page_url
    
    return next_page_url

In [50]:
# Test out your function
next_page(EVENTS_PAGE_URL)

'https://web.archive.org//web/20210326225933/https://ra.co/events/us/newyork?week=2019-04-06'

## Scrape the Next 500 Events

In other words, repeatedly call `scrape_events` and `next_page` until you have assembled a dataframe with at least 500 rows.

Display the data sorted by the number of attendees, greatest to least.

We recommend adding a brief `time.sleep` call between `requests.get` calls to avoid rate limiting.

In [67]:
# Your code here

# Make a dataframe to store results. We will concatenate
# additional dfs as they are returned
overall_df = pd.DataFrame()

current_url = EVENTS_PAGE_URL

# Now define a while look on overall_df

while overall_df.shape[0]<500:
    overall_df = pd.concat([overall_df, scrape_events(current_url)], ignore_index=True)
    current_url = next_page(current_url)

In [68]:
overall_df

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,UnterMania II,TBA - New York,"Sat, 30 Mar",457.0
1,"Cocoon New York: Sven Väth, Ilario Alicante, B...",99 Scott Ave,"Sat, 30 Mar",407.0
2,Horse Meat Disco - New York Residency,Elsewhere,"Sat, 30 Mar",375.0
3,Rave: Underground Resistance All Night,Nowadays,"Sat, 30 Mar",232.0
4,"Believe You Me // Beta Librae, Stephan Kimbel,...",TBA - New York,"Sat, 30 Mar",89.0
...,...,...,...,...
601,"Sleepy & Boo, Unseen., Dysco, Joeoh",Rose Gold,"Fri, 3 May",2.0
602,Diving for Disco with Jake From Extra Water,Our Wicked Lady,"Fri, 3 May",2.0
603,The Happy Hour After Work Party at Doha Nightclub,Doha Club,"Fri, 3 May",1.0
604,Best of the Boogie,Erv's,"Fri, 3 May",1.0


In [70]:
# Display overall_df the specified sorted order
# Do so by Number of Attendees in descending order

overall_df.sort_values(by="Number_of_Attendees", ascending=False)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
119,Zero presents... The Masquerade,The 1896,"Sat, 6 Apr",919.0
300,Secret Solstice Pre-Party (Free Entry): Metro ...,Kings Hall - Avant Gardner,"Thu, 18 Apr",670.0
353,Nina Kraviz / James Murphy / Justin Cudmore,Knockdown Center,"Sat, 20 Apr",501.0
208,Stavroz live! presented by Zero,The Williamsburg Hotel,"Fri, 12 Apr",481.0
91,Teksupport: Honey Dijon (All Night Long) Sold Out,99 Scott Ave,"Fri, 5 Apr",463.0
...,...,...,...,...
409,420: A Musical Experience,The Kraine Theater,"Mon, 22 Apr",
414,420: A Musical Experience,The Kraine Theater,"Tue, 23 Apr",
428,420: A Musical Experience,The Kraine Theater,"Wed, 24 Apr",
516,Klandestino Brunch with Electronic Music,Avena Downtown,"Sat, 27 Apr",


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!