<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scraping-Concerts---Lab" data-toc-modified-id="Scraping-Concerts---Lab-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scraping Concerts - Lab</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#View-the-Website" data-toc-modified-id="View-the-Website-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>View the Website</a></span></li><li><span><a href="#Open-the-Inspect-Element-Feature" data-toc-modified-id="Open-the-Inspect-Element-Feature-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Open the Inspect Element Feature</a></span></li><li><span><a href="#Write-a-Function-to-Scrape-all-of-the-Events-on-the-Given-Page-Events-Page" data-toc-modified-id="Write-a-Function-to-Scrape-all-of-the-Events-on-the-Given-Page-Events-Page-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Write a Function to Scrape all of the Events on the Given Page Events Page</a></span></li><li><span><a href="#Write-a-Function-to-Retrieve-the-URL-for-the-Next-Page" data-toc-modified-id="Write-a-Function-to-Retrieve-the-URL-for-the-Next-Page-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Write a Function to Retrieve the URL for the Next Page</a></span></li><li><span><a href="#Scrape-the-Next-1000-Events-for-Your-Area" data-toc-modified-id="Scrape-the-Next-1000-Events-for-Your-Area-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Scrape the Next 1000 Events for Your Area</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Summary</a></span></li></ul></li></ul></div>

# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [9]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [10]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [14]:
import re
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time

# Exploration; designing/testing function parts
response = requests.get("https://www.residentadvisor.net/events/us/california")
soup = BeautifulSoup(response.content, 'html.parser')

event_listings = soup.find('div', id='event-listing') # define the search event

entries = event_listings.findAll('li') # tell Python to search for all the event listings via listed items.
print(len(entries), entries[0]) #prints the number of entries, as well as the first entry (entries[0])

# Successive exploration in function development

rows = [] # create rows for dataframe
for entry in entries:
    # Is it a date? If so, set current date.
    date = entry.find('p', class_="eventDate date")
    event = entry.find('h1', class_="event-title")
    if event:
        details = event.text.split(' at') # tell Python that you want to split text, and where.
        # Format the split string exactly as it appears in the html, like you do here
        event_name = details[0].strip()
        venue = details[1].strip()
        try:
            n_attendees = int(re.match("(\d*)", entry.find('p', class_="attending").text)[0])
            # \d looks for any decimals
        except:
                n_attendees = np.nan
        rows.append([event_name, venue, cur_date, n_attendees])
    elif date: # I don't really understand why there would not be a date?
        cur_date = date.text
    else:
        continue
        
df = pd.DataFrame(rows)
df.head()

# Final function
def scrape_events(events_page_url):
    response = requests.get(events_page_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    entries = event_listings.findAll('li')
    rows = []
    for entry in entries:
        # Is it a date? If so, set current date.
        date = entry.find('p', class_="eventDate date")
        event = entry.find('h1', class_="event-title")
        if event:
            details = event.text.split(' at ')
            event_name = details[0].strip()
            venue = details[1].strip()
            try:
                n_attendees = int(re.match("(\d*)"), entry.find('p', class_="attending".text)[0])
            except:
                n_attendees = np.nan
            rows.append([event_name, venue, cur_date, n_attendees])
        elif date:
            cur_date = date.text
        else:
            continue
    df = pd.DataFrame(rows)
    df.head()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

91 <li><p class="eventDate date"><a href="/events.aspx?ai=308&amp;v=day&amp;mn=2&amp;yr=2020&amp;dy=4"><span>Tue, 04 Feb 2020 /</span></a></p></li>


## Write a Function to Retrieve the URL for the Next Page

In [15]:
soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']
def next_page(url):
    response =  requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    url_ext = soup.find('a', attrs={'ga-event-action':"Next "}).attrs['href']
    next_page_url = "https://www.residentadvisor.net" + url_ext
    return next_page_url

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [22]:
dfs = []
total_rows = 0 # start counter at 0.
cur_url = "https://www.residentadvisor.net/events/us/newyork"
while total_rows <= 1000:
    df = scrape_events(cur_url)
    dfs.append(df)
    total_rows += len(df)
    cur_url = next_page(cur_url) # force while loop to move on to next page
    time.sleep(.2) # give the page a chance to load?
df = pd.concat(dfs)
df = df.iloc[:1000]
print(len(df))
df.head(1000)

    

1000


Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Mirrorshades,"The Lash, Los Angeles","Tue, 04 Feb 2020 /",
1,"Brett Ballou's Birthday - DJ Dan, Wally Caller...","The Circle OC, Los Angeles","Tue, 04 Feb 2020 /",
2,Clinic with Rafa Barrios (Toolroom),"The Sayers Club, Los Angeles","Wed, 05 Feb 2020 /",
3,Housepitality: Garth - Cole - Smokes,"F8 1192 Folsom, San Francisco","Wed, 05 Feb 2020 /",
4,An Intimate Night with Broun Fellinis & Charlo...,"The Great Northern, San Francisco","Wed, 05 Feb 2020 /",
5,Negress: BAE 2 BAE,"El Dorado, Los Angeles","Wed, 05 Feb 2020 /",
6,"Warhol Vinyl: Dahlia, Dog People, Dr D, Michae...","The Lexington, Los Angeles","Wed, 05 Feb 2020 /",
7,"Mintyboi presents: Actress, Sentimental Rave, ...","Chewing Foil, Los Angeles","Thu, 06 Feb 2020 /",
8,[CANCELLED] The Rose presents: Doc Martin (Sub...,"The Rose, Los Angeles","Thu, 06 Feb 2020 /",
9,"Supernature x ICD: Skyler Redondo, Anderson Ch...","Monarch, San Francisco","Thu, 06 Feb 2020 /",


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!