# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [67]:
from bs4 import BeautifulSoup
import requests
import re
import numpy as np
import shutil
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import pandas as pd
from IPython.display import Image, HTML
import time

In [68]:
# Load the https://www.residentadvisor.net/events page in your browser.
html_page = requests.get('https://www.residentadvisor.net/events')
soup = BeautifulSoup(html_page.content, 'html.parser')


## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [69]:
events = soup.find('div', id="event-listing")
#events

In [70]:
evententries = events.findAll('li')
#evententries

In [71]:
rows = []
for entry in evententries: # iterate through evententries and scrape useful data 
    date = entry.find('p', class_="eventDate date") # pull date from event date class 
    event = entry.find('h1', class_="event-title") # pull title from event header
    if event:
        details = event.text.split(' at ') # splits event name string between act and venue
        event_name = details[0].strip() # declares former part of string as event name
        venue = details[1].strip() # declares latter part of string as venue name
        try: # pulls amount of people confirmed to attend the show by searching for attending class and declaring it an int
            n_attendees = int(re.match("(\d)", entry.find('p', class_="attending").text)[0]) 
        except: # if no attending class exists(no confirmed attendees), declare NaN
            n_attendees = np.nan
        rows.append([event_name, venue, cur_date, n_attendees]) # append the rows with data; name, venue, event date, # of attendees
    elif date:
        cur_date = date.text # takes date text from above if the event isn't the current date
    else: 
        continue # continue to next iteration
df = pd.DataFrame(rows) # plug row data into pandas dataframe
df

Unnamed: 0,0,1,2,3
0,Meow Wolf Dark Palace,National Western Complex,"Thu, 09 Apr 2020 /",9.0
1,Meow Wolf Dark Palace,National Western Complex,"Fri, 10 Apr 2020 /",9.0
2,William Black,Bluebird Theater,"Fri, 10 Apr 2020 /",1.0
3,Vincent,Temple Nightclub,"Fri, 10 Apr 2020 /",
4,Meow Wolf Dark Palace,National Western Complex,"Sat, 11 Apr 2020 /",9.0
5,Sofi Tukker,Ogden Theatre,"Wed, 15 Apr 2020 /",5.0


In [72]:
def scrape_events(events_page_url): #copy steps from above into one big function
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    #events = soup.find('div', id="event-listing")
    evententries = events.findAll('li')
    rows = []
    for entry in evententries: # iterate through evententries and scrape useful data 
        date = entry.find('p', class_="eventDate date") # pull date from event date class 
        event = entry.find('h1', class_="event-title") # pull title from event header
        if event:
            details = event.text.split(' at ') # splits event name string between act and venue
            event_name = details[0].strip() # declares former part of string as event name
            venue = details[1].strip() # declares latter part of string as venue name
            try: # pulls amount of people confirmed to attend the show by searching for attending class and declaring it an int
                n_attendees = int(re.match("(\d)", entry.find('p', class_="attending").text)[0]) 
            except: # if no attending class exists(no confirmed attendees), declare NaN
                n_attendees = np.nan
            rows.append([event_name, venue, cur_date, n_attendees]) # append the rows with data; name, venue, event date, # of attendees
        elif date:
            cur_date = date.text # takes date text from above if the event isn't the current date
        else: 
            continue # continue to next iteration
    df = pd.DataFrame(rows) # plug row data into pandas dataframe
    df
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [73]:
scrape_events("https://www.residentadvisor.net/events")

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Meow Wolf Dark Palace,National Western Complex,"Thu, 09 Apr 2020 /",9.0
1,Meow Wolf Dark Palace,National Western Complex,"Fri, 10 Apr 2020 /",9.0
2,William Black,Bluebird Theater,"Fri, 10 Apr 2020 /",1.0
3,Vincent,Temple Nightclub,"Fri, 10 Apr 2020 /",
4,Meow Wolf Dark Palace,National Western Complex,"Sat, 11 Apr 2020 /",9.0
5,Sofi Tukker,Ogden Theatre,"Wed, 15 Apr 2020 /",5.0


## Write a Function to Retrieve the URL for the Next Page

In [74]:
soup.find('a', attrs={'ga-event-action':'Next '}).attrs['href'] # pulls link extension from next button 
# PAY ATTENTION TO SPACING. THANKS. 

'/events/us/colorado/week/2020-04-16'

In [97]:
def next_page(url):
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    url_ext = soup.find('a', attrs={'ga-event-action':'Next '}).attrs['href']
    next_page_url = "https://www.residentadvisor.net" + url_ext # need to use base URL or it duplicates part of url
    #print(url_ext) commented this out, had to insert to debug where I was going wrong in this function
    return next_page_url


In [93]:
next_page('https://www.residentadvisor.net/events')

/events/us/colorado/week/2020-04-16


'https://www.residentadvisor.net/events/us/colorado/week/2020-04-16'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [98]:
#Your code here
dfs = [] # create empty list of dataframes
total_rows = 0 # set starting number of rows to 0
current_url = 'https://www.residentadvisor.net/events/us/colorado'
while total_rows <= 100: # kept throwing an error because there are less than 1000 events in my area
    df = scrape_events(current_url) # scrape events page by page and store them as a df
    dfs.append(df) # append each individual df to the dfs list
    total_rows += len(df) # increase total rows by number of entries in the appended df
    current_url = next_page(current_url) # set current url to the next page's url by calling next_page function
    time.sleep(.2)
df = pd.concat(dfs) # create an aggregate df by concatenating the dfs created in the iterations
df = df.iloc[:1000] # limit the length to 1000 
print(len(df)) # print length of the df to ensure it functioned correctly
df.head()


102


Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Meow Wolf Dark Palace,National Western Complex,"Thu, 09 Apr 2020 /",9.0
1,Meow Wolf Dark Palace,National Western Complex,"Fri, 10 Apr 2020 /",9.0
2,William Black,Bluebird Theater,"Fri, 10 Apr 2020 /",1.0
3,Vincent,Temple Nightclub,"Fri, 10 Apr 2020 /",
4,Meow Wolf Dark Palace,National Western Complex,"Sat, 11 Apr 2020 /",9.0


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!