# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser

In [3]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

In [35]:
html_page = requests.get('https://www.residentadvisor.net/events')
soup = BeautifulSoup(html_page.content, 'html.parser')


In [38]:
soup.find('h1', class_='event-title') # event

<h1 class="event-title" itemprop="summary"><a href="/events/1328205" itemprop="url" title="Event details of Spanksgiving // Verlk // Bedrock">Spanksgiving // Verlk // Bedrock</a> <span>at <a href="/club.aspx?id=62829">Boondocks</a>, <a href="/events.aspx?ai=63">Houston</a></span></h1>

In [45]:
soup.find('h1', class_='event-title').findAll('a')[0].text

'Spanksgiving // Verlk // Bedrock'

In [46]:
soup.find('h1', class_='event-title').findAll('a')[1].text

'Boondocks'

In [75]:
soup.find('h1', class_='event-title').parent.parent.find('time').text[:-6]

'2019-11-27'

In [71]:
soup.findAll('h1', class_='event-title')[2].parent.parent.find('time').text[:-6]

'2019-11-29'

In [90]:
soup.findAll('h1', class_='event-title')[6].nextSibling.nextSibling.find('span').text

AttributeError: 'NoneType' object has no attribute 'find'

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [96]:
def scrape_events(events_page_url):
    #Your code here
    names = []
    venues = []
    dates = []
    num_attendees = []
    
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    event_list = soup.findAll('h1', class_='event-title')
    
    for event in event_list:
        names.append(event.findAll('a')[0].text)
        venues.append(event.findAll('a')[1].text)
        dates.append(event.parent.parent.find('time').text[:-6])
        if event.nextSibling.nextSibling is None:
            num_attendees.append(0)
        else:
            num_attendees.append(event.nextSibling.nextSibling.find('span').text)
        
    df = pd.DataFrame({"Event_Name": names, "Venue": venues, "Event_Date": dates, "Number_of_Attendees": num_attendees})
    return df


In [97]:
scrape_events('https://www.residentadvisor.net/events')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Spanksgiving // Verlk // Bedrock,Boondocks,2019-11-27,4
1,Praia at Night Feat. Carlo Lio B2B Nathan Barato,Bauhaus,2019-11-29,5
2,Nero (DJ Set),Empire Control Room & Garage,2019-11-29,3
3,Chus Ceballos,It'll Do,2019-11-29,3
4,Black Friday Midwest Sessions with Demarkus Lewis,Plush,2019-11-29,2
5,Kinda Super Disco // Erick Morillo,Numbers,2019-11-30,5
6,Brett Johnson,It'll Do,2019-11-30,0


## Write a Function to Retrieve the URL for the Next Page

In [113]:
soup.find(id='liNext2').find('a').attrs['href'] #next button url

'/events/us/texas/week/2019-12-04'

In [128]:
def next_page(url):
    #Your code here
    
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    
    base_url = 'https://www.residentadvisor.net'
    
    if 'href' in soup.find(id='liNext2').find('a').attrs.keys():
        return base_url + soup.find(id='liNext2').find('a').attrs['href']
    else:
        return url

In [126]:
next_page('https://www.residentadvisor.net/events')

'https://www.residentadvisor.net/events/us/texas/week/2019-12-04'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [129]:
#Your code here

df = pd.DataFrame()
url = 'https://www.residentadvisor.net/events'

while len(df) < 50:
    df1 = scrape_events(url)
    df = pd.concat([df,df1], ignore_index=True)
    if url == next_page(url):
        break
    else:
        url = next_page(url)

print(len(df))
df.tail()


28


Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
23,Gatsby's Penthouse - Dallas New Year's 2020,The Le Méridien Dallas Stoneleigh,2019-12-31,1
24,Gatsby's House - Houston New Year's Eve 2020,Houston,2019-12-31,1
25,Gladys Knight in Concert,Arena Theatre,2020-01-09,1
26,Oscar G - Made In Miami,Bauhaus,2020-01-18,1
27,Chaka Khan in Concert,Arena Theatre,2020-02-01,2


In [132]:
df.sort_values(['Number_of_Attendees','Event_Date'], ascending=[False,True])

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
1,Praia at Night Feat. Carlo Lio B2B Nathan Barato,Bauhaus,2019-11-29,5
5,Kinda Super Disco // Erick Morillo,Numbers,2019-11-30,5
14,Tourist & Matthew Dear,The Parish,2019-12-14,5
20,"Bas_mrkt 2020 NYE! Barbuto, Sara Landry, Nymbl...",The Oven,2019-12-31,5
0,Spanksgiving // Verlk // Bedrock,Boondocks,2019-11-27,4
7,House of Tones presents: LA Riots & Vanilla Ace,Voodoo Rm 3rd Floor,2019-12-05,4
11,Steve Lawler (Viva Music),Club Here I Love You,2019-12-08,4
12,Tacky Sweater Tech House Party,Voodoo Rm 3rd Floor,2019-12-08,4
16,Private Label presents: Sacha Robbotti,The Terrace at Stereo Live,2019-12-15,4
2,Nero (DJ Set),Empire Control Room & Garage,2019-11-29,3


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!