# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [3]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [140]:
def get_names(soup):
    try:
        names = []
        for h1 in soup.find_all('h1', class_='event-title'):
            names.append(h1.a.text.strip())
        return names
    except AttributeError as e:
        print(e)

In [141]:
get_names(events)

["Seth's Coming Home Tour",
 'Jacques Greene',
 'Transform: Odd Mobb',
 'AfterHours Anonymous presents: Maceo Plex',
 'Punjahbae',
 'Party Favor',
 'H.C.P x Miley Serious',
 'Archie Hamilton',
 'Destructo',
 "They're (not) There: Time Ravelers",
 'La Roux',
 "Tennyson's Taco Tuesday • Peer Review"]

In [146]:
def get_venues(soup):
    venues = []
    try:
        for h1 in soup.find_all('h1', class_='event-title'):
            if h1.span.a == None:
                venues.append('TBD')
            else:
                venues.append(h1.span.a.text)
        return venues
    except AttributeError as e:
        print(e)

In [147]:
get_venues(events)

['Vinyl',
 "Ophelia's Electric Soapbox",
 'Temple Nightclub',
 'TBD',
 'Larimer Lounge',
 'Temple Nightclub',
 'Rhinoceropolis',
 'Bar Standard',
 'Temple Nightclub',
 'TBD',
 'Gothic Theatre',
 "Tennyson's Tap"]

In [190]:
def get_dates(soup):
    dates = []
    try:
        for h1 in soup.find_all('h1', class_='event-title'):
            dates.append(h1.parent.parent.time.get('datetime'))
        return dates
    except AttributeError as e:
        print(e)

In [191]:
get_dates(events)

['2020-03-12T00:00',
 '2020-03-12T00:00',
 '2020-03-12T00:00',
 '2020-03-13T00:00',
 '2020-03-13T00:00',
 '2020-03-13T00:00',
 '2020-03-14T00:00',
 '2020-03-14T00:00',
 '2020-03-14T00:00',
 '2020-03-14T00:00',
 '2020-03-16T00:00',
 '2020-03-17T00:00']

In [204]:
def get_attends(soup):
    attends = []
    try:
        for p in soup.find_all('p', class_='attending'):
            if p.span.text:
                attends.append(int(p.span.text))
            else:
                attends.append(0)
        return attends
    except AttributeError as e:
        print(e)

In [201]:
get_attends(events)

[16, 3, 26, 3, 23, 4, 2, 1, 2, 1]

In [202]:
def get_events_div(page_url):
    html = requests.get(page_url)
    soup = BeautifulSoup(html.content)
    div = soup.find('div', id='event-listing')
    return div

In [139]:
url = 'https://www.residentadvisor.net/events'
events = get_events_div(url)

In [195]:
def scrape_events(events_page_url):
    #Your code here
    events = get_events_div(events_page_url)
    data = [ get_names(events), get_venues(events), get_dates(events), get_attends(events)]
    df = pd.DataFrame(data).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [216]:
scrape_events(next_page(url))

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Christian Martin,Cervantes' Masterpiece Ballroom,2020-03-19T00:00,1.0
1,Elsewhere feat. Gettoblaster,Larimer Lounge,2020-03-19T00:00,1.0
2,Armnhmr,Bluebird Theater,2020-03-19T00:00,27.0
3,Prince Fox,Temple Nightclub,2020-03-19T00:00,2.0
4,[CANCELLED] Robag Wruhme & Leafar Legov: Below...,Cervantes' Masterpiece Ballroom,2020-03-20T00:00,2.0
5,We Should Remember This feat. Straight White T...,TBD,2020-03-20T00:00,2.0
6,Chris Stussy,Bar Standard,2020-03-20T00:00,2.0
7,Kristian Nairn,Temple Nightclub,2020-03-20T00:00,1.0
8,We Should Remember This feat. Straight White T...,TBD,2020-03-21T00:00,1.0
9,Classixx (DJ Set) Touch Sensitive (DJ Set),Vinyl,2020-03-21T00:00,


## Write a Function to Retrieve the URL for the Next Page

In [213]:
def next_page(url):
    #Your code here
    base_url = url[:38]
    try:
        html = requests.get(url)
        soup = BeautifulSoup(html.content)
        next_page_url = soup.find('a', attrs={'ga-event-action': 'Next '}).get('href')
        next_page_url = ("/").join(next_page_url.split('/')[2:])
        return (base_url + '/' + next_page_url)
    except AttributeError as e:
        print(e)

In [215]:
next_page('https://www.residentadvisor.net/events/us/colorado/week/2020-03-19')

'https://www.residentadvisor.net/events/us/colorado/week/2020-03-26'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [232]:
#Your code here
num_events = 0
url = 'https://www.residentadvisor.net/events'
cols = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
df = pd.DataFrame(columns = cols)

while num_events < 100:
    try:
        new_df = scrape_events(url)
        df = df.append(new_df)
        url = next_page(url)
        num_events = df.shape[0]
    except:
        print('err')
        break

'NoneType' object has no attribute 'get'
err


In [236]:
df.sort_values(['Number_of_Attendees', 'Event_Date'], ascending=False).head()

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
1,All Day I Dream of the Mile High City,Sculpture Park,2020-05-30T00:00,217
2,Mersiv,Ogden Theatre,2020-03-27T00:00,37
3,Black Caviar,Temple Nightclub,2020-03-27T00:00,28
2,Armnhmr,Bluebird Theater,2020-03-19T00:00,27
2,Transform: Odd Mobb,Temple Nightclub,2020-03-12T00:00,26


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!