# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [2]:
# Open the inspect element feature in your browser

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [3]:
import requests
from bs4 import BeautifulSoup as BS
import pandas as pd
import numpy as np

In [19]:
def scrape_events(events_page_url):
    #Your code here
    html = requests.get(events_page_url)
    soup = BS(html.content, 'html.parser')
    events = soup.find('div', id='event-listing')
    listings = events.findAll('li')
    rows = []

    for entry in listings:
        date = entry.find('p', class_='date')
        event = entry.find('article', class_='event-item')
        if event:
            event_name = event.find('h1').find('a').text
            venue = event.find('h1').find('span').contents[1].text
            try:
                attending = int(event.find('p', class_='attending').find('span').text)
            except:
                attending = np.nan
            
            rows.append([event_name, venue, new_date, attending])
        elif date:
            new_date = date.text
        else:
            continue
    
    df = pd.DataFrame(rows, columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"])
    
    return df
scrape_events('https://www.residentadvisor.net/events')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Clinic with Sebastian Mullaert (Circle of Live),The Sayers Club,"Wed, 08 Apr 2020 /",7.0
1,CCVO Thursdays with Kerim Bey & Joe Dismal,Underground SF,"Thu, 09 Apr 2020 /",1.0
2,Postponed - Damian Lazarus (Crosstown Rebels),Public Works,"Thu, 09 Apr 2020 /",22.0
3,[CANCELLED] DJ Koze & Floating Points,1015 Folsom,"Fri, 10 Apr 2020 /",37.0
4,Cancelled,F8 1192 Folsom,"Fri, 10 Apr 2020 /",22.0
5,"Desert Dream feat. Walker & Royce, Will Clarke...",Equl Estate,"Fri, 10 Apr 2020 /",2.0
6,"Canceled - Om Unit, The Librarian, J:Kenzo by ...",Public Works,"Fri, 10 Apr 2020 /",2.0
7,[POSTPONED] Coachella 2020,Empire Polo Club,"Fri, 10 Apr 2020 /",23.0
8,DUBLAB x Doom Trip present: The Doom Mix IV Re...,TBA - Downtown LA,"Fri, 10 Apr 2020 /",2.0
9,Rich Medina presents Home,Resident,"Fri, 10 Apr 2020 /",1.0


In [9]:
events = soup.find('div', id='event-listing')
listings = events.findAll('li')
#find date
listings[0].find('p', class_='date').text
#find event
listings[1].find('article', class_='event-item')
listings[1].find('h1').find('span').contents[1].text

NameError: name 'soup' is not defined

## Write a Function to Retrieve the URL for the Next Page

In [10]:
def next_page(url):
    #Your code here
    html = requests.get(url)
    soup = BS(html.content, 'html.parser')
    next_end = soup.find('li', id='liNext2').find('a').attrs['href']
    next_page_url = 'https://www.residentadvisor.net' + next_end
    return next_page_url

next_page('https://www.residentadvisor.net/events/us/california/week/2020-04-15')

'https://www.residentadvisor.net/events/us/california/week/2020-04-22'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [23]:
#Your code here
df_final = pd.DataFrame(columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"])

dfs_to_group = []

length_of_df = 0

iterations = 0

web_page = 'https://www.residentadvisor.net/events/us/california'

while length_of_df <= 170:
    df_toadd = scrape_events(web_page)
    dfs_to_group.append(df_toadd)
    web_page = next_page(web_page)
    length_of_df += len(df_toadd)
    iterations += 1
    print(iterations, " ", length_of_df)

df_final = pd.concat(dfs_to_group, ignore_index = True)
print('COMPLETE')

1   28
2   54
3   73
4   85
5   94
6   110
7   116
8   121
9   127
10   135
11   139
12   146
13   146
14   148
15   149
16   153
17   155
18   155
19   155
20   155
21   156
22   156
23   157
24   158
25   159
26   161
27   165
28   166
29   167
30   168
31   170
32   171
COMPLETE


In [27]:
df_final.sort_values('Number_of_Attendees', ascending = False)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
137,All Day I Dream of Golden Days,Golden Gate Park,"Sun, 21 Jun 2020 /",722.0
136,All Day I Dream of Golden Days,Golden Gate Park,"Sat, 20 Jun 2020 /",722.0
120,Paradise in the Park,Pershing Square,"Sun, 31 May 2020 /",511.0
117,Paradise in the Park,Pershing Square,"Sat, 30 May 2020 /",511.0
80,[POSTPONED] Rumors Los Angeles Block Party 2020,"Gin Ling Way, Chinatown","Sat, 02 May 2020 /",393.0
...,...,...,...,...
86,Boshke Beats XX Years Anniversary Tour,AC Lounge,"Thu, 07 May 2020 /",
151,FNGRS CRSSD Pres: OTR,Bang Bang,"Sat, 25 Jul 2020 /",
153,Sequence Feat. Gentlemen's Club & Dirty Snatcha,DNA Lounge,"Thu, 30 Jul 2020 /",
168,Cancelled - Mardeleva Live in San Francisco,906 World Cultural Center,"Sat, 07 Nov 2020 /",


In [28]:
# Limited to 170 rows due to not enough events to make 1000

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!