# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [1]:
# Open the inspect element feature in your browser
import pandas as pd
from bs4 import BeautifulSoup
import requests

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [77]:
def scrape_events(events_page_url):
    #Your code here
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    contentDiv = soup.find('div', class_="content clearfix")
    eventListing = contentDiv.find('div', id="event-listing")
    eventName = []
    venueName = []
    eventDate = []
    numAttendee = []
    eventName = [h1.find('a').attrs['title'] for h1 in eventListing.findAll('h1', class_="event-title")]
    eventName = [p[17:] for p in eventName]
    venueName = [div.find('span').text.strip() for div in eventListing.findAll('div', class_="bbox")]
    venueName = [p[3:] for p in venueName]
    eventDate = [span.find('time').text.strip() for span in eventListing.findAll('span', style="display:none;")]
    eventDate = [p[:-6] for p in eventDate]
    numAttendee = [p.find('span').text.strip() for p in eventListing.findAll('p', class_='attending')]
    df = pd.DataFrame([eventName, venueName, eventDate, numAttendee]).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df
# scrape_events('https://www.residentadvisor.net/events/us/newyork/week/2020-04-07')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,"Virtual Tuesday: Skillshare, Jadalareign, and ...",Nowadays,2020-04-07,1.0
1,Virtual Wednesday: Planetarium with DJ Python ...,Nowadays,2020-04-08,18.0
2,Passover Seders - NYC Kosher Restaurants - Tal...,Talia's Steakhouse,2020-04-08,4.0
3,Arthur Moon,public records,2020-04-08,1.0
4,Caspian with Pianos Become The Teeth and Maserati,Le Poisson Rouge,2020-04-08,5.0
5,Postponed / Open Decks Session 107,Eris,2020-04-08,4.0
6,Seven Davis Jr. presents 'Friends',Chelsea Music Hall,2020-04-09,4.0
7,Passover Seders - NYC Kosher Restaurants - Tal...,Talia's Steakhouse,2020-04-09,1.0
8,"Virtual Thursday: Seltzer with Bearcat, Precol...",Nowadays,2020-04-09,3.0
9,Anything Goes: Demuir,1 Oak,2020-04-09,309.0


## Write a Function to Retrieve the URL for the Next Page

In [44]:
#if you are URL hacking you need to set var 'todayDate' to current day of month, may not work if you go past 1 year
def next_page(url):
    #Your code here
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    next_button = soup.find('li', id='liNext')
    if next_button:
        return next_button.find('a').attrs['href']
    else:
        return None
next_page('https://www.residentadvisor.net/events/us/newyork/week/2020-04-14')

'/events/us/newyork/week/2020-04-21'

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [112]:
#Your code here
import time
def parseData(iterations):
#     eventName, venueName, eventDate, numAttendee
    finalDF = []
    total_rows = 0
    url = 'https://www.residentadvisor.net/events'
    next_url = ''
    while total_rows <= iterations:
        if total_rows == 0:
            url = 'https://www.residentadvisor.net/events'
            next_url = next_page(url)
            df = scrape_events(url)
            finalDF.append(df)
            total_rows += len(df)
            time.sleep(.2)
        else:
            url = 'https://www.residentadvisor.net' + next_url
            next_url = next_page(url)
            df = scrape_events(url)
            finalDF.append(df)
            total_rows += len(df)
            time.sleep(.2)
    df = pd.concat(finalDF)
    return len(df)
    
parseData(200)
    

201

NameError: name 'self' is not defined

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!