# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
# Load the https://www.residentadvisor.net/events page in your browser.
from bs4 import BeautifulSoup
from datetime import datetime as dt
import pandas as pd
import requests

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [4]:
# Testing scrape
url = 'https://www.residentadvisor.net/events/us/newyork'
#url = 'https://www.residentadvisor.net/events/us/virginia/month/2020-01-04'
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser')

In [5]:
# All the information about events is contained in this section:
events_container = soup.find('ul', {'id':"items"})
print(events_container.prettify())

<ul class="list" id="items">
 <li>
  <p class="eventDate date">
   <a href="/events.aspx?ai=8&amp;v=day&amp;mn=1&amp;yr=2020&amp;dy=5">
    <span>
     Sun, 05 Jan 2020 /
    </span>
   </a>
  </p>
 </li>
 <li class="">
  <article class="event-item clearfix tickets-bkg-logo" itemscope="" itemtype="http://data-vocabulary.org/Event">
   <a href="/events/1363143#tickets">
    <img class="nohide" src="https://residentadvisor.net/images/ra-tix.png" style="height: 23px; width: 40px; right: 0px; position: absolute; top: 1px;"/>
   </a>
   <span style="display:none;">
    <time datetime="2020-01-05T00:00" itemprop="startDate">
     2020-01-05T00:00
    </time>
   </span>
   <a href="/events/1363143">
    <img height="76" src="/images/events/flyer/2020/1/us-0105-1363143-list.jpg" width="152"/>
   </a>
   <div class="bbox">
    <h1 class="event-title" itemprop="summary">
     <a href="/events/1363143" itemprop="url" title="Event details of Weird Science n.17 with Ron Morelli, L.Sangre, Maroje T"

In [6]:
#Test block for logic. Venues needed special consideration because the containing elements were 
#not consistent between pages, so instead I used the child address of the containing element in
#relation to the parent element which was the h1 header.

dates = [dt.strptime(d.text.strip()[:10],'%Y-%m-%d').strftime("%A %B %d, %Y") for d in events_container.findAll('time')]
titles = [list(t.children)[0].text.strip() for t in events_container.findAll('h1', {'class': "event-title"})]
venues = [list(v.children)[2].text.strip()[3:] for v in events_container.findAll('h1', {'class': "event-title"})]
attends = [int(a.text.strip().split()[0]) for a in events_container.findAll('p', {'class': "attending"})]

print(dates,titles, venues, attends)

['Sunday January 05, 2020', 'Sunday January 05, 2020', 'Sunday January 05, 2020', 'Sunday January 05, 2020', 'Sunday January 05, 2020', 'Sunday January 05, 2020', 'Monday January 06, 2020', 'Monday January 06, 2020', 'Monday January 06, 2020', 'Monday January 06, 2020', 'Tuesday January 07, 2020', 'Tuesday January 07, 2020', 'Tuesday January 07, 2020', 'Tuesday January 07, 2020', 'Tuesday January 07, 2020', 'Wednesday January 08, 2020', 'Wednesday January 08, 2020', 'Wednesday January 08, 2020', 'Thursday January 09, 2020', 'Thursday January 09, 2020', 'Thursday January 09, 2020', 'Thursday January 09, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Friday January 10, 2020', 'Frid

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [7]:
def scrape_events(events_page_url):
    #Get html and soup object:
    html = requests.get(events_page_url)
    soup = BeautifulSoup(html.content, 'html.parser')
    
    #Establish aggregate list for event information:
    info_list = []
    
    #Select proper container:
    events_container = soup.find('ul', {'id':"items"})
    
    #Parse info for events:
    dates = [dt.strptime(d.text.strip()[:10],'%Y-%m-%d').strftime("%A %B %d, %Y") for d in events_container.findAll('time')]
    titles = [list(t.children)[0].text.strip() for t in events_container.findAll('h1', {'class': "event-title"})]
    venues = [list(v.children)[2].text.strip()[3:] for v in events_container.findAll('h1', {'class': "event-title"})]
    attends = [int(a.text.strip().split()[0]) for a in events_container.findAll('p', {'class': "attending"})]
    
    info_list = [titles, dates, attends, venues]
        
    df = pd.DataFrame(info_list).transpose()
    df.columns = ["Event_Name", "Event_Date", "Number_of_Attendees", "Venue"]
    return df

In [8]:
scrape_events(url)

Unnamed: 0,Event_Name,Event_Date,Number_of_Attendees,Venue
0,"Weird Science n.17 with Ron Morelli, L.Sangre,...","Sunday January 05, 2020",49.0,Magick City
1,Taj Lounge NYC Hip Hop vs. Reggae™ Sunday Fund...,"Sunday January 05, 2020",1.0,Taj Lounge
2,"Zephyr Ann Bday with Paul Nickerson, Ivan Mone...","Sunday January 05, 2020",8.0,public records
3,Sunday Soiree: Tony Paniro and HŸPNØTÏX,"Sunday January 05, 2020",8.0,TBA Brooklyn
4,The Office presents: Anti-Social Takeover,"Sunday January 05, 2020",4.0,TBA - Brooklyn
5,Cotton / Motiv-A / Extol / January,"Sunday January 05, 2020",3.0,Bossa Nova Civic Club
6,The Office presents: Shhh Music By: Mike Nervous,"Monday January 06, 2020",1.0,TBA - Brooklyn
7,Fermented Frequency,"Monday January 06, 2020",1.0,Bossa Nova Civic Club
8,Industry Night with Special Guests,"Monday January 06, 2020",8.0,Rose Gold
9,"Jamais Vu presents Sheepshead, Feverdream, Ser...","Monday January 06, 2020",7.0,Jupiter Disco


## Write a Function to Retrieve the URL for the Next Page

In [9]:
def next_page(url):
    #Get html and soup object:
    html = requests.get(url)
    soup = BeautifulSoup(html.content, 'html.parser')
    
    #Get the url, if there is no url_container break function:
    url_container = soup.find('li', {'id':"liNext", 'class':"but arrow-right right"})
    if url_container:
        url_suffix = url_container.find('a').attrs['href']
    else:
        return None
    next_page_url = 'https://residentadvisor.net{}'.format(url_suffix)
    
    return next_page_url

In [10]:
#Test to see if there are errors when last page is reached:
next_page('https://www.residentadvisor.net/events/us/virginia/week/2020-01-11')

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [12]:
#There are not 1000 events listed, I'll scrape all they have.
next_url = 'https://residentadvisor.net/events/us/newyork'
df = pd.DataFrame()

while next_url:
    temp_df = scrape_events(next_url)
    df = pd.concat([df, temp_df], axis=0)
    next_url = next_page(next_url)

df.head(20)

Unnamed: 0,Event_Name,Event_Date,Number_of_Attendees,Venue
0,"Weird Science n.17 with Ron Morelli, L.Sangre,...","Sunday January 05, 2020",49,Magick City
1,Taj Lounge NYC Hip Hop vs. Reggae™ Sunday Fund...,"Sunday January 05, 2020",1,Taj Lounge
2,"Zephyr Ann Bday with Paul Nickerson, Ivan Mone...","Sunday January 05, 2020",8,public records
3,Sunday Soiree: Tony Paniro and HŸPNØTÏX,"Sunday January 05, 2020",8,TBA Brooklyn
4,The Office presents: Anti-Social Takeover,"Sunday January 05, 2020",4,TBA - Brooklyn
5,Cotton / Motiv-A / Extol / January,"Sunday January 05, 2020",3,Bossa Nova Civic Club
6,The Office presents: Shhh Music By: Mike Nervous,"Monday January 06, 2020",1,TBA - Brooklyn
7,Fermented Frequency,"Monday January 06, 2020",1,Bossa Nova Civic Club
8,Industry Night with Special Guests,"Monday January 06, 2020",8,Rose Gold
9,"Jamais Vu presents Sheepshead, Feverdream, Ser...","Monday January 06, 2020",7,Jupiter Disco


In [13]:
print(len(df))

284


## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!