# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Create a full scraping pipeline that involves traversing over many pages of a website, dealing with errors and storing data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [None]:
# Load the https://www.residentadvisor.net/events page in your browser.

## Open the Inspect Element Feature

Next, open the inspect element feature from your web browser in order to preview the underlying HTML associated with the page.

In [None]:
# Open the inspect element feature in your browser

#### Neha:
Importing all necessary libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import shutil
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from IPython.display import Image, HTML

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [3]:
html_page = requests.get('https://www.residentadvisor.net/events/us/newyork')
soup = BeautifulSoup(html_page.content, 'html.parser')

In [42]:
main_container = soup.find('ul', id='items')

In [43]:
events_container = main_container.findAll('article', class_="event-item clearfix tickets-bkg-logo")
events_container[0].find('time')

<time datetime="2020-05-15T00:00" itemprop="startDate">2020-05-15T00:00</time>

In [158]:
event_date = events_container[8].find('time').attrs['datetime'][:-6]
event_date

'2020-05-17'

In [59]:
event_name = events_container[0].find('a', itemprop="url").attrs['title']
event_name

'Event details of Spacecraft'

In [123]:
event_venue = events_container[0].find('span', class_="grey").text
event_venue

'TBA - New York'

In [132]:
events_container[1].find('h1', class_="event-title").find('span').find('a').text

'Skyport Marina'

In [64]:
attendees = events_container[0].find('p', class_="attending").find('span').text
attendees

'2'

In [152]:
events_container[6]

<article class="event-item clearfix tickets-bkg-logo" itemscope="" itemtype="http://data-vocabulary.org/Event"><a href="/events/1399517#tickets"><img class="nohide" src="https://residentadvisor.net/images/ra-tix.png" style="height: 23px; width: 40px; right: 0px; position: absolute; top: 1px;"/></a><span style="display:none;"><time datetime="2020-05-16T00:00" itemprop="startDate">2020-05-16T00:00</time></span><a href="/events/1399517"><img height="76" src="/images/events/flyer/2020/5/us-0516-1399517-list.jpg" width="152"/></a><div class="bbox"><h1 class="event-title" itemprop="summary"><a href="/events/1399517" itemprop="url" title="Event details of Higher Level: Richie Hawtin, Charlotte de Witte &amp; More - Postponed">Higher Level: Richie Hawtin, Charlotte de Witte &amp; More - Postponed</a> <span>at <a href="/club.aspx?id=141127">Avant Gardner</a></span></h1><div class="grey event-lineup">Richie Hawtin, Charlotte de Witte, Octave One live, A-Z, Anastasia Kristensen, CLARA 3000, Fadi 

In [140]:
main_container.findAll('h1', class_="event-title")[1]

<h1 class="event-title" itemprop="summary"><a href="/events/1398647" itemprop="url" title="Event details of NYC Hip Hop vs. Reggae ® Yacht Party Cabana Yacht 2020">NYC Hip Hop vs. Reggae ® Yacht Party Cabana Yacht 2020</a> <span>at <a href="/club.aspx?id=104916">Skyport Marina</a></span></h1>

In [143]:
main_container.findAll('h1', class_="event-title")[1].text.split(' at ')

['NYC Hip Hop vs. Reggae ® Yacht Party Cabana Yacht 2020', 'Skyport Marina']

In [129]:
def scrape_events1(events_page_url):
    #Your code here
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    main_container = soup.find('ul', id='items')
    event_date = [e.find('time').attrs['datetime'][:-6] for e in events_container]
    event_name = [n.find('a', itemprop="url").attrs['title'] for n in events_container]
    event_venue = [events_container[0].find('span', class_="grey").text]+[v.find('h1', class_="event-title").find('span').find('a').text for v in events_container[1:]]
    attendees = [a.find('p', class_="attending").find('span').text for a in events_container]
    df = pd.DataFrame([event_name, event_venue, event_date, attendees]).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [130]:
scrape_events1('https://www.residentadvisor.net/events/us/newyork')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Event details of Spacecraft,TBA - New York,2020-05-15,2
1,Event details of NYC Hip Hop vs. Reggae ® Yach...,Skyport Marina,2020-05-15,1
2,Event details of [POSTPONED] Tale Of Us,Knockdown Center,2020-05-15,302
3,"Event details of [POSTPONED] Derrick Carter, D...",Webster Hall,2020-05-16,31
4,Event details of LPR Presents: Netherfriends,The Broadway,2020-05-16,1
5,Event details of [POSTPONED] Teksupport: Sven ...,The 1896,2020-05-16,107
6,"Event details of Higher Level: Richie Hawtin, ...",Avant Gardner,2020-05-16,70
7,Event details of [CANCELLED] Elevation: Season...,House Of Yes,2020-05-16,5
8,Event details of Boris Brejcha - Postponed,Brooklyn Mirage,2020-05-17,110


OR

In [144]:
def scrape_events(events_page_url):
    #Your code here
    html_page = requests.get(events_page_url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    main_container = soup.find('ul', id='items')
    event_date = [e.find('time').attrs['datetime'][:-6] for e in events_container]
    event_name = [n.text.split(' at ')[0] for n in main_container.findAll('h1', class_="event-title")]
    event_venue = [n.text.split(' at ')[1] for n in main_container.findAll('h1', class_="event-title")]
    attendees = [a.find('p', class_="attending").find('span').text for a in events_container]
    df = pd.DataFrame([event_name, event_venue, event_date, attendees]).transpose()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [145]:
scrape_events('https://www.residentadvisor.net/events/us/newyork')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Spacecraft,TBA - New York,2020-05-15,2.0
1,NYC Hip Hop vs. Reggae ® Yacht Party Cabana Ya...,Skyport Marina,2020-05-15,1.0
2,Virtual Friday: Roza Terenzi and Toni Yotzi,Nowadays,2020-05-15,302.0
3,[POSTPONED] Tale Of Us,Knockdown Center,2020-05-16,31.0
4,"[POSTPONED] Derrick Carter, DJ Sneak, Mark Farina",Webster Hall,2020-05-16,1.0
5,LPR Presents: Netherfriends,The Broadway,2020-05-16,107.0
6,Virtual Saturday: Juliana Huxtable and Lydo,Nowadays,2020-05-16,70.0
7,O/NDA + Digital Discoteca,Zoom Party Live,2020-05-16,5.0
8,Gotta Have House Hits Queens,Resorts World Casino,2020-05-17,110.0
9,[POSTPONED] Teksupport: Sven Väth (All Night L...,The 1896,,


Soulution way:

In [148]:
event_listings = soup.find('div', id="event-listing")
def scrape_events3(events_page_url):
    #Your code here
    response = requests.get(events_page_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    entries = event_listings.findAll('li')
    rows = []
    for entry in entries:
        #Is it a date? If so, set current date.
        date = entry.find('p', class_="eventDate date")
        event = entry.find('h1', class_="event-title")
        if event:
            details = event.text.split(' at ')
            event_name = details[0].strip()
            venue = details[1].strip()
            try:
                n_attendees = int(re.match("(\d*)", entry.find('p', class_="attending").text)[0])
            except:
                n_attendees = np.nan
            rows.append([event_name, venue, cur_date, n_attendees])
        elif date:
            cur_date = date.text
        else:
            continue
    df = pd.DataFrame(rows)
    df.head()
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

In [149]:
scrape_events3('https://www.residentadvisor.net/events/us/newyork')

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees
0,Spacecraft,TBA - New York,"Fri, 15 May 2020 /",2
1,NYC Hip Hop vs. Reggae ® Yacht Party Cabana Ya...,Skyport Marina,"Fri, 15 May 2020 /",1
2,Virtual Friday: Roza Terenzi and Toni Yotzi,Nowadays,"Fri, 15 May 2020 /",2
3,[POSTPONED] Tale Of Us,Knockdown Center,"Fri, 15 May 2020 /",302
4,"[POSTPONED] Derrick Carter, DJ Sneak, Mark Farina",Webster Hall,"Sat, 16 May 2020 /",31
5,LPR Presents: Netherfriends,The Broadway,"Sat, 16 May 2020 /",1
6,Virtual Saturday: Juliana Huxtable and Lydo,Nowadays,"Sat, 16 May 2020 /",4
7,O/NDA + Digital Discoteca,Zoom Party Live,"Sat, 16 May 2020 /",3
8,Gotta Have House Hits Queens,Resorts World Casino,"Sat, 16 May 2020 /",2
9,[POSTPONED] Teksupport: Sven Väth (All Night L...,The 1896,"Sat, 16 May 2020 /",107


## Write a Function to Retrieve the URL for the Next Page

In [None]:
def next_page(url):
    #Your code here
    return next_page_url

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [None]:
#Your code here

## Summary 

Congratulations! In this lab, you successfully developed a pipeline to scrape a website for concert event information!