# Scraping Concerts - Lab

## Introduction

Now that you've seen how to scrape a simple website, it's time to again practice those skills on a full-fledged site!
In this lab, you'll practice your scraping skills on a music website: https://www.residentadvisor.net.
## Objectives

You will be able to:
* Scrape events from a website
* Follow links to those events to retrieve further information
* Clean and store scraped data

## View the Website

For this lab, you'll be scraping the https://www.residentadvisor.net website. Start by navigating to the events page [here](https://www.residentadvisor.net/events) in your browser.

<img src="images/ra.png">

In [1]:
import pandas as pd
import numpy as np
from datetime import date
from bs4 import BeautifulSoup
import requests
import json

In [2]:
#soup.findAll('p', class_ = "eventDate date")
# print([x for x in soup.findAll('h1', class_ = 'event-title')[0].children][0].text)
# print('\n')
# print([x for x in soup.findAll('h1', class_ = 'event-title')[0].children][2].contents[3].text)
# print('\n')
# print([x for x in soup.findAll('h1', class_ = 'event-title')[0].children][2].contents[1].text)

In [25]:
def next_sib_search(container):
    result = [x for x in container.children][0]['class'][0]
    count = 0
    while result == 'event-item':
        count += 1
        container = container.next_sibling
        try:
            result = [x for x in container.children][0]['class'][0]
        except:
            break
    return count


def retrieve_dates(soup):
    found = soup.findAll('p', class_ = "eventDate date")
    sibs = [x.parent.next_sibling for x in found]
    event_dates = []

    for i in range(len(found)):
        for number in range(next_sib_search(sibs[i])):
            event_dates.append(found[i].text[:-2])
    return event_dates

In [24]:
def retrieve_names(soup):
    found = soup.findAll('h1', class_ = 'event-title')
    names = []
    for i in range(len(found)):
        names.append([x for x in found[i].children][0].text)
    return names

In [23]:
def retrieve_venues(soup):
    found = soup.findAll('h1', class_ = 'event-title')
    venues = []
    for i in range(len(found)):
        venues.append([x for x in found[i].children][2].contents[1].text)
    return venues

In [22]:
def retrieve_goers(soup):
    attendees = [x.contents[0].text for x in soup.findAll('p', class_ = 'attending')]
    return attendees

## Write a Function to Scrape all of the Events on the Given Page Events Page

The function should return a Pandas DataFrame with columns for the Event_Name, Venue, Event_Date and Number_of_Attendees.

In [26]:
def scrape_events(url):
    html_page = requests.get(url)
    soup = BeautifulSoup(html_page.content, 'html.parser')
    names = retrieve_names(soup)
    venues = retrieve_venues(soup)
    dates = retrieve_dates(soup)
    attendees = retrieve_goers(soup)
    data = [names, venues, dates, attendees]
    df = pd.DataFrame(data)
    df = df.T
    df.columns = ["Event_Name", "Venue", "Event_Date", "Number_of_Attendees"]
    return df

## Scrape the Next 1000 Events for Your Area

Display the data sorted by the number of attendees. If there is a tie for the number attending, sort by event date.

In [166]:
#Your code here
def scrape_thousand_events(events_page_url):
    from datetime import date
    import datetime as dt
    day = date.today()
    iterator = dt.timedelta(weeks = 1)
    url = events_page_url + f'/us/illinois/week/{day}'
    dfs = []
    df = scrape_events(url)
    dfs.append(df)
    count = df.shape[0]
    while count < 1000:
        day += iterator
        url = events_page_url + f'/us/illinois/week/{day}'
        new_df = scrape_events(url)
        count += new_df.shape[0]
        if new_df.shape[0] == 0:
            break
        dfs.append(new_df)
    df = pd.concat(dfs)
    return df

    

In [134]:
events = scrape_thousand_events("https://www.residentadvisor.net/events")

In [135]:
def replace_int(x):
    try:
        x = int(x)
    except:
        x = 0
    return x   

In [136]:
events['Number_of_Attendees'] = events['Number_of_Attendees'].apply(replace_int)

In [138]:
events['Event_Day'] = events['Event_Date'].str[:3]
events['Event_Date'] = events['Event_Date'].str[4:]
events['Event_Date'] = pd.to_datetime(events['Event_Date'])

In [162]:
events = events.sort_values(by = ['Number_of_Attendees', 'Event_Date'], ascending = [False, True])
events.reset_index(inplace = True).drop(columns = ['index'], inplace = True)

In [165]:
events.head(10)

Unnamed: 0,Event_Name,Venue,Event_Date,Number_of_Attendees,Event_Day
0,Format x Sleepwalker present: FJAAK,TBA - Chicago,2019-09-06,103,Fri
1,Spybar & Nightsweat presents: Dax J,Spybar,2019-09-12,64,Thu
2,Obscure 033: Scuba presents SCB,TBA - Chicago,2019-09-13,51,Fri
3,"Format and Synthetik Minds present: Perc, Remc...",TBA - Chicago,2019-10-11,46,Fri
4,Excursions: Underground,The Post,2019-10-05,29,Sat
5,WYA Chi: Mike Servito (The Bunker NY),TBA - Chicago,2019-10-25,29,Fri
6,Loose Ends with Honey Dijon / Harry Cross,smartbar,2019-10-12,27,Sat
7,The Black Madonna / Peach / Phillip Stone,smartbar,2019-09-28,21,Sat
8,Chris Liebing,Spybar,2019-09-06,20,Fri
9,Ghostly 20 - Metro/smartbar All-Building - fea...,smartbar,2019-10-19,20,Sat


## Summary 

Congratulations! In this lab, you successfully scraped a website for concert event information!