## Web Scraping the Ryman Calendar

In this exercise, your objective is to use BeautifulSoup in order to obtain a dataset of upcoming events at the Ryman. This information is available at https://ryman.com/events/, but you will take the contents of this website and convert it into a pandas DataFrame.

The website splits the events across multiple pages, but start by just working on the first page. Later on in the exercise, you'll take what you've done for the first page and apply it across other pages.

In [None]:
import requests
from bs4 import BeautifulSoup as BS
from bs4 import Comment
import pandas as pd
import re

1. Start by using either the inspector or by viewing the page source. Can you identify a tag that might be helpful for finding the names of all performers? For now, just worry about the headliner and don't worry about the opener. (Eg. For Vince Gill, featuring Wendy Moten, we only care about Vince Gill.) Make use of this to create a list containing just the names of each inductee.

In [None]:
URL = 'https://ryman.com/events/'
response = requests.get(URL)
response.status_code

In [None]:
soup = BS(response.text)
#print(soup.prettify())

In [None]:
tribe_events = soup.findAll('a', attrs = {'class':'tribe-event-url'})
performers = [x.get('title') for x in tribe_events]
print(len(performers))
#performers

2. Next, try and find a tag that could be used to find the date and time for each show. Extract these into two lists, one containing the date and the other containing the time. (Eg. THURSDAY, AUGUST 4, 2022 AT 8:00 PM CDT should be split into August 4, 2022 and 8:00 PM CDT.) 

In [None]:
event_time_tags = soup.findAll('time')
event_datetimes = [x.text for x in event_time_tags]

print(len(event_datetimes))
#event_datetimes

In [None]:
event_datetimes = [x.split(' at ') for x in event_datetimes]
event_dates_full = [x[0] for x in event_datetimes]
#Using regex is probably overkill here, but including it here for practice.
event_dates = [re.search('[a-zA-Z]+,\s(.+)', x).group(1) for x in event_dates_full]
event_times = [x[1] for x in event_datetimes]

3. Take the two lists you created on parts 1 and 2 and convert it into a pandas DataFrame.

In [None]:
d = {'performer':performers,'event_date':event_dates, 'event_time':event_times}

In [None]:
events_df = pd.DataFrame(d)
print(len(events_df))
events_df.head()

4. Now, you need to take what you created for the first page and apply it across multiple rest of the pages so that you can scrape all inductees. Notice how the url changes when you click the "More Events" button at the top of the page. Check that the code that you wrote for the first page still works for page 2. Once you have verified that your code will still work, write a for loop that will cycle through the first five pages of events.

In [None]:
performers = []
event_dates = []
event_times = []

for x in range(5):
    URL='https://ryman.com/events/list/?tribe_event_display=list&tribe_paged=' + str(x+1)
    
    response = requests.get(URL)
    if response.status_code == 200:
        soup = BS(response.text)

        #Note: This approach finds all of each type of tag separately. 
        #      The approach used in Bonus #1 finds all for a higher parent tag and then finds tags within that,
        #      which I think is the better of the two.
        tribe_events = soup.findAll('a', attrs = {'class':'tribe-event-url'})
        page_performers = [x.get('title') for x in tribe_events]
        performers = performers + page_performers

        event_time_tags = soup.findAll('time')
        event_datetimes = [x.text for x in event_time_tags]
        event_datetimes_split = [x.split(' at ') for x in event_datetimes]
        
        page_event_dates_full = [x[0] for x in event_datetimes_split]
        page_event_dates = [re.search('[a-zA-Z]+,\s(.+)', x).group(1) for x in page_event_dates_full]
        page_event_times = [x[1] for x in event_datetimes_split]

        event_dates = event_dates + page_event_dates
        event_times = event_times + page_event_times

d = {'performer':performers,'event_date':event_dates, 'event_time':event_times}
events_df = pd.DataFrame(d)

events_df

5. **Bonus #1:**: Add to your data frame the opening act for all shows that list an opener.

In [None]:
#WORKING VERSION -- LOOK AT PRIOR COMMENTS FOR OPENERS
performers = []
openers = []
event_dates = []
event_times = []
event_hrefs = []   #included to help set up Bonus #2
event_prices = []  #included to help set up Bonus #2

for x in range(5):
    URL='https://ryman.com/events/list/?tribe_event_display=list&tribe_paged=' + str(x+1)
    
    response = requests.get(URL)
    if response.status_code == 200:
        soup = BS(response.text)
        
        tribe_beside_image_tags = soup.findAll('div', attrs={'class':'tribe-beside-image'})
        
        for tribe_beside_image_tag in tribe_beside_image_tags:
            
            #find tribe_event_url_tag
            tribe_event_url_tag = tribe_beside_image_tag.find('a', attrs={'class':'tribe-event-url'})
            
            #get performer from tribe_event_url_tag and add to list of performers
            performer = tribe_event_url_tag.get('title')
            performers = performers + [performer]

            #get href from tribe_event_url_tag and add to list of hrefs
            event_href = tribe_event_url_tag.get('href')
            event_hrefs = event_hrefs + [event_href]

            #ISSUE: CAN BE MULTIPLE OPENER TAGS
            #ISSUE: HOW TO DETERMINE IF OPENER TAG IS ACTUALLY AN OPENER
            #find opener tag, get text, and add to openers list
            
            #Set opener to None as its default value. Opener will be changed if a legitimate opener is found.
            opener = None
            
            #Find all opener tags
            opener_tags = tribe_beside_image_tag.findAll('span', attrs={'class':'opener'})
            
            #if no opener tags are found, for loop will end immediately.
            for opener_tag in opener_tags:
                #the opener tag must have an 'OPENER' comment preceding it.
                if isinstance(opener_tag.previous_sibling.previous_sibling, Comment):
                    if str(opener_tag.previous_sibling.previous_sibling).strip().upper() == 'OPENER':
                        #the opener text must start with 'With ', 'Featuring', or 'Hosted by'.
                        if opener_tag.text.upper()[:5] == 'WITH ':
                            opener = opener_tag.text
                        elif opener_tag.text.upper()[:9] == 'FEATURING':
                            opener = opener_tag.text
                        elif opener_tag.text.upper()[:9] == 'HOSTED BY':
                            opener = opener_tag.text

            openers = openers + [opener]

            #find datetime tag, get text, and split out date and time
            datetime_tag = tribe_beside_image_tag.find('time')
            datetime = datetime_tag.text.split(' at ')
            event_date = re.search('[a-zA-Z]+,\s(.+)', datetime[0]).group(1)
            event_dates = event_dates + [event_date]
            event_times = event_times + [datetime[1]]

            #Build list of empty values (must be same length as other lists)
            event_prices = event_prices + [None]
            
event_dict = {'performer':performers,
              'opener':openers,
              'event_date':event_dates, 
              'event_time':event_times,
              'event_href':event_hrefs,
              'event_price':event_prices}
events_df = pd.DataFrame(event_dict)
events_df

5. **Bonus #2:** If you click the "MORE INFO" button for an event, it will take you to a page which shows ticket prices. Write code that can be used to retrieve the ticket prices for each show that you have scraped. Make sure that your code can handle cases where the show has been canceled (eg. https://ryman.com/event/nhabit-worship-experience/).

In [None]:
for x in range(events_df.shape[0]):
    url=events_df.event_href[x]
    response = requests.get(url)
    if response.status_code == 200:
        soup = BS(response.text)
        event_price_tag = soup.find('p', attrs={'class':'theprices'})
        if event_price_tag is not None:
            event_price = str(event_price_tag.text)
            
            #NOT SURE WHETHER TO INCLUDE THIS:
            #Add escape character before $ so it will display appropriately.
            event_price = event_price.replace('$', '\$')
            
            events_df.event_price[x]=event_price
            #print(events_df.event_price[x])

In [None]:
events_df = events_df[['performer',
                       'opener',
                       'event_date',
                       'event_time',
                       'event_price']]

In [None]:
events_df

In [None]:
events_df.to_csv('../data/ryman_events_df.csv')

In [None]:
events_df.event_price.head()