# IBDB Webscraping
This file webscrapes publically available data from the Broadway League's IBDB (Internet Broadway Database) for the purposes of analyzing the importance of Tony Awards outcomes for Broadway Productions.

This file borrows code in its first few chunks and takes general inspiration from the following [Colaboratory Jupyter Notebook](https://colab.research.google.com/drive/1IVwOhBMYay14NkO7kGkrPu0Ij9dSDdEP) by Yaakov Bressler.

In [1]:
import io
import requests
from lxml import etree
from bs4 import BeautifulSoup
import urllib.request
import re
import string
#import time
#import json   #might not need commented out ones
import datetime
import pandas as pd
import numpy as np
import urllib
import ast
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

## Scraping the names of the shows we are interested in

### Create a function that grabs links from a page, using a tag to identify value of link

In [2]:
def getLinks_tagged_fast(url, tag):
    """
    This function finds elements (nodes) at a given url that have attribute 'href' and returns a list of all the 
    urls the href attributes refer to. 
    """
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html.parser')
    links = []
    # set the opening of each link to be...
    tag = tag
    for link in soup.findAll('a', attrs={'href': re.compile(tag)}):
        links.append(link.get('href'))
    return links

https://www.broadwayworld.com/browseshows.cfm?showtype=BR

The above link is your starting point. It will allow us to get the name of every Broadway production that opened between 1979 and now.

In [3]:
def get_show_links_year(year_url):
    """
    This function webscrapes from a page on broadwayworld.com to get the urls to many webpages for different
    Broadway shows. Each page is for a different show that has been on Broadway. This function scrapes the links for
    every show that opened since 1979. It returns a list of the urls of interest.
    """
    url = year_url
    tag_year = 'browseshows.cfm?'
    #calling previous function to get the link I want
    years = getLinks_tagged_fast(url, tag_year)[1:]
    page_base = 'https://www.broadwayworld.com/'
    years_loop =[]
    for year in years:
        #focusing on 1979 or later
        if year[-4:].isdigit() and int(year[-4:]) >= 1979:
            years_loop.append(page_base+year)
    
    # Now you have all the years
    tag_show = 'https://www.broadwayworld.com/shows/backstage.php?'
    show_links_nested = []
    for year in years_loop:
        show_links_nested.append(getLinks_tagged_fast(year,tag_show))
    show_links = sum(show_links_nested, [])
    
    return show_links
    
#running function to get my list of links to the productions I want
year_url = 'https://www.broadwayworld.com/browseshows.cfm?showtype=BR'
show_links = get_show_links_year(year_url)

In [4]:
len(show_links)

1877

From here, we need to iterate through `show_links` to get the name of every show that has been on Broadway since 1979. This is important because we need to know which shows to search for on the IBDB (shows with productions from 1979 or later), and which shows to ignore (show that were pre-1979). 

In [5]:
def get_show_name(url):
    """
    This function takes a url for a given show as input and uses an xpath statment to search for the name of the
    show on the website. It returns the name of the show. 
    """
    response = requests.get(url)
    #make sure url actually exists
    assert response.status_code == 200

    show_html = response.text
    htmlparser = etree.HTMLParser()
    tree = etree.parse(io.StringIO(show_html), parser=htmlparser)
    showroot = tree.getroot()
    #this xml path is the same for every url, it will give you the name of the show for that url
    show_name = showroot.xpath("//span[@itemprop = 'name']/text()")[1]
    
    return show_name

In [6]:
#testing the function on a random url (tried this with several urls)
get_show_name('https://www.broadwayworld.com/shows/backstage.php?showid=6366')

'Oliver!'

In [7]:
#using get_show_name function to make a list of every single show I want to find data on
#took about 10 minutes to run this chunk
show_names = []
for link in show_links:
    show_name = get_show_name(link)
    show_names.append(show_name)
    
len(show_names)

1877

In [9]:
#due to the presence of revivals in the list, some show names appear more than once
#therefore, to reduce redundancy when I search for these shows on the database, I am only keeping unique show names
unique_show_names = []
for show in show_names:
    if show not in unique_show_names:
        unique_show_names.append(show)    

#print(unique_show_names)     
len(unique_show_names)

1590

In [10]:
#database will not accept searches with & or + in them, so I must replace these with 'and'
for i in range(len(unique_show_names)):
    unique_show_names[i] = unique_show_names[i].replace('+', 'and')
    unique_show_names[i] = unique_show_names[i].replace('&', 'and')

In [11]:
#unique_show_names

## Searching for shows of interest on the IBDB
TO DO: give overview of this process

In [14]:
def search_alt_path(path):
    """
    When the initial path for webscraping (in the next chunk) does not produce results, this function uses an
    alternate path to find the urls that I want. There are two potential alternate paths that can yield results.
    This function has no return value; it edits alt_url_list in place
    """
    alt_web_elts = driver.find_elements('xpath', path)
    assert len(alt_web_elts) > 0
    alt_results = [elt.get_attribute("href") for elt in alt_web_elts]
    alt_urls_list.extend(alt_results)

In [15]:
#this chunk took about 1:30 hours to run
#the xpath statement that will get me the href for each production I am interested in
prod_xpath = '//div[@id = "nyc-productions" and @data-id = "nyc-productions"]/div/div/div/div/a'#/@href'
#alt_path = f"//div[@data-id = 'shows']/div[@class = 'row']/div/a[text() = {show}]"

#this will be a list all of the urls that I need 
urls_list = []
#TO DO: explain this
alt_urls_list = []
#list of show name searches that did not produce needed results 
failed_searches = []

driver = webdriver.Chrome()
#test_shows = ['The King and I', 'Dear Evan Hansen', "Rodgers and Hammerstein's Cinderella",'Spring Awakening']
#test_shows = ["Is He Dead?", 'Sweeney Todd','Awake and Sing!', 'Jerry Springer: The Opera']
for show in unique_show_names:
    driver.get('https://www.ibdb.com/shows/')     #website I am searching from (IBDB)
    search_box = driver.find_element('name','ShowProperName')    #locating the searchbar
    search_box.send_keys(show)       #automating the searches (will store results in a list later)
    search_box.submit()
    #search_box.send_keys(Keys.ENTER)  #other way to do above line of code
    web_elts = driver.find_elements('xpath', prod_xpath)
    try:
        assert len(web_elts) > 0
    except:
        #alt_path is the first option for searches that do not immediately give us the desire results
        try:
            alt_path = f'//div[@data-id = "shows"]/div[@class = "row"]/div/a[text() = "{show}"]'
            search_alt_path(alt_path)
            #alt_web_elts = driver.find_elements('xpath', alt_path)
            #assert len(alt_web_elts) > 0
            #alt_results = [elt.get_attribute("href") for elt in alt_web_elts]
            #alt_urls_list.extend(alt_results)
        except:
            #second option for searches that do not immediately give desired results
            try:
                alt_path_2 = f'//p/a[text() = "{show}"]'
                search_alt_path(alt_path_2)
                #alt_web_elts = driver.find_elements('xpath', alt_path_2)
                #print(alt_web_elts)
                #assert len(alt_web_elts) > 0
                #alt_results = [elt.get_attribute("href") for elt in alt_web_elts]
                #alt_urls_list.extend(alt_results)
            except:
                failed_searches.append(show)

    #getting href I need from each selenium WebElement
    results = [elt.get_attribute("href") for elt in web_elts]
    urls_list.extend(results)


In [36]:
len(alt_urls_list)

353

In [37]:
len(failed_searches)

137

In [18]:
#fail_path works, put this code under except
#from here, put all unfail_results in a list
#have separate chunk that iterates through them, grabs relevant info, and extends onto urls_list
#driver = webdriver.Chrome()

#example = 'Sweeney Todd'
#fail_path = f"//div[@data-id = 'shows']/div[@class = 'row']/div/a[text() = '{example}']"
#driver.get('https://www.ibdb.com/shows/')     #website I am searching from (IBDB)
#search_box = driver.find_element('name','ShowProperName')    #locating the searchbar
#search_box.send_keys(example)       #automating the searches (will store results in a list later)
#search_box.submit()
#fail_web_elts = driver.find_elements('xpath', fail_path)
#unfail_results = [elt.get_attribute("href") for elt in fail_web_elts]

In [19]:
#unfail_results

In [20]:
len(alt_urls_list)

353

In [21]:
#length is longer because several pre 1979 productions urls were scraped, these will be excluded by next big chunk
len(urls_list)

2105

In [24]:
#can fix some of this by accounting for searches that result in "Did you mean..."
#might not be necessary though
len(failed_searches)

137

In [25]:
for url in alt_urls_list:
    resp = requests.get(url)
    prod_html = resp.text
    soup = BeautifulSoup(prod_html, 'html.parser')
    tree = etree.HTML(str(soup)) 
    prod_urls = tree.xpath(prod_xpath + '/@href')
    prod_urls = ['https://www.ibdb.com' + elt for elt in prod_urls]
    urls_list.extend(prod_urls)

In [26]:
#once again: many of these urls are for pre 1979 productions. They will be filtered out in the next chunk
len(urls_list)

2897

In [27]:
#took about 2 hours to run this chunk
#this LoD (poorly structured because of IBDB layout) will have weekly gross/capacity info
prod_data_list = []

#this LoD will have Tony nom/win info
tony_data_LoD = []

#this xpath give me the name of each award the production was nominated for
noms_path = "//div[@class = 'collapsible-body awards-tab']/div[position() = 1]/div/div/div/h4/text()"
#this xpath will give me the number of awards the production won
wins_path = "//div[@class = 'collapsible-body awards-tab']/div[position() = 1]/div[@class='col s1 right-align']/img[@src = '/Images/award.png']"
#this xpath will give me the year that the production was eligible for awards
#this path only works if the production received noms, for productions with 0 noms, I might have to manually put in the year
year_path = "//div[@class = 'collapsible-body awards-tab']/div[position() = 1]/div[@class = 'col s11']/div/div[@class = 'col s12' and position() = 2]/text()"
#this path gives me the opening date for a production, it is the path that will be used for productions with 0 noms
alt_year_path = "//div[@class = 'col s5 m3 l5 txt-paddings']/div[@class = 'xt-main-title']/text()"
#this xpath will give me the type of show the production was (musical, play, or special)
type_path = "//div[@class='row wrapper hide-on-med-and-up']/div[@class='col s12 txt-paddings tag-block-compact']/i[position() = 1]/text()"

for url in urls_list:
    resp = requests.get(url)
    prod_html = resp.text
    soup = BeautifulSoup(prod_html, 'html.parser')
    #finding node that has javascript text with our data
    script = soup.find_all('script', type='text/javascript')[1]   #node at [1] is one with our data
    #making script into a string so I can easily parse through it
    js: str = script.text
        
    #need a 'try:' because the next few lines of code will not work for productions with no finanical data 
    #(i.e. pre 1979)
    try:
        #using a regex to search for the dict we are looking for (i.e. the one that has the data)
        raw_json = re.search('var grossdata = {0:\[.*\] };', js, flags=re.MULTILINE).group(0)
        #[16:-1] to exclude javascript syntax stuff
        data = ast.literal_eval(raw_json[16:-1])
        #adding key:value pair to dict to keep track of which production is which
        data['production'] = url[41:]
        prod_data_list.append(data)
        
        #scraping Tony info now
        #We must do this within the try because this allows us to skip Tony info for productions with no financial data
        #REACH GOAL: split awards into major/minor categories
        # ^ Based on the layout of the website, it would be tedious/difficult to do this
        tree = etree.HTML(str(soup)) 
        nominations = tree.xpath(noms_path)
        num_noms = len(nominations)
        num_wins = len(tree.xpath(wins_path))
        try:
            year = int(tree.xpath(year_path)[0][26:30])
        except:
            year = tree.xpath(alt_year_path)[0]
        show_type = tree.xpath(type_path)[0]
        prod_award_dict = {'production': url[41:], 'nominations': num_noms, 'wins': num_wins, 'type': show_type, 'year': year}
        tony_data_LoD.append(prod_award_dict)
    except:
        pass

In [13]:
#testing new stuff
#TO DO: delete this
noms_path = "//div[@class = 'collapsible-body awards-tab']/div[position() = 1]/div/div/div/h4/text()"
wins_path = "//div[@class = 'collapsible-body awards-tab']/div[position() = 1]/div[@class='col s1 right-align']/img[@src = '/Images/award.png']"
year_path = "//div[@class = 'collapsible-body awards-tab']/div[position() = 1]/div[@class = 'col s11']/div/div[@class = 'col s12' and position() = 2]/text()"
type_path = "//div[@class='row wrapper hide-on-med-and-up']/div[@class='col s12 txt-paddings tag-block-compact']/i[position() = 1]/text()"

test_url = 'https://www.ibdb.com/broadway-production/escape-to-margaritaville-515030'
test_url = 'https://www.ibdb.com/broadway-production/the-king-and-i-497593'
resp = requests.get(test_url)
prod_html = resp.text
soup = BeautifulSoup(prod_html, 'html.parser')
tree = etree.HTML(str(soup)) 
nominations = tree.xpath(noms_path)
num_noms = len(nominations)
num_wins = len(tree.xpath(wins_path))
try:
    year = int(tree.xpath(year_path)[0][26:30])
except:
    year = tree.xpath("//div[@class = 'col s5 m3 l5 txt-paddings']/div[@class = 'xt-main-title']/text()")[0]
show_type = tree.xpath(type_path)[0]
print(num_noms, num_wins, year, show_type)

9 4 2015 Musical


In [28]:
tony_data_LoD[:10]

[{'production': 'bad-cinderella-535361',
  'nominations': 0,
  'wins': 0,
  'type': 'Musical',
  'year': 'Mar 23, 2023'},
 {'production': 'dancin-4051',
  'nominations': 7,
  'wins': 2,
  'type': 'Musical',
  'year': 1978},
 {'production': 'dancin-535808',
  'nominations': 1,
  'wins': 0,
  'type': 'Musical',
  'year': 2023},
 {'production': 'camelot-13313',
  'nominations': 2,
  'wins': 0,
  'type': 'Musical',
  'year': 1981},
 {'production': 'camelot-4143',
  'nominations': 0,
  'wins': 0,
  'type': 'Musical',
  'year': 'Nov 15, 1981'},
 {'production': 'camelot-4571',
  'nominations': 0,
  'wins': 0,
  'type': 'Musical',
  'year': 'Jun 21, 1993'},
 {'production': 'camelot-534339',
  'nominations': 5,
  'wins': 0,
  'type': 'Musical',
  'year': 2023},
 {'production': 'el-mago-pop-536773',
  'nominations': 0,
  'wins': 0,
  'type': 'Special',
  'year': 'Aug 20, 2023'},
 {'production': 'fat-ham-535958',
  'nominations': 5,
  'wins': 0,
  'type': 'Play',
  'year': 2023},
 {'production': 

In [29]:
prod_data_list[0][1]

[['Feb 19, 2023',
  '$318,478',
  '-2147483648%',
  '2,796',
  '100%',
  'Feb 19',
  318478.0,
  0.0,
  2796.0,
  2796.0,
  '2',
  '0'],
 ['Feb 26, 2023',
  '$684,822',
  '-2147483648%',
  '9,107',
  '93%',
  'Feb 26',
  684822.0,
  0.0,
  9107.0,
  9786.0,
  '7',
  '0'],
 ['Mar 5, 2023',
  '$568,165',
  '-2147483648%',
  '8,695',
  '89%',
  'Mar 5',
  568165.0,
  0.0,
  8695.0,
  9786.0,
  '7',
  '0'],
 ['Mar 12, 2023',
  '$592,938',
  '-2147483648%',
  '8,663',
  '89%',
  'Mar 12',
  592938.0,
  0.0,
  8663.0,
  9786.0,
  '7',
  '0'],
 ['Mar 19, 2023',
  '$642,196',
  '-2147483648%',
  '8,752',
  '89%',
  'Mar 19',
  642196.0,
  0.0,
  8752.0,
  9786.0,
  '7',
  '0'],
 ['Mar 26, 2023',
  '$633,929',
  '-2147483648%',
  '10,185',
  '91%',
  'Mar 26',
  633929.0,
  0.0,
  10185.0,
  11184.0,
  '3',
  '5'],
 ['Apr 2, 2023',
  '$631,890',
  '-2147483648%',
  '8,978',
  '80%',
  'Apr 2',
  631890.0,
  0.0,
  8978.0,
  11184.0,
  '0',
  '8'],
 ['Apr 9, 2023',
  '$639,391',
  '-2147483648%'

In [30]:
#TO DO: Make these comments (and others) into markdown chunks
#structure of prod_data_list (it is a poorly structured LoD):
#each dict is one production
#each key represent one season of data, with the exception of the key I added to represent the production
#the value for each season key is an LoL 
#each list in the LoL represents one week of data for that production
#in each list, we care about the vals at indeces: 0 (date of week), 4 (weekly capacity), 6 (weekly gross) 

#end result should be LoD of the following structure:
#[{'production': 'the-king-and-i-497593', 'date': 'May 29, 2016', 'capacity':'76%', 'gross': 546476.0},...]

prod_LoD = []
#prod_data_list[2][0][0]
for prod in prod_data_list:
    for season in prod:
        for week in prod[season]:
            if type(week) == list:
                #print(week)
                relevant_data = {'production': prod['production'], 'date': week[0], 'capacity': int(week[4][:-1]), 'gross': week[6]}
                prod_LoD.append(relevant_data)

#see what first few rows of data look like
prod_LoD[:5]

[{'production': 'bad-cinderella-535361',
  'date': 'May 28, 2023',
  'capacity': 69,
  'gross': 351163.0},
 {'production': 'bad-cinderella-535361',
  'date': 'Jun 4, 2023',
  'capacity': 77,
  'gross': 384017.0},
 {'production': 'bad-cinderella-535361',
  'date': 'Feb 19, 2023',
  'capacity': 100,
  'gross': 318478.0},
 {'production': 'bad-cinderella-535361',
  'date': 'Feb 26, 2023',
  'capacity': 93,
  'gross': 684822.0},
 {'production': 'bad-cinderella-535361',
  'date': 'Mar 5, 2023',
  'capacity': 89,
  'gross': 568165.0}]

In [31]:
#TO DO: DELETE THIS CHUNK

#this xpath give me the name of each award the production was nominated for
#noms_path = "//div[@class = 'collapsible-body awards-tab']/div[position() = 1]/div/div/div/h4/text()"
#this xpath will give me the number of awards the production won
#wins_path = "//div[@class = 'collapsible-body awards-tab']/div[position() = 1]/div[@class='col s1 right-align']/img[@src = '/Images/award.png']"
#tony_data_list = []
#type_list =[]
#for url in urls_list[0:20]:
#test = 'https://www.ibdb.com/broadway-production/the-king-and-i-497593'
#test = 'https://www.ibdb.com/broadway-production/good-night-oscar-535325'
    #resp = requests.get(url)
    #prod_html = resp.text
    #soup = BeautifulSoup(prod_html, 'html.parser')
    #tree = etree.HTML(str(soup)) 
#year = tree.xpath("//div[@class = 'collapsible-body awards-tab']/div[position() = 1]/div[@class = 'col s11']/div/div[@class = 'col s12' and position() = 2]/text()")[0]
    #show_type = tree.xpath('//div[@class="row wrapper hide-on-med-and-up"]/div[@class="col s12 txt-paddings tag-block-compact"]/i[position() = 1]/text()')[0]
    #type_list.append(show_type)
#nominations = tree.xpath(noms_path)
#num_noms = len(nominations)
#num_wins = len(tree.xpath(wins_path))
#prod_award_dict = {"production": url[41:], "nominations": num_noms, "wins": num_wins}
#tony_data_list.append(prod_award_dict)


In [32]:
#tony_data_list

In [33]:
#print(nominations)
#num_wins

In [34]:
tony_data = pd.DataFrame(tony_data_LoD)
tony_data = tony_data.set_index('production')
tony_data

Unnamed: 0_level_0,nominations,wins,type,year
production,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bad-cinderella-535361,0,0,Musical,"Mar 23, 2023"
dancin-4051,7,2,Musical,1978
dancin-535808,1,0,Musical,2023
camelot-13313,2,0,Musical,1981
camelot-4143,0,0,Musical,"Nov 15, 1981"
...,...,...,...,...
a-life-522289,3,0,Play,2020
barnum-3949,10,3,Musical,1980
home-3953,2,0,Play,1980
nuts-3948,1,0,Play,1980


In [35]:
weekly_data = pd.DataFrame(prod_LoD)
weekly_data = weekly_data.set_index(['production','date'])
weekly_data

Unnamed: 0_level_0,Unnamed: 1_level_0,capacity,gross
production,date,Unnamed: 2_level_1,Unnamed: 3_level_1
bad-cinderella-535361,"May 28, 2023",69,351163.0
bad-cinderella-535361,"Jun 4, 2023",77,384017.0
bad-cinderella-535361,"Feb 19, 2023",100,318478.0
bad-cinderella-535361,"Feb 26, 2023",93,684822.0
bad-cinderella-535361,"Mar 5, 2023",89,568165.0
...,...,...,...
nuts-3948,"Jul 20, 1980",63,46596.0
bent-3823,"Jun 8, 1980",36,43510.0
bent-3823,"Jun 15, 1980",51,57117.0
bent-3823,"Jun 22, 1980",48,52885.0


In [38]:
#save to data folder
tony_data.to_csv('data/tony_data.csv')
weekly_data.to_csv('data/weekly_data.csv')