# 🎭 Broadway Show Scraper – IBDB.com

### 📌 Overview
This notebook scrapes show details from IBDB.com, saves the data into CSV files, detects newly added shows, and sends email notifications. It's also scheduled to run every 3 minutes.

# PART 1

## 🔧 Step 1: Setup – Import Required Libraries

In [None]:
import re
import time
import os
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import schedule
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart


In [None]:
chrome_driver_path = "C:\\WebDriver\\chromedriver.exe"


### Why is chrome_driver_path Important?

What is it?
This is the file path where the ChromeDriver executable is located on your computer.

Why do we need it?
Selenium uses ChromeDriver to automate and control the Chrome browser programmatically. ChromeDriver acts as a bridge between your Python code and the Chrome browser.

How does it work?
When you start webdriver.Chrome(), Selenium launches ChromeDriver using this path, which in turn opens the Chrome browser window that Selenium controls.

Note on path formatting:
Since backslashes (\) are escape characters in Python strings, use either double backslashes (\\) or raw strings (r"...") to avoid errors.

### 🚀 Step 2: Scraping Function
This function uses Selenium to open the IBDB website, parse show blocks, and collect data from each show's detail page.

In [11]:

# Start Chrome browser
driver = webdriver.Chrome()
driver.get("https://www.ibdb.com/shows")

# Wait for show blocks to load
WebDriverWait(driver, 10).until(
    EC.presence_of_all_elements_located((By.CLASS_NAME, "xt-iblock-inner"))
)

# Parse initial page for show links
soup = BeautifulSoup(driver.page_source, "html.parser")
blocks = soup.select(".xt-iblock-inner")
all_shows = []

# Set page load timeout
driver.set_page_load_timeout(15)

# Loop through blocks (limit to first 40)
for i, block in enumerate(blocks):
    if len(all_shows) >= 40:
        break
    try:
        # Get detail page URL and image
        relative_link = block.select_one("a")["href"]
        detail_url = f"https://www.ibdb.com{relative_link}"
        style = block.select_one("span")["style"]
        image_url = re.search(r"url\((.*?)\)", style).group(1)

        # Load detail page
        driver.get(detail_url)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )
        detail_soup = BeautifulSoup(driver.page_source, "html.parser")

        # Extract title
        title_element = detail_soup.select_one("h3.title-label")
        title = title_element.text.strip() if title_element else "N/A"

        # Extract show types and remove duplicates
        type_elements = detail_soup.select(".col.s12.txt-paddings.tag-block-compact i")
        show_types = [elem.text.strip() for elem in type_elements]
        show_types = list(dict.fromkeys(show_types))  # Remove duplicates while preserving order

        # Extract opening and closing dates
        date_blocks = detail_soup.select(".xt-main-title")
        opening_date = date_blocks[0].text.strip() if len(date_blocks) > 0 else "N/A"
        closing_date = date_blocks[1].text.strip() if len(date_blocks) > 1 else "N/A"

        # Extract performances
        performances = "N/A"
        performance_blocks = detail_soup.select("div.col.s7.m6.l7.txt-paddings.vertical-divider")
        for block in performance_blocks:
            label = block.select_one("div.xt-lable")
            if label and "Performances" in label.text:
                value = block.select_one("div.xt-main-title")
                performances = value.text.strip() if value else "N/A"
                break

        # Compile show data
        show_data = {
            "Title": title,
            "Image URL": image_url,
            "Detail Link": detail_url,
            "Opening Date": opening_date,
            "Closing Date": closing_date,
            "Type(s)": ", ".join(show_types),
            "Performances": performances
        }
        all_shows.append(show_data)
        print(f"[{len(all_shows)}] ✅ Collected: {title}")

        time.sleep(1)

    except Exception as e:
        print(f"[{i+1}] ❌ Error: {e}")
        continue

# Display results
print("\n\n=== Top 40 Shows Preview ===")
for idx, show in enumerate(all_shows, 1):
    print(f"\nShow {idx}:")
    for key, value in show.items():
        print(f"{key}: {value}")
    print("-" * 40)

# Close browser
driver.quit()


[1] ✅ Collected: & Juliet
[2] ❌ Error: Message: timeout: Timed out receiving message from renderer: 14.769
  (Session info: chrome=136.0.7103.114)
Stacktrace:
	GetHandleVerifier [0x00007FF6F00BCF45+75717]
	GetHandleVerifier [0x00007FF6F00BCFA0+75808]
	(No symbol) [0x00007FF6EFE88F9A]
	(No symbol) [0x00007FF6EFE764EC]
	(No symbol) [0x00007FF6EFE761DA]
	(No symbol) [0x00007FF6EFE73E8A]
	(No symbol) [0x00007FF6EFE7483F]
	(No symbol) [0x00007FF6EFE833AE]
	(No symbol) [0x00007FF6EFE997E1]
	(No symbol) [0x00007FF6EFEA091A]
	(No symbol) [0x00007FF6EFE74FAD]
	(No symbol) [0x00007FF6EFE994D0]
	(No symbol) [0x00007FF6EFF2F732]
	(No symbol) [0x00007FF6EFF07153]
	(No symbol) [0x00007FF6EFED0421]
	(No symbol) [0x00007FF6EFED11B3]
	GetHandleVerifier [0x00007FF6F03BD71D+3223453]
	GetHandleVerifier [0x00007FF6F03B7CC2+3200322]
	GetHandleVerifier [0x00007FF6F03D5AF3+3322739]
	GetHandleVerifier [0x00007FF6F00D6A1A+180890]
	GetHandleVerifier [0x00007FF6F00DE11F+211359]
	GetHandleVerifier [0x00007FF6F00C5

## Export to DataFrame and view the data

In [16]:
# Convert to DataFrame
df = pd.DataFrame(all_shows)

# Display preview
"\n\n=== DataFrame Preview ==="
df.head()

Unnamed: 0,Title,Image URL,Detail Link,Opening Date,Closing Date,Type(s),Performances
0,& Juliet,https://www.broadway.org/assets/shows/andjulie...,https://www.ibdb.com/broadway-production/-juli...,"Nov 17, 2022",,"Musical, Original, Broadway",1045
1,BOOP! The Musical,https://www.broadway.org/assets/shows/boop-100...,https://www.ibdb.com/broadway-production/boop-...,"Apr 05, 2025",,"Musical, Original, Broadway",49
2,Buena Vista Social Club,https://www.broadway.org/assets/shows/buenavis...,https://www.ibdb.com/broadway-production/buena...,"Mar 19, 2025",,"Musical, Original, Broadway",71
3,Cabaret,https://www.broadway.org/assets/shows/cabaret-...,https://www.ibdb.com/broadway-production/cabar...,"Apr 21, 2024",,"Musical, Drama, Revival, Broadway",448
4,Chicago,https://www.broadway.org/assets/shows/chicago-...,https://www.ibdb.com/broadway-production/chica...,"Nov 14, 1996",,"Musical, Comedy, Revival, Broadway",11209


## Save to csv

In [18]:
#Save to CSV
df.to_csv("Broadway shows.csv", index=False)

## Locate where it is saved incase you don't know 

In [19]:
import os
os.getcwd()

'C:\\Users\\Ayo'

# PART 2

## 📧 Step 3: Send Email Notification (for New Shows)

In [69]:
def send_email(new_shows, sender_email, sender_password, recipient_email):
    subject = f"New IBDB Shows Detected: {len(new_shows)} New Show(s)"
    body = "New shows found:\n\n"
    for show in new_shows:
        body += f"- {show['Title']} (Opening: {show['Opening Date']})\n  Link: {show['Detail Link']}\n\n"

    msg = MIMEMultipart()
    msg['From'] = sender_email
    msg['To'] = recipient_email
    msg['Subject'] = subject
    msg.attach(MIMEText(body, 'plain'))

    try:
        server = smtplib.SMTP('smtp.gmail.com', 587)
        server.starttls()
        server.login(sender_email, sender_password)
        server.send_message(msg)
        server.quit()
        print("✅ Notification email sent.")
    except Exception as e:
        print(f"❌ Failed to send email: {e}")


## 💾 Step 4: Save to CSV and Detect New Shows

In [20]:
pip install schedule


Collecting scheduleNote: you may need to restart the kernel to use updated packages.

  Obtaining dependency information for schedule from https://files.pythonhosted.org/packages/20/a7/84c96b61fd13205f2cafbe263cdb2745965974bdf3e0078f121dfeca5f02/schedule-1.2.2-py3-none-any.whl.metadata
  Downloading schedule-1.2.2-py3-none-any.whl.metadata (3.8 kB)
Downloading schedule-1.2.2-py3-none-any.whl (12 kB)
Installing collected packages: schedule
Successfully installed schedule-1.2.2


In [68]:
def save_data(df_or_list, base_dir="data",
              sender_email=None, sender_password=None, recipient_email=None):
    os.makedirs(base_dir, exist_ok=True)

    df = pd.DataFrame(df_or_list) if isinstance(df_or_list, list) else df_or_list
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    snapshot_name = f"{base_dir}/ibdb_shows_{timestamp}.csv"
    master_path = f"{base_dir}/ibdb_master.csv"

    if os.path.exists(master_path):
        master_df = pd.read_csv(master_path)
        combined = pd.concat([master_df, df]).drop_duplicates(subset=["Title", "Opening Date"])
        new_shows_df = combined.merge(master_df, how='outer', indicator=True)
        new_shows_df = new_shows_df[new_shows_df['_merge'] == 'left_only']
        new_shows = new_shows_df.drop(columns=['_merge']).to_dict('records')
    else:
        combined = df
        new_shows = df.to_dict('records')

    combined.to_csv(master_path, index=False)
    df.to_csv(snapshot_name, index=False)

    print(f"✅ Data saved to:\n - {master_path}\n - {snapshot_name}")

    if new_shows and sender_email and sender_password and recipient_email:
        send_email(new_shows, sender_email, sender_password, recipient_email)


## ⏲️ Step 5: Schedule the Job to Run Every 3 Minutes ( You can choose any of your choice)

In [None]:
def job():
    print("Starting scrape...")
    shows_list = scrape_ibdb()
    save_data(shows_list,
              sender_email="osuyaikechukwu7@gmail.com",
              sender_password="your_app_password_here",
              recipient_email="osuyaikechukwu7@gmail.com")

schedule.every(3).minutes.do(job)

print("⏰ Scheduler started. Press Ctrl+C to stop.")
while True:
    schedule.run_pending()
    time.sleep(1)


In [2]:
pip install streamlit 


Collecting streamlit
  Obtaining dependency information for streamlit from https://files.pythonhosted.org/packages/13/e6/69fcbae3dd2fcb2f54283a7cbe03c8b944b79997f1b526984f91d4796a02/streamlit-1.45.1-py3-none-any.whl.metadata
  Downloading streamlit-1.45.1-py3-none-any.whl.metadata (8.9 kB)
Collecting altair<6,>=4.0 (from streamlit)
  Obtaining dependency information for altair<6,>=4.0 from https://files.pythonhosted.org/packages/aa/f3/0b6ced594e51cc95d8c1fc1640d3623770d01e4969d29c0bd09945fafefa/altair-5.5.0-py3-none-any.whl.metadata
  Downloading altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting blinker<2,>=1.5.0 (from streamlit)
  Obtaining dependency information for blinker<2,>=1.5.0 from https://files.pythonhosted.org/packages/10/cb/f2ad4230dc2eb1a74edf38f1a38b9b52277f75bef262d8908e60d957e13c/blinker-1.9.0-py3-none-any.whl.metadata
  Downloading blinker-1.9.0-py3-none-any.whl.metadata (1.6 kB)
Collecting cachetools<6,>=4.0 (from streamlit)
  Obtaining dependency information 

## ✅ Final Notes ( This is the combine script to run )
1. Be sure to replace "your_app_password_here" with your actual Gmail App Password.

2. To stop the scheduled job, interrupt the kernel or run Ctrl+C if using a script.

3. Add .env for better credential management in production.

In [3]:
import re
import time
import os
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import schedule
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart


def scrape_ibdb():
    driver = webdriver.Chrome()
    driver.get("https://www.ibdb.com/shows")

    WebDriverWait(driver, 10).until(
        EC.presence_of_all_elements_located((By.CLASS_NAME, "xt-iblock-inner"))
    )

    soup = BeautifulSoup(driver.page_source, "html.parser")
    blocks = soup.select(".xt-iblock-inner")
    all_shows = []

    driver.set_page_load_timeout(15)

    for i, block in enumerate(blocks):
        if len(all_shows) >= 40:
            break
        try:
            relative_link = block.select_one("a")["href"]
            detail_url = f"https://www.ibdb.com{relative_link}"
            style = block.select_one("span")["style"]
            image_url = re.search(r"url\((.*?)\)", style).group(1)

            driver.get(detail_url)
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.TAG_NAME, "body"))
            )
            detail_soup = BeautifulSoup(driver.page_source, "html.parser")

            title_element = detail_soup.select_one("h3.title-label")
            title = title_element.text.strip() if title_element else "N/A"

            type_elements = detail_soup.select(".col.s12.txt-paddings.tag-block-compact i")
            show_types = [elem.text.strip() for elem in type_elements]
            show_types = list(dict.fromkeys(show_types))  # remove duplicates

            date_blocks = detail_soup.select(".xt-main-title")
            opening_date = date_blocks[0].text.strip() if len(date_blocks) > 0 else "N/A"
            closing_date = date_blocks[1].text.strip() if len(date_blocks) > 1 else "N/A"

            performances = "N/A"
            performance_blocks = detail_soup.select("div.col.s7.m6.l7.txt-paddings.vertical-divider")
            for block_perf in performance_blocks:
                label = block_perf.select_one("div.xt-lable")
                if label and "Performances" in label.text:
                    value = block_perf.select_one("div.xt-main-title")
                    performances = value.text.strip() if value else "N/A"
                    break

            show_data = {
                "Title": title,
                "Image URL": image_url,
                "Detail Link": detail_url,
                "Opening Date": opening_date,
                "Closing Date": closing_date,
                "Type(s)": ", ".join(show_types),
                "Performances": performances
            }
            all_shows.append(show_data)
            print(f"[{len(all_shows)}] ✅ Collected: {title}")

            time.sleep(1)

        except Exception as e:
            print(f"[{i+1}] ❌ Error: {e}")
            continue

    driver.quit()
    return all_shows


def send_email(new_shows, sender_email, sender_password, recipient_email):
    subject = f"New IBDB Shows Detected: {len(new_shows)} New Show(s)"
    body = "New shows found:\n\n"
    for show in new_shows:
        body += f"- {show['Title']} (Opening: {show['Opening Date']})\n  Link: {show['Detail Link']}\n\n"

    msg = MIMEMultipart()
    msg['From'] = sender_email
    msg['To'] = recipient_email
    msg['Subject'] = subject
    msg.attach(MIMEText(body, 'plain'))

    try:
        server = smtplib.SMTP('smtp.gmail.com', 587)
        server.starttls()
        server.login(sender_email, sender_password)
        server.send_message(msg)
        server.quit()
        print("✅ Notification email sent.")
    except Exception as e:
        print(f"❌ Failed to send email: {e}")

def save_data(df_or_list, base_dir="data",
              sender_email=None, sender_password=None, recipient_email=None):
    os.makedirs(base_dir, exist_ok=True)

    if isinstance(df_or_list, list):
        df = pd.DataFrame(df_or_list)
    else:
        df = df_or_list

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    snapshot_name = f"{base_dir}/ibdb_shows_{timestamp}.csv"
    master_path = f"{base_dir}/ibdb_master.csv"

    if os.path.exists(master_path):
        master_df = pd.read_csv(master_path)
        combined = pd.concat([master_df, df]).drop_duplicates(subset=["Title", "Opening Date"])
        # Detect new shows:
        new_shows_df = combined.merge(master_df, how='outer', indicator=True)
        new_shows_df = new_shows_df[new_shows_df['_merge'] == 'left_only']
        new_shows = new_shows_df.drop(columns=['_merge']).to_dict('records')
    else:
        combined = df
        new_shows = df.to_dict('records')

    combined.to_csv(master_path, index=False)
    df.to_csv(snapshot_name, index=False)
    print(f"✅ Data saved to:\n - {master_path}\n - {snapshot_name}")

    # Send email if new shows found & email info provided
    if new_shows and sender_email and sender_password and recipient_email:
        send_email(new_shows, sender_email, sender_password, recipient_email)

    print("Scrape done!\n")

def job():
    print("Starting scrape...")
    shows_list = scrape_ibdb()
    save_data(shows_list,
              sender_email="osuyaikechukwu7@gmail.com",
              sender_password="your_app_password_here",
              recipient_email="osuyaikechukwu7@gmail.com")


# Schedule the job every 3 minutes
schedule.every(3).minutes.do(job)

print("Scheduler started. Press Ctrl+C to stop.")

while True:
    schedule.run_pending()
    time.sleep(1)


### Where the CSV files get saved:

By default, with the function I gave you, the data files will be saved inside a folder called data in the same directory where you run the script.

In [None]:
data/ibdb_master.csv


### Timestamped snapshot file (for each run):

In [None]:
data/ibdb_shows_YYYYMMDD_HHMMSS.csv
