## Import Statements

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import warnings

## Notebook Settings

In [8]:
warnings.filterwarnings("ignore") # hide warnings
plt.style.use('ggplot')

## Data Dictionary

`grosses.csv` contains weekly box office grosses from [playbill.com](https://www.playbill.com/grosses)

For more information about the web scraping of this data using R please see https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-04-28.

| variable             | description                                                  |
| :------------------- |  :----------------------------------------------------------- |
| week_ending          |  Date of the end of the weekly measurement period. Always a Sunday. |
| week_number          |  Week number in the Broadway season. The season starts after the Tony Awards, held in early June. Some seasons have 53 weeks. |
| weekly_gross_overall |  Weekly box office gross for all shows                        |
| show                 | Name of show. Some shows have the same name, but multiple runs. |
| theatre              |  Name of theatre                                              |
| weekly_gross         |  Weekly box office gross for individual show/theatre                 |
| potential_gross      | Weekly box office gross if all seats are sold at full price. Shows can exceed their potential gross by selling premium tickets and/or standing room tickets. |
| avg_ticket_price     |  Average price of tickets sold                                |
| top_ticket_price     |  Highest price of tickets sold                                |
| seats_sold           |  Total seats sold for all performances and previews           |
| seats_in_theatre     |  Theatre seat capacity                                        |
| pct_capacity         |  Percent of theatre capacity sold. Shows can exceed 100% capacity by selling standing room tickets. |
| performances         |  Number of performances in the week                           |
| previews             |  Number of preview performances in the week. Previews occur before a show's official open. |

## Read the Data

In [9]:
filepath = 'grosses.csv'
df = pd.read_csv(filepath,header=0)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47524 entries, 0 to 47523
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   week_ending           47524 non-null  object 
 1   week_number           47524 non-null  int64  
 2   weekly_gross_overall  47524 non-null  float64
 3   show                  47524 non-null  object 
 4   theatre               47524 non-null  object 
 5   weekly_gross          47524 non-null  float64
 6   potential_gross       34911 non-null  float64
 7   avg_ticket_price      47524 non-null  float64
 8   top_ticket_price      36167 non-null  float64
 9   seats_sold            47524 non-null  int64  
 10  seats_in_theatre      47524 non-null  int64  
 11  pct_capacity          47524 non-null  float64
 12  performances          47524 non-null  int64  
 13  previews              47524 non-null  int64  
dtypes: float64(6), int64(5), object(3)
memory usage: 5.1+ MB


In [12]:
df.head()

Unnamed: 0,week_ending,week_number,weekly_gross_overall,show,theatre,weekly_gross,potential_gross,avg_ticket_price,top_ticket_price,seats_sold,seats_in_theatre,pct_capacity,performances,previews
0,1985-06-09,1,3915937.0,42nd Street,St. James Theatre,282368.0,,30.42,,9281,1655,0.701,8,0
1,1985-06-09,1,3915937.0,A Chorus Line,Sam S. Shubert Theatre,222584.0,,27.25,,8167,1472,0.6935,8,0
2,1985-06-09,1,3915937.0,Aren't We All?,Brooks Atkinson Theatre,249272.0,,33.75,,7386,1088,0.8486,8,0
3,1985-06-09,1,3915937.0,Arms and the Man,Circle in the Square Theatre,95688.0,,20.87,,4586,682,0.8405,8,0
4,1985-06-09,1,3915937.0,As Is,Lyceum Theatre,61059.0,,20.78,,2938,684,0.5369,8,0


In [13]:
df.tail()

Unnamed: 0,week_ending,week_number,weekly_gross_overall,show,theatre,weekly_gross,potential_gross,avg_ticket_price,top_ticket_price,seats_sold,seats_in_theatre,pct_capacity,performances,previews
47519,2020-03-01,40,26109896.25,The Phantom of the Opera,Majestic Theatre,639215.93,1358986.0,72.18,213.0,8856,1605,0.6897,8,0
47520,2020-03-01,40,26109896.25,Tina: The Tina Turner Musical,Lunt-Fontanne Theatre,1320766.0,1566688.0,132.02,297.0,10004,1478,0.8461,8,0
47521,2020-03-01,40,26109896.25,To Kill A Mockingbird,Sam S. Shubert Theatre,1132278.54,1549625.0,115.41,423.0,9811,1435,0.9767,7,0
47522,2020-03-01,40,26109896.25,West Side Story,Broadway Theatre,1598947.32,1722464.0,114.87,373.0,13920,1740,1.0,8,0
47523,2020-03-01,40,26109896.25,Wicked,Gershwin Theatre,1202089.5,1779845.0,96.33,250.0,12479,1807,0.8632,8,0


In [15]:
# How many theatres in total?
print(len(df.theatre.unique()))

58


In [16]:
all_theatres = df.theatre.unique()
for theatre in all_theatres:
    print(theatre)

St. James Theatre
Sam S. Shubert Theatre
Brooks Atkinson Theatre
Circle in the Square Theatre
Lyceum Theatre
Eugene O'Neill Theatre
Neil Simon Theatre
46th Street Theatre
Winter Garden Theatre
Ritz Theatre
Mark Hellinger Theatre
Palace Theatre
Ambassador Theatre
Edison Theatre
Gershwin Theatre
Booth Theatre
Broadway Theatre
Broadhurst Theatre
Minskoff Theatre
Royale Theatre
Plymouth Theatre
Lunt-Fontanne Theatre
Helen Hayes Theatre
Biltmore Theatre
Imperial Theatre
John Golden Theatre
Music Box Theatre
Nederlander Theatre
Ethel Barrymore Theatre
Longacre Theatre
Virginia Theatre
Jack Lawrence Theatre
Vivian Beaumont Theater
Marquis Theatre
Cort Theatre
Martin Beck Theatre
Majestic Theatre
Criterion Center Stage Right
Belasco Theatre
Richard Rodgers Theatre
Walter Kerr Theatre
Comedy Theatre
New Amsterdam Theatre
Ford Center for the Performing Arts
Studio 54
American Airlines Theatre
Henry Miller's Theatre
Al Hirschfeld Theatre
Gerald Schoenfeld Theatre
Hilton Theatre
Bernard B. Jacobs 

### Observation.

According to this [Wikipedia article](https://en.wikipedia.org/wiki/Broadway_theatre), Broadway theatre, or Broadway, are the theatrical performances presented in the 41 professional theatres, located in the Theater District and the Lincoln Center along Broadway, in Midtown Manhattan, New York City. 

In our data set we have 58 theatre names. Let's compare our list of theatres to Wikipedia's list.
We'll first read the data we scraped from Wikipedia.

In [18]:
broadway_df = pd.read_csv('broadway_theatres.csv', header=0)

In [21]:
broadway_df

Unnamed: 0,theatre,address,capacity,owner
0,Al Hirschfeld Theatre,W. 45th St. (No. 302),1424,Jujamcyn Theaters
1,Ambassador Theatre,W. 49th St. (No. 219),1125,Shubert Organization
2,American Airlines Theatre,W. 42nd St. (No. 227),740,Roundabout Theatre Company
3,August Wilson Theatre,W. 52nd St. (No. 245),1228,Jujamcyn Theaters
4,Belasco Theatre,W. 44th St. (No. 111),1018,Shubert Organization
5,Bernard B. Jacobs Theatre,W. 45th St. (No. 242),1078,Shubert Organization
6,Booth Theatre,W. 45th St. (No. 222),766,Shubert Organization
7,Broadhurst Theatre,W. 44th St. (No. 235),1186,Shubert Organization
8,Broadway Theatre,W. 53rd St & Broadway (No. 1681),1761,Shubert Organization
9,Circle in the Square Theatre,W. 50th St. (No. 235),840,Independent


In [25]:
count = 0
for theatre in all_theatres:
    if theatre not in broadway_df.theatre.values:
        print(count, ':', theatre)
        count += 1

0 : Sam S. Shubert Theatre
1 : Brooks Atkinson Theatre
2 : 46th Street Theatre
3 : Ritz Theatre
4 : Mark Hellinger Theatre
5 : Edison Theatre
6 : Royale Theatre
7 : Plymouth Theatre
8 : Helen Hayes Theatre
9 : Biltmore Theatre
10 : Virginia Theatre
11 : Jack Lawrence Theatre
12 : Cort Theatre
13 : Martin Beck Theatre
14 : Criterion Center Stage Right
15 : Comedy Theatre
16 : Ford Center for the Performing Arts
17 : Henry Miller's Theatre
18 : Hilton Theatre
19 : Foxwoods Theatre
20 : Helen Hayes Theater


### Theatres in our data set that were renamed or closed:

* Brooks Atkinson Theatre -  renamed [Lena Horne Theatre](https://en.wikipedia.org/wiki/Lena_Horne_Theatre)
* 46th Street Theatre - renamed [Richard Rodgers Theatre](https://en.wikipedia.org/wiki/Richard_Rodgers_Theatre) in 1990
* Ritz Theatre - renamed [Walter Kerr Theatre](https://en.wikipedia.org/wiki/Walter_Kerr_Theatre) in 1990
* [Mark Hellinger Theatre](https://en.wikipedia.org/wiki/Mark_Hellinger_Theatre) - formerly the 51st Street Theatre and the Hollywood Theatre
* [Edison Theatre](https://en.wikipedia.org/wiki/Edison_Theatre) - closed February 24, 1991
* Royale Theatre - renamed [Bernard B. Jacobs Theatre](https://en.wikipedia.org/wiki/Bernard_B._Jacobs_Theatre)
* Plymouth Theatre - renamed [Gerald Schoenfeld Theatre](https://en.wikipedia.org/wiki/Gerald_Schoenfeld_Theatre)
* Biltmore Theatre - renamed [Samuel J. Friedman Theatre](https://en.wikipedia.org/wiki/Samuel_J._Friedman_Theatre)
* Virginia Theatre - renamed [August Wilson Theatre](https://en.wikipedia.org/wiki/August_Wilson_Theatre)
* Jack Lawrence Theatre - closed? (https://www.broadwayworld.com/board/readmessage.php?thread=873297)
* Cort Theatre - renamed [James Earl Jones Theatre](https://en.wikipedia.org/wiki/James_Earl_Jones_Theatre)
* Martin Beck Theatre - renamed [Al Hirschfeld Theatre](https://en.wikipedia.org/wiki/Al_Hirschfeld_Theatre)
* Criterion Center Stage Right - renamed (?) [Olympia Theatre](https://en.wikipedia.org/wiki/Olympia_Theatre_(New_York_City))
* [Comedy Theatre](https://en.wikipedia.org/wiki/Comedy_Theatre_(New_York_City)) - closed 1942 (?)
* Henry Miller's Theatre - renamed [Stephen Sondheim Theatre](https://en.wikipedia.org/wiki/Stephen_Sondheim_Theatre)
* Ford Center for the Performing Arts - renamed [Lyric Theatre](https://en.wikipedia.org/wiki/Lyric_Theatre_(New_York_City,_1998))
* Hilton Theatre - renamed [Lyric Theatre](https://en.wikipedia.org/wiki/Lyric_Theatre_(New_York_City,_1998))
* Foxwoods Theatre - renamed [Lyric Theatre](https://en.wikipedia.org/wiki/Lyric_Theatre_(New_York_City,_1998))

### Theatre name in our data set vs theatre name in Wikipedia's list:

1. "Sam S. Shubert Theatre" vs "Shubert Theatre"
2. "Brooks Atkinson Theatre" vs "Lena Horne Theatre"
3. "46th Street Theatre" vs "Richard Rodgers Theatre"
4. "Ritz Theatre" vs "Walter Kerr Theatre"
5. "Royale Theatre" vs "Bernard B. Jacobs Theatre"
6. "Plymouth Theatre" vs "Gerald Schoenfeld Theatre"
7. "Helen Hayes Theatre" vs "Hayes Theater"
8. "Helen Hayes Theater" vs "Hayes Theater"
9. "Biltmore Theatre" vs "Samuel J. Friedman Theatre"
10. "Virginia Theatre" vs "August Wilson Theatre"
11. "Cort Theatre" vs "James Earl Jones Theatre"
12. "Martin Beck Theatre" vs "Al Hirschfeld Theatre"
13. "Ford Center for the Performing Arts" vs "Lyric Theatre"
14. "Henry Miller's Theatre" vs "Stephen Sondheim Theatre"
15. "Hilton Theatre" vs "Lyric Theatre"
16. "Foxwoods Theatre" vs "Lyric Theatre"

In [49]:
# Take a closer look at Edison Theatre 
df[df.theatre == "Edison Theatre"].head(1)

Unnamed: 0,week_ending,week_number,weekly_gross_overall,show,theatre,weekly_gross,potential_gross,avg_ticket_price,top_ticket_price,seats_sold,seats_in_theatre,pct_capacity,performances,previews
13,1985-06-09,1,3915937.0,Oh! Calcutta!,Edison Theatre,32330.0,,14.22,,2273,499,0.5061,9,0


In [50]:
df[df.theatre == "Edison Theatre"].tail(1)

Unnamed: 0,week_ending,week_number,weekly_gross_overall,show,theatre,weekly_gross,potential_gross,avg_ticket_price,top_ticket_price,seats_sold,seats_in_theatre,pct_capacity,performances,previews
5959,1991-02-24,38,4438845.0,Those Were the Days,Edison Theatre,75580.0,,21.76,,3473,499,0.87,8,0


This confirms our findings that Edison Theatre was closed in 1991.

## Conclusion

We see that out of the 58 theatre names in our dataset, 21 of them don't show up in Wikipedia's list.
Investigating further, we find that:
* 1 is a name of a theatre that have been closed: Edison Theatre 
* 1 is a name of a theatre that should have been closed: Comedy Theatre (error in data collection?)
* 16 are names of theatres that have been renamed
* 3 are names that show up slighly different:  Sam S. Shubert Theatre, Helen Hayes Theatre and Helen Hayes Theater


In [1]:
rename_theatres = {"Sam S. Shubert Theatre": "Shubert Theatre",
                    "Brooks Atkinson Theatre": "Lena Horne Theatre",
                    "46th Street Theatre": "Richard Rodgers Theatre",
                    "Ritz Theatre": "Walter Kerr Theatre",
                    "Royale Theatre": "Bernard B. Jacobs Theatre",
                    "Plymouth Theatre": "Gerald Schoenfeld Theatre",
                    "Helen Hayes Theatre": "Hayes Theater",
                    "Helen Hayes Theater": "Hayes Theater",
                    "Biltmore Theatre": "Samuel J. Friedman Theatre",
                    "Virginia Theatre": "August Wilson Theatre",
                    "Cort Theatre": "James Earl Jones Theatre",
                    "Martin Beck Theatre": "Al Hirschfeld Theatre",
                    "Ford Center for the Performing Arts": "Lyric Theatre",
                    "Henry Miller's Theatre": "Stephen Sondheim Theatre",
                    "Hilton Theatre": "Lyric Theatre",
                    "Foxwoods Theatre": "Lyric Theatre",          
                    }             