# Hollywood Theatrical Market Synopsis 1995 to 2021

-- North American domestic movies theatrical market synopsis

All movies released since 1995 are categorized according to the following attributes: Creative type (factual, contemporary fiction, fantasy etc.), Source (book, play, original screenplay etc.), Genre (drama, horror, documentary etc.), MPAA rating, Production method (live action, digital animation etc.) and Distributor. In order to provide a fair comparison between movies released in different years, all rankings are based on ticket sales, which are calculated using average ticket prices announced by the MPAA in their annual state of the industry report

Dataset showing the exponential growth of box office collections as well as ticket sales over time (and the decline after 2020 due to the Covid-19 pandemic) indirectly indicating the quality of modern day films. This Dataset can also be used to study the genres which attract audience the most and encourage one to create an amazing genre specific plot in order to take one step closer to becoming the next most successful director!


Source: https://www.kaggle.com/johnharshith/hollywood-theatrical-market-synopsis-1995-to-2021


In [24]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
# Tickets sold per year
TicketSales = pd.read_csv('AnnualTicketSales.csv')
# Top movies per year
HighestGross = pd.read_csv('HighestGrossers.csv')
# Rankings of top 9 creative types
PopularCreative = pd.read_csv('PopularCreativeTypes.csv')
# Top 10 distributors
TopDistr = pd.read_csv('TopDistributors.csv')
# Top 10 genres
TopGenre = pd.read_csv('TopGenres.csv')
# Top 8 MPAA ratings
TopGrossRate = pd.read_csv('TopGrossingRatings.csv')
# Top 10 movies sources
TopGrossSrc = pd.read_csv('TopGrossingSources.csv')
# Top 7 production methods
TopProdMethod = pd.read_csv('TopProductionMethods.csv')
# Number of releases per year by top 6 distributors and others (total)
WideRelease = pd.read_csv('WideReleasesCount.csv')


In [3]:
files = [TicketSales, HighestGross, PopularCreative, TopDistr, TopGenre, TopGrossRate, TopGrossSrc, TopProdMethod, WideRelease]
file_name = ['TicketSales', 'HighestGross', 'PopularCreative', 'TopDistr', 'TopGenre', 'TopGrossRate', 'TopGrossSrc', 'TopProdMethod', 'WideRelease']

In [4]:
for i in range(9):
    print('File #', i+1, ':', file_name[i])
    display(files[i].head(2))
    print('─' * 100)

File # 1 : TicketSales


Unnamed: 0,YEAR,TICKETS SOLD,TOTAL BOX OFFICE,TOTAL INFLATION ADJUSTED BOX OFFICE,AVERAGE TICKET PRICE
0,2021,423774881,"$3,881,777,912","$3,881,777,912",$9.16
1,2020,223638958,"$2,048,534,616","$2,048,534,616",$9.16


────────────────────────────────────────────────────────────────────────────────────────────────────
File # 2 : HighestGross


Unnamed: 0,YEAR,MOVIE,GENRE,MPAA RATING,DISTRIBUTOR,TOTAL FOR YEAR,TOTAL IN 2019 DOLLARS,TICKETS SOLD
0,1995,Batman Forever,Drama,PG-13,Warner Bros.,"$184,031,112","$387,522,978",42306002
1,1996,Independence Day,Adventure,PG-13,20th Century Fox,"$306,169,255","$634,504,608",69269062


────────────────────────────────────────────────────────────────────────────────────────────────────
File # 3 : PopularCreative


Unnamed: 0,RANK,CREATIVE TYPES,MOVIES,TOTAL GROSS,AVERAGE GROSS,MARKET SHARE
0,1.0,Contemporary Fiction,7442,"$96,203,727,036","$12,927,133",40.46%
1,2.0,Kids Fiction,564,"$32,035,539,746","$56,800,602",13.47%


────────────────────────────────────────────────────────────────────────────────────────────────────
File # 4 : TopDistr


Unnamed: 0,RANK,DISTRIBUTORS,MOVIES,TOTAL GROSS,AVERAGE GROSS,MARKET SHARE
0,1,Walt Disney,588,"$40,472,424,278","$68,830,654",17.02%
1,2,Warner Bros.,824,"$36,269,425,479","$44,016,293",15.25%


────────────────────────────────────────────────────────────────────────────────────────────────────
File # 5 : TopGenre


Unnamed: 0,RANK,GENRES,MOVIES,TOTAL GROSS,AVERAGE GROSS,MARKET SHARE
0,1,Adventure,1102,"$64,529,536,530","$58,556,748",27.14%
1,2,Action,1098,"$49,339,974,493","$44,936,224",20.75%


────────────────────────────────────────────────────────────────────────────────────────────────────
File # 6 : TopGrossRate


Unnamed: 0,RANK,MPAA RATINGS,MOVIES,TOTAL GROSS,AVERAGE GROSS,MARKET SHARE
0,1,PG-13,3243,"$113,524,789,243","$35,006,102",47.75%
1,2,R,5480,"$63,497,164,978","$11,587,074",26.71%


────────────────────────────────────────────────────────────────────────────────────────────────────
File # 7 : TopGrossSrc


Unnamed: 0,RANK,SOURCES,MOVIES,TOTAL GROSS,AVERAGE GROSS,MARKET SHARE
0,1,Original Screenplay,7946,"$106,375,196,782","$13,387,264",44.74%
1,2,Based on Fiction Book/Short Story,2150,"$47,005,613,207","$21,863,076",19.77%


────────────────────────────────────────────────────────────────────────────────────────────────────
File # 8 : TopProdMethod


Unnamed: 0,RANK,PRODUCTION METHODS,MOVIES,TOTAL GROSS,AVERAGE GROSS,MARKET SHARE
0,1,Live Action,14613,"$179,637,201,848","$12,292,972",75.56%
1,2,Animation/Live Action,264,"$30,346,622,254","$114,949,327",12.76%


────────────────────────────────────────────────────────────────────────────────────────────────────
File # 9 : WideRelease


Unnamed: 0,YEAR,WARNER BROS,WALT DISNEY,20TH CENTURY FOX,PARAMOUNT PICTURES,SONY PICTURES,UNIVERSAL,TOTAL MAJOR 6,TOTAL OTHER STUDIOS
0,2021,17,7,0,4,16,17,61,38
1,2020,5,3,1,3,9,13,34,23


────────────────────────────────────────────────────────────────────────────────────────────────────


# Data cleanup

In [28]:
# File #1: Annual Ticket Sales
TicketSales = TicketSales.dropna(axis=1, how='all')

TicketSales['TICKETS SOLD'] = TicketSales['TICKETS SOLD'].apply(lambda x: x.replace('$','').replace(',','')).astype(int)
TicketSales['TOTAL BOX OFFICE'] = TicketSales['TOTAL BOX OFFICE'].apply(lambda x: x.replace('$','').replace(',',''))
TicketSales['TOTAL INFLATION ADJUSTED BOX OFFICE'] = TicketSales['TOTAL INFLATION ADJUSTED BOX OFFICE'].apply(lambda x: x.replace('$','').replace(',',''))
TicketSales['AVERAGE TICKET PRICE'] = TicketSales['AVERAGE TICKET PRICE'].apply(lambda x: x.replace('$','').replace(',',''))

TicketSales = TicketSales.astype({"TOTAL BOX OFFICE":'int64', "TOTAL INFLATION ADJUSTED BOX OFFICE":'int64',
                                 "AVERAGE TICKET PRICE":'float'})

In [149]:
# File #2: Highest Grossers
HighestGross['TOTAL FOR YEAR'] = HighestGross['TOTAL FOR YEAR'].apply(lambda x: x.replace('$','').replace(',',''))
HighestGross['TOTAL IN 2019 DOLLARS'] = HighestGross['TOTAL IN 2019 DOLLARS'].apply(lambda x: x.replace('$','').replace(',',''))
HighestGross['TICKETS SOLD'] = HighestGross['TICKETS SOLD'].apply(lambda x: x.replace(',',''))
HighestGross = HighestGross.astype({"TOTAL FOR YEAR":'int64', "TOTAL IN 2019 DOLLARS":'int64',
                                 "TICKETS SOLD":'int64'})

In [69]:
# File #3: Popular Creative Types
PopularCreative['RANK'] = PopularCreative['RANK'].apply(lambda x: x.split('.')[0])
PopularCreative['MOVIES'] = PopularCreative['MOVIES'].apply(lambda x: x.replace(',',''))
PopularCreative['TOTAL GROSS'] = PopularCreative['TOTAL GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
PopularCreative['AVERAGE GROSS'] = PopularCreative['AVERAGE GROSS'].apply(lambda x: x.replace('$','').replace(',',''))

PopularCreative.rename(columns = {'MARKET SHARE':'MARKET SHARE (%)',}, inplace = True)
PopularCreative['MARKET SHARE (%)'] = PopularCreative['MARKET SHARE (%)'].apply(lambda x: x.replace('%',''))

In [86]:
nan_value = float("NaN")
PopularCreative.replace("", nan_value, inplace=True)
PopularCreative.dropna(subset = ["CREATIVE TYPES"], inplace=True)

In [88]:
PopularCreative = PopularCreative.astype({"RANK":'int', "MOVIES":'int', "TOTAL GROSS":'int64', 
                                 "AVERAGE GROSS":'int64', "MARKET SHARE (%)":'float'})

In [99]:
# File #4: Top Distributors
TopDistr['TOTAL GROSS'] = TopDistr['TOTAL GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopDistr['AVERAGE GROSS'] = TopDistr['AVERAGE GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopDistr['MARKET SHARE'] = TopDistr['MARKET SHARE'].apply(lambda x: x.replace('%',''))

TopDistr.rename(columns = {'MARKET SHARE':'MARKET SHARE (%)',}, inplace = True)

TopDistr = TopDistr.astype({"TOTAL GROSS":'int64', "AVERAGE GROSS":'int64', "MARKET SHARE (%)":'float'})

In [106]:
# File #5: Top Genres
TopGenre['MOVIES'] = TopGenre['MOVIES'].apply(lambda x: x.replace(',',''))
TopGenre['TOTAL GROSS'] = TopGenre['TOTAL GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopGenre['AVERAGE GROSS'] = TopGenre['AVERAGE GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopGenre['MARKET SHARE'] = TopGenre['MARKET SHARE'].apply(lambda x: x.replace('%',''))

TopGenre.rename(columns = {'MARKET SHARE':'MARKET SHARE (%)',}, inplace = True)

TopGenre = TopGenre.astype({"MOVIES":'int', "TOTAL GROSS":'int64', "AVERAGE GROSS":'int64', "MARKET SHARE (%)":'float'})

In [112]:
# File #6: Top Grossing Ratings
TopGrossRate['MOVIES'] = TopGrossRate['MOVIES'].apply(lambda x: x.replace(',',''))
TopGrossRate['TOTAL GROSS'] = TopGrossRate['TOTAL GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopGrossRate['AVERAGE GROSS'] = TopGrossRate['AVERAGE GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopGrossRate['MARKET SHARE'] = TopGrossRate['MARKET SHARE'].apply(lambda x: x.replace('%',''))

TopGrossRate.rename(columns = {'MARKET SHARE':'MARKET SHARE (%)',}, inplace = True)

TopGrossRate = TopGrossRate.astype({"MOVIES":'int', "TOTAL GROSS":'int64', "AVERAGE GROSS":'int64', "MARKET SHARE (%)":'float'})

In [118]:
# File #7: Top Grossing Sources
TopGrossSrc['MOVIES'] = TopGrossSrc['MOVIES'].apply(lambda x: x.replace(',',''))
TopGrossSrc['TOTAL GROSS'] = TopGrossSrc['TOTAL GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopGrossSrc['AVERAGE GROSS'] = TopGrossSrc['AVERAGE GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopGrossSrc['MARKET SHARE'] = TopGrossSrc['MARKET SHARE'].apply(lambda x: x.replace('%',''))

TopGrossSrc.rename(columns = {'MARKET SHARE':'MARKET SHARE (%)',}, inplace = True)

TopGrossSrc = TopGrossSrc.astype({"MOVIES":'int', "TOTAL GROSS":'int64', "AVERAGE GROSS":'int64', "MARKET SHARE (%)":'float'})

In [123]:
# File #8: Top Production Methods
TopProdMethod['MOVIES'] = TopProdMethod['MOVIES'].apply(lambda x: x.replace(',',''))
TopProdMethod['TOTAL GROSS'] = TopProdMethod['TOTAL GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopProdMethod['AVERAGE GROSS'] = TopProdMethod['AVERAGE GROSS'].apply(lambda x: x.replace('$','').replace(',',''))
TopProdMethod['MARKET SHARE'] = TopProdMethod['MARKET SHARE'].apply(lambda x: x.replace('%',''))

TopProdMethod.rename(columns = {'MARKET SHARE':'MARKET SHARE (%)',}, inplace = True)

TopProdMethod = TopProdMethod.astype({"MOVIES":'int', "TOTAL GROSS":'int64', "AVERAGE GROSS":'int64', "MARKET SHARE (%)":'float'})

In [131]:
# File #9: Wide Releases 
WideRelease = WideRelease.dropna(axis=1, how='all')

In [145]:
# Overwriting data cleaned to files
TicketSales.to_csv ('AnnualTicketSales.csv', index = False, header=True)
HighestGross.to_csv ('HighestGrossers.csv', index = False, header=True)
PopularCreative.to_csv ('PopularCreativeTypes.csv', index = False, header=True)
TopDistr.to_csv ('TopDistributors.csv', index = False, header=True)
TopGenre.to_csv ('TopGenres.csv', index = False, header=True)
TopGrossRate.to_csv ('TopGrossingRatings.csv', index = False, header=True)
TopGrossSrc.to_csv ('TopGrossingSources.csv', index = False, header=True)
TopProdMethod.to_csv ('TopProductionMethods.csv', index = False, header=True)
WideRelease.to_csv ('WideReleasesCount.csv', index = False, header=True)