# About

The dataset comprises box office data and supplemental information for all theatrically released films adapted from Marvel Comics and DC Comics core superhero universes. TV specials and other projects that did not receive a wide theatrical release are not included.

Column explanation and suggested use:

- Film: The title of the film in the U.S. market. Note that this may differ from other territories. For example, in the United Kingdom Avengers was retitled Avengers Assemble to distinguish it from the unrelated television series of the same name.

- U.S. release date: The first day the film was available in theatres in the United States of America to the general public.

- Box office gross Domestic (U.S. and Canada): The total gross earnings of the film in U.S. dollars in Hollywood’s domestic market (comprising the United States of America and Canada). Typically, the distributor for a motion picture receives slightly more than half of the final gross with the remainder going to the exhibitor (i.e. the cinema) across the theatrical window.

- Box office gross Other territories: The total gross earnings of the film in U.S. dollars in Hollywood’s international market (comprising any countries the film released in except the USA and Canada). We might note that the split between distributors and exhibitors can be more variable than within the domestic market, but this is granular detail beyond the present analysis.

- Box office gross Worldwide: The sum total gross earnings for all territories in U.S. dollars.

- Budget: The production budget for the movie in U.S. dollars. This does not include any additional expenses relating to the movie beyond what it cost to make, most notably the marketing budget which can equal or even exceed the production budget on major motion pictures.

- MCU: Indicates whether the film is part of the Marvel Cinematic Universe. Recorded as a Boolean value, with TRUE indicating that the film is part of the MCU. Note that some pre-existing films (such as 20th Century Fox X-Men films and Sony Pictures Spider-Man films) have been retroactively made part of the MCU’s multiverse (separate continuities that exist within a large continuity superstructure). This has mostly been driven by corporate acquisitions and mergers, but not entirely. There remain questions over the degree of interconnectedness for certain movies, particularly Sony’s films based on Spider-Man-related characters. To avoid confusion, this value is only given as TRUE when a film satisfies both the following criteria: (a) it was part of the MCU at the time of release (b) it is part of the “main” MCU timeline, elsewhere known as “The Sacred Timeline”.

- Phase: States which phase of MCU this film belongs to (if not applicable NA is used). MCU phase was originally used as internal planning term at Marvel Studios, but has since become widely known and used by the general public. Noting the specific phases can be helpful in understanding public perception around the brand/franchise and public opinion on its health and reliability.

- Distributor: The name of the film studio that distributed the movie within the United States. You may wish to note that several of the listed distributors are divisions of larger studios. Twentieth Century Fox was rebranded as 20th Century Studios following its acquisition by the Walt Disney Company. Columbia Pictures is a division of Sony Pictures Entertainment. New Line Cinema merged with what is now Warner Bros. Discovery in 2008 (then Time-Warner).

- MPAA Rating: The age rating awarded under the Motion Picture Association film rating system. Note that age ratings can vary in other markets (as can the final cut of the film). The current available classifications from the MPA are:

    - G (General Audiences). All ages admitted. Nothing that would offend parents for viewing by children.
    - PG (Parental Guidance Suggested). Some material may not be suitable for children. Parents urged to give “parental guidance”. May contain some material parent might not like for their young children.
    - PG-13 (Parents Strongly Cautioned). Parents are urged to be cautious. Some material may be inappropriate for pre-teenagers.
    - R (Restricted). Under 17 requires accompanying parent or adult guardian. Contains some adult material. Parents are urged to learn more about the film before taking their young children with them.
    - NC-17 (No on 17 and under admitted). Clearly adult. Children are not admitted.
    
This information can help with understanding the demographics of the audience, most notably by who would be excluded. It can also provide insight into the commercial viability of films given their relative MPAA rating. It may also help the user understand the relative maturity of a given film, though this is very much in terms of what the MPAA deems suitable for a given age rather than say the themes explored by the film.

- Length: The length of the U.S. theatrical cut of the film given in hours, minutes and seconds, rounded to the nearest minute. This may differ from edits of the film shown in other markets or in secondary markets such as television, streaming and physical media releases.

- Minutes: The same as above but presented only in minutes.

- Source: The sources consulted for all aforementioned box-office figures and production budgets. Further information on my process can be found under Provenance.

- Character family: The character family that the protagonist(s) of the film fall under, primarily gauged by title with secondary reference to how the character(s) are divided according to comic book editorial team, and film continuity (for example Venom existing in a separate film continuity to Spider-Man despite originating in Spider-Man comics as an antagonist). This is somewhat subjective, but useful for identifying films that are related to each other by leading character(s), but exist across multiple often unrelated series and even studios. It can be assumed that a portion of the audience will be aware of other films that feature the character(s) and that the quality of previous instalments may have a bearing on the success of future instalments whether or not they exist in the same fictional continuity. It may also be a helpful data point to determine how much the MCU branding can affect the fortunes of films featuring a particular character. This category excludes cross-over and cameo characters, only the protagonist(s) is used to determine character family. There is therefore the potential for this to be somewhat misleading in isolation, for example, Captain America: Civil War positions Captain America as its protagonist, Iron Man as deuteragonist/antagonist, features Spider-Man, and could be considered an Avengers movie based on how prominently that set of characters feature but is here listed simply as part of the Captain America character family.

- Domestic %: The percentage of the world-wide gross made up by the domestic (U.S. and Canada) gross. This may be a helpful metric for gauging relative popularity outside North America. Gross to budget: The ratio between the world-wide gross and the production budget. A higher ratio would be indicative of greater profitability.

- Rotten Tomatoes Critic Score: The average score from all professional critics on the “Tomatometer” from the popular review aggregator website. More information on Rotten Tomatoes curation and process can be found on their website. This figure has been included to give some insight into the general critical consensus of the movie, but is of course extremely reductive in isolation. It may be a useful measure for determining how “critic-proof” these films are, i.e. is there a strong relationship between critic score and box office performance? Elsewhere referred to as RT Score.

- Male/Female-led: This column records whether the film had a male or female lead, or whether women and men co-star roughly equally. This is a highly subjective measure, based on the assumed gender of the protagonist of the film. It may have some value in analysing box office performance of superhero films based on the gender of the lead. It can also be revealing in how male-dominated the genre is (in terms of characters featured as protagonists).

- Year: The calendar year the film was released theatrically in the United States of America. Inflation Adjusted Worldwide Gross: The worldwide gross adjusted for inflation, given in 2023 U.S. dollars. This calculation was derived from U.S. Consumer Price Index (CPI) data, and should be considered a rough estimate. It is a useful metric for looking at the relative box office success of these pictures irrespective of when they released. Without this data point for comparison, more recent movies tend to look more successful than older films due to inflation.

- Inflation Adjusted Budget: The production budget adjusted for inflation, given in 2023 U.S. dollars. Calculated using CPI data as above.

- 2.5x prod: The production budget multiplied by 2.5. This is a common rule of thumb for determining whether a theatrical release achieved profitability for the distributor after marketing costs and exhibitors have taken their cut. It is a rough estimate, and does not take into account other income streams such as merchandise sales, home video releases, streaming, etc.

- Break Even: Determines whether the film reached profitability by checking whether the worldwide gross exceeded two and half times the production budget (the rule of thumb outlined above). This is a binary distinction, a single dollar over the 2.5 value will register as a success. This column is primarily useful for determining if a film can be considered a flop. Even if the rule of thumb does not necessarily hold true in each case in practice, the public’s awareness of the rule can help us understand which films are understood by members of the public as flops. The degree of success (or failure) can be more accurately gauged by looking at the grosses and the Gross to Budget column.

# Imports

In [1]:
import pandas as pd

# Load, cleaning and preparation

## Load

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/CelioMaciel179/PDS_ANALISE_BilheteriaDCMarvel/main/datasets/dc_marvel_movie_performance.csv')

In [3]:
pd.set_option('display.max_columns', None)
df

Unnamed: 0,Film,U.S. release date,Box office gross Domestic (U.S. and Canada ),Box office gross Other territories,Box office gross Worldwide,Budget,MCU,Phase,Distributor,MPAA Rating,Length,Minutes,Franchise,Character Family,Domestic %,Gross to Budget,Rotten Tomatoes Critic Score,Male/Female-led,Year,Inflation Adjusted Worldwide Gross,Inflation Adjusted Budget,2.5x prod,Break Even
0,Superman,15/12/1978,"$134,478,449","$166,000,000","$300,478,449","$55,000,000",False,,Warner Bros.,PG,02:23:00,143.0,DC,Superman,45%,5.46,94.0,Male,1978.0,"$1,404,237,104","$257,033,544","$137,500,000",Success
1,Superman II,19/06/1981,"$108,185,706","$108,200,000","$216,385,706","$54,000,000",False,,Warner Bros.,PG,02:07:00,127.0,DC,Superman,50%,4.01,83.0,Male,1981.0,"$725,336,273","$181,010,842","$135,000,000",Success
2,Superman III,17/06/1983,"$59,950,623","$20,300,000","$80,250,623","$39,000,000",False,,Warner Bros.,PG,02:05:00,125.0,DC,Superman,75%,2.06,29.0,Male,1983.0,"$245,506,947","$119,310,861","$97,500,000",Flop
3,Supergirl,21/11/1984,"$14,296,438",,"$14,296,438","$35,000,000",False,,Tri-Star Pictures,PG,02:04:00,124.0,DC,Superman,100%,0.41,8.0,Female,1984.0,"$41,926,345","$102,642,497","$87,500,000",Flop
4,Howard the Duck,01/08/1986,"$16,295,774","$21,667,000","$37,962,774","$37,000,000",False,,Universal Pictures,PG,01:50:00,110.0,Marvel,Howard the Duck,43%,1.03,13.0,Male,1986.0,"$37,962,774","$37,000,000","$92,500,000",Flop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109,Blue Beetle,18/08/2023,"$72,488,072","$56,800,000","$129,288,072","$120,000,000",False,,Warner Bros.,PG-13,02:07:00,127.0,DC,Blue Beetle,56%,1.08,78.0,Male,2023.0,"$129,288,072","$120,000,000","$300,000,000",Flop
110,The Marvels,10/11/2023,"$84,500,223","$121,373,601","$205,873,824","$219,800,000",True,5.0,Walt Disney Studios Motion Pictures,PG-13,01:45:00,105.0,Marvel,Captain Marvel,41%,0.94,62.0,Female,2023.0,"$205,873,824","$219,800,000","$549,500,000",Flop
111,Aquaman and the Lost Kingdom,20/12/2023,"$124,436,589","$309,900,000","$434,336,589","$205,000,000",False,,Warner Bros.,PG-13,02:04:00,124.0,DC,Aquaman,29%,2.12,35.0,Male,2023.0,"$434,336,589","$205,000,000","$512,500,000",Flop
112,Madame Web,14/02/2024,"$42,619,699","$54,000,000","$96,619,699","$80,000,000",False,,Columbia Pictures,PG-13,01:56:00,116.0,Marvel,Spider-Man Allies and Villains,44%,1.21,13.0,Female,2024.0,"$96,619,699","$80,000,000","$200,000,000",Flop


In [4]:
# Deleting empty row
df.drop(index=113, inplace=True)

## Overview

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 23 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Film                                          113 non-null    object 
 1   U.S. release date                             113 non-null    object 
 2   Box office gross Domestic (U.S. and Canada )  113 non-null    object 
 3   Box office gross Other territories            112 non-null    object 
 4   Box office gross Worldwide                    113 non-null    object 
 5   Budget                                        113 non-null    object 
 6   MCU                                           113 non-null    object 
 7   Phase                                         33 non-null     float64
 8   Distributor                                   113 non-null    object 
 9   MPAA Rating                                   113 non-null    obj

In [6]:
def show_unique_values(df):
    object_columns = df.select_dtypes(include=['object']).columns
    
    for col in object_columns:
        unique_values = df[col].unique()
        print(f"Unique values '{col}':")
        print(unique_values)
        print('-' * 40)

show_unique_values(df)

Unique values 'Film':
['Superman' 'Superman II' 'Superman III' 'Supergirl' 'Howard the Duck'
 'Superman IV: The Quest for Peace' 'The Return of Swamp Thing' 'Batman'
 'The Punisher' 'Batman Returns' 'Batman: Mask of the Phantasm'
 'Batman Forever' 'Batman & Robin' 'Steel' 'Blade' 'X-Men' 'Blade II'
 'Spider-Man' 'Daredevil' 'X2' 'Hulk' 'Spider-Man 2' 'Catwoman'
 'Blade: Trinity' 'Elektra' 'Constantine' 'Batman Begins' 'Fantastic Four'
 'X-Men: The Last Stand' 'Superman Returns' 'Ghost Rider' 'Spider-Man 3'
 'Fantastic Four: Rise of the Silver Surfer' 'Iron Man'
 'The Incredible Hulk' 'The Dark Knight' 'Punisher: War Zone' 'Watchmen'
 'X-Men Origins: Wolverine' 'Iron Man 2' 'Jonah Hex' 'Thor'
 'X-Men: First Class' 'Green Lantern'
 'Captain America: The First Avengers' 'Ghost Rider: Spirit of Vengeance'
 'The Avengers' 'The Amazing Spider-Man' 'The Dark Knight Rises'
 'Iron Man 3' 'Man of Steel' 'The Wolverine' 'Thor: The Dark World'
 'Captain America: The Winter Soldier' 'The Amazing Sp

## Checking duplicates and Null

In [7]:
# Checking duplicates
df.duplicated().sum()

0

In [8]:
# Checking null values
df.isnull().sum()

Film                                             0
U.S. release date                                0
Box office gross Domestic (U.S. and Canada )     0
Box office gross Other territories               1
Box office gross Worldwide                       0
Budget                                           0
MCU                                              0
Phase                                           80
Distributor                                      0
MPAA Rating                                      0
Length                                           0
Minutes                                          0
Franchise                                        0
Character Family                                 0
Domestic %                                       0
Gross to Budget                                  0
Rotten Tomatoes Critic Score                     0
Male/Female-led                                  0
Year                                             0
Inflation Adjusted Worldwide Gr

## Cleans signs

In [9]:
# list of columns with signs
cols = ['Box office gross Domestic (U.S. and Canada )', 'Box office gross Other territories',
        'Box office gross Worldwide', 'Budget', 'Domestic %', 'Inflation Adjusted Worldwide Gross', 'Inflation Adjusted Budget',
        '2.5x prod']


df[cols] = df[cols].replace({'\\$': '', ',': '', '%': ''}, regex=True)

## Fill value null 

In [10]:
index_null = df[df['Box office gross Other territories'].isnull()].index
i = index_null[0]

# There was no screening of the Supergirl movie in other places around the world.
df.at[i, 'Box office gross Other territories'] = 0

In [11]:
# Phases marvel
df['Phase'] = df['Phase'].fillna(0)

## Convert types

In [12]:
df['U.S. release date'] = pd.to_datetime(df['U.S. release date'], format='%d/%m/%Y')

float_cols = [
    'Box office gross Domestic (U.S. and Canada )',
    'Box office gross Other territories',
    'Box office gross Worldwide',
    'Budget',
    'Domestic %',
    'Gross to Budget',
    'Rotten Tomatoes Critic Score',
    'Inflation Adjusted Worldwide Gross',
    'Inflation Adjusted Budget',
    '2.5x prod'
]

df[float_cols] = df[float_cols].astype(float)

df['Minutes'] = df['Minutes'].astype(int)
df['Year'] = df['Year'].astype(int)
df['Phase'] = df['Phase'].astype(int)

df['MCU'] = df['MCU'].replace({'True': True, 'False': False})

  df['MCU'] = df['MCU'].replace({'True': True, 'False': False})


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 23 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   Film                                          113 non-null    object        
 1   U.S. release date                             113 non-null    datetime64[ns]
 2   Box office gross Domestic (U.S. and Canada )  113 non-null    float64       
 3   Box office gross Other territories            113 non-null    float64       
 4   Box office gross Worldwide                    113 non-null    float64       
 5   Budget                                        113 non-null    float64       
 6   MCU                                           113 non-null    bool          
 7   Phase                                         113 non-null    int32         
 8   Distributor                                   113 non-null    object  

In [14]:
# Removing columns
remove = ['Length']
df = df.drop(columns= remove).copy()

## Export

In [18]:
# df.to_csv('df_clean.csv', index=False)