# Movie Revenue Analysis

## Introduction

This analysis explores the factors associated with movie revenues using the TMDB dataset. We investigate the following research questions:

* **Main Question**: Which properties are associated with the highest movie revenues?
* **Sub-questions**:
  * What is the relationship between budget and revenue?
  * Which genres tend to generate the highest revenues?
  * How does popularity correlate with revenue?
  * Has the relationship between movie properties and revenue changed over time?

## 1. Data Loading and Initial Exploration

First, we'll import the necessary libraries and load our dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

df = pd.read_csv('./tmdb-movies.csv')

df.head()

Let's examine the dataset structure and summary statistics.

In [None]:
print(f"Dataset shape: {df.shape}")
print("\nColumn data types:")
print(df.dtypes)

print("\nSummary statistics for numerical columns:")
df.describe()

## 2. Data Cleaning and Wrangling

Before analysis, we need to clean and prepare our data, including handling missing values and transforming variables.

In [None]:
print("Missing values per column:")
print(df.isnull().sum())

df_clean = df.drop(['homepage', 'tagline', 'imdb_id', 'overview'], axis=1)

df_clean = df_clean.dropna(subset=['revenue_adj', 'budget_adj'])

df_clean = df_clean[(df_clean['revenue_adj'] > 0) & (df_clean['budget_adj'] > 0)]

df_clean['release_date'] = pd.to_datetime(df_clean['release_date'])

df_clean['genres'] = df_clean['genres'].str.split('|')
df_exploded = df_clean.explode('genres')

print(f"\nCleaned dataset shape: {df_clean.shape}")
print(f"Number of rows removed: {df.shape[0] - df_clean.shape[0]}")

### Summary of Data Cleaning Steps

1. Removed irrelevant columns (homepage, tagline, imdb_id, overview)
2. Dropped rows with missing revenue or budget data
3. Filtered out movies with zero revenue or budget
4. Converted release dates to datetime format
5. Processed genre data for analysis

These steps ensure we have quality data for our analysis, focusing on movies with complete financial information.

## 3. Exploratory Data Analysis (EDA)

Now we'll conduct a thorough exploration of the data to uncover patterns and relationships.

Plotting def funcation.

In [None]:
from typing import Optional, Tuple, List, Union


def plot_data(
    data: pd.DataFrame,
    x: str,
    y: str,
    plot_type: str = "scatter",
    figsize: Tuple[int, int] = (12, 6),
    title: str = None,
    xlabel: str = None,
    ylabel: str = None,
    alpha: float = 0.6,
    hue: Optional[str] = None,
    style: Optional[str] = None,
    size: Optional[str] = None,
    palette: Optional[str] = None,
    legend: bool = False,
    xlim: Optional[Tuple[float, float]] = None,
    ylim: Optional[Tuple[float, float]] = None,
    plain_format: bool = True,
    grid: bool = False,
    ax: Optional[plt.Axes] = None,
    **kwargs
):
    """
    A versatile plotting function that creates different types of plots based on the input parameters.
    
    Parameters:
    -----------
    data : pd.DataFrame
        The DataFrame containing the data to plot
    x : str
        The column name for x-axis data
    y : str
        The column name for y-axis data
    plot_type : str, default="scatter"
        Type of plot to create. Options: "scatter", "line", "bar", "hist", "box", "violin", "heatmap"
    figsize : tuple, default=(12, 6)
        Figure size as (width, height) in inches
    title : str, optional
        Title of the plot
    xlabel : str, optional
        Label for x-axis (defaults to x variable name if None)
    ylabel : str, optional
        Label for y-axis (defaults to y variable name if None)
    alpha : float, default=0.6
        Transparency of the plot elements
    hue : str, optional
        Column name for color encoding
    style : str, optional
        Column name for styling points
    size : str, optional
        Column name for sizing points
    palette : str, optional
        Color palette name
    legend : bool, default=True
        Whether to show the legend
    xlim : tuple, optional
        x-axis limits as (min, max)
    ylim : tuple, optional
        y-axis limits as (min, max)
    plain_format : bool, default=True
        If True, uses plain number format instead of scientific notation
    grid : bool, default=False
        Whether to show grid lines
 
    ax : matplotlib.axes.Axes, optional
        Existing axes to plot on
    **kwargs
        Additional keyword arguments to pass to the underlying plotting function
    
    Returns:
    --------
    matplotlib.axes.Axes
        The axes object with the plot
    
    Examples:
    ---------
    #scatter plot
    plot_data(df, x='popularity', y='revenue_adj', title='Revenue vs Popularity')
    
    #custom styling
    plot_data(df, x='year', y='revenue_adj', plot_type='line', hue='category', 
              title='Revenue Trends by Category', grid=True)
    
    #subplots
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    plot_data(df, x='popularity', y='revenue_adj', ax=axes[0], title='Revenue vs Popularity')
    plot_data(df, x='popularity', y='budget_adj', ax=axes[1], title='Budget vs Popularity')
    plt.tight_layout()
    plt.show()
    """

    if ax is None:
        fig, ax = plt.subplots(figsize=figsize)
    
    
    if xlabel is None:
        xlabel = x
    if ylabel is None:
        ylabel = y
    
    if plot_type == "scatter":
        sns.scatterplot(data=data, x=x, y=y, hue=hue, style=style, size=size, 
                        palette=palette, alpha=alpha, ax=ax, legend=legend, **kwargs)
    
        
    elif plot_type in ["dist", "distplot"]:
        data_array = data[x] if isinstance(data, pd.DataFrame) and x in data.columns else x
        
        if 'transform' in kwargs:
            transform_func = kwargs.pop('transform')
            data_array = transform_func(data_array)
            
        sns.distplot(data_array, ax=ax, **kwargs)
    
    else:
        raise ValueError(f"Plot type '{plot_type}' not recognized")
    
    if title:
        ax.set_title(title, fontsize=16)
    ax.set_xlabel(xlabel, fontsize=12)
    ax.set_ylabel(ylabel, fontsize=12)

    if xlim:
        ax.set_xlim(xlim)
    if ylim:
        ax.set_ylim(ylim)
    
    if plain_format:
        ax.ticklabel_format(style='plain', axis='y')
        ax.ticklabel_format(style='plain', axis='x')
    
    if grid:
        ax.grid(True, alpha=0.3)
    
    return ax

### 3.1 Distribution of Revenue

Let's first look at the distribution of movie revenues.

In [None]:
plot_data(
    data=df_clean,
    x='revenue_adj',
    y=None,
    plot_type='distplot',
    title='Distribution of Adjusted Revenue',
    xlabel='Adjusted Revenue (USD)',
    ylabel='Frequency',
    bins=50,
    kde=True,
    plain_format=True
)
plt.show()

plot_data(
    data=pd.DataFrame({'log_revenue': np.log10(df_clean['revenue_adj'])}),
    x='log_revenue',
    y=None, 
    plot_type='dist', 
    title='Distribution of Log-Transformed Adjusted Revenue',
    xlabel='Log10(Adjusted Revenue)',
    ylabel='Frequency',
    kde=True,
    bins=50 
)
plt.show()

### 3.2 Correlation Analysis

Let's examine correlations between key numerical variables.

In [None]:
corr_matrix = df_clean[['revenue_adj', 'budget_adj', 'popularity', 'vote_average', 'vote_count', 'runtime']].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Key Variables', fontsize=16)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.show()

### 3.3 Revenue vs. Budget Relationship

One of our key research questions is the relationship between budget and revenue.

In [None]:
ax = plot_data(
    data=df_clean,
    x='budget_adj',
    y='revenue_adj',
    plot_type='scatter',
    title='Adjusted Revenue vs. Adjusted Budget',
    xlabel='Adjusted Budget (USD)',
    ylabel='Adjusted Revenue (USD)',
    alpha=0.6,
    plain_format=True
)

max_val = max(df_clean['budget_adj'].max(), df_clean['revenue_adj'].max())
ax.plot([0, max_val], [0, max_val], 'r--', alpha=0.7, label='Break-even line')

ax.legend()
plt.show()

### 3.4 Revenue by Genre

Which genres tend to generate the highest revenues?

In [None]:
#Revenue by genre
genre_revenue = df_exploded.groupby('genres')['revenue_adj'].median().sort_values(ascending=False)[:10]
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_revenue.values, y=genre_revenue.index, palette='viridis')
plt.title('Median Adjusted Revenue by Genre (Top 10)', fontsize=16)
plt.xlabel('Adjusted Revenue (USD)', fontsize=12)
plt.ylabel('Genre', fontsize=12)
plt.ticklabel_format(style='plain', axis='x')
plt.show()

plt.figure(figsize=(12, 6))
genre_counts = df_exploded['genres'].value_counts().sort_values(ascending=False)[:15]
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='muted')
plt.title('Number of Movies by Genre (Top 15)', fontsize=16)
plt.xlabel('Count', fontsize=12)
plt.ylabel('Genre', fontsize=12)
plt.show()

### 3.5 Revenue Trends Over Time

Let's analyze how movie revenues have changed over time.

In [None]:
yearly_revenue = df_clean.groupby('release_year')['revenue_adj'].median()
plt.figure(figsize=(14, 6))
sns.lineplot(data=yearly_revenue, marker='o')
plt.title('Median Adjusted Revenue Over Time', fontsize=16)
plt.xlabel('Release Year', fontsize=12)
plt.ylabel('Adjusted Revenue (USD)', fontsize=12)
plt.ticklabel_format(style='plain', axis='y')
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

yearly_budget = df_clean.groupby('release_year')['budget_adj'].median()
plt.figure(figsize=(14, 6))

plt.plot(yearly_revenue.index, yearly_revenue.values, 'b-', marker='o', label='Revenue')
plt.plot(yearly_budget.index, yearly_budget.values, 'g-', marker='s', label='Budget')

plt.title('Median Adjusted Revenue and Budget Over Time', fontsize=16)
plt.xlabel('Release Year', fontsize=12)
plt.ylabel('USD (Adjusted)', fontsize=12)
plt.ticklabel_format(style='plain', axis='y')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

### 3.6 Popularity and Revenue

How does popularity relate to revenue?

In [None]:
plt.figure(figsize=(12, 6))
sns.scatterplot(data=df_clean, x='popularity', y='revenue_adj', alpha=0.6)
plt.title('Adjusted Revenue vs. Popularity', fontsize=16)
plt.xlabel('Popularity Score', fontsize=12)
plt.ylabel('Adjusted Revenue (USD)', fontsize=12)
plt.ticklabel_format(style='plain', axis='y')
plt.show()

## 4. Statistical Analysis

Now we'll perform more rigorous statistical analysis to quantify the relationships we've observed.

In [None]:
from scipy.stats import pearsonr

#budget and revenue
corr, p_value = pearsonr(df_clean['budget_adj'], df_clean['revenue_adj'])
print(f"Pearson Correlation (Budget vs. Revenue): {corr:.2f}, p-value: {p_value:.4f}")

#popularity and revenue
pop_corr, pop_p = pearsonr(df_clean['popularity'], df_clean['revenue_adj'])
print(f"Pearson Correlation (Popularity vs. Revenue): {pop_corr:.2f}, p-value: {pop_p:.4f}")

#budget vs. revenue
sns.lmplot(data=df_clean, x='budget_adj', y='revenue_adj', height=6, aspect=2, scatter_kws={'alpha':0.5})
plt.title('Regression Plot: Budget vs. Revenue', fontsize=16)
plt.xlabel('Adjusted Budget (USD)', fontsize=12)
plt.ylabel('Adjusted Revenue (USD)', fontsize=12)
plt.ticklabel_format(style='plain', axis='both')
plt.show()

## 5. Conclusions and Limitations

Based on our analysis, we can draw several conclusions about the factors associated with movie revenue.

### Key Findings:

1. **Budget and Revenue**: Strong positive correlation (r = 0.57, p = 0.0000). Higher budgets are associated with higher revenues, though the relationship is not perfectly linear.

2. **Popularity**: Moderately correlated with revenue (r = 0.55), indicating popular movies tend to earn more. This suggests that marketing and public reception are important factors.

3. **Temporal Trend**: Median revenue has increased significantly since the 1980s, with particularly strong growth in the 2000s. Budget growth has followed a similar pattern, suggesting increasing production costs over time.

### Limitations:

- **Causality**: Correlation does not imply causation. High budgets may reflect studio confidence in a movie's potential rather than directly causing high revenue.

- **Data Scope**: Our analysis does not account for marketing expenditure, which can significantly impact a movie's commercial success. It also doesn't incorporate factors like director reputation, star power, or external market conditions.

- **Inflation Adjustment**: While `revenue_adj` and `budget_adj` are provided, their accuracy depends on the adjustment methodology used in the dataset creation.

- **Selection Bias**: Our analysis focuses only on movies with complete data and positive budget/revenue values, which may exclude certain types of films.

### Future Work:

- Investigate the impact of production companies and directors on movie revenue.
- Analyze seasonal trends (e.g., summer vs. winter releases) to identify optimal release timing.
- Build predictive models to forecast movie revenue based on pre-release characteristics.
- Include marketing budget data to get a more complete picture of movie economics.
- Explore audience demographics and their relationship to movie performance.
