
# Comprehensive Movie Data Analysis
# 1. Business Understanding

This notebook provides a structured approach to analyzing movie datasets to answer the business question:

**"What kinds of movies should a new studio produce for financial success?"**

We will proceed through the following sections:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Data Analysis
5. Visualization

**Objectives:**
- Analyze which genres are most profitable.
- Examine the relationship between production budget and revenue.
- Assess the impact of review scores on financial performance.

By integrating multiple movie datasets, we aim to provide actionable insights for new studios to maximize their chances of financial success.

**Import Required Libraries**

Import all necessary libraries for data manipulation, visualization, and analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

# 2. Data Understanding
**Load and Inspect Datasets**

Loading all provided datasets (CSV, TSV, SQLite) into pandas DataFrames. Displaying the first few rows and data types for each DataFrame. Printing the shape of each DataFrame to confirm successful loading.

In [None]:
# File paths
bom_path = r'C:\Users\DAVID\Documents\Moringa\Phase2\Assessments\dsc-phase-2-project-PT11-Group4\Data\bom.movie_gross.csv'
tn_path = r'C:\Users\DAVID\Documents\Moringa\Phase2\Assessments\dsc-phase-2-project-PT11-Group4\Data\tn.movie_budgets.csv'
tmdb_path = r'C:\Users\DAVID\Documents\Moringa\Phase2\Assessments\dsc-phase-2-project-PT11-Group4\Data\tmdb.movies.csv'
rt_info_path = r'C:\Users\DAVID\Documents\Moringa\Phase2\Assessments\dsc-phase-2-project-PT11-Group4\Data\rt.movie_info.tsv'
rt_reviews_path = r'C:\Users\DAVID\Documents\Moringa\Phase2\Assessments\dsc-phase-2-project-PT11-Group4\Data\rt.reviews.tsv'
imdb_db_path = r'C:\Users\DAVID\Documents\Moringa\Phase2\Assessments\dsc-phase-2-project-PT11-Group4\Data\im.db'

# Load CSV/TSV files
df_bom = pd.read_csv(bom_path)
df_tn = pd.read_csv(tn_path, encoding='latin-1')
df_tmdb = pd.read_csv(tmdb_path)
df_rt_info = pd.read_csv(rt_info_path, sep='\t')
df_rt_reviews = pd.read_csv(rt_reviews_path,encoding='latin-1',sep='\t')

# Load SQLite tables
conn = sqlite3.connect(imdb_db_path)
df_movie_basics = pd.read_sql_query("SELECT * FROM movie_basics", conn)
df_movie_ratings = pd.read_sql_query("SELECT * FROM movie_ratings", conn)
conn.close()

# Inspect DataFrames
for name, df in [
    ("df_bom", df_bom), ("df_tn", df_tn), ("df_tmdb", df_tmdb),
    ("df_rt_info", df_rt_info), ("df_rt_reviews", df_rt_reviews),
    ("df_movie_basics", df_movie_basics), ("df_movie_ratings", df_movie_ratings)
]:
    print(f"{name}: shape={df.shape}")
    display(df.head())
    display(df.dtypes)

# 3. Data Preparation
**Cleaning and Preparing Data**

Handling missing values, removing duplicates, standardizing column names, and ensuring consistent data types. Converting relevant columns to numeric or datetime as needed. Addressing encoding issues and outliers.

In [None]:
# Clean df_tn
for col in ['production_budget', 'domestic_gross', 'worldwide_gross']:
    df_tn[col] = df_tn[col].astype(str).str.replace(r'[$,]', '', regex=True)
    df_tn[col] = pd.to_numeric(df_tn[col], errors='coerce')
df_tn['release_date'] = pd.to_datetime(df_tn['release_date'], errors='coerce')
df_tn = df_tn.drop_duplicates()
df_tn.columns = df_tn.columns.str.lower().str.replace(' ', '_')
df_tn = df_tn.fillna(0)

# Remove outliers using IQR for financial columns
for col in ['production_budget', 'domestic_gross', 'worldwide_gross']:
    Q1 = df_tn[col].quantile(0.25)
    Q3 = df_tn[col].quantile(0.75)
    IQR = Q3 - Q1
    df_tn = df_tn[(df_tn[col] >= Q1 - 1.5 * IQR) & (df_tn[col] <= Q3 + 1.5 * IQR)]

# Clean df_bom
df_bom.columns = df_bom.columns.str.lower().str.replace(' ', '_')
df_bom = df_bom.drop_duplicates()

# Clean df_tmdb
df_tmdb.columns = df_tmdb.columns.str.lower().str.replace(' ', '_')
df_tmdb['release_date'] = pd.to_datetime(df_tmdb['release_date'], errors='coerce')
df_tmdb = df_tmdb.drop_duplicates()

# Clean df_rt_info
df_rt_info.columns = df_rt_info.columns.str.lower().str.replace(' ', '_')
df_rt_info['theater_date'] = pd.to_datetime(df_rt_info['theater_date'], errors='coerce')
df_rt_info['dvd_date'] = pd.to_datetime(df_rt_info['dvd_date'], errors='coerce')
df_rt_info = df_rt_info.drop_duplicates()

# Clean df_rt_reviews
df_rt_reviews.columns = df_rt_reviews.columns.str.lower().str.replace(' ', '_')
df_rt_reviews = df_rt_reviews.drop_duplicates()

# Clean IMDB tables
df_movie_basics.columns = df_movie_basics.columns.str.lower().str.replace(' ', '_')
df_movie_ratings.columns = df_movie_ratings.columns.str.lower().str.replace(' ', '_')

# 4. Data Analysis
**Merging Datasets and Feature Engineering**

Merging the cleaned datasets on appropriate keys (e.g., movie title or ID). Creating new features such as total revenue, profit margin, and ROI. Preparing genre information for analysis.

In [None]:
# Merge tn and tmdb on movie/original_title
df_merged = pd.merge(df_tn, df_tmdb, left_on='movie', right_on='original_title', how='inner')

# Merge with bom on movie/title
df_merged = pd.merge(df_merged, df_bom, left_on='movie', right_on='title', how='left')

# Merge with rt_info on movie/id (ensure type match)
df_rt_info['id'] = df_rt_info['id'].astype(str)
df_merged = pd.merge(df_merged, df_rt_info, left_on='movie', right_on='id', how='left')

# Feature engineering
df_merged['total_revenue'] = df_merged['domestic_gross_x'] + df_merged['worldwide_gross']
df_merged['profit_margin'] = (df_merged['total_revenue'] - df_merged['production_budget']) / df_merged['total_revenue']
df_merged['profit_margin'] = df_merged['profit_margin'].fillna(0)
df_merged['roi'] = (df_merged['total_revenue'] - df_merged['production_budget']) / df_merged['production_budget']
df_merged['roi'] = df_merged['roi'].fillna(0)

# Prepare genre information
def extract_genre_ids(genre_str):
    if isinstance(genre_str, str):
        try:
            return [int(x) for x in genre_str[1:-1].split(',') if x.strip() != '']
        except:
            return []
    else:
        return []
df_merged['genre_ids_list'] = df_merged['genre_ids'].apply(extract_genre_ids)

**Exploratory Data Analysis: Key Variables**

Analyzing distributions and relationships between production budget, revenue, ratings, and ROI. Visualizing correlations and summary statistics to identify patterns.

In [None]:
# Correlation matrix
corr = df_merged[['production_budget', 'total_revenue', 'profit_margin', 'roi']].corr()
display(corr)

# Scatter plots
plt.figure(figsize=(15, 10))
plt.subplot(2, 2, 1)
sns.scatterplot(data=df_merged, x='production_budget', y='total_revenue')
plt.title('Production Budget vs. Total Revenue')
plt.subplot(2, 2, 2)
sns.scatterplot(data=df_merged, x='production_budget', y='profit_margin')
plt.title('Production Budget vs. Profit Margin')
plt.subplot(2, 2, 3)
sns.scatterplot(data=df_merged, x='production_budget', y='roi')
plt.title('Production Budget vs. ROI')
plt.subplot(2, 2, 4)
sns.scatterplot(data=df_merged, x='total_revenue', y='roi')
plt.title('Total Revenue vs. ROI')
plt.tight_layout()
plt.show()

**Genre Analysis and Visualization**

Analyzing the performance of different genres using ROI and revenue metrics. Creating bar charts to visualize average ROI by genre and genre popularity over time.

In [None]:
# Average ROI by genre
genre_roi = df_merged.explode('genre_ids_list').groupby('genre_ids_list')['roi'].mean().sort_values(ascending=False)
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_roi.index, y=genre_roi.values, palette='viridis')
plt.title('Average ROI by Genre')
plt.xlabel('Genre ID')
plt.ylabel('Average ROI')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Genre popularity over time
df_merged['release_year'] = pd.to_datetime(df_merged['release_date_x'], errors='coerce').dt.year
plt.figure(figsize=(14, 8))
for genre_id in df_merged['genre_ids_list'].explode().unique():
    genre_movies = df_merged.explode('genre_ids_list').loc[df_merged.explode('genre_ids_list')['genre_ids_list'] == genre_id]
    sns.lineplot(x='release_year', y='movie', data=genre_movies, label=genre_id, estimator=len)
plt.title('Genre Popularity Over Time')
plt.xlabel('Release Year')
plt.ylabel('Number of Movies')
plt.legend(title='Genre ID', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

**Budget and Revenue Analysis**

Investigating the relationship between production budget and box office revenue. Using scatter plots and regression lines to visualize and interpret the results. Segment analysis by budget tiers.

In [None]:
# Scatter plot with regression
plt.figure(figsize=(10, 6))
sns.regplot(x='production_budget', y='total_revenue', data=df_merged, order=2, scatter_kws={'s': 20, 'alpha': 0.5}, line_kws={'color': 'red'})
plt.title('Production Budget vs Revenue (Polynomial Regression)')
plt.xlabel('Production Budget')
plt.ylabel('Total Revenue')
plt.tight_layout()
plt.show()

# ROI by budget tier
quartiles = df_merged['production_budget'].quantile([0.25, 0.5, 0.75])
def budget_tier(budget):
    if budget <= quartiles[0.25]:
        return 'Low'
    elif budget <= quartiles[0.75]:
        return 'Medium'
    else:
        return 'High'
df_merged['budget_tier'] = df_merged['production_budget'].apply(budget_tier)
plt.figure(figsize=(10, 6))
sns.boxplot(x='budget_tier', y='roi', data=df_merged)
plt.title('ROI Distribution by Budget Tier')
plt.xlabel('Budget Tier')
plt.ylabel('ROI')
plt.tight_layout()
plt.show()

**Ratings and ROI Analysis**

Examining the impact of movie ratings on ROI and revenue. Creating scatter plots and calculating correlation coefficients. Segment analysis by budget or genre if relevant.

In [None]:
# Scatter plots for ratings vs revenue/ROI
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(data=df_merged, x='vote_average', y='total_revenue')
plt.title('Vote Average vs. Total Revenue')
plt.xlabel('Vote Average')
plt.ylabel('Total Revenue')
plt.subplot(1, 2, 2)
sns.scatterplot(data=df_merged, x='vote_average', y='roi')
plt.title('Vote Average vs. ROI')
plt.xlabel('Vote Average')
plt.ylabel('ROI')
plt.tight_layout()
plt.show()

# Correlation coefficients
correlation_rating_revenue = df_merged['vote_average'].corr(df_merged['total_revenue'])
correlation_rating_roi = df_merged['vote_average'].corr(df_merged['roi'])
print(f"Correlation between Vote Average and Total Revenue: {correlation_rating_revenue:.3f}")
print(f"Correlation between Vote Average and ROI: {correlation_rating_roi:.3f}")

**Temporal Trends in Movie Performance**

Analyzing trends in ROI and genre popularity over time using line plots and rolling averages. Identifying any seasonal or long-term patterns.

In [None]:
# ROI over time
roi_by_year = df_merged.groupby('release_year')['roi'].mean()
plt.figure(figsize=(12, 6))
sns.lineplot(x=roi_by_year.index, y=roi_by_year.values)
plt.title('Average ROI Over Time')
plt.xlabel('Release Year')
plt.ylabel('Average ROI')
plt.tight_layout()
plt.show()

# Rolling average
rolling_avg_window = 5
rolling_avg = roi_by_year.rolling(window=rolling_avg_window, center=True).mean()
plt.figure(figsize=(12, 6))
sns.lineplot(x=roi_by_year.index, y=roi_by_year.values, label='Average ROI')
sns.lineplot(x=rolling_avg.index, y=rolling_avg.values, label=f'{rolling_avg_window}-Year Rolling Average')
plt.title('Average ROI Over Time with Rolling Average')
plt.xlabel('Release Year')
plt.ylabel('Average ROI')
plt.legend()
plt.tight_layout()
plt.show()

# 5. Visualization 
**Business Recommendations with Supporting Visualizations**

Presenting three concrete business recommendations based on the analysis. Supporting each recommendation with clear, well-formatted visualizations and concise explanations.

## Recommendation 1: Focus on High-ROI Genres

Certain genres consistently deliver higher average ROI. The studio should prioritize producing films in these genres to maximize profitability.


In [None]:
# Visualize average ROI by genre
plt.figure(figsize=(12, 6))
sns.barplot(x=genre_roi.index, y=genre_roi.values, palette='viridis')
plt.title('Average ROI by Genre')
plt.xlabel('Genre ID')
plt.ylabel('Average ROI')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## Recommendation 2: Optimize Production Budgets

Higher budgets do not guarantee higher ROI. The studio should carefully allocate budgets, targeting the "medium" tier for optimal balance between risk and reward.


In [None]:
# Visualize ROI by budget tier
plt.figure(figsize=(10, 6))
sns.boxplot(x='budget_tier', y='roi', data=df_merged)
plt.title('ROI Distribution by Budget Tier')
plt.xlabel('Budget Tier')
plt.ylabel('ROI')
plt.tight_layout()
plt.show()

## Recommendation 3: Leverage Ratings and Monitor Trends

While ratings have a weak correlation with ROI, higher-rated movies tend to earn more revenue. The studio should aim for quality to boost revenue and monitor temporal trends to capitalize on emerging genres.


In [None]:
# Visualize ratings vs revenue and ROI
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(data=df_merged, x='vote_average', y='total_revenue')
plt.title('Vote Average vs. Total Revenue')
plt.xlabel('Vote Average')
plt.ylabel('Total Revenue')
plt.subplot(1, 2, 2)
sns.scatterplot(data=df_merged, x='vote_average', y='roi')
plt.title('Vote Average vs. ROI')
plt.xlabel('Vote Average')
plt.ylabel('ROI')
plt.tight_layout()
plt.show()

# Visualize ROI trend over time
plt.figure(figsize=(12, 6))
sns.lineplot(x=roi_by_year.index, y=roi_by_year.values)
plt.title('Average ROI Over Time')
plt.xlabel('Release Year')
plt.ylabel('Average ROI')
plt.tight_layout()
plt.show()

---

# Summary

This analysis provides actionable insights for a new movie studio:

- **Target high-ROI genres** for production focus.
- **Optimize budget allocation** to maximize ROI, especially in the medium budget tier.
- **Aim for quality and monitor trends** to boost revenue and adapt to changing audience preferences.

## Detailed Explanation on Further Steps

**Targeted Genre Production:** 
Prioritize genres with consistently high ROI, considering the number of movies in each genre and their potential for profitability. Investigate the reasons for success in these genres further.

**Budget Allocation Strategy:** Refine the budget allocation strategy by segmenting movies by genre and performing more robust regression analysis to identify optimal budget levels for different genres. Consider other factors beyond budget, such as marketing and distribution
