#  Movie Insights Analysis
  
**Group Members:** Mathews Wandera, Tinah, Diana, Pacificah, Night, Frank  
**Branch Owners:**  
- `Mathews-Tableau`  
- `Tinah-Presentation`  
- `Diana-Data_Preparation`  
- `Pacificah-EDA`  
- `Night-Visualization`  
- `Frank-README`  

---

### Objective
To analyze multiple movie-related datasets from various platforms (IMDb, Rotten Tomatoes, The Numbers, TMDb, and Box Office Mojo) and provide strategic insights to support decision-making in the film industry.

## 1. Business Understanding

### Problem Statement
The movie industry produces a vast number of films every year, but only a few turn out to be box office hits or critical successes. Stakeholders—such as producers, distributors, and marketing teams—require data-driven insights to inform their decisions. By analyzing trends in budgets, genres, ratings, and revenues, we aim to uncover factors that contribute to a movie's commercial and critical success.

### Project Goal
This project seeks to:
- Identify the key characteristics of successful movies.
- Understand how different platforms rate movies (IMDb, Rotten Tomatoes, etc.).
- Analyze financial patterns like budget vs. revenue.
- Provide data-driven recommendations for improving the profitability and impact of future productions.

### Key Questions
- What genres perform best in terms of revenue and ratings?
- How do production budgets relate to revenue or critical scores?
- Are there specific patterns in release dates that impact success?
- What are the most profitable platforms or combinations of features?

### Target Audience
- Film producers and investors
- Distribution and marketing teams
- Data-driven creative teams in media
- Streaming platform analysts

## 2. Data Understanding
We'll now systematically explore each dataset to understand:

Structure (columns and data types)

- Sample data (via .head())

- Missing values

- Duplicates

- Basic statistics (via .describe())

We'll do this for:

**bom.movie_gross.csv.gz**

**rt.movie_info.tsv.gz**

**rt.reviews.tsv.gz**

**tmdb.movies.csv.gz**

**tn.movie_budgets.csv.gz**

Key tables from `im.db`: movie_basics, movie_ratings, principals, etc.



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
import zipfile
import os

## Confirm available files

In [3]:

# Confirm available files
print(" Contents of 'datasets':")
print(os.listdir('datasets'))

 Contents of 'datasets':
['tmdb.movies.csv.gz', 'im.db.zip', 'rt.movie_info.tsv.gz', '.gitkeep', 'bom.movie_gross.csv.gz', 'rt.reviews.tsv.gz', 'im.db', '.gitignore', 'tn.movie_budgets.csv.gz']


## Load external CSV/TSV dataset

In [4]:
# Load external CSV/TSV datasets
bom = pd.read_csv("datasets/bom.movie_gross.csv.gz")
rt_info = pd.read_csv("datasets/rt.movie_info.tsv.gz", sep="\t")
rt_reviews = pd.read_csv("datasets/rt.reviews.tsv.gz", sep="\t", encoding="latin1")  # fixed encoding issue
tmdb = pd.read_csv("datasets/tmdb.movies.csv.gz")
tn_budgets = pd.read_csv("datasets/tn.movie_budgets.csv.gz")


## Unzip the im.db.zip

In [5]:
if not os.path.exists("datasets/im.db"):
    with zipfile.ZipFile("datasets/im.db.zip", 'r') as zip_ref:
        zip_ref.extractall("datasets")

In [6]:
#Connect to the SQLite database
conn = sqlite3.connect("datasets/im.db")

# Preview available tables
tables = pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", conn)
print("\n Tables in im.db:")
print(tables)


 Tables in im.db:
            name
0   movie_basics
1      directors
2      known_for
3     movie_akas
4  movie_ratings
5        persons
6     principals
7        writers


## 3. Data Preparation

Now that we have successfully loaded all datasets and explored their structure, we'll proceed with data preparation. This involves:

- Merging key datasets into a master data table.
- Cleaning and standardizing fields.
- Saving the final cleaned dataset for EDA and Tableau use.


## Step 1 — Prepare Individual DataFrames

### 🔹 Step 1: Select and Rename Relevant Columns

We’ll simplify each dataset and rename columns for consistency before merging. This ensures our final dataset is clean and easy to work with.


In [8]:
# Check column names in Rotten Tomatoes info dataset
print(rt_info.columns.tolist())


['id', 'synopsis', 'rating', 'genre', 'director', 'writer', 'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime', 'studio']


### 1: BOM (Box Office Mojo)

In [9]:
# BOM dataset (Box Office Mojo)
bom_clean = bom[['title', 'domestic_gross', 'foreign_gross', 'year']].copy()
bom_clean.rename(columns={'title': 'movie_title'}, inplace=True)


###  2.TN (The Numbers) Budgets

In [10]:
# TN Budgets dataset
tn_budgets_clean = tn_budgets[['movie', 'production_budget', 'worldwide_gross', 'release_date']].copy()
tn_budgets_clean.rename(columns={'movie': 'movie_title'}, inplace=True)


### 3: TMDb

In [11]:
# TMDb dataset
tmdb_clean = tmdb[['title', 'id', 'popularity', 'vote_average', 'vote_count']].copy()
tmdb_clean.rename(columns={'title': 'movie_title', 'id': 'tmdb_id'}, inplace=True)


### 4: Rotten Tomatoes (info)

In [12]:
# Rotten Tomatoes info dataset
rt_info_clean = rt_info[['genre', 'rating', 'studio', 'theater_date', 'runtime']].copy()

# No title column, so we cannot rename to movie_title here — will join later if possible


## Step 2 – Load Tables from the SQLite Database
We'll extract and preview the following key tables:

- movie_basics – core movie metadata

- movie_ratings – IMDb average rating + votes

- principals – cast/crew info

- optionally) names, crew, etc., if needed later



### 5: Connect to the SQLite DB and List Tables

In [13]:
# Connect to the SQLite database
conn = sqlite3.connect('datasets/im.db')

# Check available tables
tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
pd.read_sql(tables_query, conn)


Unnamed: 0,name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers
