# Movie Success Predictor
## Notebook 01: Data Exploration

### Objective
In this notebook, we:
- Load the TMDB movies and credits datasets
- Understand dataset structure and columns
- Identify potential target variables and features
- Perform initial inspection without modifying data

> Note: No cleaning or feature engineering is done in this notebook.

## Problem Context

This project aims to predict movie success using historical movie data.
Two prediction tasks are addressed in later phases:

1. **Hit / Flop Classification**
   - Based on financial performance using Return on Investment (ROI)

2. **IMDb Rating Prediction**
   - Predicting audience ratings (`vote_average`) on a 0–10 scale

The purpose of this phase is to understand the structure, scope, and quality
of the raw data before any cleaning or feature engineering is performed.


### Step 1: Load Required Libraries
We import libraries required for data handling and numerical analysis.


In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries imported successfully.")

Libraries imported successfully.


## Step 2: Load the TMDB Datasets

In this step, we load the raw TMDB movies and credits datasets from the `data/raw/` directory.
These datasets will be used throughout the project for exploration, cleaning, and modeling.


In [2]:
# Load TMDB datasets
movies = pd.read_csv("../data/raw/tmdb_5000_movies.csv")
credits = pd.read_csv("../data/raw/tmdb_5000_credits.csv")
print("Dataframe loaded sucessfully...")

Dataframe loaded sucessfully...


## Step 3: Check Dataset Shape

In this step, we check the number of rows and columns in each dataset.
This helps verify data completeness and understand dataset size.

In [3]:
print(f"Movies dataset shape: rows: {movies.shape[0]}, columns: {movies.shape[1]}")
print(f"Credits dataset shape: rows: {credits.shape[0]}, columns: {credits.shape[1]}")

Movies dataset shape: rows: 4803, columns: 20
Credits dataset shape: rows: 4803, columns: 4


## Step 4: Dataset Structure and Data Types

In this step, we inspect the structure of both datasets to understand:
- Data types of each column
- Presence of missing values
- Overall memory usage
> Note: Data type classification is already covered using `DataFrame.info()`.  
> No separate data type inspection is required at this stage.

In [4]:
print("Movies dataset info:")
print(movies.info())
print()
print("Credits dataset info:")
print(credits.info())

Movies dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  

### Observations

**Movies Dataset**
- Contains 4803 movies with 20 columns.
- Numerical features include:
  - `budget`, `revenue`, `popularity`, `runtime`, `vote_average`, `vote_count`
- Categorical/text features include:
  - `genres`, `production_companies`, `spoken_languages`, `overview`, etc.
- Missing values are present in:
  - `homepage`
  - `overview`
  - `release_date`
  - `runtime`
  - `tagline`

**Credits Dataset**
- Contains cast and crew information for all 4803 movies.
- Key columns:
  - `movie_id` (used for merging)
  - `cast` (JSON-like structure)
  - `crew` (JSON-like structure)

At this stage, no cleaning or preprocessing is performed.

## Step 5: Preview Dataset Samples

In this step, we view sample rows from both datasets to understand how values are represented.


In [5]:
print("Sample data from Movies dataset:")
movies.head(2)


Sample data from Movies dataset:


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [6]:
print("Sample data from Credits dataset:")
credits.head(2)

Sample data from Credits dataset:


Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


### Observations from Sample Rows

**Movies Dataset**
- Financial information such as `budget` and `revenue` is clearly present.
- Columns like `genres`, `keywords`, `production_companies`, and `spoken_languages`
  are stored as JSON-like strings.
- `release_date` is stored as a string and will need date parsing later.
- `vote_average` represents the movie rating (0–10 scale).
- `vote_count` indicates the number of users who rated the movie.

**Credits Dataset**
- `cast` and `crew` columns are stored as JSON-like strings.
- Each movie has multiple cast members and crew roles.
- Director information is embedded inside the `crew` column.



## Step 6: Initial Target Variable Identification

At this stage, we conceptually identify potential target variables
for both classification and regression tasks without defining thresholds.

In [7]:
movies[['budget', 'revenue', 'vote_average','runtime']].describe()

Unnamed: 0,budget,revenue,vote_average,runtime
count,4803.0,4803.0,4803.0,4801.0
mean,29045040.0,82260640.0,6.092172,106.875859
std,40722390.0,162857100.0,1.194612,22.611935
min,0.0,0.0,0.0,0.0
25%,790000.0,0.0,5.6,94.0
50%,15000000.0,19170000.0,6.2,103.0
75%,40000000.0,92917190.0,6.8,118.0
max,380000000.0,2787965000.0,10.0,338.0


- `budget` and `revenue` can be used to compute profit or ROI for hit/flop classification.
- `vote_average` represents the movie rating and will be used for regression.
- Statistical summaries help understand value ranges before defining targets.

### Observations from Target-Related Columns

- **Budget**
  - Mean budget is approximately $29M.
  - Minimum budget is 0, indicating missing or unreported values.
  - Budget distribution is highly skewed, with a few very high-budget films.

- **Revenue**
  - Mean revenue is approximately $82M.
  - Many movies have zero revenue, which may indicate:
    - Unreported revenue
    - Movies that did not perform commercially
  - Revenue has a much larger spread than budget.

- **Vote Average**
  - Mean rating is approximately 6.1.
  - Ratings range from 0 to 10.
  - Most movies fall between 5.6 and 6.8, indicating a narrow rating band.


## Phase 1 Summary: Setup & Data Exploration

In this phase, we:
- Successfully loaded the TMDB movies and credits datasets.
- Verified dataset sizes and confirmed a one-to-one relationship.
- Inspected dataset structure, data types, and missing values.
- Explored sample rows to understand data representation.
- Identified potential target variables for:
  - Classification (movie success)
  - Regression (IMDb rating)

No data cleaning or feature engineering was performed in this phase.
The dataset is now well understood and ready for preprocessing.
