# 01 — Data Exploration & TMDB API Validation
Professional exploration notebook for the TMDB Data Analyst project.

This notebook demonstrates:
- Validation of the raw → processed → clean pipeline
- Structural and content checks for API-extracted JSON files
- Basic exploratory data analysis (EDA) for recruiter-friendly inspection

The goal is to provide a transparent first step in the data engineering workflow and to ensure that TMDB source data is complete, correctly transformed, and ready for downstream analytics.

## 0. Environment & Project Path Resolution
This notebook resolves all paths dynamically, ensuring it works regardless of execution location. No absolute paths are used.

In [20]:
import pandas as pd
import os
import json
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()

# === UNIVERSAL NOTEBOOK PATH RESOLUTION ===
# Always resolves to "tmdb-data-analyst" root no matter how the notebook is opened.

NOTEBOOK_ROOT = Path().resolve()

# Find project root by searching upward until the folder name matches.
for parent in NOTEBOOK_ROOT.parents:
    if parent.name == "tmdb-data-analyst":
        PROJECT_ROOT = parent
        break
else:
    raise RuntimeError("Project root 'tmdb-data-analyst' not found.")

RAW_DIR = PROJECT_ROOT / "data" / "raw"
PROC_DIR = PROJECT_ROOT / "data" / "processed"
CLEAN_DIR = PROJECT_ROOT / "data" / "clean"

print("NOTEBOOK_ROOT:", NOTEBOOK_ROOT)
print("PROJECT_ROOT:", PROJECT_ROOT)
print("RAW_DIR:", RAW_DIR)
print("PROC_DIR:", PROC_DIR)
print("CLEAN_DIR:", CLEAN_DIR)


NOTEBOOK_ROOT: C:\Users\casco\Desktop\tmdb-data-analyst\src\notebooks
PROJECT_ROOT: C:\Users\casco\Desktop\tmdb-data-analyst
RAW_DIR: C:\Users\casco\Desktop\tmdb-data-analyst\data\raw
PROC_DIR: C:\Users\casco\Desktop\tmdb-data-analyst\data\processed
CLEAN_DIR: C:\Users\casco\Desktop\tmdb-data-analyst\data\clean


In [21]:
import sys
sys.executable


'c:\\Users\\casco\\Desktop\\tmdb-data-analyst\\.venv\\Scripts\\python.exe'

In [22]:
import os
os.getcwd()


'c:\\Users\\casco\\Desktop\\tmdb-data-analyst\\src\\notebooks'

## 1. Raw JSON Overview
Inspection of the raw files extracted directly from TMDB.

In [23]:
raw_files = list(RAW_DIR.glob('*.json'))[:5]
raw_files

[WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/raw/genres.json'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/raw/popular.json'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/raw/top_rated.json'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/raw/trending.json'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/raw/upcoming.json')]

In [24]:
if raw_files:
    sample = raw_files[0]
    with open(sample, 'r', encoding='utf-8') as f:
        data = json.load(f)
    data if isinstance(data, dict) else data[:3]
else:
    'No raw files found'

## 2. Processed Layer Validation
Checks that the transformation scripts created valid intermediate JSON tables.

In [25]:
proc_files = list(PROC_DIR.glob('*.json'))
proc_files

[WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/processed/credits.json'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/processed/details.json'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/processed/genres.json')]

In [26]:
if (PROC_DIR / 'movies.json').exists():
    df_movies = pd.read_json(PROC_DIR / 'movies.json')
    df_movies.head()
else:
    'movies.json not found'

## 3. Clean Layer Validation
These CSV files represent the final cleaned datasets ready for loading into analytical databases.

In [27]:
clean_files = list(CLEAN_DIR.glob('*.csv'))
clean_files

[WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/clean/cast.csv'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/clean/crew.csv'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/clean/genres.csv'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/clean/movies.csv'),
 WindowsPath('C:/Users/casco/Desktop/tmdb-data-analyst/data/clean/movie_genres.csv')]

In [28]:
if (CLEAN_DIR / 'movies.csv').exists():
    df_clean_movies = pd.read_csv(CLEAN_DIR / 'movies.csv')
    df_clean_movies.head()
else:
    'movies.csv not found'

## 4. Basic Exploratory Data Analysis (EDA)
A preliminary overview to validate dataset completeness.

In [29]:
if 'df_clean_movies' in locals():
    df_clean_movies.describe(include='all')
else:
    'No clean dataset loaded'

## Final Notes
This notebook confirms that TMDB data successfully flows through each ETL stage, ensuring structural integrity and analytic readiness. It forms the foundation for downstream steps including dimensional modeling, SQL analytics, and BI dashboard development.