# 01 – Data Ingestion & Exploration

**Project:** StreamSense – Netflix Hit Predictor  
**Goal of this notebook:**
- Load the raw Netflix CSV dataset from DBFS
- Create a managed Delta table: `netflix_raw`
- Perform initial exploratory analysis:
  - Schema & data types
  - Row counts, duplicates
  - Nulls / missing data
  - Basic distributions for key fields (type, rating, release_year, etc.)

This notebook should be safe to re-run end-to-end.

In [0]:
# 1. Configuration

# Update this if your filename/path is different
DATA_PATH = "dbfs:/FileStore/netflix_data.csv"  # e.g. netflix_data.csv or netflix_titles.csv

print(f"Using data from: {DATA_PATH}")

In [0]:
from pyspark.sql import functions as F

# 2. Load raw CSV into Spark DataFrame

df_raw = (
    spark.read
    .option("header", True)
    .option("inferSchema", True)
    .csv(DATA_PATH)
)

print(f"Row count (raw): {df_raw.count():,}")
df_raw.printSchema()

In [0]:
# 3. Save as a managed Delta table for reuse

table_name = "netflix_raw"

(
    df_raw
    .write
    .mode("overwrite")
    .format("delta")
    .saveAsTable(table_name)
)

print(f"Saved table: {table_name}")

In [0]:
%sql


In [0]:
from pyspark.sql import functions as F

df = spark.table("netflix_raw")

print(f"Rows: {df.count():,}")
print(f"Columns: {len(df.columns)}")
print(df.columns)

In [0]:
display(df.limit(10))

In [0]:
# 4. Null counts per column

null_counts = (
    df.select([
        F.sum(F.when(F.col(c).isNull() | (F.col(c) == ""), 1).otherwise(0)).alias(c)
        for c in df.columns
    ])
)

display(null_counts)

In [0]:
if "show_id" in df.columns:
    dup_count = (
        df.groupBy("show_id")
          .count()
          .filter(F.col("count") > 1)
          .count()
    )
    print(f"Duplicate show_id values: {dup_count}")
else:
    print("No 'show_id' column found – will choose a different key later.")

In [0]:
for col_name in ["type", "rating", "release_year", "country"]:
    if col_name in df.columns:
        print(f"\nValue counts for '{col_name}':")
        display(
            df.groupBy(col_name)
              .count()
              .orderBy(F.col("count").desc())
        )
    else:
        print(f"\nColumn '{col_name}' not found in this dataset.")

## Initial Findings

- **Row count:** `<fill in from output>`
- **Columns:** `<list key ones e.g. type, title, release_year, rating, ...>`
- **Null patterns:**
  - `director` has many nulls
  - `country` partially missing
- **Duplicates:**
  - `show_id` seems unique (if present) / or some duplicates found
- **Interesting distributions:**
  - Majority of titles are Movies vs TV Shows
  - Release years range from X to Y
  - Ratings concentrated around TV-MA / TV-14, etc.

These findings will drive:
- How we define our `is_hit` label
- Which features are most promising for the model