# Netflix Data Collection

This notebook handles the initial data collection phase of the Netflix analysis pipeline:

1. Load netflix1.csv into a staging PostgreSQL table
2. Profile row counts, nulls, duplicates
3. Export raw CSV to data/raw/netflix_raw.csv

## Prerequisites
- PostgreSQL database running
- Database credentials configured in src/utils.py


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import sys
import os

# Add src directory to path for imports
sys.path.append('../src')

from utils import connect_db, get_engine, print_data_summary

# Enable autoreload for development
%load_ext autoreload
%autoreload 2

print("Libraries imported successfully!")


Libraries imported successfully!


## 1. Load Raw Netflix Data


In [2]:
# Load the raw Netflix dataset
raw_data_path = '../data/raw/netflix1.csv'

print(f"Loading data from: {raw_data_path}")
df_raw = pd.read_csv(raw_data_path)

print(f"\nData loaded successfully!")
print(f"Dataset shape: {df_raw.shape}")
print(f"\nColumn names:")
for i, col in enumerate(df_raw.columns, 1):
    print(f"{i:2d}. {col}")


Loading data from: ../data/raw/netflix1.csv

Data loaded successfully!
Dataset shape: (8790, 10)

Column names:
 1. show_id
 2. type
 3. title
 4. director
 5. country
 6. date_added
 7. release_year
 8. rating
 9. duration
10. listed_in


## 2. Initial Data Profiling


In [3]:
# Generate comprehensive data summary
print_data_summary(df_raw, "Raw Netflix Dataset Profile")



Raw Netflix Dataset Profile
Shape: (8790, 10)
Memory usage: 4.66 MB

Data types:
object    9
int64     1
Name: count, dtype: int64

Missing values:
Empty DataFrame
Columns: [Missing Count, Percentage]
Index: []



In [5]:
# Export to standardized raw data file
output_path = '../data/raw/netflix_raw.csv'
print(f"Exporting raw data to: {output_path}")

df_raw.to_csv(output_path, index=False)
print(f" Raw data exported successfully!")

# Verify export
if os.path.exists(output_path):
    file_size = os.path.getsize(output_path) / (1024 * 1024)  # MB
    print(f"   File size: {file_size:.2f} MB")
    
    # Quick verification load
    df_verify = pd.read_csv(output_path)
    print(f"   Verification: {len(df_verify):,} rows loaded from exported file")
    print(f"   Match with original: {' Yes' if len(df_verify) == len(df_raw) else ' No'}")


Exporting raw data to: ../data/raw/netflix_raw.csv
 Raw data exported successfully!
   File size: 1.02 MB
   Verification: 8,790 rows loaded from exported file
   Match with original:  Yes


---