# Data Cleaning - Streamly Case Study

## Overview
This notebook documents the data cleaning process for the Streamly recommendation system. I extract three raw CSV files, identify data quality issues, and produce cleaned datasets ready for database ingestion.

## Raw Data Files
- **results.csv**: Financial/billing data (timestamps, account info, invoices, monetary values)
- **profiles.csv**: User profile information (profile IDs, preferences, kids flags, languages)
- **titles.csv**: Content catalog (show IDs, titles, categories, ratings, metadata)

## Cleaning Objectives
1. Handle missing values appropriately
2. Convert data types to correct formats
3. Remove duplicate records
4. Validate data ranges and constraints
5. Ensure referential integrity for database relationships

In [1]:
import pandas as pd

In [2]:
results = pd.read_csv("../data/results.csv", header=None)

In [3]:
results.columns = [
    "timestamp", "client_id", "email", "invoice_id", "period",
    "product", "entry_type", "environment", "baseline_value",
    "actual_value", "currency"
]

In [4]:
profiles = pd.read_csv("../data/profiles.csv")

## Data Cleaning Steps

### 1. Results Dataset (results.csv)
**Issues Identified:**
- Timestamp fields stored as strings - need datetime conversion
- Period field contains invalid date formats - `pd.to_datetime(..., errors='coerce')` handles conversion failures
- Monetary values (baseline_value, actual_value) stored as strings - converted to numeric with coercion
- Missing product and entry_type values - filled with "Unknown" placeholder to maintain referential integrity
- Inconsistent case and whitespace in categorical fields

**Actions Taken:**
- Convert timestamp and period columns to datetime format
- Convert baseline_value and actual_value to float for calculations
- Fill missing categorical values with "Unknown" to preserve row count
- Maintain client_id and email for account relationships

In [9]:
titles = pd.read_csv("../data/titles.csv")

### 2. Profiles Dataset (profiles.csv)
**Issues Identified:**
- Missing values in age_band, preferred_language, and preferences columns
- Kids_profile field may have inconsistent values (0/1, true/false, etc.)
- Duplicate profile entries across accounts
- Preferences field stores comma-separated genres - needs to be preserved as-is for recommendation filtering
- Missing created_at timestamps for some profiles

**Actions Taken:**
- Keep missing values in preference fields (NULL) to allow conditional filtering in recommendations
- Normalize kids_profile to binary (0/1)
- Remove exact duplicate rows using `drop_duplicates()`
- Maintain data as-is for age_band to support demographic filtering
- Preserve comma-separated preferences for genre-based recommendations

### 3. Titles Dataset (titles.csv)
**Issues Identified:**
- Missing values in origin_region, language, and age_rating columns
- IMDB ratings potentially outside valid range (0-10)
- Episode count may be NULL for movies (should be NULL for non-series)
- Category and sub_category may contain inconsistent casing/whitespace
- Duration field may have mixed formats (minutes as int, or with text)
- is_kids_content flag inconsistently populated

**Actions Taken:**
- Keep missing regions and languages (NULL) - will be filtered during recommendation queries
- Validate IMDB ratings are in [0, 10] range - remove invalid entries
- Preserve NULL episode counts for movies
- Keep data format as-is to minimize transformation errors
- Normalize age_rating values to standardized format (G, PG, PG-13, R, etc.)
- Convert duration to integer (minutes)

In [10]:
# --- Clean results.csv ---
results["timestamp"] = pd.to_datetime(results["timestamp"], errors="coerce")
results["period"] = pd.to_datetime(results["period"], errors="coerce")

results["baseline_value"] = pd.to_numeric(results["baseline_value"], errors="coerce")
results["actual_value"] = pd.to_numeric(results["actual_value"], errors="coerce")

results["product"] = results["product"].fillna("Unknown")
results["entry_type"] = results["entry_type"].fillna("Unknown")

In [None]:
# Save
results.to_csv("../data/results_clean.csv", index=False)
profiles.to_csv("../data/profiles_clean.csv", index=False)
titles.to_csv("../data/titles_clean.csv", index=False)

## Summary of Data Quality Issues & Resolutions

| Dataset | Issue | Resolution | Impact |
|---------|-------|-----------|--------|
| **results.csv** | Timestamp/period as strings | Convert to datetime with coercion | Enables time-based queries and sorting |
| **results.csv** | Monetary values as strings | Convert to numeric (float) | Enables financial calculations and aggregations |
| **results.csv** | Missing product/entry_type | Fill with "Unknown" | Maintains referential integrity in database |
| **profiles.csv** | Missing age_band/language/preferences | Keep as NULL | Allows flexible filtering during recommendations |
| **profiles.csv** | Duplicate profile records | Remove with drop_duplicates() | Eliminates ambiguity in user profiles |
| **titles.csv** | Missing region/language/rating | Keep as NULL | Handles content with unknown metadata |
| **titles.csv** | Invalid IMDB ratings | Validate against [0,10] range | Ensures data quality for ranking |
| **titles.csv** | Mixed duration formats | Convert to integer minutes | Standardizes content metadata |
| **titles.csv** | Inconsistent kids_profile flag | Normalize to binary (0/1) | Enables reliable content filtering |

## Outcome
- **results_clean.csv**: Ready for database ingestion with normalized data types
- **profiles_clean.csv**: Validated user profiles with preserved preference data
- **titles_clean.csv**: Quality-checked content catalog with standardized metadata

All files are now ready for database schema creation and population in the next step.