# Data Cleaning Summary - MovieLens 100K SVD Ready
✅ **การทำความสะอาดข้อมูลเสร็จสิ้น - พร้อมสำหรับ SVD Model Training**

In [None]:
import pandas as pd
import numpy as np
import os

# Load cleaned data
data_dir = os.path.join("..", "data")
final_data = pd.read_csv(os.path.join(data_dir, "final_data_cleaned.csv"))
svd_data = pd.read_csv(os.path.join(data_dir, "svd_data.csv")) if os.path.exists(os.path.join(data_dir, "svd_data.csv")) else None

print("📊 Data Cleaning Results Summary")
print("=" * 50)
print(f"✅ Total processed ratings: {len(final_data):,}")
print(f"✅ Unique users: {final_data['user_id'].nunique():,}")
print(f"✅ Unique movies: {final_data['item_id'].nunique():,}")
print(f"✅ Rating range: {final_data['rating'].min()} - {final_data['rating'].max()}")
print(f"✅ Data sparsity: {((1 - len(final_data) / (final_data['user_id'].nunique() * final_data['item_id'].nunique())) * 100):.2f}%")

print("\n📋 Available columns:")
for i, col in enumerate(final_data.columns, 1):
    print(f"{i:2d}. {col}")

In [None]:
# Display sample of cleaned data
print("🎬 Sample of Cleaned Data:")
print("=" * 50)
final_data.head(10)

In [None]:
# Rating distribution
print("⭐ Rating Distribution:")
print("=" * 30)
rating_dist = final_data['rating'].value_counts().sort_index()
for rating, count in rating_dist.items():
    percentage = (count / len(final_data)) * 100
    print(f"Rating {rating}: {count:,} ({percentage:.1f}%)")

# Genre distribution (top 10)
print("\n🎭 Top 10 Genres:")
print("=" * 20)
genre_counts = {}
for genres_str in final_data['genres'].dropna():
    for genre in genres_str.split(', '):
        genre_counts[genre] = genre_counts.get(genre, 0) + 1

top_genres = sorted(genre_counts.items(), key=lambda x: x[1], reverse=True)[:10]
for genre, count in top_genres:
    print(f"{genre}: {count:,}")

In [None]:
# User demographics
print("👥 User Demographics:")
print("=" * 25)

print("Gender Distribution:")
gender_dist = final_data.groupby('user_id')['gender'].first().value_counts()
for gender, count in gender_dist.items():
    percentage = (count / len(gender_dist)) * 100
    print(f"  {gender}: {count} ({percentage:.1f}%)")

print("\nAge Group Distribution:")
age_dist = final_data.groupby('user_id')['age_group'].first().value_counts()
for age_group, count in age_dist.items():
    percentage = (count / len(age_dist)) * 100
    print(f"  {age_group}: {count} ({percentage:.1f}%)")

print("\nTop 5 Occupations:")
occupation_dist = final_data.groupby('user_id')['occupation'].first().value_counts().head()
for occupation, count in occupation_dist.items():
    percentage = (count / final_data['user_id'].nunique()) * 100
    print(f"  {occupation}: {count} ({percentage:.1f}%)")

In [None]:
# Data quality checks
print("✅ Data Quality Verification:")
print("=" * 35)

# Missing values check
missing_values = final_data.isnull().sum()
print("Missing Values:")
for col, missing in missing_values.items():
    if missing > 0:
        percentage = (missing / len(final_data)) * 100
        print(f"  {col}: {missing} ({percentage:.2f}%)")
    else:
        print(f"  {col}: ✅ No missing values")

# Data type verification
print("\nData Types:")
for col in final_data.columns:
    dtype = final_data[col].dtype
    print(f"  {col}: {dtype}")

# Rating validation
invalid_ratings = final_data[(final_data['rating'] < 1) | (final_data['rating'] > 5)]
print(f"\nRating Validation: {'✅ All ratings valid (1-5)' if len(invalid_ratings) == 0 else f'❌ {len(invalid_ratings)} invalid ratings found'}")

In [None]:
# SVD readiness confirmation
print("🚀 SVD Model Training Readiness:")
print("=" * 40)

# Check if user-item matrix can be created
try:
    user_item_matrix = final_data.pivot_table(
        index='user_id', 
        columns='item_id', 
        values='rating'
    )
    print(f"✅ User-Item Matrix: {user_item_matrix.shape}")
    print(f"✅ Matrix sparsity: {((user_item_matrix.isna().sum().sum() / user_item_matrix.size) * 100):.2f}%")
except Exception as e:
    print(f"❌ User-Item Matrix creation failed: {e}")

# Check data consistency
print(f"✅ Unique user IDs: {final_data['user_id'].nunique()}")
print(f"✅ Unique movie IDs: {final_data['item_id'].nunique()}")
print(f"✅ Total ratings: {len(final_data):,}")

# Files created
print("\n📁 Files Created:")
files_to_check = [
    "final_data_cleaned.csv",
    "svd_data.csv",
    "user_item_matrix.csv"
]

for filename in files_to_check:
    filepath = os.path.join(data_dir, filename)
    if os.path.exists(filepath):
        size_mb = os.path.getsize(filepath) / (1024 * 1024)
        print(f"  ✅ {filename} ({size_mb:.1f} MB)")
    else:
        print(f"  ❌ {filename} (missing)")

print("\n🎯 Next Steps:")
print("1. Proceed to notebook 02_exploratory_analysis.ipynb")
print("2. Then move to 03_model_svd.ipynb for SVD training")
print("3. Data is ready for collaborative filtering algorithms!")