# Yelp Dataset Preprocessing Exploration

This notebook provides an interactive exploration of the data preprocessing pipeline for the LLM-powered Business Improvement Agent project.

## Objectives
1. **Missing Value Analysis**: Identify and handle missing data across all datasets
2. **Duplicate Detection**: Remove exact and near-duplicate records
3. **Spam Detection**: Filter out spam and bot-generated reviews
4. **Data Consistency**: Standardize formats and validate data quality
5. **Quality Assessment**: Generate comprehensive quality reports

## Dataset Overview
- **Business Data**: Business information, ratings, categories
- **Review Data**: User reviews with text and ratings
- **User Data**: User profiles and activity
- **Tip Data**: Short user tips and recommendations
- **Check-in Data**: Business visit patterns


In [None]:
# Import required libraries
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add src directory to path
project_root = Path('../').resolve()
sys.path.insert(0, str(project_root / "src"))

# Import our custom preprocessor
from data_preprocessor import YelpDataPreprocessor

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("🚀 Libraries imported successfully!")
print(f"📁 Project root: {project_root}")

# Initialize preprocessor
preprocessor = YelpDataPreprocessor(
    raw_data_path=project_root / "raw_data",
    processed_data_path=project_root / "data" / "processed"
)

print("✅ Data preprocessor initialized")


## Step 1: Load and Explore Raw Data

Let's start by loading a sample of each dataset to understand the data structure and identify quality issues.


In [None]:
# Load sample datasets for exploration
print("📂 Loading sample datasets for exploration...")

# Load smaller samples for interactive exploration
datasets = {}
sample_sizes = {
    'business': 5000,
    'review': 10000, 
    'user': 3000,
    'tip': 2000
}

for dataset_name, sample_size in sample_sizes.items():
    try:
        print(f"Loading {dataset_name} ({sample_size:,} records)...")
        df = preprocessor.load_dataset_stream(dataset_name, sample_size)
        datasets[dataset_name] = df
        print(f"✅ {dataset_name}: {len(df):,} records loaded")
    except FileNotFoundError:
        print(f"⚠️ {dataset_name} dataset not found")
    except Exception as e:
        print(f"❌ Error loading {dataset_name}: {e}")

print(f"\n📊 Successfully loaded {len(datasets)} datasets")
