# Customer Deposit Forecasting - Phase 1: Data Preparation & EDA

## Overview
This notebook covers Phase 1 of the customer deposit forecasting project:
1. Generate synthetic dataset (1000 customers, 12 months)
2. Perform comprehensive Exploratory Data Analysis
3. Create preprocessing pipeline

**Author**: Data Science Team  
**Date**: 2024-12-01  
**Business Goal**: Predict next-day deposit amounts for financial planning

## 1. Setup and Imports

In [None]:
# Core imports
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Custom modules
from data_generator import CustomerDepositGenerator, save_dataset
from eda import DepositEDA
from preprocessing import DepositPreprocessor

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')

print("✓ All imports successful!")
print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 2. Generate Synthetic Dataset

Creating realistic deposit data with:
- 1000 customers across 7 behavior segments
- 365 days (12 months) of historical data
- Seasonality (weekend effects, month-end spikes)
- Trends (growing, declining, stable patterns)
- Realistic noise and outliers

In [None]:
# Initialize data generator
generator = CustomerDepositGenerator(n_customers=1000, n_days=365, random_seed=42)

# Generate dataset
df_raw = generator.generate_full_dataset()

# Save raw data
save_dataset(df_raw, '../data/customer_deposits_raw.csv')

In [None]:
# Display first records
print("\n=" * 80)
print("FIRST 20 RECORDS")
print("=" * 80)
display(df_raw.head(20))

In [None]:
# Quick data overview
print("\n=" * 80)
print("DATASET INFO")
print("=" * 80)
print(df_raw.info())

print("\n=" * 80)
print("BASIC STATISTICS")
print("=" * 80)
display(df_raw.describe())

## 3. Exploratory Data Analysis (EDA)

Comprehensive analysis of deposit patterns, customer behavior, and temporal trends.

In [None]:
# Initialize EDA class
eda = DepositEDA(df_raw)

# Get summary statistics
summary_stats = eda.get_summary_statistics()

### 3.1 Customer-Level Statistics

In [None]:
# Customer statistics
customer_stats = eda.get_customer_statistics()

# Display top customers
print("\n=" * 80)
print("TOP 10 CUSTOMERS BY TOTAL DEPOSITS")
print("=" * 80)
display(customer_stats.nlargest(10, 'total_deposits'))

### 3.2 Segment Analysis

In [None]:
# Segment statistics
segment_stats = eda.analyze_segments()

# Display segment breakdown
print("\n=" * 80)
print("CUSTOMER SEGMENTS BREAKDOWN")
print("=" * 80)
display(segment_stats)

### 3.3 Deposit Distribution Visualizations

In [None]:
# Plot deposit distributions
eda.plot_deposit_distributions(save_path='../visualizations/01_deposit_distributions.png')

### 3.4 Time Series Patterns

In [None]:
# Plot time series patterns
eda.plot_time_series_patterns(save_path='../visualizations/02_time_series_patterns.png')

### 3.5 Customer Behavior Analysis

In [None]:
# Plot customer behavior
eda.plot_customer_behavior(save_path='../visualizations/03_customer_behavior.png')

### 3.6 Sample Customer Time Series

In [None]:
# Plot sample customers from different segments
eda.plot_sample_customers(n_samples=9, save_path='../visualizations/04_sample_customers.png')

### 3.7 Correlation Analysis

In [None]:
# Analyze correlations
eda.analyze_correlations()

## 4. Data Preprocessing

Apply preprocessing pipeline:
1. Handle missing values (forward-fill)
2. Remove extreme outliers (>3 std from mean)
3. Create daily aggregates
4. Add temporal features

In [None]:
# Initialize preprocessor
preprocessor = DepositPreprocessor(outlier_threshold=3.0, scaling_method='standard')

# Run preprocessing pipeline
df_processed, outliers = preprocessor.preprocess_pipeline(df_raw)

In [None]:
# Display processed data
print("\n=" * 80)
print("PROCESSED DATA SAMPLE")
print("=" * 80)
display(df_processed.head(20))

In [None]:
# Display preprocessing statistics
print("\n=" * 80)
print("PREPROCESSING STATISTICS")
print("=" * 80)
stats = preprocessor.get_statistics(df_processed)
for key, value in stats.items():
    print(f"{key:.<40} {value}")

In [None]:
# Check outliers
if not outliers.empty:
    print("\n=" * 80)
    print(f"OUTLIERS DETECTED: {len(outliers)}")
    print("=" * 80)
    display(outliers.head(10))
    
    # Visualize outliers
    plt.figure(figsize=(12, 6))
    plt.hist(outliers['deposit_amount'], bins=30, edgecolor='black', alpha=0.7, color='red')
    plt.xlabel('Deposit Amount ($)')
    plt.ylabel('Frequency')
    plt.title('Distribution of Detected Outliers', fontsize=14, fontweight='bold')
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.savefig('../visualizations/05_outliers_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("\nNo outliers detected.")

## 5. Save Processed Data

In [None]:
# Save processed data
df_processed.to_csv('../data/customer_deposits_preprocessed.csv', index=False)
print("✓ Processed data saved to: ../data/customer_deposits_preprocessed.csv")

if not outliers.empty:
    outliers.to_csv('../data/outliers_detected.csv', index=False)
    print("✓ Outliers saved to: ../data/outliers_detected.csv")

# Save customer statistics
customer_stats.to_csv('../data/customer_statistics.csv', index=False)
print("✓ Customer statistics saved to: ../data/customer_statistics.csv")

## 6. Key Insights from Phase 1

### Customer Segments:
1. **High Frequency Regular** (15%): Daily depositors with consistent amounts
2. **Medium Frequency Stable** (25%): Weekly depositors, most common segment
3. **Growing Users** (20%): Increasing deposit amounts over time
4. **Declining Users** (15%): Decreasing activity
5. **Sporadic High Value** (10%): Rare but large deposits
6. **Weekend Warriors** (10%): Primarily weekend depositors
7. **Inactive Declining** (5%): Very low activity

### Temporal Patterns:
- **Weekend Effect**: Increased deposits on Saturdays and Sundays
- **Month-End Spike**: Higher deposits between days 25-31 (salary effect)
- **Weekly Cycles**: Clear weekly patterns visible in moving averages

### Data Quality:
- Dataset successfully generated with realistic patterns
- Outliers detected and handled appropriately
- Temporal features engineered for modeling

### Next Steps (Phase 2):
- Create lag features and rolling statistics
- Build customer behavior metrics
- Prepare train/validation/test splits
- Engineer features for ML models

## 7. Summary Statistics

In [None]:
# Create comprehensive summary
summary = {
    'Total Customers': df_processed['customer_id'].nunique(),
    'Total Days': df_processed['date'].nunique(),
    'Total Records': len(df_processed),
    'Non-Zero Deposits': (df_processed['deposit_amount'] > 0).sum(),
    'Zero Deposits': (df_processed['deposit_amount'] == 0).sum(),
    'Total Volume': f"${df_processed['deposit_amount'].sum():,.2f}",
    'Avg Daily Volume': f"${df_processed.groupby('date')['deposit_amount'].sum().mean():,.2f}",
    'Avg Customer LTV': f"${df_processed.groupby('customer_id')['deposit_amount'].sum().mean():,.2f}",
    'Date Range': f"{df_processed['date'].min().date()} to {df_processed['date'].max().date()}"
}

print("\n" + "="*80)
print("PHASE 1 COMPLETION SUMMARY")
print("="*80)
for key, value in summary.items():
    print(f"{key:.<40} {value}")

print("\n" + "="*80)
print("✓ PHASE 1 COMPLETED SUCCESSFULLY!")
print("="*80)