# CRM Customer Churn Analysis - Phase 1 Complete Tutorial

## üéØ Learning Objectives

This notebook provides a comprehensive walkthrough of Phase 1 of the CRM Analytics Pipeline:

1. **Data Collection**: Load and simulate customer data from multiple sources
2. **Data Cleaning**: Handle missing values, outliers, and data quality issues
3. **Data Validation**: Ensure data meets quality standards
4. **Feature Engineering**: Create meaningful features from raw data
5. **Exploratory Data Analysis**: Understand patterns and relationships
6. **Model Preparation**: Prepare datasets for Phase 2 modeling

---

## üì¶ Setup and Imports

First, let's import all necessary libraries and set up our environment.

In [1]:
# Standard library imports
import sys
import os
import warnings
from pathlib import Path
from datetime import datetime, timedelta
import json

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Statistics
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set pandas display options
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

# Set matplotlib style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Add project root to path
project_root = Path.cwd().parent if 'notebooks' in str(Path.cwd()) else Path.cwd()
sys.path.insert(0, str(project_root))

print("‚úÖ All imports successful!")
print(f"üìÅ Project root: {project_root}")
print(f"üêç Python version: {sys.version.split()[0]}")
print(f"üìä Pandas version: {pd.__version__}")
print(f"üî¢ NumPy version: {np.__version__}")

## üîß Import Project Modules

Now let's import our custom modules from the project.

In [2]:
# Import project modules
from config.settings import get_settings
from src.data.collector import DataCollector
from src.data.cleaner import DataCleaner
from src.data.validator import DataValidator
from src.features.engineer import FeatureEngineer
from src.analysis.eda import ExploratoryDataAnalysis
from src.pipeline.orchestrator import PipelineOrchestrator, PipelineStage

# Initialize settings
settings = get_settings()

print("‚úÖ Project modules imported successfully!")
print(f"üìÇ Data directory: {settings.paths.DATA_DIR}")
print(f"üìä Reports directory: {settings.paths.REPORTS_DIR}")

---

# 1Ô∏è‚É£ Data Collection

## Understanding the Data Sources

Our CRM system collects data from four main sources:

1. **Customers**: Demographics and account information
2. **Transactions**: Purchase history and behavior
3. **Interactions**: Customer service touchpoints
4. **Marketing**: Campaign engagement data

Let's collect and explore each dataset.

In [3]:
# Initialize the data collector
collector = DataCollector(settings)

# Collect data from all sources
# Note: This will use simulated data if database is unavailable
print("üîÑ Starting data collection...")
print("‚è±Ô∏è  This may take a few minutes...\n")

raw_data = collector.collect_all_data(
    sources=['customers', 'transactions', 'interactions', 'marketing'],
    use_cache=True
)

print("\n‚úÖ Data collection complete!")
print("\nüìä Dataset Summary:")
print("=" * 60)

for name, df in raw_data.items():
    memory_mb = df.memory_usage(deep=True).sum() / 1024 / 1024
    print(f"\n{name.upper():15s} | Rows: {len(df):>8,} | Columns: {len(df.columns):>3} | Memory: {memory_mb:>7.2f} MB")

## 1.1 Explore Customer Data

Let's take a detailed look at the customer dataset.

In [None]:
customers_df = raw_data['customers']

print("üë• CUSTOMER DATA OVERVIEW")
print("=" * 60)
print(f"\nTotal Customers: {len(customers_df):,}")
print(f"\nColumn Names and Types:")
print(customers_df.dtypes)

print("\nüìã Sample Records:")
display(customers_df.head(10))

print("\nüìä Statistical Summary:")
display(customers_df.describe())

print("\n‚ùì Missing Values:")
missing = customers_df.isnull().sum()
missing_pct = (missing / len(customers_df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
}).sort_values('Missing Count', ascending=False)
display(missing_df[missing_df['Missing Count'] > 0])

### Customer Demographics Visualization

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
fig.suptitle('Customer Demographics Overview', fontsize=16, fontweight='bold')

# Age distribution
customers_df['age'].hist(bins=30, ax=axes[0, 0], edgecolor='black')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')

# Gender distribution
customers_df['gender'].value_counts().plot(kind='bar', ax=axes[0, 1], color=['skyblue', 'pink', 'lightgreen'])
axes[0, 1].set_title('Gender Distribution')
axes[0, 1].set_xlabel('Gender')
axes[0, 1].set_ylabel('Count')
axes[0, 1].tick_params(axis='x', rotation=0)

# Customer Segment
customers_df['customer_segment'].value_counts().plot(kind='bar', ax=axes[0, 2], color='coral')
axes[0, 2].set_title('Customer Segments')
axes[0, 2].set_xlabel('Segment')
axes[0, 2].set_ylabel('Count')
axes[0, 2].tick_params(axis='x', rotation=45)

# Acquisition Channel
customers_df['acquisition_channel'].value_counts().plot(kind='barh', ax=axes[1, 0], color='lightblue')
axes[1, 0].set_title('Acquisition Channels')
axes[1, 0].set_xlabel('Count')
axes[1, 0].set_ylabel('Channel')

# Churn Distribution
churn_counts = customers_df['churned'].value_counts()
axes[1, 1].pie(churn_counts.values, labels=['Not Churned', 'Churned'], 
               autopct='%1.1f%%', colors=['green', 'red'], startangle=90)
axes[1, 1].set_title('Churn Distribution')

# State Distribution (Top 10)
customers_df['state'].value_counts().head(10).plot(kind='bar', ax=axes[1, 2], color='purple')
axes[1, 2].set_title('Top 10 States')
axes[1, 2].set_xlabel('State')
axes[1, 2].set_ylabel('Count')
axes[1, 2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\nüìà Key Insights:")
print(f"  ‚Ä¢ Average age: {customers_df['age'].mean():.1f} years")
print(f"  ‚Ä¢ Churn rate: {customers_df['churned'].mean()*100:.2f}%")
print(f"  ‚Ä¢ Most common segment: {customers_df['customer_segment'].mode()[0]}")
print(f"  ‚Ä¢ Top acquisition channel: {customers_df['acquisition_channel'].mode()[0]}")

## 1.2 Explore Transaction Data

In [None]:
transactions_df = raw_data['transactions']

print("üí≥ TRANSACTION DATA OVERVIEW")
print("=" * 60)
print(f"\nTotal Transactions: {len(transactions_df):,}")
print(f"Unique Customers: {transactions_df['customer_id'].nunique():,}")
print(f"\nTransaction Value Summary:")
print(f"  ‚Ä¢ Total Revenue: ${transactions_df['total_amount'].sum():,.2f}")
print(f"  ‚Ä¢ Average Transaction: ${transactions_df['total_amount'].mean():.2f}")
print(f"  ‚Ä¢ Median Transaction: ${transactions_df['total_amount'].median():.2f}")
print(f"  ‚Ä¢ Max Transaction: ${transactions_df['total_amount'].max():.2f}")

print("\nüìã Sample Transactions:")
display(transactions_df.head(10))

In [None]:
# Transaction Analytics
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('Transaction Analytics', fontsize=16, fontweight='bold')

# Transaction amount distribution
transactions_df['total_amount'].hist(bins=50, ax=axes[0, 0], edgecolor='black')
axes[0, 0].set_title('Transaction Amount Distribution')
axes[0, 0].set_xlabel('Amount ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_xlim(0, transactions_df['total_amount'].quantile(0.95))

# Product category distribution
if 'product_category' in transactions_df.columns:
    transactions_df['product_category'].value_counts().plot(kind='bar', ax=axes[0, 1], color='orange')
    axes[0, 1].set_title('Product Categories')
    axes[0, 1].set_xlabel('Category')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].tick_params(axis='x', rotation=45)

# Payment method distribution
if 'payment_method' in transactions_df.columns:
    transactions_df['payment_method'].value_counts().plot(kind='pie', ax=axes[1, 0], autopct='%1.1f%%')
    axes[1, 0].set_title('Payment Methods')
    axes[1, 0].set_ylabel('')

# Channel distribution
if 'channel' in transactions_df.columns:
    transactions_df['channel'].value_counts().plot(kind='bar', ax=axes[1, 1], color='green')
    axes[1, 1].set_title('Transaction Channels')
    axes[1, 1].set_xlabel('Channel')
    axes[1, 1].set_ylabel('Count')
    axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 1.3 Explore Interaction Data

In [None]:
interactions_df = raw_data['interactions']

print("üìû INTERACTION DATA OVERVIEW")
print("=" * 60)
print(f"\nTotal Interactions: {len(interactions_df):,}")
print(f"Unique Customers: {interactions_df['customer_id'].nunique():,}")

if 'satisfaction_score' in interactions_df.columns:
    print(f"\nSatisfaction Metrics:")
    print(f"  ‚Ä¢ Average Satisfaction: {interactions_df['satisfaction_score'].mean():.2f}/5")
    print(f"  ‚Ä¢ Median Satisfaction: {interactions_df['satisfaction_score'].median():.0f}/5")

print("\nüìã Sample Interactions:")
display(interactions_df.head(10))

In [None]:
# Interaction Analytics
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('Customer Interaction Analytics', fontsize=16, fontweight='bold')

# Interaction type distribution
if 'interaction_type' in interactions_df.columns:
    interactions_df['interaction_type'].value_counts().plot(kind='bar', ax=axes[0, 0], color='steelblue')
    axes[0, 0].set_title('Interaction Types')
    axes[0, 0].set_xlabel('Type')
    axes[0, 0].set_ylabel('Count')
    axes[0, 0].tick_params(axis='x', rotation=45)

# Satisfaction score distribution
if 'satisfaction_score' in interactions_df.columns:
    interactions_df['satisfaction_score'].value_counts().sort_index().plot(kind='bar', ax=axes[0, 1], color='coral')
    axes[0, 1].set_title('Satisfaction Scores')
    axes[0, 1].set_xlabel('Score')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].tick_params(axis='x', rotation=0)

# Channel distribution
if 'channel' in interactions_df.columns:
    interactions_df['channel'].value_counts().plot(kind='pie', ax=axes[1, 0], autopct='%1.1f%%')
    axes[1, 0].set_title('Interaction Channels')
    axes[1, 0].set_ylabel('')

# Duration distribution
if 'duration_seconds' in interactions_df.columns:
    # Convert to minutes for better readability
    (interactions_df['duration_seconds'] / 60).hist(bins=30, ax=axes[1, 1], edgecolor='black')
    axes[1, 1].set_title('Interaction Duration')
    axes[1, 1].set_xlabel('Duration (minutes)')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].set_xlim(0, (interactions_df['duration_seconds'] / 60).quantile(0.95))

plt.tight_layout()
plt.show()

## 1.4 Explore Marketing Data

In [None]:
marketing_df = raw_data['marketing']

print("üìß MARKETING DATA OVERVIEW")
print("=" * 60)

if len(marketing_df) > 0:
    print(f"\nTotal Campaigns: {len(marketing_df):,}")
    print(f"Unique Customers Reached: {marketing_df['customer_id'].nunique():,}")
    
    if 'opened' in marketing_df.columns:
        print(f"\nEngagement Metrics:")
        print(f"  ‚Ä¢ Open Rate: {marketing_df['opened'].mean()*100:.2f}%")
        print(f"  ‚Ä¢ Click Rate: {marketing_df['clicked'].mean()*100:.2f}%")
        print(f"  ‚Ä¢ Conversion Rate: {marketing_df['converted'].mean()*100:.2f}%")
    
    display(marketing_df.head(10))
else:
    print("\n‚ö†Ô∏è  No marketing data available")

---

# 2Ô∏è‚É£ Data Cleaning

## Understanding Data Quality Issues

Before we can analyze data, we need to clean it. Common issues include:
- Missing values
- Duplicates
- Outliers
- Invalid data types
- Business rule violations

Let's clean our datasets!

In [None]:
# Initialize the data cleaner
cleaner = DataCleaner(settings)

print("üßπ Starting data cleaning...")
print("‚è±Ô∏è  This may take a minute...\n")

# Clean all datasets
cleaned_data = cleaner.clean_all_data(
    data_dict=raw_data,
    deep_clean=True
)

print("\n‚úÖ Data cleaning complete!")

## 2.1 Review Cleaning Report

In [None]:
print("\nüìä CLEANING REPORT SUMMARY")
print("=" * 80)

for dataset_name, report in cleaner.cleaning_reports.items():
    print(f"\n{dataset_name.upper()}")
    print("-" * 40)
    print(f"Records before: {report.total_records_before:,}")
    print(f"Records after:  {report.total_records_after:,}")
    print(f"Duplicates removed: {report.duplicates_removed:,}")
    
    if report.missing_values_handled:
        print(f"\nMissing values handled:")
        for col, count in report.missing_values_handled.items():
            print(f"  ‚Ä¢ {col}: {count:,}")
    
    if report.outliers_handled:
        print(f"\nOutliers handled:")
        for col, count in report.outliers_handled.items():
            print(f"  ‚Ä¢ {col}: {count:,}")

## 2.2 Compare Before/After Cleaning

In [None]:
# Compare customer data before and after cleaning
print("üìä CUSTOMER DATA: BEFORE vs AFTER CLEANING")
print("=" * 80)

print("\nBEFORE CLEANING:")
print(f"Shape: {raw_data['customers'].shape}")
print(f"Missing values: {raw_data['customers'].isnull().sum().sum()}")
print(f"Duplicates: {raw_data['customers'].duplicated().sum()}")

print("\nAFTER CLEANING:")
print(f"Shape: {cleaned_data['customers'].shape}")
print(f"Missing values: {cleaned_data['customers'].isnull().sum().sum()}")
print(f"Duplicates: {cleaned_data['customers'].duplicated().sum()}")

print("\nNew columns added:")
new_cols = set(cleaned_data['customers'].columns) - set(raw_data['customers'].columns)
for col in new_cols:
    print(f"  ‚Ä¢ {col}")

---

# 3Ô∏è‚É£ Data Validation

## Quality Checks

Let's validate our cleaned data to ensure it meets quality standards.

In [None]:
# Initialize validator
validator = DataValidator(settings)

print("üîç Starting data validation...\n")

validation_results = {}

for name, df in cleaned_data.items():
    print(f"\nValidating {name}...")
    is_valid = validator.validate_data(df, dataset_name=name)
    validation_results[name] = {
        'valid': is_valid,
        'summary': validator.validation_summary.copy()
    }
    
print("\n‚úÖ Validation complete!")

## 3.1 Validation Summary

In [None]:
print("\nüìä VALIDATION SUMMARY")
print("=" * 80)

summary_data = []
for name, result in validation_results.items():
    summary = result['summary']
    summary_data.append({
        'Dataset': name.upper(),
        'Total Checks': summary['total_checks'],
        'Passed': summary['passed'],
        'Failed': summary['failed'],
        'Errors': summary['errors'],
        'Warnings': summary['warnings'],
        'Pass Rate': f"{summary['pass_rate']*100:.1f}%",
        'Status': '‚úÖ Valid' if summary['is_valid'] else '‚ö†Ô∏è Issues Found'
    })

summary_df = pd.DataFrame(summary_data)
display(summary_df)

---

# 4Ô∏è‚É£ Feature Engineering

## Creating Meaningful Features

Feature engineering is where we transform raw data into features that better represent the underlying problem.

We'll create:
1. **Transaction features**: Purchase patterns, frequency, monetary values
2. **Interaction features**: Support engagement, satisfaction metrics
3. **RFM features**: Recency, Frequency, Monetary analysis
4. **Behavioral features**: Engagement scores, activity patterns
5. **Time-based features**: Temporal patterns and trends

In [None]:
# Initialize feature engineer
engineer = FeatureEngineer(settings)

print("‚öôÔ∏è Starting feature engineering...")
print("‚è±Ô∏è  This may take 1-2 minutes...\n")

# Create features
master_features = engineer.create_features(cleaned_data)

print("\n‚úÖ Feature engineering complete!")
print(f"\nüìä Master Feature Set:")
print(f"  ‚Ä¢ Total records: {len(master_features):,}")
print(f"  ‚Ä¢ Total features: {len(master_features.columns):,}")
print(f"  ‚Ä¢ Memory usage: {master_features.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

## 4.1 Explore Generated Features

In [None]:
print("\nüìã FEATURE CATEGORIES")
print("=" * 80)

# Group features by category
feature_categories = {
    'Customer Demographics': [col for col in master_features.columns if any(x in col for x in ['age', 'gender', 'state'])],
    'Transaction Features': [col for col in master_features.columns if any(x in col for x in ['transaction', 'spent', 'purchase'])],
    'Interaction Features': [col for col in master_features.columns if any(x in col for x in ['interaction', 'satisfaction', 'support'])],
    'RFM Features': [col for col in master_features.columns if any(x in col for x in ['recency', 'frequency', 'monetary', 'rfm'])],
    'Behavioral Features': [col for col in master_features.columns if any(x in col for x in ['engagement', 'activity', 'behavior'])],
    'Time Features': [col for col in master_features.columns if any(x in col for x in ['date', 'days', 'lifetime', 'age_'])]
}

for category, features in feature_categories.items():
    if features:
        print(f"\n{category} ({len(features)} features):")
        for feat in features[:10]:  # Show first 10
            print(f"  ‚Ä¢ {feat}")
        if len(features) > 10:
            print(f"  ... and {len(features) - 10} more")

In [None]:
# Show sample of master features
print("\nüìä Sample of Master Features:")
display(master_features.head(10))

## 4.2 Feature Importance Analysis

In [None]:
if hasattr(engineer, 'feature_importance') and engineer.feature_importance is not None:
    print("\nüéØ TOP 20 MOST IMPORTANT FEATURES")
    print("=" * 80)
    
    top_features = engineer.feature_importance.head(20)
    display(top_features)
    
    # Visualize
    plt.figure(figsize=(12, 8))
    plt.barh(range(len(top_features)), top_features['importance'].values)
    plt.yticks(range(len(top_features)), top_features['feature'].values)
    plt.xlabel('Importance Score')
    plt.title('Top 20 Most Important Features for Churn Prediction')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print("\n‚ö†Ô∏è  Feature importance not calculated")

---

# 5Ô∏è‚É£ Exploratory Data Analysis (EDA)

## Understanding Patterns and Relationships

Now let's dive deep into the data to understand patterns, correlations, and insights.

In [None]:
# Initialize EDA analyzer
analyzer = ExploratoryDataAnalysis(settings)

print("üìä Starting Exploratory Data Analysis...")
print("‚è±Ô∏è  This may take 1-2 minutes...\n")

# Perform EDA
eda_report = analyzer.perform_eda(
    df=master_features,
    target_col='churned',
    generate_plots=True
)

print("\n‚úÖ EDA complete!")

## 5.1 Data Overview

In [None]:
print("\nüìä DATA OVERVIEW")
print("=" * 80)

overview = eda_report.get('data_overview', {})

print(f"\nDataset Shape: {overview.get('shape', 'N/A')}")
print(f"Memory Usage: {overview.get('memory_usage_mb', 0):.2f} MB")
print(f"Numeric Features: {overview.get('numeric_features', 0)}")
print(f"Categorical Features: {overview.get('categorical_features', 0)}")
print(f"Duplicate Rows: {overview.get('duplicate_rows', 0):,}")

if overview.get('zero_variance_features'):
    print(f"\n‚ö†Ô∏è  Zero Variance Features: {len(overview['zero_variance_features'])}")
    
if overview.get('high_cardinality_features'):
    print(f"‚ö†Ô∏è  High Cardinality Features: {len(overview['high_cardinality_features'])}")

## 5.2 Target Analysis (Churn)

In [None]:
print("\nüéØ TARGET VARIABLE ANALYSIS (CHURN)")
print("=" * 80)

target_analysis = eda_report.get('target_analysis', {})

print(f"\nChurn Distribution:")
for label, count in target_analysis.get('distribution', {}).items():
    pct = target_analysis.get('percentage', {}).get(label, 0)
    print(f"  ‚Ä¢ {label}: {count:,} ({pct:.2f}%)")

if target_analysis.get('class_ratio'):
    print(f"\nClass Imbalance Ratio: {target_analysis['class_ratio']:.2f}:1")
    
if target_analysis.get('entropy'):
    print(f"Entropy: {target_analysis['entropy']:.4f}")

## 5.3 Feature Correlations with Target

In [None]:
print("\nüìà TOP FEATURES CORRELATED WITH CHURN")
print("=" * 80)

correlation_analysis = eda_report.get('correlation_analysis', {})
target_correlations = correlation_analysis.get('target_correlations', {})

if target_correlations:
    # Show top 15 correlations
    corr_df = pd.DataFrame(list(target_correlations.items())[:15], 
                          columns=['Feature', 'Correlation'])
    corr_df['Abs_Correlation'] = corr_df['Correlation'].abs()
    corr_df = corr_df.sort_values('Abs_Correlation', ascending=False)
    
    display(corr_df[['Feature', 'Correlation']].head(15))
    
    # Visualize
    plt.figure(figsize=(12, 8))
    top_15 = corr_df.head(15)
    colors = ['red' if x < 0 else 'green' for x in top_15['Correlation']]
    plt.barh(range(len(top_15)), top_15['Correlation'].values, color=colors, alpha=0.7)
    plt.yticks(range(len(top_15)), top_15['Feature'].values)
    plt.xlabel('Correlation with Churn')
    plt.title('Top 15 Features Correlated with Churn')
    plt.axvline(x=0, color='black', linestyle='--', linewidth=0.5)
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print("No correlation data available")

## 5.4 Segment Analysis

In [None]:
print("\nüë• CHURN RATE BY CUSTOMER SEGMENTS")
print("=" * 80)

segmentation = eda_report.get('segmentation_analysis', {})

# Customer Segment
if 'by_customer_segment' in segmentation:
    print("\nBy Customer Segment:")
    segment_data = segmentation['by_customer_segment']
    segment_df = pd.DataFrame(segment_data).T
    if 'churned_mean' in segment_df.columns:
        segment_df = segment_df.sort_values('churned_mean', ascending=False)
        display(segment_df)

# RFM Segment
if 'by_rfm_segment' in segmentation:
    print("\nBy RFM Segment:")
    rfm_data = segmentation['by_rfm_segment']
    rfm_df = pd.DataFrame(rfm_data).T
    if 'churned_mean' in rfm_df.columns:
        rfm_df = rfm_df.sort_values('churned_mean', ascending=False)
        display(rfm_df.head(10))

## 5.5 Statistical Tests

In [None]:
print("\nüìä STATISTICAL SIGNIFICANCE TESTS")
print("=" * 80)

stat_tests = eda_report.get('statistical_tests', {})

# T-tests for numeric features
if 't_tests' in stat_tests:
    print("\nT-Tests (Top 10 Most Significant):")
    t_tests = stat_tests['t_tests']
    if t_tests:
        t_test_df = pd.DataFrame(t_tests)
        t_test_df = t_test_df.sort_values('p_value')
        display(t_test_df.head(10))

# Chi-square tests for categorical features
if 'chi_square_tests' in stat_tests:
    print("\nChi-Square Tests (Top 10 Most Significant):")
    chi_tests = stat_tests['chi_square_tests']
    if chi_tests:
        chi_df = pd.DataFrame(chi_tests)
        chi_df = chi_df.sort_values('p_value')
        display(chi_df.head(10))

## 5.6 View Generated Visualizations

In [None]:
print("\nüìä GENERATED VISUALIZATIONS")
print("=" * 80)

if hasattr(analyzer, 'visualizations') and analyzer.visualizations:
    print(f"\nTotal visualizations created: {len(analyzer.visualizations)}")
    print("\nVisualization files:")
    for viz in analyzer.visualizations:
        viz_path = Path(viz)
        print(f"  ‚Ä¢ {viz_path.name}")
        if viz_path.suffix == '.html':
            print(f"    ‚Üí Open in browser: {viz_path}")
else:
    print("No visualizations generated")

---

# 6Ô∏è‚É£ Model Preparation

## Preparing Datasets for Phase 2

Let's prepare our final datasets for machine learning modeling.

In [None]:
print("\nüéØ PREPARING DATASETS FOR MODELING")
print("=" * 80)

# Separate features and target
target_col = 'churned'
feature_cols = [col for col in master_features.columns 
                if col not in ['customer_id', target_col]]

# Select only numeric features for initial modeling
numeric_features = master_features[feature_cols].select_dtypes(include=[np.number]).columns.tolist()

X = master_features[numeric_features]
y = master_features[target_col]

print(f"\nFeature Matrix (X):")
print(f"  ‚Ä¢ Shape: {X.shape}")
print(f"  ‚Ä¢ Features: {len(numeric_features)}")
print(f"  ‚Ä¢ Memory: {X.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

print(f"\nTarget Vector (y):")
print(f"  ‚Ä¢ Shape: {y.shape}")
print(f"  ‚Ä¢ Churn rate: {y.mean()*100:.2f}%")

## 6.1 Train/Validation/Test Split

In [None]:
# Split into train/temp (70/30)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, 
    test_size=0.3, 
    random_state=42, 
    stratify=y
)

# Split temp into validation/test (15/15)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, 
    test_size=0.5, 
    random_state=42, 
    stratify=y_temp
)

print("\nüìä DATASET SPLITS")
print("=" * 80)

splits_info = [
    ['Training Set', len(X_train), len(X_train)/len(X)*100, y_train.mean()*100],
    ['Validation Set', len(X_val), len(X_val)/len(X)*100, y_val.mean()*100],
    ['Test Set', len(X_test), len(X_test)/len(X)*100, y_test.mean()*100],
    ['Total', len(X), 100.0, y.mean()*100]
]

splits_df = pd.DataFrame(splits_info, 
                        columns=['Dataset', 'Records', 'Percentage', 'Churn Rate (%)'])
display(splits_df)

## 6.2 Feature Scaling

In [None]:
# Scale features using StandardScaler
scaler = StandardScaler()

X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)

X_val_scaled = pd.DataFrame(
    scaler.transform(X_val),
    columns=X_val.columns,
    index=X_val.index
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

print("\n‚úÖ Feature scaling complete!")
print("\nüìä Scaled Feature Statistics:")
print("\nTraining Set:")
print(f"  Mean: {X_train_scaled.mean().mean():.6f}")
print(f"  Std:  {X_train_scaled.std().mean():.6f}")

## 6.3 Save Prepared Datasets

In [None]:
# Save datasets to disk
processed_dir = settings.paths.PROCESSED_DATA_DIR

datasets = {
    'X_train': X_train_scaled,
    'X_val': X_val_scaled,
    'X_test': X_test_scaled,
    'y_train': y_train,
    'y_val': y_val,
    'y_test': y_test
}

print("\nüíæ Saving datasets...")
for name, data in datasets.items():
    filepath = processed_dir / f"{name}.parquet"
    if isinstance(data, pd.Series):
        data.to_frame().to_parquet(filepath)
    else:
        data.to_parquet(filepath)
    print(f"  ‚úì Saved {name} to {filepath.name}")

print("\n‚úÖ All datasets saved successfully!")

---

# 7Ô∏è‚É£ Complete Pipeline Execution

## Running the Full Pipeline

Now let's run the complete pipeline using the orchestrator.

In [None]:
# Initialize orchestrator
orchestrator = PipelineOrchestrator(settings)

print("\nüöÄ RUNNING COMPLETE PIPELINE")
print("=" * 80)
print("\nThis will execute all stages:")
print("  1. Data Collection")
print("  2. Data Cleaning")
print("  3. Data Validation")
print("  4. Feature Engineering")
print("  5. Exploratory Analysis")
print("  6. Model Preparation")
print("\n‚è±Ô∏è  Estimated time: 3-5 minutes")
print("\nStarting pipeline...\n")

In [None]:
# Run the pipeline
results = orchestrator.run_pipeline(
    use_cache=True,
    continue_on_error=False,
    generate_plots=True,
    deep_clean=True,
    target_column='churned'
)

print("\n‚úÖ Pipeline execution complete!")

## 7.1 Pipeline Summary

In [None]:
# Get pipeline summary
summary = orchestrator.get_pipeline_summary()

print("\nüìä PIPELINE EXECUTION SUMMARY")
print("=" * 80)

print(f"\nTotal Stages: {summary['total_stages']}")
print(f"Successful: {summary['successful_stages']} ‚úÖ")
print(f"Failed: {summary['failed_stages']} ‚ùå")
print(f"Total Time: {summary['total_execution_time']:.2f} seconds")

print("\nüìã Stage Results:")
print("-" * 80)

stage_results = []
for stage_name, stage_info in summary['stages'].items():
    status = "‚úÖ" if stage_info['success'] else "‚ùå"
    stage_results.append([
        status,
        stage_name,
        f"{stage_info['execution_time']:.2f}s",
        stage_info['error'] if stage_info['error'] else "-"
    ])

results_df = pd.DataFrame(stage_results, 
                         columns=['Status', 'Stage', 'Time', 'Error'])
display(results_df)

---

# 8Ô∏è‚É£ Final Summary and Next Steps

## What We Accomplished

Let's review what we've achieved in Phase 1.

In [None]:
print("\nüéâ PHASE 1 COMPLETE!")
print("=" * 80)

print("\n‚úÖ Achievements:")
print("\n1. Data Collection")
print(f"   ‚Ä¢ Collected {len(raw_data)} datasets")
print(f"   ‚Ä¢ Total records: {sum(len(df) for df in raw_data.values()):,}")

print("\n2. Data Cleaning")
print(f"   ‚Ä¢ Cleaned {len(cleaned_data)} datasets")
print(f"   ‚Ä¢ Handled missing values, outliers, and duplicates")

print("\n3. Feature Engineering")
print(f"   ‚Ä¢ Created {len(master_features.columns)} features")
print(f"   ‚Ä¢ From {len(master_features):,} customer records")

print("\n4. Exploratory Analysis")
print(f"   ‚Ä¢ Generated comprehensive EDA report")
print(f"   ‚Ä¢ Created {len(analyzer.visualizations) if hasattr(analyzer, 'visualizations') else 0} visualizations")

print("\n5. Model Preparation")
print(f"   ‚Ä¢ Training set: {len(X_train):,} records")
print(f"   ‚Ä¢ Validation set: {len(X_val):,} records")
print(f"   ‚Ä¢ Test set: {len(X_test):,} records")

print("\n\nüìÅ Output Files:")
print(f"   ‚Ä¢ Processed Data: {settings.paths.PROCESSED_DATA_DIR}")
print(f"   ‚Ä¢ Reports: {settings.paths.REPORTS_DIR}")
print(f"   ‚Ä¢ Visualizations: {settings.paths.FIGURES_DIR}")

print("\n\nüéØ Key Metrics:")
print(f"   ‚Ä¢ Churn Rate: {y.mean()*100:.2f}%")
print(f"   ‚Ä¢ Total Features: {len(numeric_features)}")
print(f"   ‚Ä¢ Data Quality: ‚úÖ Validated")

print("\n\nüìà Next Steps (Phase 2):")
print("   1. Model Selection")
print("      ‚Ä¢ Try XGBoost, Random Forest, Logistic Regression")
print("   2. Hyperparameter Tuning")
print("      ‚Ä¢ Optimize model parameters")
print("   3. Model Evaluation")
print("      ‚Ä¢ Target: 85%+ accuracy")
print("      ‚Ä¢ Focus on precision and recall")
print("   4. Model Deployment")
print("      ‚Ä¢ Create API endpoint")
print("      ‚Ä¢ Set up monitoring")

print("\n" + "=" * 80)
print("\nüí° Tip: Review the interactive dashboard at:")
dashboard_path = settings.paths.FIGURES_DIR / 'interactive_dashboard.html'
if dashboard_path.exists():
    print(f"   {dashboard_path}")
print("\n" + "=" * 80)

---

# üéì Learning Summary

## Key Concepts Covered

1. **Data Pipeline Architecture**: Modular, scalable design
2. **Data Quality**: Importance of cleaning and validation
3. **Feature Engineering**: Creating meaningful predictors
4. **EDA**: Understanding data patterns and relationships
5. **Model Preparation**: Train/val/test splits and scaling

## Best Practices Demonstrated

- ‚úÖ Reproducible analysis with random seeds
- ‚úÖ Proper train/validation/test splits
- ‚úÖ Feature scaling before modeling
- ‚úÖ Comprehensive data validation
- ‚úÖ Documentation and reporting

## Common Pitfalls Avoided

- ‚ùå Data leakage between train/test
- ‚ùå Ignoring class imbalance
- ‚ùå Poor feature engineering
- ‚ùå Inadequate data validation
- ‚ùå Missing documentation

---

**Congratulations! You've completed Phase 1 of the CRM Analytics Pipeline! üéâ**