# DataProf Working Demo - Data Profiling & Quality Analysis

This notebook demonstrates the **working** functionality of the `dataprof` library for data profiling and quality assessment.

**Version:** 0.4.1 (confirmed working)  
**Date:** 2025-01-19

## 1. Setup and Environment Check

In [1]:
import pandas as pd
import dataprof as dp
import os
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

print("📊 DataProf Working Demo")
print("=" * 50)
print(f"Dataprof version: {getattr(dp, '__version__', 'Unknown')}")
print(f"Pandas version: {pd.__version__}")
print()
print("Available dataprof functions:")
functions = [f for f in dir(dp) if not f.startswith('_') and callable(getattr(dp, f))]
for func in functions:
    print(f"  ✓ {func}")
print()
print("✅ Setup completed successfully!")

📊 DataProf Working Demo
Dataprof version: Unknown
Pandas version: 2.3.2

Available dataprof functions:
  ✓ PyBatchResult
  ✓ PyColumnProfile
  ✓ PyQualityIssue
  ✓ PyQualityReport
  ✓ analyze_csv_file
  ✓ analyze_csv_with_quality
  ✓ analyze_json_file
  ✓ batch_analyze_directory
  ✓ batch_analyze_glob

✅ Setup completed successfully!


## 2. Sample Data Creation

In [3]:
# Create a realistic dataset with various data quality issues
import numpy as np

np.random.seed(42)

data = {
    'customer_id': range(1, 101),
    'age': [25, 32, 45, None, 28, 31, 67, 23, 29, 41] * 10,
    'income': [35000, 52000, 78000, 45000, None, 28000, 95000, 31000, 48000, 67000] * 10,
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', None, 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas'] * 10,
    'purchase_amount': np.random.normal(100, 30, 100).round(2),
    'satisfaction_score': np.random.randint(1, 6, 100)
}

# Introduce some missing values
data['age'][15] = None
data['age'][33] = None
data['income'][22] = None
data['city'][8] = None

df = pd.DataFrame(data)

print("📋 Sample Dataset Created")
print("=" * 30)
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print()
print("First 5 rows:")
print(df.head())
print()
print("Missing values per column:")
missing_data = df.isnull().sum()
for col, missing in missing_data.items():
    if missing > 0:
        print(f"  {col}: {missing} ({missing/len(df)*100:.1f}%)")
    else:
        print(f"  {col}: 0 (0.0%)")

📋 Sample Dataset Created
Shape: (100, 6)
Columns: ['customer_id', 'age', 'income', 'city', 'purchase_amount', 'satisfaction_score']

First 5 rows:
   customer_id   age   income         city  purchase_amount  \
0            1  25.0  35000.0     New York           114.90   
1            2  32.0  52000.0  Los Angeles            95.85   
2            3  45.0  78000.0      Chicago           119.43   
3            4   NaN  45000.0      Houston           145.69   
4            5  28.0      NaN      Phoenix            92.98   

   satisfaction_score  
0                   1  
1                   5  
2                   1  
3                   3  
4                   2  

Missing values per column:
  customer_id: 0 (0.0%)
  age: 11 (11.0%)
  income: 11 (11.0%)
  city: 11 (11.0%)
  purchase_amount: 0 (0.0%)
  satisfaction_score: 0 (0.0%)


## 3. Basic Data Profiling

In [4]:
# Save dataset for analysis
csv_file = "customer_data.csv"
df.to_csv(csv_file, index=False)
print(f"💾 Dataset saved to: {csv_file}")

# Perform basic analysis
print("\n🔍 Basic Data Profiling")
print("=" * 30)

try:
    profiles = dp.analyze_csv_file(csv_file)
    print(f"✅ Analysis completed: {len(profiles)} column profiles generated")
    
    print("\n📊 Column Profile Summary:")
    for i, profile in enumerate(profiles):
        print(f"\nColumn {i+1} Profile:")
        
        # Get available attributes
        attrs = [attr for attr in dir(profile) if not attr.startswith('_') and not callable(getattr(profile, attr))]
        
        for attr in attrs[:8]:  # Show first 8 attributes
            try:
                value = getattr(profile, attr)
                print(f"  {attr}: {value}")
            except Exception as e:
                print(f"  {attr}: Error - {e}")
                
except Exception as e:
    print(f"❌ Error during analysis: {e}")
    import traceback
    traceback.print_exc()

# Clean up
if os.path.exists(csv_file):
    os.remove(csv_file)
    print(f"\n🧹 Cleaned up {csv_file}")

💾 Dataset saved to: customer_data.csv

🔍 Basic Data Profiling
✅ Analysis completed: 6 column profiles generated

📊 Column Profile Summary:

Column 1 Profile:
  data_type: integer
  name: customer_id
  null_count: 0
  null_percentage: 0.0
  total_count: 100
  unique_count: 100
  uniqueness_ratio: 1.0

Column 2 Profile:
  data_type: float
  name: age
  null_count: 11
  null_percentage: 11.0
  total_count: 100
  unique_count: 10
  uniqueness_ratio: 0.1

Column 3 Profile:
  data_type: float
  name: purchase_amount
  null_count: 0
  null_percentage: 0.0
  total_count: 100
  unique_count: 99
  uniqueness_ratio: 0.99

Column 4 Profile:
  data_type: string
  name: city
  null_count: 11
  null_percentage: 11.0
  total_count: 100
  unique_count: 10
  uniqueness_ratio: 0.1

Column 5 Profile:
  data_type: float
  name: income
  null_count: 11
  null_percentage: 11.0
  total_count: 100
  unique_count: 10
  uniqueness_ratio: 0.1

Column 6 Profile:
  data_type: integer
  name: satisfaction_score
  nu

## 4. Quality Assessment

In [5]:
# Quality analysis
csv_file = "customer_data_quality.csv"
df.to_csv(csv_file, index=False)

print("🎯 Data Quality Assessment")
print("=" * 35)

try:
    quality_report = dp.analyze_csv_with_quality(csv_file)
    
    print(f"📈 Quality Score: {quality_report.quality_score()}/100")
    print(f"📊 Total Rows: {quality_report.total_rows:,}")
    print(f"📊 Total Columns: {quality_report.total_columns}")
    print(f"⏱️  Scan Time: {quality_report.scan_time_ms} ms")
    print(f"⚠️  Issues Found: {len(quality_report.issues)}")
    
    if quality_report.issues:
        print("\n🚨 Quality Issues Detected:")
        for i, issue in enumerate(quality_report.issues, 1):
            severity_icon = "🔴" if issue.severity == "High" else "🟡" if issue.severity == "Medium" else "🟢"
            print(f"  {i}. {severity_icon} {issue.description}")
            print(f"     Column: {issue.column} | Severity: {issue.severity}")
    else:
        print("\n✅ No quality issues detected!")
        
    # Additional quality metrics
    print("\n📋 Quality Report Details:")
    attrs = [attr for attr in dir(quality_report) if not attr.startswith('_') and not callable(getattr(quality_report, attr))]
    for attr in attrs:
        try:
            value = getattr(quality_report, attr)
            print(f"  {attr}: {value}")
        except:
            pass
        
except Exception as e:
    print(f"❌ Error during quality analysis: {e}")
    import traceback
    traceback.print_exc()

# Clean up
if os.path.exists(csv_file):
    os.remove(csv_file)
    print(f"\n🧹 Cleaned up {csv_file}")

🎯 Data Quality Assessment
📈 Quality Score: 50.0/100
📊 Total Rows: 100
📊 Total Columns: 6
⏱️  Scan Time: 533 ms
⚠️  Issues Found: 7

🚨 Quality Issues Detected:
  1. 🟢 11 null values (11%) in column 'age'
     Column: age | Severity: medium
  2. 🟢 90 duplicate values in column 'age'
     Column: age | Severity: low
  3. 🟢 11 null values (11%) in column 'income'
     Column: income | Severity: medium
  4. 🟢 90 duplicate values in column 'income'
     Column: income | Severity: low
  5. 🟢 11 null values (11%) in column 'city'
     Column: city | Severity: medium
  6. 🟢 90 duplicate values in column 'city'
     Column: city | Severity: low
  7. 🟢 95 duplicate values in column 'satisfaction_score'
     Column: satisfaction_score | Severity: low

📋 Quality Report Details:
  column_profiles: [<builtins.PyColumnProfile object at 0x000002CEBB7FC4B0>, <builtins.PyColumnProfile object at 0x000002CEBB7FE330>, <builtins.PyColumnProfile object at 0x000002CEBB7FE430>, <builtins.PyColumnProfile object 

## 5. Batch Processing Demo

In [6]:
print("📦 Batch Processing Demo")
print("=" * 30)

# Create multiple test datasets
test_files = []
datasets = {
    'sales_q1': {
        'month': ['Jan', 'Feb', 'Mar'] * 10,
        'sales': np.random.normal(50000, 10000, 30).round(2),
        'region': ['North', 'South', 'East', 'West'] * 7 + ['North', 'South']
    },
    'sales_q2': {
        'month': ['Apr', 'May', 'Jun'] * 8,
        'sales': np.random.normal(55000, 12000, 24).round(2),
        'region': ['North', 'South', 'East', 'West'] * 6
    },
    'employee_data': {
        'employee_id': range(1, 21),
        'department': ['IT', 'HR', 'Finance', 'Marketing'] * 5,
        'salary': np.random.normal(70000, 15000, 20).round(2),
        'years_experience': np.random.randint(1, 16, 20)
    }
}

# Create files
for name, data in datasets.items():
    filename = f"batch_{name}.csv"
    pd.DataFrame(data).to_csv(filename, index=False)
    test_files.append(filename)
    print(f"📄 Created: {filename}")

try:
    # Run batch analysis
    print("\n🔄 Running batch analysis...")
    batch_result = dp.batch_analyze_glob("batch_*.csv")
    
    print(f"✅ Batch analysis completed!")
    print(f"📊 Result type: {type(batch_result)}")
    
    # Explore batch result attributes
    print("\n📋 Batch Result Attributes:")
    attrs = [attr for attr in dir(batch_result) if not attr.startswith('_') and not callable(getattr(batch_result, attr))]
    for attr in attrs:
        try:
            value = getattr(batch_result, attr)
            print(f"  {attr}: {value}")
        except:
            pass
            
except Exception as e:
    print(f"❌ Error during batch analysis: {e}")
    import traceback
    traceback.print_exc()

# Clean up
for filename in test_files:
    if os.path.exists(filename):
        os.remove(filename)
        
print(f"\n🧹 Cleaned up {len(test_files)} batch test files")

📦 Batch Processing Demo
📄 Created: batch_sales_q1.csv
📄 Created: batch_sales_q2.csv
📄 Created: batch_employee_data.csv

🔄 Running batch analysis...
✅ Batch analysis completed!
📊 Result type: <class 'builtins.PyBatchResult'>

📋 Batch Result Attributes:
  average_quality_score: 96.94444444444446
  failed_files: 0
  processed_files: 3
  total_duration_secs: 0.4270756
  total_quality_issues: 6

🧹 Cleaned up 3 batch test files


## 6. Summary and Next Steps

In [9]:
print("🎉 DataProf Working Demo Complete!")
print("=" * 45)

print("\n✅ Successfully Demonstrated:")
print("   📊 Basic data profiling with analyze_csv_file()")
print("   🎯 Quality assessment with analyze_csv_with_quality()")
print("   📦 Batch processing with batch_analyze_glob()")
print("   🔍 Column profiling and statistics")
print("   ⚠️  Quality issue detection")
print("   📈 Quality scoring")

print("\n🛠️  DataProf Functions Used (v0.4.1):")
working_functions = [
    'analyze_csv_file() - Basic CSV analysis',
    'analyze_csv_with_quality() - Quality assessment',
    'batch_analyze_glob() - Batch file processing',
    'PyColumnProfile - Column statistics object',
    'PyQualityReport - Quality assessment object',
    'PyBatchResult - Batch processing result object'
]

for func in working_functions:
    print(f"   ✓ {func}")

print("\n🚀 Next Steps:")
print("   1. Explore analyze_json_file() for JSON data")
print("   2. Try batch_analyze_directory() for folder processing")
print("   3. Integrate dataprof into your data pipelines")
print("   4. Set up automated quality monitoring")
print("   5. Create custom quality thresholds")
print("   6. Export quality reports for stakeholders")

print("\n📚 Documentation: https://github.com/AndreaBozzo/dataprof")
print("🐛 Issues: https://github.com/AndreaBozzo/dataprof/issues")

print("\n" + "=" * 50)
print("    DataProf Demo - Ready for Production Use! 🚀")
print("=" * 50)

🎉 DataProf Working Demo Complete!

✅ Successfully Demonstrated:
   📊 Basic data profiling with analyze_csv_file()
   🎯 Quality assessment with analyze_csv_with_quality()
   📦 Batch processing with batch_analyze_glob()
   🔍 Column profiling and statistics
   ⚠️  Quality issue detection
   📈 Quality scoring

🛠️  DataProf Functions Used (v0.4.1):
   ✓ analyze_csv_file() - Basic CSV analysis
   ✓ analyze_csv_with_quality() - Quality assessment
   ✓ batch_analyze_glob() - Batch file processing
   ✓ PyColumnProfile - Column statistics object
   ✓ PyQualityReport - Quality assessment object
   ✓ PyBatchResult - Batch processing result object

🚀 Next Steps:
   1. Explore analyze_json_file() for JSON data
   2. Try batch_analyze_directory() for folder processing
   3. Integrate dataprof into your data pipelines
   4. Set up automated quality monitoring
   5. Create custom quality thresholds
   6. Export quality reports for stakeholders

📚 Documentation: https://github.com/AndreaBozzo/dataprof
🐛