# AI-Powered Data Cleaning Agent - Main Demo

**GenAI Competition - UoM DSCubed x UWA DSC**  
**Author:** Rudra Tiwari  
**Complete Data Cleaning Agent with WHO Health Data Support**

---

## What This Notebook Demonstrates:
1. **Core Data Cleaning Engine** - Comprehensive data quality analysis and cleaning
2. **AI-Powered Intelligence** - OpenAI integration for smart cleaning suggestions
3. **WHO Health Data Processing** - Real-world health dataset cleaning
4. **Multi-Sheet Excel Support** - Advanced Excel file handling
5. **Interactive Visualizations** - Beautiful before/after comparisons
6. **Professional Reporting** - Detailed cleaning reports and insights

**This is the main demonstration notebook for the competition!**


## Step 1: Import Required Libraries and Setup


In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime
import json

# Data cleaning agent
from data_cleaning_agent import DataCleaningAgent
from ai_data_cleaning import AIDataCleaningAgent
from config import OPENAI_API_KEY

# Suppress warnings for clean output
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("All libraries imported successfully!")
print("AI-Powered Data Cleaning Agent ready!")
print(f"OpenAI API Key configured: {'Yes' if OPENAI_API_KEY != 'your-openai-api-key-here' else 'Please set in config.py'}")


## Step 2: Initialize AI-Powered Data Cleaning Agent


In [None]:
# Initialize the AI-powered data cleaning agent
agent = AIDataCleaningAgent()

print("AI-Powered Data Cleaning Agent initialized!")
print("Features available:")
print("- Missing value detection and handling")
print("- Duplicate identification and removal")
print("- Data type optimization")
print("- Outlier detection and treatment")
print("- Text standardization")
print("- Memory optimization")
print("- AI-powered cleaning suggestions")
print("- Multi-sheet Excel support")


## Step 3: Load and Clean WHO Health Data


In [None]:
# Load WHO health data
df = pd.read_csv('datasets/Life Expectancy Data.csv')

print("WHO Health Data loaded successfully!")
print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

# Show basic information
print("\nDataset Overview:")
print(df.info())

# Show first few rows
print("\nFirst 5 rows:")
display(df.head())

# Perform data quality analysis
print("\nData Quality Analysis:")
quality_report = agent.analyze_data_quality(df)
print(f"Overall Quality Score: {quality_report['overall_score']:.1f}%")
print(f"Missing Values: {quality_report['missing_values']}")
print(f"Duplicate Rows: {quality_report['duplicates']}")
print(f"Outliers Detected: {quality_report['outliers']}")


## Step 4: Apply AI-Powered Data Cleaning


In [None]:
# Store original data for comparison
original_df = df.copy()
original_shape = df.shape
original_memory = df.memory_usage(deep=True).sum() / 1024

print("APPLYING AI-POWERED DATA CLEANING")
print("=" * 40)
print(f"Original dataset: {original_shape[0]} rows × {original_shape[1]} columns")
print(f"Original memory usage: {original_memory:.1f} KB")

# Apply comprehensive cleaning
cleaned_df = agent.intelligent_clean(df)

cleaned_shape = cleaned_df.shape
cleaned_memory = cleaned_df.memory_usage(deep=True).sum() / 1024

print(f"\nCleaned dataset: {cleaned_shape[0]} rows × {cleaned_shape[1]} columns")
print(f"Cleaned memory usage: {cleaned_memory:.1f} KB")

# Calculate improvements
rows_removed = original_shape[0] - cleaned_shape[0]
memory_saved = original_memory - cleaned_memory
memory_improvement = (memory_saved / original_memory) * 100

print(f"\nCleaning Results:")
print(f"- Rows removed: {rows_removed}")
print(f"- Memory saved: {memory_saved:.1f} KB ({memory_improvement:.1f}% improvement)")

# Show cleaned data
print("\nCleaned Data Sample:")
display(cleaned_df.head())

print("\nAI-powered cleaning complete!")
print("Data is now ready for analysis and modeling.")


## Step 5: Health Crisis Analysis with WHO Data


In [None]:
# Import health crisis analyzer
from features.health_crisis_analysis import analyze_health_data

print("HEALTH CRISIS ANALYSIS")
print("=" * 30)

# Perform health crisis analysis on cleaned data
crisis_analysis = analyze_health_data(cleaned_df)

print("\nHealth crisis analysis complete!")
print("This demonstrates real-world health data processing capabilities.")
print("Perfect for showing practical application in the hackathon demo!")


AW