# AI-Powered Data Cleaning Agent - Interactive Demo

**GenAI Competition - UoM DSCubed x UWA DSC**  
**Author:** Rudra Tiwari  
**Perfect for Google Colab Demo!**

---

## What This Demo Does:
1. **Step 1:** Load and examine your dataset
2. **Step 2:** Show data quality issues
3. **Step 3:** Get AI-powered cleaning suggestions
4. **Step 4:** Apply intelligent cleaning
5. **Step 5:** Show before/after comparison
6. **Step 6:** Generate final report

**Just upload your dataset and run each cell!**


## Step 0: Install Required Libraries


In [None]:
# Install required libraries
%pip install pandas numpy matplotlib seaborn openpyxl langchain langchain-openai python-dotenv

print("All libraries installed successfully!")
print("Ready to start the demo!")


## Step 1: Upload Your Dataset

**Upload your dataset file (CSV, Excel, or JSON) in the file uploader below:**


In [None]:
from google.colab import files
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime
import json
import io

warnings.filterwarnings('ignore')

print("Please upload your dataset file...")
print("Supported formats: CSV, Excel (.xlsx), JSON")

# File upload
uploaded = files.upload()

# Get the uploaded file
if uploaded:
    file_name = list(uploaded.keys())[0]
    print(f"File uploaded: {file_name}")
    
    # Load the dataset based on file type
    if file_name.endswith('.csv'):
        df = pd.read_csv(io.BytesIO(uploaded[file_name]))
    elif file_name.endswith(('.xlsx', '.xls')):
        df = pd.read_excel(io.BytesIO(uploaded[file_name]))
    elif file_name.endswith('.json'):
        df = pd.read_json(io.BytesIO(uploaded[file_name]))
    else:
        print("Unsupported file format")
        df = None
    
    if df is not None:
        print(f"Dataset loaded successfully!")
        print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
        print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")
else:
    print("No file uploaded")
    df = None


## 📊 Step 2: Examine Your Dataset

**Let's take a detailed look at your data:**


In [None]:
if df is not None:
    print("🔍 DATASET EXAMINATION")
    print("=" * 50)
    
    # Basic information
    print(f"📊 Dataset Overview:")
    print(f"   - Shape: {df.shape[0]} rows × {df.shape[1]} columns")
    print(f"   - Total cells: {df.shape[0] * df.shape[1]:,}")
    print(f"   - Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")
    
    # Column information
    print(f"\n📝 Column Information:")
    for i, col in enumerate(df.columns, 1):
        print(f"   {i:2d}. {col}")
    
    # Data types
    print(f"\n🔧 Data Types:")
    dtype_counts = df.dtypes.value_counts()
    for dtype, count in dtype_counts.items():
        print(f"   - {dtype}: {count} columns")
    
    # Show first few rows
    print(f"\n📊 First 5 rows of your data:")
    display(df.head())
    
    print(f"\n✅ Dataset examination complete!")
else:
    print("❌ No dataset loaded. Please upload a file first.")
