# adqia Demo Notebook

This notebook demonstrates how to use the adqia (Auto Data QA & Insight Agent) orchestrator to analyze CSV data.

## Features Demonstrated:
- Data ingestion from CSV
- Schema inference and tracking
- Data quality assessment
- Anomaly detection
- Insight generation
- Report creation

## 1. Setup and Imports

In [None]:
import sys
import os

# Add src to path
sys.path.insert(0, os.path.abspath('..'))

from src.orchestrator import Orchestrator
import pandas as pd

## 2. Initialize Orchestrator

In [None]:
# Create orchestrator instance
orchestrator = Orchestrator(use_llm=False)

print("Orchestrator initialized successfully!")

## 3. Load and Preview Sample Data

In [None]:
# Path to sample data
data_path = "../data/sample_sales.csv"

# Preview the data
df = pd.read_csv(data_path)
print(f"Dataset shape: {df.shape}")
print(f"\nFirst 5 rows:")
df.head()

## 4. Run Complete Analysis

In [None]:
# Run the analysis pipeline
results = orchestrator.analyze(
    filepath=data_path,
    generate_report=True,
    report_dir="../reports"
)

print("\n✅ Analysis complete!")

## 5. View Dataset Information

In [None]:
info = results['dataset_info']

print("Dataset Information:")
print("="*50)
print(f"File: {info['filepath']}")
print(f"Rows: {info['rows']}")
print(f"Columns: {info['columns']}")
print(f"Column Names: {', '.join(info['column_names'])}")

## 6. View Schema

In [None]:
schema = results['schema']

print("Schema:")
print("="*50)
for col, dtype in schema.items():
    print(f"  {col:20s} -> {dtype}")

## 7. Data Quality Assessment Results

In [None]:
qa = results['qa_results']

print("Data Quality Assessment:")
print("="*50)

# Missing values
missing = qa.get('missing_values', {})
if missing:
    print("\nMissing Values:")
    for col, count in missing.items():
        frac = qa.get('null_fraction', {}).get(col, 0)
        print(f"  - {col}: {count} ({frac*100:.2f}%)")
else:
    print("\n✅ No missing values detected")

# Duplicates
duplicates = qa.get('duplicate_rows', 0)
print(f"\nDuplicate Rows: {duplicates}")

if duplicates == 0:
    print("✅ No duplicates")
else:
    print(f"⚠️ {duplicates} duplicate(s) found")

## 8. Anomaly Detection Results

In [None]:
anomaly = results['anomaly_results']

print("Anomaly Detection Results:")
print("="*50)

outliers = anomaly.get('outliers', {})
if outliers:
    print("\nOutliers Detected:")
    for col, count in outliers.items():
        print(f"  - {col}: {count} outlier(s)")
        
        # Show stats
        stats = anomaly.get('summary_stats', {}).get(col, {})
        if stats:
            print(f"    Mean: {stats.get('mean', 0):.2f}, Std: {stats.get('std', 0):.2f}")
else:
    print("\n✅ No outliers detected")

## 9. View Insights

In [None]:
print("Generated Insights:")
print("="*70)
print(results['insights'])

## 10. View Recommendations

In [None]:
recommendations = results.get('recommendations', [])

print("Actionable Recommendations:")
print("="*70)

if recommendations:
    for i, rec in enumerate(recommendations, 1):
        print(f"{i}. {rec}")
else:
    print("No specific recommendations at this time.")

## 11. Generated Report Files

In [None]:
if 'report_paths' in results:
    print("Generated Reports:")
    print("="*70)
    for report_type, path in results['report_paths'].items():
        print(f"  - {report_type.upper()}: {path}")
else:
    print("No reports were generated.")

## 12. Quick Summary Method

In [None]:
# Use the quick_summary method for a condensed view
summary = orchestrator.quick_summary(data_path)
print(summary)

## 13. Memory State Check

In [None]:
# Check what's stored in memory
memory_state = orchestrator.get_memory_state()

print("Memory State:")
print("="*50)
print(f"Stored keys: {memory_state['keys']}")
print(f"\nStored schema: {memory_state['schema']}")

## 14. Visualization (Optional)

In [None]:
import matplotlib.pyplot as plt

# Plot distribution of a numeric column with outliers
if anomaly.get('outliers'):
    for col in list(anomaly['outliers'].keys())[:2]:  # First 2 columns with outliers
        plt.figure(figsize=(10, 4))
        df[col].hist(bins=20, edgecolor='black')
        plt.title(f"Distribution of {col}")
        plt.xlabel(col)
        plt.ylabel("Frequency")
        plt.grid(axis='y', alpha=0.5)
        plt.show()

## Conclusion

This notebook demonstrated the complete workflow of adqia:

1. ✅ Data ingestion and schema inference
2. ✅ Quality checks (missing values, duplicates)
3. ✅ Anomaly detection (outliers)
4. ✅ Insight generation
5. ✅ Report creation

You can now use this workflow with your own CSV files!