# DataWash Jupyter Demo

This notebook demonstrates DataWash's features in a Jupyter environment:
- Rich HTML rendering of reports
- Interactive exploration of suggestions
- Step-by-step cleaning with selective application
- Before/after quality score visualization

## Installation

```bash
pip install datawash
```

## 1. Setup and Import

In [None]:
import pandas as pd
import numpy as np
from datawash import analyze

# For visualization
import matplotlib.pyplot as plt

print("DataWash loaded successfully!")

## 2. Create Sample Messy Data

Let's create a dataset with various quality issues that DataWash can detect and fix.

In [None]:
# Create messy sample data with multiple issues
df = pd.DataFrame({
    "customer_name": ["John Smith", "JANE DOE", "bob wilson", "  Alice Brown  ", "Charlie Davis",
                      "Diana Miller", "EDWARD JONES", "fiona garcia", "George Lee", "Hannah White"],
    "email": ["john@email.com", "", "bob@email.com", "alice@email.com", None,
              "diana@email.com", "  edward@email.com  ", "", "george@email.com", "hannah@email.com"],
    "age": ["28", "34", "45", "29", "38", "42", "31", "27", "35", "40"],  # Stored as strings!
    "purchase_amount": [150.00, 230.50, 89.99, 1250.00, 175.25, 95.00, 310.00, 88.50, 450.00, 125.75],
    "is_premium": ["yes", "Yes", "YES", "no", "No", "NO", "true", "True", "false", "False"],
    "signup_date": ["2023-01-15", "15/02/2023", "March 10, 2023", "2023-04-20", "2023/05/25",
                    "25-Jun-2023", "July 4, 2023", "2023-08-15", "15/09/2023", "October 1, 2023"],
})

# Add a duplicate row
df = pd.concat([df, df.iloc[[0]]], ignore_index=True)

print(f"Created dataset with {len(df)} rows and {len(df.columns)} columns")
df

## 3. Analyze the Data

Use DataWash to analyze the dataset. In Jupyter, the report renders as a rich HTML table!

In [None]:
# Analyze the data
report = analyze(df)

# In Jupyter, this automatically renders as HTML!
report

## 4. Explore the Quality Score

In [None]:
print(f"Data Quality Score: {report.quality_score}/100")
print(f"\nIssues Found: {len(report.issues)}")
print(f"Suggestions: {len(report.suggestions)}")

## 5. View Detailed Summary

In [None]:
print(report.summary())

## 6. Explore Issues in Detail

In [None]:
# Create a DataFrame of issues for easy exploration
issues_df = pd.DataFrame([
    {
        "Type": issue.issue_type,
        "Severity": issue.severity.value,
        "Columns": ", ".join(issue.columns) if issue.columns else "all",
        "Message": issue.message[:50] + "..." if len(issue.message) > 50 else issue.message,
        "Confidence": f"{issue.confidence:.0%}"
    }
    for issue in report.issues
])

issues_df

## 7. Explore Suggestions

In [None]:
# Create a DataFrame of suggestions
suggestions_df = pd.DataFrame([
    {
        "ID": s.id,
        "Priority": s.priority.value,
        "Action": s.action,
        "Transformer": s.transformer,
        "Impact": s.impact
    }
    for s in report.suggestions
])

suggestions_df

## 8. Apply Specific Suggestions

Instead of applying all suggestions, let's selectively apply some using their IDs.

In [None]:
# Let's apply only the first 3 suggestions
ids_to_apply = [s.id for s in report.suggestions[:3]]
print(f"Applying suggestions: {ids_to_apply}")

for s in report.suggestions[:3]:
    print(f"  [{s.id}] {s.action}")

In [None]:
# Apply selected suggestions
partial_clean_df = report.apply(ids_to_apply)

print(f"\nApplied {len(ids_to_apply)} transformations")
partial_clean_df

## 9. Apply All Remaining Suggestions

In [None]:
# Now let's apply ALL suggestions for comparison
# First, re-analyze to get fresh report
report = analyze(df)
clean_df = report.apply_all()

print(f"Applied all {len(report.suggestions)} transformations")
clean_df

## 10. Before/After Quality Score Visualization

In [None]:
# Get quality scores
score_before = report._last_score_before
score_after = report._last_score_after

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart
ax1 = axes[0]
bars = ax1.bar(['Before', 'After'], [score_before, score_after], 
               color=['#ff6b6b', '#51cf66'], edgecolor='black')
ax1.set_ylim(0, 100)
ax1.set_ylabel('Quality Score')
ax1.set_title('Data Quality Score: Before vs After')
ax1.axhline(y=80, color='green', linestyle='--', alpha=0.5, label='Good threshold')
ax1.axhline(y=60, color='orange', linestyle='--', alpha=0.5, label='Warning threshold')

# Add value labels on bars
for bar, score in zip(bars, [score_before, score_after]):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2, 
             f'{score}', ha='center', va='bottom', fontsize=14, fontweight='bold')

# Improvement gauge
ax2 = axes[1]
improvement = score_after - score_before
ax2.pie([improvement, 100-improvement], 
        labels=[f'+{improvement} points', ''], 
        colors=['#51cf66', '#e9ecef'],
        startangle=90,
        explode=[0.05, 0])
ax2.set_title(f'Quality Improvement: +{improvement} points')

plt.tight_layout()
plt.show()

print(f"\nImprovement Summary:")
print(f"  Before: {score_before}/100")
print(f"  After:  {score_after}/100")
print(f"  Change: +{improvement} points ({improvement/score_before*100:.1f}% improvement)")

## 11. Compare Original vs Cleaned Data

In [None]:
print("ORIGINAL DATA:")
print(f"Rows: {len(df)}, Columns: {len(df.columns)}")
print(f"Data types: {dict(df.dtypes)}")
print()

print("CLEANED DATA:")
print(f"Rows: {len(clean_df)}, Columns: {len(clean_df.columns)}")
print(f"Data types: {dict(clean_df.dtypes)}")

In [None]:
# Side-by-side comparison of specific columns
comparison_cols = ['customer_name', 'is_premium', 'signup_date']

print("Before cleaning:")
display(df[comparison_cols].head())

print("\nAfter cleaning:")
display(clean_df[comparison_cols].head())

## 12. Generate Reproducible Code

In [None]:
# Generate Python code that reproduces the cleaning
code = report.generate_code(style="function")
print(code)

## 13. Use Case Comparison: General vs ML

In [None]:
# Compare suggestions for different use cases
report_general = analyze(df, use_case="general")
report_ml = analyze(df, use_case="ml")

print("GENERAL use case priorities:")
for s in report_general.suggestions[:5]:
    print(f"  [{s.priority.value:6}] {s.action}")

print("\nML use case priorities:")
for s in report_ml.suggestions[:5]:
    print(f"  [{s.priority.value:6}] {s.action}")

## 14. Export Cleaned Data

In [None]:
# Save cleaned data to CSV
clean_df.to_csv("cleaned_data.csv", index=False)
print("Saved cleaned data to 'cleaned_data.csv'")

# You can also save to other formats
# clean_df.to_parquet("cleaned_data.parquet")  # Requires pyarrow
# clean_df.to_excel("cleaned_data.xlsx", index=False)  # Requires openpyxl

## Summary

In this notebook, we demonstrated:

1. **Rich HTML rendering** - Reports display beautifully in Jupyter
2. **Interactive exploration** - Easily explore issues and suggestions as DataFrames
3. **Selective application** - Apply specific fixes using `report.apply([1, 2, 3])`
4. **Quality visualization** - Track improvement with before/after scores
5. **Code generation** - Get reproducible Python code
6. **Use case comparison** - Different priorities for ML vs general cleaning

### Next Steps

- Try with your own data: `report = analyze("your_data.csv")`
- Experiment with different use cases: `analyze(df, use_case="ml")`
- Use interactive mode: `report.apply_interactive()`
- Check the CLI: `datawash analyze your_data.csv`