# Exploring dataprof v0.4.5 for Data Engineering & ML Workflows

This notebook provides a comprehensive hands-on introduction to the [`dataprof`](https://github.com/AndreaBozzo/dataprof) library v0.4.5.  
The goal is to demonstrate how it can be used for **data profiling**, **quality checks**, **ML readiness assessment**, and as a helper in **ML pipelines**.

## 🆕 New in v0.4.5:
- **ML Readiness Assessment System** with comprehensive scoring
- **Enhanced Pandas Integration** with DataFrame outputs  
- **Context Managers** for resource management
- **Security enhancements** and comprehensive fixes
- **Python Logging Integration** with configurable levels

In [None]:
# Install dependencies if needed
# %pip install dataprof pandas scikit-learn

In [1]:
import pandas as pd
import dataprof as dp
import os

print(f"Dataprof version: {getattr(dp, '__version__', 'Unknown')}")
print("Available functions:", [f for f in dir(dp) if not f.startswith('_') and callable(getattr(dp, f))])

# 🆕 NEW in v0.4.5: Configure logging
print("\n=== Configuring Python Logging (New in v0.4.5) ===")
try:
    dp.configure_logging(level="INFO")
    print("✅ Python logging configured successfully!")
except AttributeError:
    print("ℹ️  Logging configuration not available (requires v0.4.5+)")

print("Setup completed successfully!")

Dataprof version: 0.4.5

=== Configuring Python Logging (New in v0.4.5) ===
✅ Python logging configured successfully!
Setup completed successfully!


In [2]:
# Create a sample dataset with some data quality issues
data = {
    "age": [25, 32, 40, None, 18, 22, 45, 33],
    "income": [30000, 50000, 70000, 45000, None, 22000, 80000, 55000],
    "gender": ["M", "F", "M", "F", "F", None, "M", "F"],
    "city": ["NYC", "LA", "Chicago", "NYC", "Boston", "LA", "Chicago", "NYC"]
}

df = pd.DataFrame(data)
print("Sample dataset created:")
print(df)
print(f"\nDataset shape: {df.shape}")
print(f"Missing values per column:\n{df.isnull().sum()}")

Sample dataset created:
    age   income gender     city
0  25.0  30000.0      M      NYC
1  32.0  50000.0      F       LA
2  40.0  70000.0      M  Chicago
3   NaN  45000.0      F      NYC
4  18.0      NaN      F   Boston
5  22.0  22000.0   None       LA
6  45.0  80000.0      M  Chicago
7  33.0  55000.0      F      NYC

Dataset shape: (8, 4)
Missing values per column:
age       1
income    1
gender    1
city      0
dtype: int64


In [3]:
# 🆕 NEW in v0.4.5: Enhanced Pandas Integration with DataFrame outputs
print("=== Enhanced Pandas Integration (New in v0.4.5) ===")

# Create a more comprehensive dataset for ML analysis
ml_data = {
    "age": [25, 32, 40, None, 18, 22, 45, 33, 28, 35],
    "income": [30000, 50000, 70000, 45000, None, 22000, 80000, 55000, 42000, 65000],
    "experience_years": [2, 8, 15, None, 0, 1, 20, 10, 5, 12],
    "gender": ["M", "F", "M", "F", "F", None, "M", "F", "M", "F"],
    "city": ["NYC", "LA", "Chicago", "NYC", "Boston", "LA", "Chicago", "NYC", "Boston", "LA"],
    "target": [0, 1, 1, 0, 0, 0, 1, 1, 0, 1]  # Binary target for ML
}

df = pd.DataFrame(ml_data)
print("Enhanced ML dataset created:")
print(df)
print(f"\nDataset shape: {df.shape}")
print(f"Target distribution:\n{df['target'].value_counts()}")

# Save to CSV for analysis
csv_file = "ml_sample_data.csv"
df.to_csv(csv_file, index=False)
print(f"Dataset saved to {csv_file}")

try:
    # 🆕 NEW: Enhanced pandas integration with DataFrame output
    print("\n=== Pandas DataFrame Integration (New in v0.4.5) ===")
    profiles_df = dp.analyze_csv_dataframe(csv_file)
    print(f"Profiles DataFrame shape: {profiles_df.shape}")
    print("Profiles DataFrame columns:", profiles_df.columns.tolist())
    print("\nFirst few rows of profiles:")
    print(profiles_df.head())
    
except AttributeError:
    print("ℹ️  Enhanced pandas integration not available (requires v0.4.5+)")
    print("Using standard analysis instead...")
    analysis_result = dp.analyze_csv_file(csv_file)
    print(f"Standard analysis completed: {len(analysis_result)} column profiles generated")

# Clean up
if os.path.exists(csv_file):
    os.remove(csv_file)
    print(f"\nCleaned up {csv_file}")

=== Enhanced Pandas Integration (New in v0.4.5) ===
Enhanced ML dataset created:
    age   income  experience_years gender     city  target
0  25.0  30000.0               2.0      M      NYC       0
1  32.0  50000.0               8.0      F       LA       1
2  40.0  70000.0              15.0      M  Chicago       1
3   NaN  45000.0               NaN      F      NYC       0
4  18.0      NaN               0.0      F   Boston       0
5  22.0  22000.0               1.0   None       LA       0
6  45.0  80000.0              20.0      M  Chicago       1
7  33.0  55000.0              10.0      F      NYC       1
8  28.0  42000.0               5.0      M   Boston       0
9  35.0  65000.0              12.0      F       LA       1

Dataset shape: (10, 6)
Target distribution:
target
0    5
1    5
Name: count, dtype: int64
Dataset saved to ml_sample_data.csv

=== Pandas DataFrame Integration (New in v0.4.5) ===
Profiles DataFrame shape: (6, 7)
Profiles DataFrame columns: ['uniqueness_ratio', 'colum

In [4]:
# 🆕 NEW in v0.4.5: ML Readiness Assessment System
print("=== ML Readiness Assessment (Major New Feature in v0.4.5) ===")

# Create a comprehensive ML dataset for testing
ml_dataset = {
    "feature_1": [1.2, 2.5, 3.8, 4.1, 5.3, 6.7, 7.9, 8.2, 9.1, 10.5],
    "feature_2": [100, 200, 300, 400, None, 600, 700, 800, 900, 1000],
    "category": ["A", "B", "A", "C", "B", "A", "C", "B", "A", "B"],
    "timestamp": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05",
                 "2023-01-06", "2023-01-07", "2023-01-08", "2023-01-09", "2023-01-10"],
    "target_variable": [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]
}

ml_df = pd.DataFrame(ml_dataset)
print("ML dataset for readiness assessment:")
print(ml_df)

# Save for analysis
ml_csv = "ml_readiness_test.csv"
ml_df.to_csv(ml_csv, index=False)

try:
    # 🆕 NEW: ML Readiness Assessment
    print("\n=== ML Readiness Score (New in v0.4.5) ===")
    ml_score = dp.ml_readiness_score(ml_csv)
    print(f"ML Ready: {ml_score.is_ml_ready()} (Score: {ml_score.overall_score:.1f}%)")
    print(f"Feature analysis completed for {len(ml_score.features)} features")
    
    # Display feature analysis
    print("\n=== Feature Analysis ===")
    for i, feature in enumerate(ml_score.features):
        print(f"Feature {i+1}: {feature.name}")
        print(f"  - Type: {feature.feature_type}")
        print(f"  - ML Ready: {feature.is_ml_ready}")
        if hasattr(feature, 'recommendations'):
            print(f"  - Recommendations: {len(feature.recommendations)} items")
    
except AttributeError:
    print("ℹ️  ML readiness assessment not available (requires v0.4.5+)")

try:
    # 🆕 NEW: Feature analysis DataFrame
    print("\n=== Feature Analysis DataFrame (New in v0.4.5) ===")
    features_df = dp.feature_analysis_dataframe(ml_csv)
    print(f"Features DataFrame shape: {features_df.shape}")
    print("Features DataFrame:")
    print(features_df)
    
except AttributeError:
    print("ℹ️  Feature analysis DataFrame not available (requires v0.4.5+)")

# Clean up
if os.path.exists(ml_csv):
    os.remove(ml_csv)
    print(f"\nCleaned up {ml_csv}")

=== ML Readiness Assessment (Major New Feature in v0.4.5) ===
ML dataset for readiness assessment:
   feature_1  feature_2 category   timestamp  target_variable
0        1.2      100.0        A  2023-01-01                0
1        2.5      200.0        B  2023-01-02                1
2        3.8      300.0        A  2023-01-03                0
3        4.1      400.0        C  2023-01-04                1
4        5.3        NaN        B  2023-01-05                1
5        6.7      600.0        A  2023-01-06                0
6        7.9      700.0        C  2023-01-07                1
7        8.2      800.0        B  2023-01-08                0
8        9.1      900.0        A  2023-01-09                1
9       10.5     1000.0        B  2023-01-10                0

=== ML Readiness Score (New in v0.4.5) ===
ML Ready: True (Score: 95.9%)
ℹ️  ML readiness assessment not available (requires v0.4.5+)

=== Feature Analysis DataFrame (New in v0.4.5) ===
Features DataFrame shape: (5, 6)

In [5]:
# Save to CSV for dataprof analysis
csv_file = "sample_data.csv"
df.to_csv(csv_file, index=False)
print(f"Dataset saved to {csv_file}")

# Basic analysis with dataprof
try:
    print("\n=== Basic Analysis ===")
    analysis_result = dp.analyze_csv_file(csv_file)
    print(f"Analysis completed: {len(analysis_result)} column profiles generated")
    
    # Inspect the column profiles
    print("\n=== Column Profile Details ===")
    for i, profile in enumerate(analysis_result):
        print(f"\nColumn {i+1}:")
        # Check available attributes of the profile object
        attrs = [attr for attr in dir(profile) if not attr.startswith('_')]
        for attr in attrs[:10]:  # Show first 10 attributes
            try:
                value = getattr(profile, attr)
                if not callable(value):
                    print(f"  {attr}: {value}")
            except:
                pass
    
except Exception as e:
    print(f"Error during analysis: {e}")
    import traceback
    traceback.print_exc()

# Clean up
if os.path.exists(csv_file):
    os.remove(csv_file)
    print(f"\nCleaned up {csv_file}")

Dataset saved to sample_data.csv

=== Basic Analysis ===
Analysis completed: 6 column profiles generated

=== Column Profile Details ===

Column 1:
  data_type: string
  name: gender
  null_count: 1
  null_percentage: 10.0
  total_count: 10
  unique_count: 3
  uniqueness_ratio: 0.3

Column 2:
  data_type: integer
  name: target
  null_count: 0
  null_percentage: 0.0
  total_count: 10
  unique_count: 2
  uniqueness_ratio: 0.2

Column 3:
  data_type: float
  name: age
  null_count: 1
  null_percentage: 10.0
  total_count: 10
  unique_count: 10
  uniqueness_ratio: 1.0

Column 4:
  data_type: string
  name: city
  null_count: 0
  null_percentage: 0.0
  total_count: 10
  unique_count: 4
  uniqueness_ratio: 0.4

Column 5:
  data_type: float
  name: income
  null_count: 1
  null_percentage: 10.0
  total_count: 10
  unique_count: 10
  uniqueness_ratio: 1.0

Column 6:
  data_type: float
  name: experience_years
  null_count: 1
  null_percentage: 10.0
  total_count: 10
  unique_count: 10
  uniqu

In [6]:
# Quality analysis
csv_file = "sample_data_quality.csv"
df.to_csv(csv_file, index=False)

try:
    print("=== Quality Analysis ===")
    quality_report = dp.analyze_csv_with_quality(csv_file)
    
    print(f"Quality score: {quality_report.quality_score()}")
    print(f"Total rows: {quality_report.total_rows}")
    print(f"Total columns: {quality_report.total_columns}")
    print(f"Scan time: {quality_report.scan_time_ms} ms")
    print(f"Issues found: {len(quality_report.issues)}")
    
    if quality_report.issues:
        print("\nQuality issues detected:")
        for i, issue in enumerate(quality_report.issues, 1):
            print(f"{i}. {issue.description} (Column: {issue.column}, Severity: {issue.severity})")
    else:
        print("No quality issues detected!")
        
except Exception as e:
    print(f"Error during quality analysis: {e}")
    import traceback
    traceback.print_exc()

# Clean up
if os.path.exists(csv_file):
    os.remove(csv_file)
    print(f"\nCleaned up {csv_file}")

=== Quality Analysis ===
Quality score: 45.0
Total rows: 10
Total columns: 6
Scan time: 94 ms
Issues found: 7

Quality issues detected:
1. 6 duplicate values in column 'city' (Column: city, Severity: low)
2. 8 duplicate values in column 'target' (Column: target, Severity: low)
3. 1 null values (10%) in column 'age' (Column: age, Severity: medium)
4. 1 null values (10%) in column 'income' (Column: income, Severity: medium)
5. 1 null values (10%) in column 'experience_years' (Column: experience_years, Severity: medium)
6. 1 null values (10%) in column 'gender' (Column: gender, Severity: medium)
7. 7 duplicate values in column 'gender' (Column: gender, Severity: low)

Cleaned up sample_data_quality.csv


In [7]:
## Summary - DataProf v0.4.5 Capabilities

This notebook demonstrated key dataprof capabilities including major new features in v0.4.5:

### 🆕 New Features in v0.4.5:
- **`ml_readiness_score()`** - Complete ML readiness assessment with feature analysis
- **`analyze_csv_dataframe()`** - Enhanced pandas integration with DataFrame outputs
- **`feature_analysis_dataframe()`** - ML feature analysis in DataFrame format
- **`configure_logging()`** - Python logging integration with configurable levels
- **Context Managers** - `PyBatchAnalyzer`, `PyMlAnalyzer`, `PyCsvProcessor` for resource management
- **Enhanced Security** - Comprehensive SQL injection protection and input validation

### Core Functions (v0.4.1-0.4.5):
- `analyze_csv_file()` - Basic CSV column profiling
- `analyze_csv_with_quality()` - Quality assessment with scoring
- `batch_analyze_glob()` - Batch processing of multiple files
- `analyze_json_file()` - JSON file analysis

### Key Features:
- **Fast analysis** of CSV data quality
- **ML readiness assessment** with comprehensive scoring and recommendations
- **Missing value detection** and quantification
- **Quality scoring** for datasets with severity-based issues
- **Batch processing** for multiple files with progress tracking
- **ML workflow integration** for preprocessing validation
- **Enhanced pandas integration** with DataFrame outputs
- **Context managers** for proper resource management
- **Security hardening** with input validation and error sanitization

### ML/AI Enhancements:
- **Feature type detection** (numeric_ready, categorical_needs_encoding, temporal_needs_engineering)
- **Blocking issues detection** (missing targets, all-null features, data leakage)
- **ML preprocessing recommendations** with priority levels
- **Scikit-learn integration** examples and pipeline building
- **Jupyter notebook support** with rich HTML displays

### Next Steps:
- Explore database ML readiness with `profile_database_with_ml()`
- Try directory-wide analysis with enhanced batch processing
- Implement automated ML preprocessing pipelines
- Generate comprehensive quality and ML readiness reports
- Leverage context managers for production data workflows
- Use enhanced security features for production deployments

### Version Upgrade Benefits:
- **Comprehensive ML readiness** assessment for data science workflows
- **Enhanced pandas integration** for data analysis pipelines
- **Resource management** with context managers
- **Security improvements** for production use
- **Performance optimizations** and reliability enhancements

SyntaxError: invalid syntax (2557100136.py, line 3)

In [8]:
# Batch analysis example
print("=== Batch Analysis Example ===")

# Create multiple test files
test_files = []
for i in range(3):
    # Create slightly different datasets
    sample_data = {
        "value1": [i*10 + j for j in range(5)],
        "value2": [j*2 + i for j in range(5)],
        "category": [f"Type_{j%2}" for j in range(5)]
    }
    
    filename = f"batch_test_{i}.csv"
    pd.DataFrame(sample_data).to_csv(filename, index=False)
    test_files.append(filename)
    print(f"Created {filename}")

try:
    # Use batch analysis
    print("\nRunning batch analysis...")
    batch_result = dp.batch_analyze_glob("batch_test_*.csv")
    
    print(f"Batch analysis completed, result type: {type(batch_result)}")
    
    # Show results
    print(f"\nBatch result attributes:")
    attrs = [attr for attr in dir(batch_result) if not attr.startswith('_')]
    for attr in attrs[:10]:  # Show first 10 attributes
        try:
            value = getattr(batch_result, attr)
            if not callable(value):
                print(f"  {attr}: {value}")
        except:
            pass
                
except Exception as e:
    print(f"Error during batch analysis: {e}")
    import traceback
    traceback.print_exc()

# Clean up test files
for filename in test_files:
    if os.path.exists(filename):
        os.remove(filename)
        
print("\nCleaned up batch test files")

=== Batch Analysis Example ===
Created batch_test_0.csv
Created batch_test_1.csv
Created batch_test_2.csv

Running batch analysis...
Batch analysis completed, result type: <class 'builtins.PyBatchResult'>

Batch result attributes:
  average_quality_score: 98.33333333333333
  failed_files: 0
  processed_files: 3
  total_duration_secs: 0.1200045
  total_quality_issues: 3

Cleaned up batch test files


In [9]:
# ML workflow example with Iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

print("=== ML Workflow with Iris Dataset ===")

# Load Iris dataset
iris = load_iris(as_frame=True)
iris_df = iris.frame
print(f"Iris dataset shape: {iris_df.shape}")
print("\nFirst few rows:")
print(iris_df.head())

# Save for dataprof analysis
iris_file = "iris_data.csv"
iris_df.to_csv(iris_file, index=False)

try:
    # Analyze with dataprof
    print("\n=== Dataprof Analysis on Iris ===")
    iris_profiles = dp.analyze_csv_file(iris_file)
    print(f"Iris analysis: {len(iris_profiles)} column profiles")
    
    iris_quality = dp.analyze_csv_with_quality(iris_file)
    print(f"Iris quality score: {iris_quality.quality_score()}")
    print(f"Iris issues: {len(iris_quality.issues)}")
    
    if iris_quality.issues:
        print("Quality issues in Iris dataset:")
        for issue in iris_quality.issues:
            print(f"- {issue.description}")
    else:
        print("✅ No quality issues found in Iris dataset!")
    
except Exception as e:
    print(f"Error analyzing Iris: {e}")

# ML training
print("\n=== Training ML Model ===")
X = iris_df.drop('target', axis=1)
y = iris_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Evaluate
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)
print(f"Training accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")

# Clean up
if os.path.exists(iris_file):
    os.remove(iris_file)
    print(f"\nCleaned up {iris_file}")

print("\n🎉 Dataprof demo completed successfully!")

=== ML Workflow with Iris Dataset ===
Iris dataset shape: (150, 5)

First few rows:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  

=== Dataprof Analysis on Iris ===
Iris analysis: 5 column profiles
Iris quality score: 70.0
Iris issues: 6
Quality issues in Iris dataset:
- 115 duplicate values in column 'sepal length (cm)'
- 147 duplicate values in column 'target'
- 128 duplicate values in column 'petal width (cm)'
- 127 duplicate values in column 'sepal width (cm)'
- 1 outlier values in column '

## Summary

This notebook demonstrated key dataprof capabilities:

### Core Functions Used (v0.4.1):
- `analyze_csv_file()` - Basic CSV column profiling
- `analyze_csv_with_quality()` - Quality assessment with scoring
- `batch_analyze_glob()` - Batch processing of multiple files
- `analyze_json_file()` - JSON file analysis (available)

### Key Features:
- **Fast analysis** of CSV data quality
- **Missing value detection** and quantification
- **Quality scoring** for datasets
- **Batch processing** for multiple files
- **ML workflow integration** for preprocessing validation

### Next Steps:
- Explore JSON analysis with `analyze_json_file()`
- Try directory-wide analysis with `batch_analyze_directory()`
- Implement data validation pipelines
- Generate automated quality reports
- Upgrade to newer dataprof versions for additional features