# Introduction

This notebook demonstrates fundamental techniques for enhancing **data security, quality, and governance** in software applications. The focus is on ensuring data reliability, tracking data lineage, sanitizing sensitive information, and conducting regular security audits. These practices are essential in modern data-driven systems to uphold integrity, compliance, and trustworthiness.

The code covers four primary areas:

1. **Data Quality Checks**: Validating datasets to ensure all necessary fields are present and correctly populated.
2. **Data Provenance Tracking**: Logging the origins and transformations of data for accountability and traceability.
3. **Data Sanitization and Redaction**: Protecting sensitive information by securely redacting specified fields.
4. **Regular Security Audits**: Performing file integrity checks and verifying permissions to safeguard data from unauthorized access.

By incorporating these techniques, developers can build systems that are robust, secure, and aligned with best practices for handling sensitive data.


In [1]:
import uuid
from datetime import datetime
import re
import os
import hashlib

# 1. Data Quality Checks
def validate_data_quality(data, required_fields):
    """
    Validates data quality by ensuring all required fields are present and non-empty.
    :param data: List of dictionaries representing the dataset.
    :param required_fields: List of fields that must be present and non-empty.
    :return: Tuple with valid and invalid data entries.
    """
    valid_data, invalid_data = [], []
    
    for entry in data:
        if all(field in entry and entry[field] for field in required_fields):
            valid_data.append(entry)
        else:
            invalid_data.append(entry)
    
    return valid_data, invalid_data

# 2. Data Provenance Tracking
class DataProvenance:
    def __init__(self):
        self.provenance_log = []

    def log_entry(self, data_id, source, processing_stage):
        """
        Logs data provenance information.
        :param data_id: Unique identifier for the data.
        :param source: Source of the data.
        :param processing_stage: Current stage of data processing.
        """
        self.provenance_log.append({
            "data_id": data_id,
            "source": source,
            "processing_stage": processing_stage,
            "timestamp": datetime.utcnow().isoformat()
        })

    def get_provenance_log(self):
        return self.provenance_log

# 3. Data Sanitization and Redaction
def sanitize_data(data, fields_to_redact):
    """
    Redacts sensitive fields from the dataset.
    :param data: List of dictionaries representing the dataset.
    :param fields_to_redact: List of fields to redact.
    :return: Sanitized dataset.
    """
    redacted_data = []
    for entry in data:
        sanitized_entry = entry.copy()
        for field in fields_to_redact:
            if field in sanitized_entry:
                sanitized_entry[field] = "[REDACTED]"
        redacted_data.append(sanitized_entry)
    return redacted_data

# 4. Regular Security Audits
def perform_security_audit(file_path):
    """
    Conducts a simple security audit by calculating file integrity hash and checking permissions.
    :param file_path: Path to the file being audited.
    :return: Audit results as a dictionary.
    """
    audit_results = {}
    
    # Check file permissions
    file_permissions = oct(os.stat(file_path).st_mode)[-3:]
    audit_results['permissions'] = file_permissions
    
    # Compute file hash for integrity check
    sha256_hash = hashlib.sha256()
    with open(file_path, "rb") as f:
        for byte_block in iter(lambda: f.read(4096), b""):
            sha256_hash.update(byte_block)
    audit_results['file_hash'] = sha256_hash.hexdigest()
    
    return audit_results

# Example Usage
if __name__ == "__main__":
    # Data Quality Check Example
    dataset = [
        {"name": "Alice", "email": "alice@example.com", "age": 25},
        {"name": "", "email": "bob@example.com", "age": 30},
        {"name": "Charlie", "email": None, "age": 35},
    ]
    required_fields = ["name", "email", "age"]
    valid_data, invalid_data = validate_data_quality(dataset, required_fields)
    print("Valid Data:", valid_data)
    print("Invalid Data:", invalid_data)

    # Data Provenance Tracking Example
    provenance_tracker = DataProvenance()
    data_id = str(uuid.uuid4())
    provenance_tracker.log_entry(data_id, "user_upload", "initial_validation")
    provenance_tracker.log_entry(data_id, "user_upload", "transformation")
    print("Provenance Log:", provenance_tracker.get_provenance_log())

    # Data Sanitization Example
    dataset = [
        {"name": "Alice", "email": "alice@example.com", "ssn": "123-45-6789"},
        {"name": "Bob", "email": "bob@example.com", "ssn": "987-65-4321"},
    ]
    fields_to_redact = ["ssn"]
    sanitized_dataset = sanitize_data(dataset, fields_to_redact)
    print("Sanitized Data:", sanitized_dataset)

    # Security Audit Example
    file_to_audit = "example_data.csv"  # Ensure this file exists for the demo
    if os.path.exists(file_to_audit):
        audit_result = perform_security_audit(file_to_audit)
        print("Security Audit Results:", audit_result)
    else:
        print(f"File '{file_to_audit}' does not exist. Please create it for the audit demo.")

Valid Data: [{'name': 'Alice', 'email': 'alice@example.com', 'age': 25}]
Invalid Data: [{'name': '', 'email': 'bob@example.com', 'age': 30}, {'name': 'Charlie', 'email': None, 'age': 35}]
Provenance Log: [{'data_id': 'd1066537-d3c0-499d-8788-053c72f1e61d', 'source': 'user_upload', 'processing_stage': 'initial_validation', 'timestamp': '2024-11-23T03:01:41.298332'}, {'data_id': 'd1066537-d3c0-499d-8788-053c72f1e61d', 'source': 'user_upload', 'processing_stage': 'transformation', 'timestamp': '2024-11-23T03:01:41.298332'}]
Sanitized Data: [{'name': 'Alice', 'email': 'alice@example.com', 'ssn': '[REDACTED]'}, {'name': 'Bob', 'email': 'bob@example.com', 'ssn': '[REDACTED]'}]
File 'example_data.csv' does not exist. Please create it for the audit demo.


  "timestamp": datetime.utcnow().isoformat()


# Conclusion

This notebook provides a **basic framework** for implementing critical data management and security practices. The included examples demonstrate how to validate data quality, maintain data provenance logs, sanitize sensitive information, and conduct security audits. These techniques are vital in creating systems that not only meet high standards of security but also ensure data accuracy and transparency.

It is important to treat this code as a **starting point** for more comprehensive implementations. Real-world applications often demand additional customizations, domain-specific considerations, and integration with larger systems. Developers are encouraged to expand upon these examples, adopting advanced methodologies and tools as necessary.

By following these foundational practices, organizations can foster trust, ensure compliance with regulations, and protect their data assets in an increasingly complex digital landscape.
