### Healthcare – Patient Data Accuracy

**Task 1**: Patient Record Accuracy Assessment

**Objective**: Achieve high accuracy in patient records.

**Steps**:
1. Examine a sample patient dataset for common inaccuracies.
2. Identify at least three common issues, such as medication errors or misdiagnoses.
3. Propose validation measures to ensure data accuracy at the point of entry.

In [1]:
# Write your code from here
import pandas as pd
import numpy as np

# 1. Sample patient dataset (including some errors)
data = {
    'patient_id': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006'],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'age': [29, -1, 999, 35, 45, 27],
    'diagnosis_code': ['D123', 'DXYZ', '', 'D456', '1234', 'D789'],
    'medication': ['MedA', 'MedB', 'MedC', 'MedA', 'MedX', 'MedB'],
    'dosage_mg': [500, -20, 100000, 250, 400, None]
}

df = pd.DataFrame(data)
print("\n🔍 Sample Patient Records:")
print(df)

# 2. Identify common issues

# A. Invalid Ages (e.g., negative or unrealistic)
invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]

# B. Invalid or missing diagnosis codes (should start with 'D' followed by 3 digits)
invalid_diagnosis = df[~df['diagnosis_code'].str.match(r'^D\d{3}$', na=False)]

# C. Medication errors: negative, missing or extreme dosage
invalid_dosage = df[(df['dosage_mg'].isnull()) | (df['dosage_mg'] <= 0) | (df['dosage_mg'] > 10000)]

print("\n⚠️ Invalid Age Entries:")
print(invalid_age)

print("\n⚠️ Misdiagnoses (Invalid Diagnosis Codes):")
print(invalid_diagnosis)

print("\n⚠️ Medication Errors (Invalid Dosages):")
print(invalid_dosage)

# 3. Proposed validation rules
def validate_patient_record(record):
    issues = []
    
    # Age must be between 0 and 120
    if not (0 <= record['age'] <= 120):
        issues.append("Invalid age")
    
    # Diagnosis code must match pattern D###
    if not pd.notnull(record['diagnosis_code']) or not record['diagnosis_code'].startswith('D') or not record['diagnosis_code'][1:].isdigit() or len(record['diagnosis_code']) != 4:
        issues.append("Invalid diagnosis code")
    
    # Dosage must be a positive number below 10,000 mg
    if pd.isnull(record['dosage_mg']) or record['dosage_mg'] <= 0 or record['dosage_mg'] > 10000:
        issues.append("Invalid dosage")

    return issues

# Apply validation to each record
print("\n✅ Record-wise Validation Results:")
for idx, row in df.iterrows():
    result = validate_patient_record(row)
    if result:
        print(f"❌ Patient {row['patient_id']} - Issues: {', '.join(result)}")
    else:
        print(f"✅ Patient {row['patient_id']} - Record is valid")



🔍 Sample Patient Records:
  patient_id     name  age diagnosis_code medication  dosage_mg
0       P001    Alice   29           D123       MedA      500.0
1       P002      Bob   -1           DXYZ       MedB      -20.0
2       P003  Charlie  999                      MedC   100000.0
3       P004    Diana   35           D456       MedA      250.0
4       P005      Eve   45           1234       MedX      400.0
5       P006    Frank   27           D789       MedB        NaN

⚠️ Invalid Age Entries:
  patient_id     name  age diagnosis_code medication  dosage_mg
1       P002      Bob   -1           DXYZ       MedB      -20.0
2       P003  Charlie  999                      MedC   100000.0

⚠️ Misdiagnoses (Invalid Diagnosis Codes):
  patient_id     name  age diagnosis_code medication  dosage_mg
1       P002      Bob   -1           DXYZ       MedB      -20.0
2       P003  Charlie  999                      MedC   100000.0
4       P005      Eve   45           1234       MedX      400.0

⚠️ Medi

**Task 2**: Implement Healthcare Data Quality Checks

**Objective**: Maintain accurate health records within a healthcare system.

**Steps**:
1. Develop a validation workflow for patient data.
2. Use appropriate software to automate checks for common errors.

In [2]:
# Write your code from here
import pandas as pd

# Sample patient data (simulate real input data)
data = {
    'patient_id': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007'],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', None],
    'age': [29, -1, 999, 35, 45, 27, 50],
    'diagnosis_code': ['D123', 'DXYZ', '', 'D456', '1234', 'D789', 'D234'],
    'medication': ['MedA', 'MedB', 'MedC', 'MedA', 'MedX', 'MedB', 'MedC'],
    'dosage_mg': [500, -20, 100000, 250, 400, None, 300]
}
df = pd.DataFrame(data)

# Validation functions

def check_missing_values(df):
    """Check for missing required fields"""
    missing = df[df[['patient_id', 'name', 'age', 'diagnosis_code', 'medication', 'dosage_mg']].isnull().any(axis=1)]
    return missing

def check_age_validity(df):
    """Age must be between 0 and 120"""
    invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
    return invalid_age

def check_diagnosis_code(df):
    """Diagnosis codes must start with 'D' followed by 3 digits"""
    invalid_diag = df[~df['diagnosis_code'].str.match(r'^D\d{3}$', na=False)]
    return invalid_diag

def check_dosage(df):
    """Dosage must be >0 and <= 10000"""
    invalid_dosage = df[(df['dosage_mg'].isnull()) | (df['dosage_mg'] <= 0) | (df['dosage_mg'] > 10000)]
    return invalid_dosage

def run_validation_workflow(df):
    report = {}

    report['Missing Values'] = check_missing_values(df)
    report['Invalid Age'] = check_age_validity(df)
    report['Invalid Diagnosis Code'] = check_diagnosis_code(df)
    report['Invalid Dosage'] = check_dosage(df)

    return report

# Run validation
validation_report = run_validation_workflow(df)

# Display results
print("\n=== Healthcare Data Quality Validation Report ===")
for issue, records in validation_report.items():
    print(f"\n{issue} ({len(records)} records):")
    if records.empty:
        print(" - None found")
    else:
        print(records)

# Optional: Save report to Excel for audit
with pd.ExcelWriter('healthcare_data_validation_report.xlsx') as writer:
    for issue, records in validation_report.items():
        records.to_excel(writer, sheet_name=issue[:31], index=False)  # Sheet name max length is 31 chars

print("\nReport saved as 'healthcare_data_validation_report.xlsx'")




=== Healthcare Data Quality Validation Report ===

Missing Values (2 records):
  patient_id   name  age diagnosis_code medication  dosage_mg
5       P006  Frank   27           D789       MedB        NaN
6       P007   None   50           D234       MedC      300.0

Invalid Age (2 records):
  patient_id     name  age diagnosis_code medication  dosage_mg
1       P002      Bob   -1           DXYZ       MedB      -20.0
2       P003  Charlie  999                      MedC   100000.0

Invalid Diagnosis Code (3 records):
  patient_id     name  age diagnosis_code medication  dosage_mg
1       P002      Bob   -1           DXYZ       MedB      -20.0
2       P003  Charlie  999                      MedC   100000.0
4       P005      Eve   45           1234       MedX      400.0

Invalid Dosage (3 records):
  patient_id     name  age diagnosis_code medication  dosage_mg
1       P002      Bob   -1           DXYZ       MedB      -20.0
2       P003  Charlie  999                      MedC   100000.0
5 

ModuleNotFoundError: No module named 'openpyxl'