Skip to content

TADSTech/PublicDatasetAudit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Public Dataset Audit

License Python Data Quality

Overview

This project performs a comprehensive audit of the Kaggle Audit Risk dataset to identify and document data quality issues, including missing values, duplicates, inconsistencies, and structural problems. The audit results are compiled into a detailed report with findings and recommendations.

Key Findings

Metric Value Status
Total Records 776 -
Total Columns 27 -
Missing Values 1 (0.00%) ✅ Excellent
Duplicate Rows 13 (1.68%) ⚠️ Cleaned
Data Quality Score 98.12% ✅ Excellent
Outliers 1,935 (9.59%) ⚠️ Review

Critical Issues Identified

  1. Duplicate Column Name: "Score_B" appears twice in the CSV, causing data loss
  2. Duplicate Records: 13 rows were exact duplicates (removed in cleaned dataset)
  3. High Outlier Count: Several columns have 15-18% outliers requiring domain expert review

Dataset Source

  • Source: Kaggle
  • Name: Audit Risk Dataset
  • Description: Dataset for classifying fraudulent firms based on risk indicators
  • Link: Kaggle Audit Data

How to Run

Prerequisites

  • Python 3.8+
  • Jupyter Notebook / JupyterLab (optional, for exploration)
  • Dependencies listed in requirements.txt

Setup

  1. Clone the repository:

    git clone <repository-url>
    cd PublicDatasetAudit
  2. Install dependencies:

    pip install -r requirements.txt
  3. Place raw data in data/raw/ directory (included: audit_data.csv)

Running the Audit

  1. Explore the data (interactive):

    jupyter notebook notebooks/audit-exploration.ipynb
  2. Run validation checks (programmatic):

    import pandas as pd
    from src.audit_checks import generate_audit_report, clean_dataframe
    
    # Load data
    df = pd.read_csv('data/raw/audit_data.csv')
    
    # Generate comprehensive audit report
    report = generate_audit_report(df)
    print(f"Quality Score: {report['quality_scores']['overall'] * 100:.2f}%")
    
    # Clean the data
    df_clean, changes = clean_dataframe(df)
    print(f"Cleaned: {changes}")
  3. Generate PDF report:

    typst compile docs/audit-report.typ reports/audit-report.pdf

Project Structure

PublicDatasetAudit/
├── data/
│   ├── raw/                    # Original dataset files
│   │   └── audit_data.csv      # Raw audit data (776 rows, 27 cols)
│   ├── processed/              # Cleaned data after audit fixes
│   │   └── audit_data_cleaned.csv  # Cleaned dataset (763 rows)
│   └── external/               # Reference files and data dictionaries
├── notebooks/
│   └── audit-exploration.ipynb # Interactive data exploration
├── src/
│   └── audit_checks.py         # Reusable audit functions
├── docs/
│   └── audit-report.typ        # Typst report source
├── reports/                    # Generated PDF reports
├── requirements.txt            # Python dependencies
├── LICENSE                     # MIT License
└── README.md                   # This file

Audit Functions

The src/audit_checks.py module provides the following functions:

Function Description
check_missing() Analyze missing values by column
check_duplicates() Detect duplicate rows
check_duplicate_columns() Find duplicate column names
check_data_types() Analyze data types with samples
check_outliers() Detect outliers using IQR or Z-score
calculate_quality_score() Compute comprehensive quality score
generate_audit_report() Generate full audit report
clean_dataframe() Apply standard cleaning operations
save_cleaned_data() Export cleaned data to CSV

License

This project is licensed under the MIT License - see LICENSE file for details.

Technologies Used

  • Python: Data manipulation and validation
  • Pandas: Data analysis and cleaning
  • NumPy: Numerical operations
  • Matplotlib/Seaborn: Static visualizations
  • Plotly: Interactive visualizations
  • Jupyter: Interactive exploration
  • Typst: PDF report generation

Author

Michael Tunwashe (TADS)


Last updated: February 1, 2026

About

Public dataset of Audit of Audit Dataset

Topics

Resources

License

Stars

Watchers

Forks

Contributors