This project performs a comprehensive audit of the Kaggle Audit Risk dataset to identify and document data quality issues, including missing values, duplicates, inconsistencies, and structural problems. The audit results are compiled into a detailed report with findings and recommendations.
| Metric | Value | Status |
|---|---|---|
| Total Records | 776 | - |
| Total Columns | 27 | - |
| Missing Values | 1 (0.00%) | ✅ Excellent |
| Duplicate Rows | 13 (1.68%) | |
| Data Quality Score | 98.12% | ✅ Excellent |
| Outliers | 1,935 (9.59%) |
- Duplicate Column Name: "Score_B" appears twice in the CSV, causing data loss
- Duplicate Records: 13 rows were exact duplicates (removed in cleaned dataset)
- High Outlier Count: Several columns have 15-18% outliers requiring domain expert review
- Source: Kaggle
- Name: Audit Risk Dataset
- Description: Dataset for classifying fraudulent firms based on risk indicators
- Link: Kaggle Audit Data
- Python 3.8+
- Jupyter Notebook / JupyterLab (optional, for exploration)
- Dependencies listed in
requirements.txt
-
Clone the repository:
git clone <repository-url> cd PublicDatasetAudit
-
Install dependencies:
pip install -r requirements.txt
-
Place raw data in
data/raw/directory (included:audit_data.csv)
-
Explore the data (interactive):
jupyter notebook notebooks/audit-exploration.ipynb
-
Run validation checks (programmatic):
import pandas as pd from src.audit_checks import generate_audit_report, clean_dataframe # Load data df = pd.read_csv('data/raw/audit_data.csv') # Generate comprehensive audit report report = generate_audit_report(df) print(f"Quality Score: {report['quality_scores']['overall'] * 100:.2f}%") # Clean the data df_clean, changes = clean_dataframe(df) print(f"Cleaned: {changes}")
-
Generate PDF report:
typst compile docs/audit-report.typ reports/audit-report.pdf
PublicDatasetAudit/
├── data/
│ ├── raw/ # Original dataset files
│ │ └── audit_data.csv # Raw audit data (776 rows, 27 cols)
│ ├── processed/ # Cleaned data after audit fixes
│ │ └── audit_data_cleaned.csv # Cleaned dataset (763 rows)
│ └── external/ # Reference files and data dictionaries
├── notebooks/
│ └── audit-exploration.ipynb # Interactive data exploration
├── src/
│ └── audit_checks.py # Reusable audit functions
├── docs/
│ └── audit-report.typ # Typst report source
├── reports/ # Generated PDF reports
├── requirements.txt # Python dependencies
├── LICENSE # MIT License
└── README.md # This file
The src/audit_checks.py module provides the following functions:
| Function | Description |
|---|---|
check_missing() |
Analyze missing values by column |
check_duplicates() |
Detect duplicate rows |
check_duplicate_columns() |
Find duplicate column names |
check_data_types() |
Analyze data types with samples |
check_outliers() |
Detect outliers using IQR or Z-score |
calculate_quality_score() |
Compute comprehensive quality score |
generate_audit_report() |
Generate full audit report |
clean_dataframe() |
Apply standard cleaning operations |
save_cleaned_data() |
Export cleaned data to CSV |
This project is licensed under the MIT License - see LICENSE file for details.
- Python: Data manipulation and validation
- Pandas: Data analysis and cleaning
- NumPy: Numerical operations
- Matplotlib/Seaborn: Static visualizations
- Plotly: Interactive visualizations
- Jupyter: Interactive exploration
- Typst: PDF report generation
Michael Tunwashe (TADS)
Last updated: February 1, 2026