# Analysis of National Government Ministries , Departments and Agencies Budget Data

## Business Understanding

### Business Problem:

This project investigates how funds have been allocated and spent across Kenya’s national government ministries, departments, and agencies (MDAs) over the past three financial years (2021/2022, 2022/2023, 2023/2024). The aim is to determine whether recurring discrepancies between approved budgets and actual expenditures exist, and if so, whether these discrepancies indicate inefficiencies, weak budget execution, or potential audit concerns.

### Introduction — Real-world problem the project aims to solve:

Kenya’s public funds must be allocated and utilized transparently to ensure accountability, efficiency, and value for money. While single-year audits provide snapshots, they often miss patterns such as persistent underspending, overspending, or repeated audit concerns. By consolidating data from three consecutive Auditor-General reports, this project will uncover long-term trends in budget allocation and execution, highlight systemic inefficiencies, and flag potential risks for audit and governance review.

### Stakeholders:

1. Auditor-General / Audit Offices: Prioritize follow-up audits on entities with repeated large variances or recurring findings.

2. Controller of Budget: Identify ministries with poor budget execution or recurring irregularities to guide hearings and budget sanctions.

3. Policy analysts & Ministry finance teams: Target reforms (procurement, budgeting discipline, capacity building) where execution gaps are persistent.

4. NGOs & advocacy groups: Create evidence-based transparency reports and campaigns.

5. Investigative journalists & researchers: Produce data-driven stories on spending patterns and accountability failures.

### Implications for the real world and stakeholders:

A structured, longitudinal analysis enables detection of recurring inefficiencies and systematic audit concerns that single-year reviews miss. Findings can guide targeted audits, improve budget discipline (by showing where approved budgets routinely diverge from expenditures), and inform policy reforms (e.g., strengthening procurement controls, rolling budget ceilings, or capacity support). For civil society and media, the dataset supplies evidence for public accountability campaigns. Overall, the project strengthens governance by turning Auditor-General PDFs into persistent, actionable intelligence.

## Data Understanding

### Data sources and why they are suitable:
This project draws on official Auditor-General reports, which provide the authoritative record of Kenya’s national government budgets, expenditures, and audit observations. 

### Core sources

1. Auditor-General Reports (FY2020/21, FY2022/23, FY2023/24)- 
Authoritative, legally mandated audits with: (i) budget vs actuals, (ii) opinion types, (iii) control/governance findings, (iv) recurrent queries and pending bills. The 2023/24 MDAs report will anchor the latest year’s audited actuals and narrative risk signals (opinions; budget execution notes; control weaknesses). 

2. National Government Budget “Blue Book” (FY2021/22) - 
Official approved estimates at vote/program level—your baseline for “approved_budget” across MDAs.

3. Kenya_National_Govt_Budget_2021_2024.csv - 
Your structured, machine-readable compilation for FY2021/22–FY2023/24 that accelerates descriptive stats, joins, and sanity checks across years (vote/MDA, approved vs actual, etc.).

4. National-Government-MinistriesDepartments-And-Agencies-2023-2024.pdf - 
The latest Auditor-General MDAs report—brings detailed, vote-level audit opinions and “Statement of Comparison of Budget and Actual Amounts,” plus systemic issues (e.g., pending bills, late releases, control weaknesses) to contextualize execution gaps.
These datasets are the official, publicly available reports that include approved budgets, actual expenditures, and audit observations for ministries, departments, and agencies (MDAs). They are suitable because they are government-issued, comprehensive, and structured around the exact problem of interest: budget allocation and execution.

### Planned extraction and structuring of the data:

Use Python PDF extraction tools (pdfplumber) to pull out the “Statement of Comparison of Budget and Actual Amounts,” “Summary Statement of Appropriation,” and “Budgetary Control and Performance” sections from the PDFs.

Normalize MDA names across years (to account for mergers, renaming, or restructuring).

Build a unified dataset with the following features:

* MDA_name

* financial_year

* approved_budget

* actual_expenditure

* variance (approved – actual)

* pct_variance (variance as % of approved)

* audit_observations (structured tags or extracted text)

* Dataset size: number of MDAs × 3 years (expected several hundred rows, depending on how many MDAs are listed per year).

### Descriptive statistics to compute:

* For approved and actual expenditures: count, total, mean, median, min, max, standard deviation.

* For variances: total variance, average % under/overspending, distribution of % variances across MDAs.

* Frequency of MDAs with significant underspending (>5%), overspending (>0%), or within tolerance (±5%).

* Audit observations summarized by category (e.g., procurement irregularities, unsupported expenditures, late disbursements).

### Justification for chosen features:

* MDA_name and financial_year are necessary identifiers for longitudinal analysis.

* Approved_budget and actual_expenditure form the basis of budget execution analysis.

* Variance and pct_variance allow comparisons across MDAs regardless of size.

* Audit_observations provide explanatory context for discrepancies and help flag recurring governance issues.

### Limitations of the data and implications:

* Format inconsistency: reports are published as PDFs with mixed tables and text, requiring a hybrid extraction strategy and some manual cleaning.

* Naming inconsistencies: some MDAs change names or merge, which complicates longitudinal tracking.

* Accounting basis differences: reporting conventions may differ slightly year to year, affecting comparability.

* Granularity limits: Blue Books and Audit Reports provide institution-level data but not always project-level detail, limiting root-cause analysis.

* Audit text variability: audit observations are qualitative and may require natural language processing or manual tagging to be comparable across years.

### Mitigation strategies:

* Combine automated extraction with manual review for problematic entries.

* Maintain a canonical MDA name mapping across years.

* Clearly document assumptions and cleaning steps in the notebook.

* Include a confidence flag for parsed figures (high when numbers are extracted from tables, lower when parsed from narrative text).