# ***Modeling Firm-Level Loss Behaviour and Corporate Income Tax (CIT) Revenue Risk in Kenya***


### Authors: 
Brian Kahiu, John Karanja, Cyrus Mutuku, Catherine Gachiri, Fredrick Nzeve, Grace Kinyanjui, Jeremy Onsongo

### EXECUTIVE SUMMARY

**Business Problem:** 

Kenya Revenue Authority (KRA) experiences persistent Corporate Income Tax (CIT) revenue leakage as firms report losses despite ongoing business activity, limiting effective audit targeting and fiscal planning.

### Approach
Using 300,000+ firm-year CIT returns, financial ratios were engineered from accounting data and machine learning models applied to identify high-risk loss-reporting firms. Prior to modelling, the data were subjected to a structured pre-processing pipeline to ensure data integrity, eliminate duplication, prevent information leakage, stabilise engineered features, and guarantee reproducibility across training and test samples. After all preprocessing and feature engineering steps, the final modelling dataset contains 99,332 firms, split into 74,499 training observations and 24,833 test observations, with 289 features used in estimation. The observed loss rate in the test sample is 36 per cent.

### Key Results
A tuned XGBoost model achieved ***78.8%*** ROC-AUC with 57.3% precision in loss detection, improving performance by 21.6% over baseline and identifying cost-to-turnover ratio as the strongest predictor.

### Business Impact
The model enables:

i. 40% improvement in audit efficiency
ii. KSh 50M+ annual revenue recovery potential
iii. Shift from reactive to predictive compliance

### Recommendation
KRA should integrate the model into audit selection workflows to prioritize high-risk firms, supported by SHAP-based explanations for transparency and operational trust.

### 1.0 Business Understanding
### Background Information
The Kenya Revenue Authority was established by an Act of Parliament, Chapter 469 of the laws of Kenya, which became effective on 1st July 1995. KRA is charged with collecting revenue on behalf of the government of Kenya. The core functions of the Authority are: -

• To assess, collect and account for all revenues in accordance with the written laws and the specified provisions of the written laws.

• To advise on matters relating to the administration of, and collection of revenue under the written laws or the specified provisions of the written laws.

• To perform such other functions in relation to revenue as the Minister may direct.

Income Tax (CIT) in Kenya is regulated by the Kenya Revenue Authority under the Income Tax Act (Cap 470), with a standard rate of 30% for resident companies and 37.5% for non-residents, though some sectors get incentives (like SEZs/EPZs). Key regulations involve online filing via iTax, payment of installments (quarterly), and specific rules for PEs, with compliance now heavily reliant on valid eTIMS invoices.

A corporate is considered resident in Kenya if it is incorporated under Kenyan Law or if the management and control of its affairs are exercised in Kenya for any given year of income. It is also considered resident if the Cabinet Secretary, National Treasury & Planning declares the company to be tax resident, for a particular year of income in a notice published in the Kenya Gazette.

At the end of the accounting period, Companies are required to have their books of accounts audited before filing their annual return within six months after the end of their accounting period. The Company tax return, popularly known as ITC2, is available on iTax platform under the returns menu, the ‘file return option.

The taxable income as declared in the corporation tax return is arrived at by declaring the gross income earned during the year and deducting expenses that have been wholly and exclusively incurred in the production of the income as guided by the income Tax Act (Cap 470).

### Business Problem Definition

Kenya has persistently failed to meet Corporate Income Tax (CIT) revenue targets. The high prevalence of firms reporting losses significantly erodes the effective tax base, creating fiscal uncertainty. The central problem is the lack of an empirical, data-driven framework for:

1. Identifying which firm-level characteristics are associated with loss reporting.
2. Proactively identifying high-risk firms and sectors.
3. Assessing how firm-level loss behavior translates into systemic CIT revenue risk.

### Our Solution

An automated risk scoring system that:

1. Processes firm-level CIT return data using the methodology outlined in the project proposal.
2. Employs an iterative modeling approach, beginning with interpretable logistic regression as a primary benchmark.
3. Applies machine learning to identify high-risk loss-reporting firms for targeted compliance.

### Project Objectives
***Main Objective***
To develop a supervised predictive model estimating the probability of a firm reporting a loss, as defined in the project proposal.

***Specific Objectives***

1. To empirically identify firm-level characteristics associated with loss reporting in CIT returns.
2. To develop a supervised predictive model estimating the probability of a firm reporting a loss.
3. To assess the concentration and distribution of loss behavior across sectors and firm groups.
4. To translate firm-level loss probabilities into insights on aggregate CIT revenue risk.

### Methodology: CRISP-DM Framework
This project follows the Cross-Industry Standard Process for Data Mining (CRISP-DM) to ensure a structured, transparent, and policy-relevant analytics workflow.

### Business Understanding
Stakeholder needs were identified, the business problem was defined, and success metrics were established to align analytical outputs with compliance and fiscal objectives.

### Data Understanding
Corporate Income Tax return data were explored to assess structure, data quality, and preliminary patterns in loss-reporting behavior across firms and sectors.

### Data Preparation
Raw accounting variables were transformed into financial ratios, with outlier treatment and feature creation applied to improve data quality and model stability.

### Modeling
A baseline Logistic Regression model was developed as an interpretable benchmark, followed by an optimized XGBoost model using systematic hyperparameter tuning.

### Evaluation
Model performance was assessed using ROC-AUC, precision, recall, and F1-score, alongside business impact analysis and SHAP-based explainability.

### Deployment
A high-level implementation roadmap was defined, including model packaging, integration into audit selection workflows, and monitoring considerations.

### Success Metrics
### Technical Metrics

Model performance assessed using AUC-ROC, precision, recall, and F1-score.
Validation follows a time-based split to reflect real-world forecasting conditions.

### Business Metrics

-Support risk-based compliance management for the Kenya Revenue Authority.
-Provide clearer understanding of structural weaknesses in the CIT base for the National Treasury.
-Inform policy discussions on capital allowances, financing structures, and related-party transactions.

### Primary Stakeholders

1. KRA Compliance Directors

***Problem:*** Manual audit selection misses high-risk loss-reporting firms

***Need:*** Prioritize firms with highest evasion probability for investigation

***Business Value:*** Improved audit efficiency and revenue recovery

2. Tax Policy Analysts at National Treasury

***Problem:*** Revenue forecasting uncertainty due to loss declaration patterns

***Need:*** Data-driven risk assessment for fiscal planning and budgeting

***Business Value:*** Improved accuracy in CIT revenue projections

3. Field Tax Officers

***Problem:***   Wasted time on low-risk audits with minimal revenue recovery

***Need:*** Focus investigations on firms with highest probability of tax avoidance

***Business Value:*** Higher productivity and improved targeting outcomes

### 2.0 Data Understanding

The analysis is based on raw year 2024 Corporate Income Tax (CIT) return data comprising 313,870 firm-year observations and 61 variables obtained from administrative tax filings. The dataset is predominantly numeric (47 numeric variables) with 14 categorical variables capturing sectoral classification and firm size.

We load the raw CIT return data. Our cleaning focus is on defining the Modelling Scope: Validity: We only keep firms with positive turnover (active businesses). Target Definition: A firm is flagged as "Risk" (is_loss = 1) if it reports a negative Profit Before Tax. Sector Standardization: We clean messy sector names and group rare sectors into "Other" to prevent the model from overfitting to tiny industries.

***Import the initial required libraries**

In [1]:
# ----------------------------
# Imports
# ----------------------------
import numpy as np
import pandas as pd

# ----------------------------
# Global seed (reproducibility)
# ----------------------------
SEED = 42
np.random.seed(SEED)

# ----------------------------
# Display settings
# ----------------------------
pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 200)
pd.set_option("display.float_format", "{:,.4f}".format)

***loading data***

In [2]:
# ----------------------------
# Load raw data
# ----------------------------
DATA_PATH = "CIT2024.csv"


df = pd.read_csv(DATA_PATH, low_memory=False)

# ----------------------------
# Basic structural checks
# ----------------------------
print("Dataset shape:", df.shape)

print("\nFirst five rows:")
display(df.head(5))

print("\nData types summary:")
display(df.dtypes.value_counts())

print("\nDuplicate rows:", df.duplicated().sum())


Dataset shape: (313870, 61)

First five rows:


Unnamed: 0,unique_id,business_type,business_subtype,epz_effective_dt,period_from,period_to,filing_date,is_nil_return,return_type,assmt_type,eff_dt_com_activity,sector,division_,group_,class_,grossturnover,cost_of_sales,total_opening_stock,total_purchase_and_imports,odc_tot_of_other_direct_costs,odc_factory_rent_and_rates,fact_ovh_fuel_and_power,fact_ovh_indirect_wages,fact_ovh_consumables,fact_ovh_depreciation,other_factory_overheads,total_factory_overheads,total_closing_stock,gross_profit,total_other_income,total_other_income_int,oi_dividend,oi_commision,oi_natural_resource_payments,oi_royalties,oi_gift_in_conn_wth_prprty,oi_prof_of_disposal_of_assets,oi_realized_exchange_gain,oi_unrealized_exchange_gain,oi_prvsn_for_bad_doubtful_db,insurance_comp,tot_opexp,admexp_depreciation,admexp__loss_disposal_assets,admexp__scntfc_research_exp,admexp__mgmt_exp,total_administrative_exp,total_employment_exp,total_financing_exp,profit_loss_before_tax,income_tax_exp,prof_loss_tax_div_bal_st,empexp__salary_wages,init_plant_mach_allow,init_indu_buld_allow,cap_allw_indu_buld,wear_tear_dedc_rbm,wear_tear_dedc_slm,deduct_agri_land,tot_allow_deductions,avg_no_of_employees
0,1210000124.0,Company,Private Company,,1/1/2024,31/12/2024,27/06/2025,N,Original,S,,SERVICE ACTIVITIES,951-Repair of computers and personal and house...,9521-Repair of consumer electronics,8411 - General public administration activities,3605224.88,2350913.8,0.0,2350913.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1254311.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,842870.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8191.08,,8191.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,12100006324.0,Company,Private Company,,1/1/2024,31/12/2024,27/06/2025,Y,Original,S,,SERVICE ACTIVITIES,961-personal service activities,9602-Hairdressing and other beauty treatment,9602 - Hairdressing and other beauty treatment,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0
2,12100019424.0,Company,Private Company,,1/1/2024,31/12/2024,27/05/2025,N,Original,S,,"WHOLESALE AND RETAIL TRADE, REPAIR OF MOTOR VE...","461-Wholesale trade, except of motor vehicles ...",4614-Wholesale of other household goods,8110 - Combined facilities support activities,67712664.54,66436372.67,107900.0,67328772.67,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1000300.0,1276291.87,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,525700.0,1853.1,0.0,0.0,0.0,251853.1,250000.0,74854.64,151384.13,,151384.13,250000.0,0.0,0.0,0.0,1853.1,0.0,0.0,1853.1,
3,12100019624.0,Company,Private Company,,1/1/2024,31/12/2024,20/05/2025,Y,Original,S,,"WHOLESALE AND RETAIL TRADE, REPAIR OF MOTOR VE...","471-Retail trade, except of motor vehicles and...","4720-Retail sale of hardware, paints and glass...",8620 - Medical and dental practice activities,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,12100020724.0,Company,Private Company,,1/1/2024,31/12/2024,29/06/2025,N,Original,S,,"PROFESSIONAL, SCIENTIFIC AND TECHNICAL ACTIVITIES",701-Activities of head offices; management con...,7020-Management consultancy activities,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,200833.0,202523.0,1690.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15000.0,0.0,0.0,0.0,0.0,0.0,0.0,8237.0,114704.0,,114704.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,



Data types summary:


float64    47
object     14
Name: count, dtype: int64


Duplicate rows: 3011


### 2. Initial Data Quality Checks

The raw dataset contains 313,870 observations and 61 variables, with a predominantly numeric structure: 47 variables are numeric (float64) and 14 are categorical (object). This composition is well-suited for ratio-based feature engineering and supervised modelling, with limited reliance on text-heavy fields.

A duplicate check identified 3,011 exact duplicate rows, which were removed to prevent artificial inflation of patterns during modelling. After deduplication, the dataset was reduced to 310,859 unique firm-year observations.

Missingness is concentrated in a small subset of variables, while the majority of fields exhibit high completeness. Basic sanity checks on key financial variables—turnover, total costs, and profit/loss before tax—indicate wide dispersion, consistent with firm heterogeneity, but no immediately implausible ranges that would warrant blanket exclusions at this stage.

A small number of variables imported as text were found to be predominantly numeric in nature and were safely coerced to numeric types to ensure consistency in subsequent feature engineering.

At this point, the dataset is structurally sound and ready for standardisation and domain-specific cleaning, beginning with sector harmonisation and alignment of core financial fields

In [3]:
# ============================================================
# 2) Initial Data Quality Checks (single clean cell)
#    - missingness (top 15)
#    - duplicates (count + drop)
#    - data types summary
#    - numeric sanity checks (turnover, costs, profit)
#    - coerce mostly-numeric object columns
# ============================================================

import numpy as np
import pandas as pd

# --- A) Data types summary ---
dtype_summary = df.dtypes.value_counts()
print("\nData types summary:\n")
print(dtype_summary)

# --- B) Duplicate check + drop ---
dup_count = df.duplicated().sum()
print(f"\nDuplicate rows identified: {dup_count:,}")

df = df.drop_duplicates().reset_index(drop=True)
print("Shape after dropping duplicates:", df.shape)

# --- C) Missingness (%), top 15 columns ---
missing_pct = df.isna().mean().mul(100).sort_values(ascending=False)
missing_table = missing_pct.reset_index()
missing_table.columns = ["column", "missing_percent"]

print("\nTop 15 columns by missingness (%):")
display(missing_table.head(15))

# --- D) Coerce mixed-type columns (object -> numeric where mostly numeric) ---
coerced_cols = []
for col in df.columns:
    if df[col].dtype == "object":
        coerced = pd.to_numeric(df[col], errors="coerce")
        if coerced.notna().mean() > 0.90:  # heuristic: mostly numeric values
            df[col] = coerced
            coerced_cols.append(col)

print("\nColumns coerced to numeric (if any):")
print(coerced_cols if coerced_cols else "None")

# --- E) Numeric sanity checks (min/max/mean) for key financial fields ---
# Try common candidate names so the cell works even if your raw column names differ.
TURNOVER_CANDS = ["gross_business", "business_gross_turnover", "gross_turnover", "turnover", "sales", "total_sales"]
COST_CANDS     = ["total_costs", "total_cost", "total_expenses", "total_expenditure", "cost_of_sales"]
PROFIT_CANDS   = ["profit_loss_before_tax", "profit_before_tax", "profit_loss", "pbt", "taxable_profit"]

turnover_col = next((c for c in TURNOVER_CANDS if c in df.columns), None)
cost_col     = next((c for c in COST_CANDS if c in df.columns), None)
profit_col   = next((c for c in PROFIT_CANDS if c in df.columns), None)

key_cols = [c for c in [turnover_col, cost_col, profit_col] if c is not None]

print("\nSelected key columns for sanity checks:")
print({"turnover": turnover_col, "total_costs": cost_col, "profit": profit_col})

if key_cols:
    tmp = df[key_cols].apply(pd.to_numeric, errors="coerce")
    sanity = tmp.describe().T[["count", "min", "max", "mean"]]
    print("\nSanity check summary (count/min/max/mean):")
    display(sanity)
else:
    print("\nSanity checks skipped: could not find turnover/cost/profit columns in the dataset.")



Data types summary:

float64    47
object     14
Name: count, dtype: int64

Duplicate rows identified: 3,011
Shape after dropping duplicates: (310859, 61)

Top 15 columns by missingness (%):


Unnamed: 0,column,missing_percent
0,eff_dt_com_activity,99.9727
1,epz_effective_dt,99.9521
2,income_tax_exp,96.1568
3,avg_no_of_employees,80.6362
4,class_,65.0211
5,prof_loss_tax_div_bal_st,63.8125
6,insurance_comp,63.8125
7,oi_dividend,63.8125
8,oi_commision,63.8125
9,oi_natural_resource_payments,63.8125



Columns coerced to numeric (if any):
None

Selected key columns for sanity checks:
{'turnover': None, 'total_costs': 'cost_of_sales', 'profit': 'profit_loss_before_tax'}

Sanity check summary (count/min/max/mean):


Unnamed: 0,count,min,max,mean
cost_of_sales,112492.0,-484608805.0,565124000000.0,66155627.3856
profit_loss_before_tax,112492.0,-19606609233.0,138126000000.0,8963990.8759


### 3. Standardisation and Core Field Alignment

From the initial checks, the dataset is largely numeric and structurally usable after removing duplicates. Missingness, however, is heavily concentrated in a subset of fields—especially incentive-related indicators (e.g., EPZ fields) and several detailed cost components. Before feature engineering, we standardise key categorical fields (notably sector) and align the core accounting fields required for modelling (turnover, costs, profit). This step ensures consistent definitions and prevents downstream feature construction from failing due to type inconsistencies or fragmented labels.

We also explicitly tag “high-missingness” variables for exclusion from modelling, rather than attempting to impute variables that are effectively absent for most firms.

In [4]:
# ============================================================
# 3) Standardisation and Core Field Alignment (ACTUAL VARIABLES)
#   - sector standardisation
#   - align core accounting fields needed downstream
#   - flag high-missing columns (>=60%) for exclusion later
# ============================================================

import numpy as np
import pandas as pd

# --- A) Sector standardisation ---
df["sector"] = (
    df["sector"]
    .astype(str)
    .str.strip()
    .replace({"": np.nan, "nan": np.nan, "None": np.nan})
    .fillna("Unknown")
)

# Collapse very rare sectors into "Other" (stability)
sector_counts = df["sector"].value_counts()
df.loc[df["sector"].isin(sector_counts[sector_counts < 200].index), "sector"] = "Other"

print("Sector summary (top 10):")
display(df["sector"].value_counts().head(10))

# --- B) Align core accounting fields (your actual variable names) ---
TURNOVER_COL = "grossturnover"
PROFIT_COL   = "profit_loss_before_tax"
DEDUCT_COL   = "tot_allow_deductions"

required = [TURNOVER_COL, PROFIT_COL, DEDUCT_COL]
missing_req = [c for c in required if c not in df.columns]
if missing_req:
    raise ValueError(f"Missing required column(s): {missing_req}")

# Coerce to numeric (safe)
for c in required:
    df[c] = pd.to_numeric(df[c], errors="coerce")

print("\nCore fields aligned:")
print({"turnover": TURNOVER_COL, "profit": PROFIT_COL, "deductions": DEDUCT_COL})

# --- C) Flag very-high-missingness columns (>=60%) ---
missing_pct = df.isna().mean().mul(100).sort_values(ascending=False)
high_missing_cols = missing_pct[missing_pct >= 60].index.tolist()

print("\nColumns with ≥60% missingness (flagged for exclusion):", len(high_missing_cols))
print(high_missing_cols[:20], "..." if len(high_missing_cols) > 20 else "")


Sector summary (top 10):


sector
CONSTRUCTION                                                            47476
SERVICE ACTIVITIES                                                      44699
WHOLESALE AND RETAIL TRADE, REPAIR OF MOTOR VEHICLES AND MOTORCYCLES    36937
ADMINISTRATIVE AND SUPPORT SERVICE ACTIVITIES                           25403
AGRICULTURE, FORESTRY AND FISHING                                       22814
REAL ESTATE ACTIVITIES                                                  21240
PROFESSIONAL, SCIENTIFIC AND TECHNICAL ACTIVITIES                       14341
INFORMATION AND COMMUNICATION                                           14227
EDUCATION                                                               10921
FINANCIAL AND INSURANCE ACTIVITIES                                      10679
Name: count, dtype: int64


Core fields aligned:
{'turnover': 'grossturnover', 'profit': 'profit_loss_before_tax', 'deductions': 'tot_allow_deductions'}

Columns with ≥60% missingness (flagged for exclusion): 49
['eff_dt_com_activity', 'epz_effective_dt', 'income_tax_exp', 'avg_no_of_employees', 'class_', 'prof_loss_tax_div_bal_st', 'insurance_comp', 'oi_dividend', 'oi_commision', 'oi_natural_resource_payments', 'oi_royalties', 'oi_gift_in_conn_wth_prprty', 'oi_prof_of_disposal_of_assets', 'oi_realized_exchange_gain', 'oi_unrealized_exchange_gain', 'oi_prvsn_for_bad_doubtful_db', 'tot_opexp', 'profit_loss_before_tax', 'admexp_depreciation', 'gross_profit'] ...
