# Tax Evasion Prediction — A Machine Learning Case Study

**Author:** Dong Liu  
**Last updated:** Feb 27, 2026

### Executive Overview
This notebook walks through end‑to‑end development of a tax evasion prediction model, including data loading, cleaning, exploratory analysis, feature engineering, model training, and evaluation.  
The focus is clarity and reproducibility. The **original modeling logic and outputs are preserved**; only structure and documentation were added for portfolio presentation.

### How to Run
- Environment: Python 3.x, common data‑science stack (pandas, numpy, scikit‑learn, matplotlib, seaborn, etc.).  
- Reproducibility: set a global random seed where applicable.  
- Run all cells from top to bottom. If data files are required, place them under a `data/` folder at project root or adjust paths accordingly.

---


## Contents
1. Data Loading & Cleaning
2. Model Training  
3. Evaluation & Error Analysis  
4. Insights & Next Steps


---


# Machine Learning - Tax Evation Prediction

## Data Loading & Cleaning



In Logistic Regression, missing necessary interaction terms will cause bias. KNN is completely non-parametric, and it can naturally capture complex interaction terms without requiring explicit interaction terms in the model.



In [None]:
import numpy as np
import statsmodels.api as sm
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, roc_curve, roc_auc_score, RocCurveDisplay
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
# Read the data
audit = pd.read_csv('Data-Audit.csv')
audit.head()


In [None]:
# Lookup the missing values
print(audit.isna().sum())

# Display rows where Money_Value is missing
print(audit[audit["Money_Value"].isna()])

# Remove the rows with missing money value
audit.dropna(axis=0, subset= 'Money_Value', inplace=True)
print(f'remaining number of rows: {len(audit)}')


In [None]:
X = audit.drop(columns = ['Risk'])
y = audit['Risk']
display(X.head())
display(y.head())

In [None]:
# Check data types
X.dtypes

In [None]:
# Data spliting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state = 13, stratify=y)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state = 13)

In [None]:
# Define and train the logistic regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

# Predict the labels on the test set
y_pred_log = model.predict_proba(X_test)[:, 1] 




## Logistic Regression Model (Threshold = 0.5)

In [None]:
# Continuous to binary predictions
y_pred_log_bin_1 = np.where(y_pred_log > 0.5, 1, 0)
y_pred_log_bin_1[:10]

In [None]:
# Confusion Matrix
cm_log_1 = confusion_matrix(y_test, y_pred_log_bin_1)
print(cm_log_1)

# Compute Accuracy
accuracy_log_1 = accuracy_score(y_test, y_pred_log_bin_1)
error_rate_log_1 = 1 - accuracy_log_1


In [None]:
ax = sns.heatmap(cm_log_1, annot=True, 
            fmt='d', cmap='Blues')

ax.set_title('Logistic Regression: Confusion Matrix with Threshold 0.5 \n')
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values')

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

## Display the visualization of the Confusion Matrix.
plt.show()

## Logistic Regression Model (Threshold = 0.6)

In [None]:

# Continuous to binary predictions
y_pred_log_bin_2 = np.where(y_pred_log > 0.6, 1, 0)
y_pred_log_bin_2[:10]

# Confusion Matrix
cm_log_2 = confusion_matrix(y_test, y_pred_log_bin_2)
print(cm_log_2)

# Compute Accuracy
accuracy_log_2 = accuracy_score(y_test, y_pred_log_bin_2)
error_rate_log_2 = 1 - accuracy_log_2



In [None]:
# Plot Confusion Matrix
ax = sns.heatmap(cm_log_2, annot=True, 
            fmt='d', cmap='Blues')

ax.set_title('Logistic Regression: Confusion Matrix with Threshold 0.6\n')
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values')

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

## Display the visualization of the Confusion Matrix.
plt.show()

In [None]:
# Print results
print(f"\nAccuracy with threshold 0.5: {accuracy_log_1 * 100:.2f}%")
print(f"Error Rate with threshold 0.5: {error_rate_log_1 * 100:.2f}%")

# Print results
print(f"\nAccuracy with threshold 0.6: {accuracy_log_2 * 100:.2f}%")
print(f"Error Rate with threshold 0.6: {error_rate_log_2 * 100:.2f}%")

Overall, the model with threshold 0.6 has higher accuracy.

## Logistic Model AUC

In [None]:
# plot ROC
fpr_audit, tpr_audit, thresholds_audit = roc_curve(y_test, y_pred_log)
auc_audit = roc_auc_score(y_test, y_pred_log)
roc_curve_audit = RocCurveDisplay(fpr=fpr_audit, tpr=tpr_audit, 
                                roc_auc=auc_audit,
                                estimator_name='Logistic Regression')


fig, ax = plt.subplots(figsize=(5, 5))
roc_curve_audit.plot(ax=ax)
ax.set_ylabel("True Positive Rate (TPR)", fontsize=20)
ax.set_xlabel("False Positive Rate (FPR)", fontsize=20)
ax.set_title("Receiver Operating Characteristic (ROC) Curve")
plt.show()

print(f"The AUC score is {auc_audit*100:.2f}%")



ROC visualize the tradeoff between the TPR and FPR at all possible decision thresholds, not only the 0.6 and 0.5 threshold we have compared.  
The ROC plot shows that the model contains high TPR rate while keeping the FPR rate low, indicating that the model perform well across differnt shresholds.  

In this context, the government prioritizes identifying firms with a high probability of tax evasion. False negatives (FN) are more concerning than false positives (FP) because:  

1. A false negative (FN) occurs when the model fails to detect a tax evader, undermining efforts to increase tax revenue.

2. A false positive (FP) leads to unnecessary audits, but the financial and administrative costs are relatively lower compared to the consequences of missing actual evaders.

Based on the ROC curve, the model effectively keeps the overall false positive rate low. 

Increasing the threadshold will increase the FN rate, while decreading the shredshold will increase the FP rate. Given the government’s focus on minimizing false negatives, I would recommend a lower shredshold in this case.



## KNN Model Unscaled (k=5)

In [None]:
# Run KNN5
knn_5 = KNeighborsClassifier(n_neighbors=5)
knn_5.fit(X_train, y_train)
knn_5_prob = knn_5.predict_proba(X_test)[:, 1] 

# Apply 0.5 threshold
knn_5_pred = (knn_5_prob > 0.5).astype(int)

# Calculate Accuracy
accuracy_knn_5 = accuracy_score(y_test, knn_5_pred)
error_rate_knn_5 = 1 - accuracy_knn_5

In [None]:
# Confusion Matrix
knn_5_cm = confusion_matrix(y_test, knn_5_pred)
print(knn_5_cm)

# Plot Confusion Matrix
ax = sns.heatmap(knn_5_cm, annot=True, 
            fmt='d', cmap='Blues')

ax.set_title('KNN_5 Confusion Matrix \n')
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values')

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

## Display the visualization of the Confusion Matrix.
plt.show()



In [None]:

# Print accuracy results

print(f"KNN 5 Error Rate: {error_rate_knn_5 * 100:.2f}%")
print(f"KNN 5 Overall acuracy: {accuracy_knn_5 * 100:.2f}%")



Under 5-neighbour KNN, the proportion of the firms predicted to evade their taxes actually evaded taxes: $\frac{138}{138+4}= 95.8\%$

## KNN Model Scaled (k=5)

In [None]:

scaler = StandardScaler()  # Initialize the scalar
X_train_scaled = scaler.fit_transform(X_train)  
X_test_scaled = scaler.transform(X_test)  


In [None]:
# Run KNN5 with scaled X
knn_5_scaled = KNeighborsClassifier(n_neighbors=5)
knn_5_scaled.fit(X_train_scaled, y_train)
knn_5_prob_scaled = knn_5_scaled.predict_proba(X_test_scaled)[:, 1]

# Apply 0.5 threshold
knn_5_pred_scaled = (knn_5_prob_scaled > 0.5).astype(int)

# Calculate Accuracy
accuracy_knn_5_scaled = accuracy_score(y_test, knn_5_pred_scaled)
error_rate_knn_5_scaled = 1 - accuracy_knn_5_scaled

In [None]:
# Confusion Matrix
knn_5_scaled_cm = confusion_matrix(y_test, knn_5_pred_scaled)
print(knn_5_scaled_cm)

# Plot Confusion Matrix
ax = sns.heatmap(knn_5_scaled_cm, annot=True, 
            fmt='d', cmap='Blues')

ax.set_title('KNN_5_scaled Confusion Matrix \n')
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values')

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['False','True'])
ax.yaxis.set_ticklabels(['False','True'])

## Display the visualization of the Confusion Matrix.
plt.show()

In [None]:
print(f"KNN 5 scaled Error Rate: {error_rate_knn_5_scaled * 100:.2f}%")
print(f"KNN 5 scaled Overall acuracy: {accuracy_knn_5_scaled * 100:.2f}%")


Using scaled data, the proportion of the firms predicted to evade their taxes actually evaded taxes: $\frac{144}{144+5}= 96.64\%$

The model with scaling performs better in this case. It has higer overall accuracy, lower FNR and FPR. The reason that unscaled data makes better prediction could due to the predictors $x_i \in X$ with lower variation weight more in the true prediction function than those with higher variation. 

In [None]:
print(f"Total Sample Size: {len(y)}")
n = int(np.sqrt(len(y)))
print(n)
ks = list(range(1, 27, 2))  
para = {'n_neighbors': ks}
print(para)

In [None]:
# Initialize the KNN classifier
knni = KNeighborsClassifier()

# Set up 5-fold cross-validation scheme
kfcv = KFold(5, random_state=13, shuffle=True)
knn_cv = GridSearchCV(knni, para, cv=kfcv) 

# Fit the model
knn_cv.fit(X_train, y_train)
knn_cv_pred = knn_cv.predict(X_test)
knn_cv_pred_acc = accuracy_score(y_test, knn_cv_pred)

# Print the results
print("Best parameters :", knn_cv.best_params_)
print(f'Best cross validation score: {knn_cv.best_score_:.4f}')
print(f'Accuracy score:{ knn_cv_pred_acc:.4f}')

Based on 5-fold cross validation, the model yields the best performance when $k=1$, with cross-valication score 96.39% and accuracy score 95.88 %.
It could due to the sample size is small, only 775, larger k could lead to underfitting. 

In the long run, if the government relies too heavily on a KNN model with k=1, several issues may arise:  

Overfitting: The model is too flexible and may not generalize well.  

Limited Sample Size: With only 775 samples, the data may not represent the entire population.  

Bias in Detection: The model may miss tax evaders with different characteristics not captured in the training data.  

---

## Insights & Next Steps

**Key Takeaways (from current results):**
- Summarize the strongest signals/features.
- Note model performance (accuracy/AUC/F1/etc.) as reported above.
- Mention any class imbalance handling, validation scheme, and error analysis observations.

**Potential Improvements (future work):**
- Try calibrated probabilities and threshold tuning for business KPIs.
- Evaluate additional models (e.g., gradient boosting, stacking) with proper cross‑validation.
- Add domain features (e.g., behavior over time, peer comparisons, network features).
- Build a lightweight pipeline (sklearn `Pipeline`) for portability.
- Draft a short model card (intended use, limitations, fairness).


