# Machine Learning - Mini Project: Tax Evasion Prediction

Name: Kelsey Xing \
Date: Feb 4, 2024



## Case Background:
The capacity of the government to collect taxes is pivotal to long-run economic growth because without tax revenue, the state cannot provide public goods. One way to increase tax revenue is by reducing the probability of successful tax evasion; as probability of success decreases, the incentive to cheat gets weaker. To reduce the probability of successful tax evasion while not increasing government expenditure on audits, the government wants to increase the probability of catching tax evasion by reducing the number of audits performed on firms that paid their taxes and increasing the number of audits performed on firms that evaded their taxes. 

The following project uses a mmachine learning techniques to approach this effort. By using a Linear Probability Model (LPM) and a k-Nearest Neighbors (KNN) saparetly, we try to predicted whether a firm has evaded taxes with a low classification error rate. 

The dataset used for this project contains information on firms that the government of India suspected of tax evasion and subsequently the Comptroller and Auditor General (CAG) of India performed audits on. Table 1 contains the variable names and their definitions. The outcome variable is whether the auditor found that the firm evaded taxes as a result of the audit (Risk). The predictors include various quantitative measures about the firms.2



## Data Analysis: 

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score
from sklearn.model_selection import GridSearchCV, KFold

### Data Processing

In [2]:
# Load data
audit = pd.read_csv('Data-Audit.csv')
# Check and drop N/A
audit.isnull().sum()

Sector_score     0
PARA_A           0
Risk_A           0
PARA_B           0
Risk_B           0
Money_Value      1
Risk_D           0
Score            0
Inherent_Risk    0
Audit_Risk       0
Risk             0
dtype: int64

In [3]:
audit = audit.dropna()
pd.set_option('display.max_columns', None)
display(audit.head())

Unnamed: 0,Sector_score,PARA_A,Risk_A,PARA_B,Risk_B,Money_Value,Risk_D,Score,Inherent_Risk,Audit_Risk,Risk
0,3.89,4.18,2.508,2.5,0.5,3.38,0.676,2.4,8.574,1.7148,1
1,3.89,0.0,0.0,4.83,0.966,0.94,0.188,2.0,2.554,0.5108,0
2,3.89,0.51,0.102,0.23,0.046,0.0,0.0,2.0,1.548,0.3096,0
3,3.89,0.0,0.0,10.8,6.48,11.75,7.05,4.4,17.53,3.506,1
4,3.89,0.0,0.0,0.08,0.016,0.0,0.0,2.0,1.416,0.2832,0


In [4]:
# Training/Validation set split
X = audit.drop(columns = ['Risk'])
y = audit['Risk']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.5, random_state=13)

In [5]:
# Meger X_train and y_train
train = X_train.merge(y_train, how = 'left', left_index =True, right_index=True)

### LPM Construction & Error Rate Report

In [6]:
#Fit a LPM model
result = smf.ols(
    'Risk ~ Sector_score + PARA_A + Risk_A + PARA_B + Risk_B + Money_Value + Risk_D + Score + Inherent_Risk + Audit_Risk',
    data = train
).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                   Risk   R-squared:                       0.641
Model:                            OLS   Adj. R-squared:                  0.632
Method:                 Least Squares   F-statistic:                     67.18
Date:                Fri, 16 Feb 2024   Prob (F-statistic):           1.97e-77
Time:                        16:48:52   Log-Likelihood:                -70.544
No. Observations:                 387   AIC:                             163.1
Df Residuals:                     376   BIC:                             206.6
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -0.8535      0.081    -10.548

In [7]:
# Make prediction
prediction = pd.DataFrame(result.predict(X_val), columns = ['Prediction']) 
prediction

Unnamed: 0,Prediction
135,0.930844
4,0.118232
356,0.124035
413,0.539158
259,0.254490
...,...
474,0.121852
345,0.180353
360,0.360905
284,0.912360


In [8]:
risk = pd.DataFrame(y_val, columns = ['Risk'])
result = risk.merge(prediction, how = 'inner', left_index =True, right_index=True)
result

Unnamed: 0,Risk,Prediction
135,1,0.930844
4,0,0.118232
356,0,0.124035
413,1,0.539158
259,0,0.254490
...,...,...
474,0,0.121852
345,0,0.180353
360,1,0.360905
284,1,0.912360


In [9]:
#For firms with a predicted probability of tax evasion greater than 0.5, construct the confusion matrix
result['Category_a'] = np.where(result['Prediction'] >= 0.5, 1, 0)
result

Unnamed: 0,Risk,Prediction,Category_a
135,1,0.930844,1
4,0,0.118232,0
356,0,0.124035,0
413,1,0.539158,1
259,0,0.254490,0
...,...,...,...
474,0,0.121852,0
345,0,0.180353,0
360,1,0.360905,0
284,1,0.912360,1


In [10]:
#Confusion matrix
#Analyze counts
cm_ols_c = confusion_matrix(result['Risk'],result['Category_a'])
print(cm_ols_c)
#Analyze percentage
cm_ols_p = confusion_matrix(result['Risk'],result['Category_a'], normalize = 'true')
print(cm_ols_p)

[[221   8]
 [ 29 130]]
[[0.9650655  0.0349345 ]
 [0.18238994 0.81761006]]


In [11]:
#For firms with a predicted probability of tax evasion greater than 0.6, construct the confusion matrix.
result['Category_b'] = np.where(result['Prediction'] >= 0.6, 1, 0)
result

Unnamed: 0,Risk,Prediction,Category_a,Category_b
135,1,0.930844,1,1
4,0,0.118232,0,0
356,0,0.124035,0,0
413,1,0.539158,1,0
259,0,0.254490,0,0
...,...,...,...,...
474,0,0.121852,0,0
345,0,0.180353,0,0
360,1,0.360905,0,0
284,1,0.912360,1,1


In [12]:
#Confusion matrix of LDM
#Analyze counts
cm_ols_c = confusion_matrix(result['Risk'], result['Category_b'])
print(cm_ols_c)
#Analyze percentage
cm_ols_p = confusion_matrix(result['Risk'], result['Category_b'], normalize = 'true')
print(cm_ols_p)

[[225   4]
 [ 39 120]]
[[0.98253275 0.01746725]
 [0.24528302 0.75471698]]


In [13]:
#report error code
# Threshold = 0.5:
accuracy = accuracy_score(result['Risk'], result['Category_a'])
test_error = 1 - accuracy
print(f'For threshold = 0.5, error rate = {test_error:.2%}')

# Threshold = 0.6:
accuracy = accuracy_score(result['Risk'], result['Category_b'])
test_error = 1 - accuracy
print(f'For threshold = 0.6, error rate = {test_error:.2%}')

For threshold = 0.5, error rate = 9.54%
For threshold = 0.6, error rate = 11.08%


According to the error rate, LPM model with threshold = 0.5 results in a more accurate overall prediction.

In [14]:
#proportion of the firms predicted to evade their taxes actually evaded taxes
# Threshold = 0.5:
precision_a = precision_score(result['Risk'], result['Category_a']) 
print(f'For thershold = 0.5, {precision_a:.2%} of the firms predicted to evade their taxes actually evaded taxes')

# Threshold = 0.6:
precision_b = precision_score(result['Risk'], result['Category_b']) 
print(f'For thershold = 0.6, {precision_b:.2%} of the firms predicted to evade their taxes actually evaded taxes')


For thershold = 0.5, 94.20% of the firms predicted to evade their taxes actually evaded taxes
For thershold = 0.6, 96.77% of the firms predicted to evade their taxes actually evaded taxes


### KNN Model Construction & Error Rate Report


#### Without data normalization

In [15]:
#fit KNN model with k = 5
knn_5 = KNeighborsClassifier(n_neighbors = 5)
knn_5.fit(X_train, y_train)
knn_5_pred = knn_5.predict(X_val)

In [16]:
#Confusion matrix of KNN
#Analyze counts
cm_KNN_c = confusion_matrix(y_val, knn_5_pred)
print(cm_KNN_c)

#Analyze percentage
cm_KNN_p = confusion_matrix(y_val, knn_5_pred, normalize = 'true')
print(cm_KNN_p)

[[226   3]
 [ 11 148]]
[[0.98689956 0.01310044]
 [0.06918239 0.93081761]]


In [17]:
#report error rate
accuracy = accuracy_score(y_val, knn_5_pred)
test_error = 1 - accuracy
print(f'KNN model with k = 5:  error rate = {test_error:.2%} , accuracy = {accuracy:.2%}')

KNN model with k = 5:  error rate = 3.61% , accuracy = 96.39%


In [18]:
#report the proportion of the firms predicted to evade their taxes actually evaded taxes
precision = precision_score(y_val, knn_5_pred) 
print(f'For KNN with k = 5, {precision:.2%} of the firms predicted to evade their taxes actually evaded taxes')

For KNN with k = 5, 98.01% of the firms predicted to evade their taxes actually evaded taxes


#### With data normalization

In [19]:
# Normalizing data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
cols = X.columns
X_final = pd.DataFrame(X_scaled, columns=cols)
display(X_final.head())

Unnamed: 0,Sector_score,PARA_A,Risk_A,PARA_B,Risk_B,Money_Value,Risk_D,Score,Inherent_Risk,Audit_Risk
0,-0.669071,0.304129,0.335827,-0.166006,-0.194273,-0.161614,-0.190146,-0.353484,-0.166753,-0.141265
1,-0.669071,-0.432005,-0.393216,-0.119482,-0.178777,-0.198271,-0.202356,-0.819385,-0.276733,-0.172402
2,-0.669071,-0.34219,-0.363566,-0.211331,-0.20937,-0.212393,-0.207059,-0.819385,-0.295112,-0.177606
3,-0.669071,-0.432005,-0.393216,-0.000278,0.004583,-0.03587,-0.030676,1.976022,-0.003134,-0.09494
4,-0.669071,-0.432005,-0.393216,-0.214326,-0.210368,-0.212393,-0.207059,-0.819385,-0.297523,-0.178289


In [20]:
# Training/Validation split
X_train_n, X_val_n, y_train_n, y_val_n = train_test_split(X_final, y, test_size=0.5, random_state=13)

#fit KNN model with k = 5
knn_5.fit(X_train_n, y_train_n)
knn_5_pred_n = knn_5.predict(X_val_n)

In [21]:
#Confusion matrix of KNN
#Analyze counts
cm_KNN_c_n = confusion_matrix(y_val_n, knn_5_pred_n)
print(cm_KNN_c_n)

#Analyze percentage
cm_KNN_p_n = confusion_matrix(y_val_n, knn_5_pred_n, normalize = 'true')
print(cm_KNN_p_n)

[[222   7]
 [ 15 144]]
[[0.96943231 0.03056769]
 [0.09433962 0.90566038]]


In [22]:
#report error rate
accuracy_n = accuracy_score(y_val_n, knn_5_pred_n)
test_error_n = 1 - accuracy_n
print(f'After normalizing, KNN model with k = 5:  error rate = {test_error_n:.2%} , accuracy = {accuracy_n:.2%}')

After normalizing, KNN model with k = 5:  error rate = 5.67% , accuracy = 94.33%


In [23]:
#report the proportion of the firms predicted to evade their taxes actually evaded taxes
precision = precision_score(y_val_n, knn_5_pred_n, zero_division=0)
print(f'After normalizing, for KNN with k = 5, {precision:.2%} of the firms predicted to evade their taxes actually evaded taxes')

After normalizing, for KNN with k = 5, 95.36% of the firms predicted to evade their taxes actually evaded taxes


#### Compare the performance of KNN model with and without data normalization

In this case, KNN model without normalized predictors performs better. Since the precision score of the model without normalized predictors is 98.01%, higher than the precision of the model with normalized predictors, which is 95.36%. And accouding to the acurracy rate, the former also has better performance to the latter. 

This is a quite rare-find case, since generally we would expect that a KNN model with normalized predictors outperforms one without normalization. This is because KNN calculates the distance between different instances using a distance metric, and if one feature has a much larger scale than another, it can dominate the distance calculation, causing the algorithm to ignore the smaller scale feature, even if that feature is important for the classification.

The reason behind this unique case could be: 
1. the model contains irrelevant feature. If normalization inadvertently gives more weight to noisy or irrelevant features, it could degrade the model's performance. Especially when those irrelevant features have high variance, which could unduly influence the distance calculation.
2. the influence of outliers: Normalization can be affected by outliers because they can skew the scale of the data. For example, if min-max scaling is used and there are extreme outliers, most of the data will be compressed into a very small part of the feature range. In such cases, KNN without normalization might perform better because the influence of outliers would be less pronounced.

#### Find the k with the lowest classification error rate

In [24]:
# Shuffle and split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=13, shuffle=True
)

ks = list(range(1, 20))  #upper bound of k is set as the sqrt(No. of rows in training set) = sqrt(387) = 20

para = {'n_neighbors': ks}

# Initialize the KNN classifier
knni = KNeighborsClassifier()

# Set up 5-fold cross-validation scheme
knn_cv = GridSearchCV(knni, para, cv=KFold(5, random_state=42, shuffle=True))

# Fit the model
knn_cv.fit(X_train, y_train)

print("Best parameters:", knn_cv.best_params_)
print(f'Best cross-validation score:" {knn_cv.best_score_:.2%}')
print(f'lowest error rate" {1-knn_cv.best_score_:.2%}')

Best parameters: {'n_neighbors': 3}
Best cross-validation score:" 95.88%
lowest error rate" 4.12%


When k = 3, the model yields the lowest error rate as 4.12%

### Problems and Risks of using KNN model to target audits in the long run

The dataset is biased. Variables such as 'Sector' and 'Inherent Rosk', which records the historical risk score of industries and firms, are not representative of the truth of the entire population of firms but only the firms that were previously selected for audits. For the rest of firms, which are not performed ausit, their historical risk score would bias towards 0. Therefore, the training data for future years is not 100% trustworthy. A model trained on a biased sample can biased to overly target perviously audited firms, and potentially ignoring other types of firms that might be non-compliant. 

