# Taxpayer Risk Classification Project

## Business Problem
Tax authorities face significant challenges in efficiently identifying taxpayers who are at high risk of non-compliance or tax evasion. Without accurate risk profiling, audits and enforcement actions may be misallocated, leading to revenue losses and wasted resources. There is a need for a data-driven approach to classify taxpayers based on their risk levels, enabling focused compliance efforts.

## Stakeholders
| Stakeholder             | Role / Interest                                                                                 |
|------------------------|------------------------------------------------------------------------------------------------|
| Tax Authority / Revenue Service | Responsible for tax collection, compliance enforcement, and overall revenue maximization.    |
| Audit Teams             | Use risk classifications to prioritize audits and investigations for efficient resource use.   |
| Policy Makers           | Use insights from the model to improve tax regulations and compliance strategies.               |
| Taxpayers               | Subject to audits and compliance monitoring; directly impacted by classification outcomes.     |
| Data Analysts / Data Scientists | Develop, validate, and maintain predictive models, providing actionable insights to stakeholders. |


## Business Objectives
1. **How can we improve audit efficiency by prioritizing audits on high-risk taxpayers to reduce costs and increase revenue recovery?**

2. **How can we develop a predictive model to categorize taxpayers by risk level (Low, Medium, High) for ongoing compliance monitoring and early intervention?**

3. **How can we proactively identify potential non-compliant taxpayers to reduce tax evasion and maximize government revenue?**

## Analysis Objectives
- **How can we build and validate a machine learning classification model that accurately predicts the risk label of taxpayers based on their financial and behavioral features?**

- **Which features (e.g., revenue, expenses, late filings) most significantly influence the taxpayer risk classification and how can these insights inform targeted policy actions?**

## Data  Description
| Column Name          | Description                                                                                       | Data Type              |
|----------------------|---------------------------------------------------------------------------------------------------|------------------------|
| Taxpayer_ID          | Unique identifier for each taxpayer                                                               | Categorical (ID)       |
| Revenue              | Total revenue reported by the taxpayer over a specific period                                     | Numeric (Continuous)   |
| Expenses             | Total expenses declared by the taxpayer                                                           | Numeric (Continuous)   |
| Profit               | Net profit calculated as Revenue minus Expenses                                                   | Numeric (Continuous)   |
| Industry_Type        | Sector in which the taxpayer operates (e.g., Retail, Manufacturing)                              | Categorical            |
| Business_Age         | Number of years the business has been operating                                                   | Numeric (Discrete)     |
| Late_Filings         | Number of late tax return submissions                                                              | Numeric (Discrete)     |
| Audit_History        | Count of previous audits conducted on the taxpayer                                                | Numeric (Discrete)     |
| Payment_Delays       | Number of instances where tax payments were delayed                                               | Numeric (Discrete)     |
| Employee_Count       | Number of employees reported by the business                                                      | Numeric (Discrete)     |
| Compliance_Score     | Internal score indicating taxpayer’s level of compliance (e.g., 0–100 scale)                      | Numeric (Continuous)   |
| Region               | Geographical location or jurisdiction of the taxpayer                                             | Categorical            |
| Risk_Label           | Target variable: classification of the taxpayer’s compliance risk (Low, Medium, High)            | Categorical (Target)   |



## Prediction Target
We are predicting the **Taxpayer Risk Label**, which categorizes taxpayers into different risk levels **Low**, **Medium**, or **High** risk of non-compliance or tax evasion based on their financial data and compliance behavior. This classification helps tax authorities prioritize audits and enforcement actions effectively.


# Import modules & packages

In [11]:
# Data manipulation 
import pandas as pd 
import numpy as np 

# Data visualization
import seaborn as sns 
import matplotlib.pyplot as plt 

# Modeling
from sklearn.model_selection import train_test_split,GridSearchCV
from imblearn.over_sampling import SMOTE #SMOTE technique to deal with unbalanced data problem
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,confusion_matrix,roc_curve,auc,classification_report # performance metrics
from sklearn.preprocessing import StandardScaler# to scale the numeric features
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler # to encode binary features
from scipy import stats

# Algorithms for supervised learning methods
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay


# Set the style to "darkgrid" and "ggplot"
sns.set_style("dark")

# Load Dataset

In [12]:

data = pd.read_csv(r"C:\Users\Harriet\Downloads\PHASE 3 PROJECT\data\tax_risk_dataset.csv")
data.head()

Unnamed: 0,Taxpayer_ID,Revenue,Expenses,Tax_Liability,Tax_Paid,Late_Filings,Compliance_Violations,Industry,Profit,Tax_Compliance_Ratio,Audit_Findings,Audit_to_Tax_Ratio,Risk_Label
0,1,1149014.25,979871.09,39872.33,28921.92,2,1,Finance,169143.16,0.73,0,0.0,High
1,2,958520.71,884926.74,47832.22,39396.15,1,1,Retail,73593.97,0.82,0,0.0,Medium
2,3,1194306.56,711926.07,38113.7,43863.94,4,0,Manufacturing,482380.49,1.15,3,0.0,High
3,4,1456908.96,570612.64,45380.58,66876.88,4,2,Finance,886296.32,1.47,1,0.0,High
4,5,929753.99,839644.66,21595.78,53565.53,0,0,Tech,90109.33,2.48,2,0.0,Low


# EXPLORATORY DATA ANALYSIS

## DATA INSPECTION

In [13]:
print("\n Column names:")
print(data.columns.tolist())

print("\n Data types and non-null counts:")
print(data.info())

print("\n Summary statistics (numerical columns):")
print(data.describe())

print("\n Preview of the first 5 rows:")
print(data.head())

print("\n Preview rows and columns:")
print(data.shape)


 Column names:
['Taxpayer_ID', 'Revenue', 'Expenses', 'Tax_Liability', 'Tax_Paid', 'Late_Filings', 'Compliance_Violations', 'Industry', 'Profit', 'Tax_Compliance_Ratio', 'Audit_Findings', 'Audit_to_Tax_Ratio', 'Risk_Label']

 Data types and non-null counts:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Taxpayer_ID            1000 non-null   int64  
 1   Revenue                1000 non-null   float64
 2   Expenses               1000 non-null   float64
 3   Tax_Liability          1000 non-null   float64
 4   Tax_Paid               1000 non-null   float64
 5   Late_Filings           1000 non-null   int64  
 6   Compliance_Violations  1000 non-null   int64  
 7   Industry               1000 non-null   object 
 8   Profit                 1000 non-null   float64
 9   Tax_Compliance_Ratio   1000 non-null   float64
 10  Audit_

## DATA PREPARATION

### Checking for missing, duplicated and placeholder values.

We will begin the data cleaning by checking for missing, duplicated and placeholder values in the dataset. One function will be used to check for them.

In [14]:
# Creating a function that returns null, duplicated and placeholder values in the dataset.

def data_prep(df):
    print('-------------------------Missing Values Check---------------------------------------\n')
    print(f'Number of null values in each column in the dataset:\n{df.isnull().sum()}\n')
    print('-------------------------Duplicated Values Check------------------------------------\n')
    print(f'Number of duplicated values in the dataset: {df.duplicated().sum()}\n')
    print('-------------------------Placeholder Values Check-----------------------------------\n')
    for column in df.columns:
        unique_values = df[column].unique()
        placeholders = [value for value in unique_values if str(value).strip().lower() in ['placeholder', 'na', 'n/a', '?']]
        placeholder_count = len(placeholders)
    
        print(f"Column: '{column}'")
        print(f"Placeholders found: {placeholders}")
        print(f"Count of placeholders: {placeholder_count}\n")
# Checking in our dataset.
data_prep(data)

-------------------------Missing Values Check---------------------------------------

Number of null values in each column in the dataset:
Taxpayer_ID              0
Revenue                  0
Expenses                 0
Tax_Liability            0
Tax_Paid                 0
Late_Filings             0
Compliance_Violations    0
Industry                 0
Profit                   0
Tax_Compliance_Ratio     0
Audit_Findings           0
Audit_to_Tax_Ratio       0
Risk_Label               0
dtype: int64

-------------------------Duplicated Values Check------------------------------------

Number of duplicated values in the dataset: 0

-------------------------Placeholder Values Check-----------------------------------

Column: 'Taxpayer_ID'
Placeholders found: []
Count of placeholders: 0

Column: 'Revenue'
Placeholders found: []
Count of placeholders: 0

Column: 'Expenses'
Placeholders found: []
Count of placeholders: 0

Column: 'Tax_Liability'
Placeholders found: []
Count of placeholders: 0