In [1]:
import pandas as pd
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
df = pd.read_csv("D:/Projects/Naive bayes tax evasion/Tax_Evasion.csv", index_col = 0)

This dataset has records of Tax evasion from an XYZ County. Our goal is to establish a ML algorithm to sucessfully classify based on given features wheter a new entrant in database is likely to evade taxes or not.

Since our dataset comprises of both continous and categorical(Which will be converted to discrete) features, it is suitable that we use Mixed Naive Bayes algorithm.

In [2]:
df.head()

Unnamed: 0_level_0,Refund,Marital Status,Taxable Income,Evasion
Tid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Yes,Single,125000,No
2,No,Married,100000,No
3,No,Single,70000,No
4,Yes,Married,120000,No
5,No,Divorced,95000,Yes


Seprating the Target variable from Independent variables

In [3]:
# separate the features and the target variable
X = df.drop('Evasion', axis=1)
y = df['Evasion']

Since Naive Bayes classifiers requires numeric entries we convert our categorical variables to numeric by label encoding.

In [4]:
le = LabelEncoder()
X['Refund'] = le.fit_transform(X['Refund'])
X['Marital Status'] = le.fit_transform(X['Marital Status'])

In [5]:
print(f"The type of Refund is: {X['Refund'].dtype}, The type of Marital Status is: {X['Marital Status'].dtype} and the type of y is: {y.dtype}")

The type of Refund is: int32, The type of Marital Status is: int32 and the type of y is: object


Standardizing the continuous feature

In [6]:
sc = StandardScaler()
X['Taxable Income'] = sc.fit_transform(X[['Taxable Income']])

Creating the column transformer to apply different preprocessing on different columns.

In our case we our applying our standardization on Taxable income column as it is a continous variable but passing through the two other categorical variables.

The two transfomers defined are 'num' and 'cat'. the 'num' is utilizing standard scaler which standardizes the numerical column to have mean 0 and standard deviation 1.  while 'cat' columns are being passed through without any trasnformation.

In [7]:
ct = ColumnTransformer(transformers=[('num', sc, ['Taxable Income']),
                                     ('cat', 'passthrough', ['Refund', 'Marital Status'])])

Creating the mixed naive bayes pipeline to sequentially apply preprocessing and estimator. 

In our case our preprocessor is the ct object that we have defined above which standardizes the data and our estimator is GaussianNB()

In [8]:
pipeline = Pipeline(steps=[('preprocessor', ct),
                           ('classifier', GaussianNB())])

Split the data into training and testing sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Fiting the mixed naive bayes model and making predictions

In [10]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Lastly we would like to see how well our Classifier ML is performing.

In [11]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          No       0.69      1.00      0.82         9
         Yes       1.00      0.33      0.50         6

    accuracy                           0.73        15
   macro avg       0.85      0.67      0.66        15
weighted avg       0.82      0.73      0.69        15



This classification report presents the performance evaluation of a binary classification model on a test set of 15 instances. The model predicts two classes, "Yes" and "No," and the report shows performance metrics for each class as well as overall metrics.

The precision for the "No" class is 0.69, which means that out of all instances predicted as "No," 69% of them were actually "No." The recall for the "No" class is 1.0, which means that out of all instances that are actually "No," the model correctly predicted all of them.

For the "Yes" class, the precision is 1.0, which means that out of all instances predicted as "Yes," all of them were actually "Yes." The recall for the "Yes" class is 0.33, which means that out of all instances that are actually "Yes," only 33% of them were correctly predicted by the model.

The F1-score is a measure of the model's accuracy that combines precision and recall. The F1-score for the "No" class is 0.82, and for the "Yes" class, it is 0.5.

The accuracy of the model on the test set is 0.73, which means that it correctly predicted 73% of instances. The macro average of precision, recall, and F1-score is calculated as the average of these metrics across both classes, weighted equally. The macro average precision, recall, and F1-score are 0.85, 0.67, and 0.66, respectively. The weighted average of these metrics takes into account the imbalance in class sizes, and the weighted average precision, recall, and F1-score are 0.82, 0.73, and 0.69, respectively.