<a href="https://colab.research.google.com/github/KwakuBonfulBosompim/MSc-Data-Analytics-and-ML-Projects/blob/main/Bill_Authentication_using_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project is about detecting fake or real banknotes using a smart machine learning algorithm called XGBoost

XGBoost (eXtreme Gradient Boosting) works like a team of tiny decision-making robots (trees). Each robot tries to fix the mistakes of the previous one, making the final prediction fast, accurate, and reliable.

this model is
Very fast on big data,
Handles missing values automatically,
Can prevent overfitting (like memorizing the training data instead of learning patterns),
Very accurate, often winning competitions


Now we going to use a dataset named bill_authentication.csv for the implementation of the XGBoost model


CSV file with features: size, number of rooms, location, age
And the target: price

Training set → teach the model
Test set → check if the model learned correctly


Tune Parameters ⚙️
You can adjust “how many trees,” “how deep each tree is,” or “learning rate” to make it smarter

First we import the libraries and load the dataset

In [1]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('bill_authentication.csv')

# techniques to check and access the datasets
data.head()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [4]:
data.info

In [7]:
X = data.iloc[:,:1].values
X[:5]

array([[3.6216 ],
       [4.5459 ],
       [3.866  ],
       [3.4566 ],
       [0.32924]])

In [10]:
y = data.iloc[:, -1].values
y[:17]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Now we split the data

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

we import the XGboost library

it could be used for classification or regression problem or other tasks


Our Project is a Classification Problem
whether a bank note is forged or not

In [12]:
from xgboost import XGBClassifier

Now We build the Algorithm / Specify the parameters
 n_estimators : Optional[int]
        Number of boosting rounds.
    max_depth :  typing.Optional[int]
        Maximum tree depth for base learners

In [15]:
bst = XGBClassifier(n_estimators=200, max_depth=3, learning_rate=0.5, objective='binary:logistic')

Now we train the model based on the data (Fit the model to the data)

In [16]:
bst.fit(X_train, y_train)

Now we Estimate / make the prediction

In [17]:
y_pred = bst.predict(X_test)

In [19]:
y_pred[:5]

array([0, 0, 1, 0, 0])

Now We check accurate the model predicted the target value based on the train data we provided

In [20]:
from sklearn.metrics import confusion_matrix, classification_report

In [21]:
print(confusion_matrix(y_test, y_pred))

[[142  24]
 [ 28  81]]


[[ True Positive, False Negative ],
 [ False Positive, True Negative ]]

142 → Real bills correctly detected ✅

24 → Real bills incorrectly labeled as fake ❌

28 → Fake bills incorrectly labeled as real ❌ (this is risky!)

81 → Fake bills correctly detected

**Now we try to optimize the model
approach to ensure we create or adopt the necessary parameters**


**Optimization is like tuning your robot’s “settings or configuration"**

three common approaches

**Manual tuning 🎛️**
You change one parameter at a time and see if accuracy improves.
Example: Increase n_estimators from 100 → 200 → 300 and check results.
Simple, but slow

**Grid Search 🔍**
You define a grid of parameter values:
The computer tries all combinations → picks the best one.
Accurate but can take a long time for many parameters


**Randomized Search 🎲**
Like Grid Search, but instead of trying all combinations, it samples randomly.
Faster for big datasets, still usually finds good settings

**Bayesian Optimization (advanced)**
The computer learns which settings are better over time instead of trying everything.
Very efficient, but more advanced

In [22]:
from sklearn.model_selection import GridSearchCV

In [23]:
# GridSearchCV with XGBoost

param_grid = {'n_estimators': [150, 250, 250],
              'learning_rate': [0.1, 0.5, 1]}
# Total combinations = 3 × 3 = 9 models
# if you use 5-fold cross-validation (cv=5):
# Total models = 9 × 5 = 45 models trained.

In [24]:
# Now we create our grid
mygrid = GridSearchCV(XGBClassifier(), param_grid, cv=5)

In [25]:
mygrid.fit(X_train, y_train)

In [26]:
mygrid.best_params_

{'learning_rate': 0.1, 'n_estimators': 150}

In [27]:
mygrid.best_score_

np.float64(0.8359402241594023)

Now we create the full summary of full validation

all results summarize in this table

In [28]:
mygrid_summary = pd.DataFrame(mygrid.cv_results_)

In [29]:
mygrid_summary.info

In [30]:
mygrid_summary.iloc[:,4]

Unnamed: 0,param_learning_rate
0,0.1
1,0.1
2,0.1
3,0.5
4,0.5
5,0.5
6,1.0
7,1.0
8,1.0


The best model

In [31]:
mygrid_summary.iloc[:,14]

Unnamed: 0,rank_test_score
0,1
1,2
2,2
3,7
4,8
5,8
6,6
7,4
8,4


Scores now

In [32]:
mean_test_score = mygrid_summary.iloc[:,12]

The Faults

In [33]:
# Cross Validation Fault
mygrid_summary.iloc[:, 7:12]

Unnamed: 0,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score
0,0.831818,0.813636,0.808219,0.858447,0.86758
1,0.827273,0.813636,0.808219,0.86758,0.858447
2,0.827273,0.813636,0.808219,0.86758,0.858447
3,0.822727,0.809091,0.808219,0.853881,0.863014
4,0.822727,0.809091,0.812785,0.844749,0.863014
5,0.822727,0.809091,0.812785,0.844749,0.863014
6,0.831818,0.818182,0.803653,0.853881,0.863014
7,0.831818,0.813636,0.808219,0.853881,0.863014
8,0.831818,0.813636,0.808219,0.853881,0.863014


Impact of the Project 🌟

Helps banks and businesses detect fake bills quickly.

Reduces human errors in bill authentication.

Shows how machine learning can automate important security tasks

Insights Achieved 🔍

XGBoost is fast, accurate, and handles missing data automatically.

Tuning parameters significantly improves model performance.

Confusion matrix helps understand mistakes and improve the model.

Even a simple dataset can be used to train a powerful model for real-world problems.

What We Achieved 🏁

Built an XGBoost model to detect real vs fake bills.

Optimized parameters using Grid Search for best accuracy.

Successfully demonstrated machine learning in bill authentication.

Created a clear workflow from dataset → model → evaluation → optimization.