### Task:
- Use the Fake bills dataset classify whether a given bill is genuine or fake using KNN.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

In [None]:
bills = pd.read_csv('x:REDACTED\07 - KNN Project\_bills.csv')

# from google.colab import files
# uploaded = files.upload()
# bills = pd.read_csv('_bills.csv')

Saving _bills.csv to _bills.csv


In [3]:
bills.head()

Unnamed: 0,is_genuine,diagonal,height_left,height_right,margin_low,margin_up,length
0,True,171.81,104.86,104.95,4.52,2.89,112.83
1,True,171.46,103.36,103.66,3.77,2.99,113.09
2,True,172.69,104.48,103.5,4.4,2.94,113.16
3,True,171.36,103.91,103.94,3.62,3.01,113.51
4,True,171.73,104.28,103.46,4.04,3.48,112.54


In [4]:
bills.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   is_genuine    1500 non-null   bool   
 1   diagonal      1500 non-null   float64
 2   height_left   1500 non-null   float64
 3   height_right  1500 non-null   float64
 4   margin_low    1463 non-null   float64
 5   margin_up     1500 non-null   float64
 6   length        1500 non-null   float64
dtypes: bool(1), float64(6)
memory usage: 71.9 KB


In [5]:
bills.isnull().sum()

Unnamed: 0,0
is_genuine,0
diagonal,0
height_left,0
height_right,0
margin_low,37
margin_up,0
length,0


In [6]:
# keep original safe copy
df = bills.copy()

# impute margin_low with median
imputer = SimpleImputer(strategy="median")
df["margin_low"] = imputer.fit_transform(df[["margin_low"]])

`is_genuine` is boolean so we convert to integer (0/1). Then split train/test with stratification so class proportions are preserved.

In [7]:
# encode target
df["is_genuine"] = df["is_genuine"].astype(int)  # False->0, True->1

# features / target
X = df.drop(columns=["is_genuine"])
y = df["is_genuine"]

# train-test split (stratify to keep class balance), 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

#### Feature Scaling, so that one factor cannot dominate the predictions

In [8]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

#### For picking a `ùëò` value
- Rule-of-thumb:
`ùëò
‚âà
ùëõ^1/2,
With
ùëõ
=
1500,
1500^1/2
‚âà
38.73
(‚âà 39)`, it‚Äôs a guideline but often too large.

- Smaller k (e.g., 3‚Äì9) captures local structure and can work well if noise is moderate.

- Always prefer to choose k by cross-validation.
- Practical pick: start with `k = 7`, it‚Äôs odd (avoids ties), small enough to be sensitive to local patterns, and often a good trade-off between variance and bias for moderate datasets. But I also give code below to evaluate multiple k values and choose the best via cross-validation.

In [9]:
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train_scaled, y_train)

# evaluate
y_pred = knn.predict(X_test_scaled)
print(f"Accuracy: { round(accuracy_score(y_test, y_pred),3)*100 }%")
print(classification_report(y_test, y_pred))

Accuracy: 98.3%
              precision    recall  f1-score   support

           0       0.99      0.96      0.97       100
           1       0.98      0.99      0.99       200

    accuracy                           0.98       300
   macro avg       0.98      0.98      0.98       300
weighted avg       0.98      0.98      0.98       300



#### Why scale?
- If you don‚Äôt scale, features with larger numeric ranges (e.g., `length`) dominate **Euclidean distance** and thus the neighbor choice.

- Imputation choice: Median is robust; if `margin_low` has a special pattern of missingness you may consider more advanced imputation.

- Class imbalance: If your classes are imbalanced, consider using `class_weight` in other algorithms or evaluate metrics beyond accuracy (precision/recall/F1). KNN has no `class_weight`, but you can inspect confusion matrix and class-wise metrics.

- Speed: KNN stores the training set and does computation at predict time; with larger datasets consider KD-trees or approximate neighbors (or other models).

### Test Run

In [10]:
# Example new data points
new_bills = pd.DataFrame([
    [171.60, 103.20, 104.80, 6.50, 3.85, 113.00],   # fake
    [171.95, 104.00, 104.10, 4.30, 3.20, 113.10]    # real
], columns=X_train.columns)

new_bills_scaled = scaler.transform(new_bills)

In [11]:
predictions = knn.predict(new_bills_scaled)

for i, pred in enumerate(predictions):
    label = "Real Bill" if pred == 1 else "Fake Bill"
    print(f"Bill {i+1}: {label}")

Bill 1: Fake Bill
Bill 2: Real Bill
