# Banknote Authentication — Midterm Classification Project
**Author:** Deb St. Cyr  
**Date:** November 11, 2025  

## Introduction
This project predicts whether a banknote is authentic (target: `class`) using four numerical features extracted from wavelet-transformed images: `variance`, `skewness`, `curtosis`, and `entropy`.  
I compare baseline and improved classifiers, quantify performance (accuracy, precision, recall, F1), and visualize results.

**Dataset:** UCI Banknote Authentication (placed locally at `data/banknote_authentication.csv`)

**Repro Steps**
1) `python -m venv .venv && source .venv/Scripts/activate` (Windows) or `source .venv/bin/activate` (macOS/Linux)  
2) `pip install -r requirements.txt`  
3) Open this notebook and run all cells (Kernel → Restart & Run All)


In [1]:
# Section 0. Imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Plot defaults
pd.set_option("display.max_columns", 50)
sns.set_theme()


## 1. Import and Inspect the Data

**1.1** Load the dataset and display the first 10 rows.  
**1.2** Check for missing values and display summary statistics.

**Reflection 1.** What do you notice about the dataset? Any data issues?


In [3]:
# 1.1 Load

import pandas as pd

CSV_PATH = "data/banknote_authentication.txt"  # adjust if running from repo root

# Load the dataset
df = pd.read_csv(CSV_PATH, header=None)

# Assign column names (UCI reference)
df.columns = ["variance", "skewness", "curtosis", "entropy", "class"]

print("Encoding check complete (UTF-8 expected).")

# Preview first 5 rows
display(df.head())

# Shape and summary
print(f"Shape: {df.shape}")
print("\nMissing values per column:")
print(df.isna().sum())

# Quick distribution of target classes
print("\nTarget variable counts:")
print(df['class'].value_counts())


Encoding check complete (UTF-8 expected).


Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


Shape: (1372, 5)

Missing values per column:
variance    0
skewness    0
curtosis    0
entropy     0
class       0
dtype: int64

Target variable counts:
class
0    762
1    610
Name: count, dtype: int64


In [4]:
# 1.2 Missing + summary
display(df.isna().sum())
display(df.describe().T)

# Target distribution quick look
df['class'].value_counts(normalize=True).rename('proportion')


variance    0
skewness    0
curtosis    0
entropy     0
class       0
dtype: int64

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
variance,1372.0,0.433735,2.842763,-7.0421,-1.773,0.49618,2.821475,6.8248
skewness,1372.0,1.922353,5.869047,-13.7731,-1.7082,2.31965,6.814625,12.9516
curtosis,1372.0,1.397627,4.31003,-5.2861,-1.574975,0.61663,3.17925,17.9274
entropy,1372.0,-1.191657,2.101013,-8.5482,-2.41345,-0.58665,0.39481,2.4495
class,1372.0,0.444606,0.497103,0.0,0.0,0.0,1.0,1.0


class
0    0.555394
1    0.444606
Name: proportion, dtype: float64