# Dowdle's Banknote Authentication Prediction
**Author:** Brittany Dowdle  
**Date:** March 31, 2025  
**Objective:** This project will demonstrate my ability to apply classification modeling techniques to a real-world dataset. I will:  
- Load and explore a dataset.
- Analyze feature distributions and consider feature selection.
- Train and evaluate a classification model.
- Compare different classification approaches.
- Document your work in a structured Jupyter Notebook.
- Conduct a peer review of a classmate’s project. 

## Introduction
This project uses the UCI Banknote Authentication Dataset to detect fake banknotes based on features such as Variance, Skewness, Curtosis. The goal is to predict which notes are genuine/fake. I will create a Random Forest model, split/train the data, evaluate performance using key metrics, and create visualizations to interpret the results.

****

## Imports
In the code cell below are the necessary Python libraries for this notebook. *All imports should be at the top of the notebook.*

In [1]:
# Import pandas for data manipulation and analysis (we might want to do more with it)
import pandas as pd
from pandas.plotting import scatter_matrix

# Import pandas for data manipulation and analysis  (we might want to do more with it)
import numpy as np

# Import matplotlib for creating static visualizations
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# Import seaborn for statistical data visualization (built on matplotlib)
import seaborn as sns

# Import for splitting data into training and test sets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

# Import classification models
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

# Import performance metrics for model evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

****

## Section 1. Import and Inspect the Data
### 1.1 Load the dataset and display the first 10 rows.

In [7]:
# Load the dataset
df = pd.read_csv(r"C:\Projects\ml-classification-dowdle\data\data_banknote_authentication.txt", delimiter=",", header=None)

# Assign column names
df.columns = ['Variance', 'Skewness', 'Kurtosis', 'Entropy', 'Class']

# Display the first few rows
df.head(10)

Unnamed: 0,Variance,Skewness,Kurtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0
5,4.3684,9.6718,-3.9606,-3.1625,0
6,3.5912,3.0129,0.72888,0.56421,0
7,2.0922,-6.81,8.4636,-0.60216,0
8,3.2032,5.7588,-0.75345,-0.61251,0
9,1.5356,9.1772,-2.2718,-0.73535,0


### 1.2 Check for missing values and display summary statistics.

In [11]:
# If command is not the last statement in a Python cell, you'll have to wrap in the print() function to display.
print('Missing Values:')
print(df.isnull().sum(), '\n') 
print('Summary Statistics:')
print(df.describe())

Missing Values:
Variance    0
Skewness    0
Kurtosis    0
Entropy     0
Class       0
dtype: int64 

Summary Statistics:
          Variance     Skewness     Kurtosis      Entropy        Class
count  1372.000000  1372.000000  1372.000000  1372.000000  1372.000000
mean      0.433735     1.922353     1.397627    -1.191657     0.444606
std       2.842763     5.869047     4.310030     2.101013     0.497103
min      -7.042100   -13.773100    -5.286100    -8.548200     0.000000
25%      -1.773000    -1.708200    -1.574975    -2.413450     0.000000
50%       0.496180     2.319650     0.616630    -0.586650     0.000000
75%       2.821475     6.814625     3.179250     0.394810     1.000000
max       6.824800    12.951600    17.927400     2.449500     1.000000


### Reflection 1: What do you notice about the dataset? Are there any data issues?

1. There are no missing values. The class column is binary with a mean of .44. Feature ranges vary significantly. The minimum and maximum values for Skewness and Kurtosis are quite extreme.
2. The class column will require a look at it's distribution to see if there is an imbalance. Since the values span different scales, scaling (like Standardization or MinMax Scaling) might be needed for better model performance. Boxplots or histograms would help confirm if there are outliers in Skewness and Kurtosis.

****