# Assignment 5 – Naïve Bayes Classification
## Breast‑Cancer Wisconsin Dataset

Please add the name, first name, immatriculation number and study program below.  
Each member of the group has to be listed:

- *Name: , First Name: , matr. number: , study program:.*
- *Name: , First Name: , matr. number: , study program:.*
- *Name: , First Name: , matr. number: , study program:.*

### Background
The **Breast‑Cancer Wisconsin (Diagnostic)** dataset contains 30 real‑valued
features computed from digitised images of a fine‑needle aspirate (FNA) of a breast mass.  
Each instance is labelled **malignant** (`0`) or **benign** (`1`).  Reliable automated
diagnosis can help clinicians decide whether further invasive procedures are
necessary. Here are some links to read more about the dataset:

1. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
2. https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

In this assignment you will:
1. perform a short exploratory analysis;
2. implement **Gaussian Naïve Bayes (GNB) from scratch**;
3. evaluate it with accuracy, precision, recall and *macro* $F_1$;
4. compare to scikit‑learn’s reference implementation;

In [None]:
# === imports (please don't import sklearn GaussianNB yet, you will use it later when told to do so) ===
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report


In [None]:
# === load data ===
data = load_breast_cancer()
X_full = pd.DataFrame(data.data, columns=data.feature_names)
y_full = pd.Series(data.target, name='target')  # 0 malignant, 1 benign
print('Shape:', X_full.shape)
print('\nClass distribution (0 malignant / 1 benign):')
print(y_full.value_counts())

# quick statistics
X_full.describe().T.head()

In [None]:
# === train / test split ===
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X_full, y_full.values, test_size=0.2, stratify=y_full, random_state=42)

# === standardise ===
scaler = StandardScaler().fit(X_train_raw)  # fit ONLY on train
X_train = scaler.transform(X_train_raw)
X_test  = scaler.transform(X_test_raw)

In [None]:
# === 2‑D PCA for visualisation ===
pca2 = PCA(n_components=2, random_state=42).fit(X_train)
X_pca_full = pca2.transform(scaler.transform(X_full))
plt.figure(figsize=(6,4))
plt.scatter(X_pca_full[:,0], X_pca_full[:,1], c=y_full, alpha=0.5, cmap='bwr')
plt.title('PCA projection'); plt.xlabel('PC1'); plt.ylabel('PC2'); plt.show()

In [None]:
class MyGaussianNB:
    """Gaussian Naïve Bayes implemented from scratch."""

    def __init__(self, var_smoothing: float = 1e-9):
        self.var_smoothing = var_smoothing          # small value to avoid zero variance

    def fit(self, X, y):
        """
        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
        y : array-like, shape (n_samples,)
        """
        # === YOUR CODE HERE ===
        raise NotImplementedError

    def _log_gauss(self, x, mean, var):
        # === YOUR CODE HERE ===
        raise NotImplementedError

    def _joint_log_likelihood(self, x):
        # === YOUR CODE HERE ===
        raise NotImplementedError

    def predict(self, X):
        # === YOUR CODE HERE ===
        raise NotImplementedError

    def predict_proba(self, X):
        # === YOUR CODE HERE ===
        raise NotImplementedError


In [None]:
# === train & evaluate custom GNB on FULL feature set ===
gnb_full = MyGaussianNB().fit(X_train, y_train)
y_pred_full = gnb_full.predict(X_test)
print('Confusion matrix (custom GNB – full features):')
print(confusion_matrix(y_test, y_pred_full))
print('\nClassification report:')
print(classification_report(y_test, y_pred_full, digits=4))

In [None]:
# === Please use sklearn implementation for the Guassian Naive Bayes, and compare your results ===
# === YOUR CODE HERE ===