# How to Prepare Data

Comment: Maybe a Section about adding variance like mirroing pictures and so on.

# 1. For Keras

## 1.1 Binary Classification

In this Section, we will use the **Breast Cancer Wisconsin Diagnostic** dataset provided by `scikit-learn`. This dataset is commonly used for binary classification tasks and is well-suited for demonstrating how to prepare data for training neural networks.

The classification goal is to predict whether a tumor is **malignant (0)** or **benign (1)** based on various measurements computed from digitized images of breast tissue.

If you're interested in exploring the structure, distributions, and correlations of the dataset before proceeding with data preparation, we recommend checking the `VisualiseData.ipynb` notebook. That notebook provides insights into the features and their relationships to the target.  

The Dataset is loaded in the below cell.

In [1]:
# Import Standard Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# Load the Dataset 
breast_cancer = datasets.load_breast_cancer(as_frame=True)
df = breast_cancer['frame']  # this is the DataFrame

### 1.1.1 Dataset Format Expectations

Just like in `scikit-learn`, models in Keras expect the following input formats:

### Design matrix $X$
A NumPy array of shape `[n_samples, n_features]` that represents the input features. Each **row** is one example, and each **column** is one feature.

### Target vector $y$
An array of shape `[n_samples]` containing the class labels (for classification) or continuous values (for regression).

### Optional: Weight vector $w$
An array of shape `[n_samples]` assigning a weight to each example, which can be useful for handling class imbalance or emphasizing certain samples during training.

> **Important:** The indices of `X`, `y`, and `w` must align — i.e., the `i`-th row in `X` must correspond to the `i`-th value in `y` and optionally in `w`.

### 1.1.2 First Checks and Data Splitting

Before training a model, it's essential to:
- Check how many examples and classes are present
- Understand the balance between class labels
- Ensure the feature matrix `X` has the right shape and type
- Split the data into training and testing subsets for evaluation

In [2]:
from sklearn.model_selection import train_test_split

# Inspect class distribution in the full dataset
print("Class distribution in the full dataset:")
print(df.groupby("target").size())
print("\nExpected output: 212 = Malignant, 357 = Benign") # The expected output comes from print(breast_cancer.DESCR)

# Select subset of features (can be expanded)
X = df[breast_cancer.feature_names[:10]]
y = df["target"]

# Split the data for training and testing (70/30 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Check class distribution in training set
n_benign = np.sum(y_train)
n_malignant = len(y_train) - n_benign
print(f"\nIn training data: {n_malignant} Malignant examples, {n_benign} Benign examples")

Class distribution in the full dataset:
target
0    212
1    357
dtype: int64

Expected output: 212 = Malignant, 357 = Benign

In training data: 148 Malignant examples, 250 Benign examples


### 1.1.3 Notes on Stratification and Scaling

When splitting the data, using the `stratify` parameter ensures that the class distribution remains similar in both training and test sets.

Depending on the model you choose in Keras, it may also be necessary to scale the input features. For instance:
- Neural networks often perform better when features are **standardized** (zero mean, unit variance)
- Alternatively, you can use **min-max scaling** if your architecture or activation functions prefer inputs in [0, 1]

### 1.1.4 Feature Scaling with `StandardScaler`

Neural networks are sensitive to the scale of input features. If one variable spans values in the thousands while another ranges between 0 and 1, the network may struggle to converge efficiently. This is because internal weights can fluctuate wildly to compensate for imbalanced magnitudes across dimensions.

To ensure that each feature contributes equally to the model during training, it's common practice to **scale and center** the data.

We use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from `sklearn.preprocessing`, which transforms each feature to have:
- Zero mean
- Unit variance

### Important:
- **Always fit the scaler on the training data only.**
- Then, use the same fitted scaler to transform the test data.
  
This mimics the realistic scenario where the model sees new, possibly out-of-distribution inputs at inference time, and avoids leaking information from the test set into the training pipeline.


In [6]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit on training data, transform both train and test
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)