# 🧼 Breast Cancer Diagnosis – Data Preprocessing and Feature Preparation

This notebook focuses on preparing the dataset for model training. It includes:

- Cleaning and removing non-informative columns
- Converting categorical labels into numerical format
- Feature scaling
- Splitting the dataset into training and test sets

Proper preprocessing ensures that machine learning algorithms can learn effectively from the data.

## 🧭 Table of Contents

1. [Objectives of this Notebook](#objectives-of-this-notebook)  
2. [Column Removal](#column-removal)  
3. [Label Encoding](#label-encoding)  
4. [Feature Scaling](#feature-scaling)  
5. [Train-Test Split](#train-test-split)  
6. [Exporting Processed Data](#exporting-processed-data)

## 1. Objectives of this Notebook <a id="objectives-of-this-notebook"></a>

In this notebook, we will prepare the dataset for supervised learning by performing the following steps:

- **Column Removal**: Drop irrelevant or non-predictive columns (e.g., `id`, `Unnamed: 32`)
- **Label Encoding**: Convert the target variable `diagnosis` into binary format
- **Feature Scaling**: Standardize numerical features to improve model performance
- **Train-Test Split**: Divide the dataset into training and test sets (e.g., 80/20 split)
- **Exporting**: Save the processed datasets for modeling in the next phase
> This notebook prepares the dataset for machine learning by ensuring clean, scaled, and encoded features.


## 2. Column Removal <a id="column-removal"></a>

Before modeling, we must remove columns that do not provide predictive value or are irrelevant for training.

The following columns will be dropped:

- `id`: A sample identifier with no clinical meaning or predictive power.
- `Unnamed: 32`: An empty column included due to formatting in the original CSV file.

Removing these ensures that only meaningful variables remain for preprocessing and modeling.


In [1]:
# Import the libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

In [2]:
# Load the dataset
df = pd.read_csv('../data/raw/data.csv')

# Create a copy to avoid modifying the original DataFrame
df_cleaned = df.copy()

# Create a copy to avoid modifying the original DataFrame
df_cleaned = df_cleaned.drop(['id','Unnamed: 32'], axis=1)
# (Previously identified as irrelevant in the EDA notebook)

# Check resulting shape and remaining columns
print(f"Remaining columns: {df_cleaned.shape[1]}")
print(df_cleaned.columns.tolist())

Remaining columns: 31
['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']


## 3. Label Encoding <a id="label-encoding"></a>

Now that irrelevant columns have been removed, we will encode the target variable for binary classification.

The `diagnosis` column contains two categorical values:

- `M` – Malignant tumor
- `B` – Benign tumor

These will be converted to numeric format:

- `M` → `1`
- `B` → `0`

This transformation allows machine learning models to interpret the target variable as a binary classification task.


In [3]:
from sklearn.preprocessing import LabelEncoder

# Encode diagnosis using sklearn's LabelEncoder
le = LabelEncoder()
df_cleaned['target'] = le.fit_transform(df_cleaned['diagnosis'])

# Confirm encoding
print("LabelEncoder mapping:", dict(zip(le.classes_, le.transform(le.classes_))))

LabelEncoder mapping: {'B': 0, 'M': 1}


## 4. Feature Scaling <a id="feature-scaling"></a>

Most machine learning models perform better when input features are on a similar scale. Since the variables in this dataset vary significantly in range (e.g., `area_worst` vs. `smoothness_mean`), feature scaling is essential.

In this step, we will:

- Select only the numeric predictor features (excluding `diagnosis` and `target`)
- Apply **standardization** using `StandardScaler`:
  - Subtracts the mean and scales to unit variance
- Store the scaled features in a new DataFrame for training

Standardization helps ensure that each feature contributes equally to the model and avoids bias toward larger-scale variables.


In [4]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
import joblib

# Encode target
le = LabelEncoder()
df_cleaned['target'] = le.fit_transform(df_cleaned['diagnosis'])

# Select features
X = df_cleaned.drop(columns=["diagnosis", "target"])
y = df_cleaned["target"]

# Scale features
scaler = StandardScaler()
X_scaled_array = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled_array, columns=X.columns)

# Save fitted scaler
joblib.dump(scaler, "../outputs/scaler.joblib")

# Check scaling
print("Mean after scaling:\n", X_scaled.mean())
print("Standard deviation after scaling:\n", X_scaled.std(ddof=0))


Mean after scaling:
 radius_mean               -1.373633e-16
texture_mean               6.868164e-17
perimeter_mean            -1.248757e-16
area_mean                 -2.185325e-16
smoothness_mean           -8.366672e-16
compactness_mean           1.873136e-16
concavity_mean             4.995028e-17
concave points_mean       -4.995028e-17
symmetry_mean              1.748260e-16
fractal_dimension_mean     4.745277e-16
radius_se                  2.372638e-16
texture_se                -1.123881e-16
perimeter_se              -1.123881e-16
area_se                   -1.311195e-16
smoothness_se             -1.529727e-16
compactness_se             1.748260e-16
concavity_se               1.623384e-16
concave points_se          0.000000e+00
symmetry_se                8.741299e-17
fractal_dimension_se      -6.243785e-18
radius_worst              -8.241796e-16
texture_worst              1.248757e-17
perimeter_worst           -3.746271e-16
area_worst                 0.000000e+00
smoothness_worst   

> 💾 The fitted `StandardScaler` is saved using `joblib` for future use (e.g., during inference or deployment).


### ⚙️ Feature Scaling with Standardization

Feature scaling is an essential preprocessing step for many machine learning algorithms. Since the variables in this dataset differ significantly in magnitude (e.g., `area_worst` can be in the thousands while `smoothness_mean` is typically < 0.2), unscaled data can bias the model toward features with larger numeric values.

In this notebook, we applied **standardization** using `StandardScaler`, which transforms each feature according to the following formula:

$$
z = \frac{x - \mu}{\sigma}
$$


Where:
- (x) is the original value
- (mu) is the mean of the feature
- (sigma) is the standard deviation of the feature

---

### 🧪 Expected Outcomes

After scaling:
- Each feature should have a **mean close to 0**
- Each feature should have a **standard deviation close to 1**
- The original shape and distribution of the data are preserved

This transformation ensures that all features contribute equally to model training and avoids dominance by high-magnitude variables.

To confirm the process, we printed the mean and standard deviation of all features post-scaling and observed values extremely close to 0 and exactly 1, respectively — indicating successful standardization.


## 5. Train-Test Split <a id="train-test-split"></a>

To assess how well our model generalizes to new, unseen data, we must divide the dataset into two separate subsets:

- **Training set**: Used to train the model
- **Test set**: Used to evaluate model performance on unseen data

We will use an **80/20 split** — allocating 80% of the samples for training and 20% for testing. This is a widely accepted default that balances training robustness with evaluation reliability.

To ensure reproducibility of our results, we will set a fixed `random_state`.

---

### ✅ What we will do:

- Use `train_test_split()` from `sklearn.model_selection`
- Input: preprocessed feature matrix `X_scaled` and target vector `y`
- Parameters:
  - `test_size=0.2`
  - `random_state=42`
  - `stratify=y` to preserve the class distribution in both sets

After splitting, we will print the shape of each subset to confirm the correct partitioning.

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
# Split the dataset into training and test sets
# 80% of the data will be used to train the model, 20% to evaluate it
# We use stratify=y to preserve the same class balance in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled,  # Scaled features
    y,         # Target variable
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Print the shape of the resulting sets to verify the split
print("X_train shape:", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape: ", y_test.shape)

X_train shape: (455, 30)
X_test shape:  (114, 30)
y_train shape: (455,)
y_test shape:  (114,)


We split the dataset using an 80/20 ratio, where 80% of the samples are used for training and 20% for testing. This produced the following shapes:

- `X_train`: shape (455, 30) — Scaled features for training
- `X_test`: shape (114, 30) — Scaled features for testing
- `y_train`: shape (455,) — Corresponding target labels for training
- `y_test`: shape (114,) — Corresponding target labels for testing

The `train_test_split()` function includes several key parameters:

- `test_size=0.2`: Reserves 20% of the data for the test set
- `random_state=42`: Ensures reproducibility of the split
- `stratify=y`: Maintains the original class distribution in both training and test sets

Using `stratify=y` is essential in medical classification problems, where class imbalance (e.g., benign vs. malignant tumors) can lead to misleading performance metrics if not properly accounted for. This guarantees that both subsets are statistically representative of the overall diagnosis distribution.


## 6. Exporting Processed Data <a id="exporting-processed-data"></a>

After cleaning, encoding, scaling, and splitting the dataset, we now save the processed data to disk. This allows us to reuse these subsets directly in the modeling phase without repeating the preprocessing steps.

We will export the following components as CSV files:

- `X_train.csv` and `X_test.csv`: Feature matrices
- `y_train.csv` and `y_test.csv`: Target vectors

All files will be saved under the `data/processed/` directory.


In [7]:
import os

# Create the output directory if it doesn't exist
os.makedirs("../data/processed", exist_ok=True)

# Save the processed training and test sets
X_train.to_csv("../data/processed/X_train.csv", float_format="%.6f", index=False)
X_test.to_csv("../data/processed/X_test.csv", float_format="%.6f", index=False)
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)

print("✅ Processed data successfully exported.")

✅ Processed data successfully exported.


## 📌 Summary and Next Steps <a id="summary-and-next-steps"></a>

In this notebook, we prepared the Breast Cancer Wisconsin dataset for machine learning by performing the following steps:

- 🧹 Removed irrelevant columns (`id`, `Unnamed: 32`)
- 🏷️ Encoded the `diagnosis` variable into a binary `target` column
- 📏 Scaled all numerical features using `StandardScaler`
- 🔀 Split the dataset into training and test sets (80/20), preserving class balance with `stratify=y`
- 💾 Exported all processed data subsets for reuse in future modeling

---

### 🚀 Next Steps

In the next notebook, we will:

- Load the processed data
- Train and evaluate several classification models (e.g., Logistic Regression, Random Forest, etc.)
- Compare their performance using cross-validation and metrics such as accuracy, precision, recall, and ROC-AUC

We are now ready to begin the modeling phase of the project.
