##  Data Preprocessing – Feature Preparation for Modeling

Before training any machine learning model, we need to **prepare the dataset** by separating features and labels, splitting into training and test sets, and scaling the feature values. These steps ensure that our model can learn effectively and generalize well to unseen data.

---

### Preprocessing Workflow

1. **Load dataset**  
   Read the cancer classification CSV file.

2. **Separate features and target variable**  
   - `X` = All input features  
   - `y` = Target label (`benign_0__mal_1`)

3. **Split data into train and test sets**  
   - Use `train_test_split()` from `sklearn.model_selection`  
   - Stratify to maintain class proportions

4. **Apply feature scaling**  
   - Use `StandardScaler` to normalize feature values to zero mean and unit variance  
   - Fit the scaler on the training data and transform both train and test sets




In [4]:
# Breast Cancer Tumor Classifier - Preprocessing Step

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
file_path = "C:/Users/sanja/1. Breast_Cancer_Tumor_Classifier/1.Breast_Cancer_Tumor_Classifier/data/raw/cancer_classification.csv"
df = pd.read_csv(file_path)

print(" Dataset loaded successfully.")
print("Shape of dataset:", df.shape)

# Step 1: Separate features and target
X = df.drop(columns=['benign_0__mal_1'])
y = df['benign_0__mal_1']

print("\n Features and target separated.")
print("X shape:", X.shape)
print("y shape:", y.shape)

# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

print("\n Train-test split done.")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train distribution:\n", y_train.value_counts())
print("y_test distribution:\n", y_test.value_counts())

# Step 3: Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\n Feature scaling completed using StandardScaler.")
print("Sample of scaled X_train:\n", X_train_scaled[:5])



 Dataset loaded successfully.
Shape of dataset: (569, 31)

 Features and target separated.
X shape: (569, 30)
y shape: (569,)

 Train-test split done.
X_train shape: (455, 30)
X_test shape: (114, 30)
y_train distribution:
 benign_0__mal_1
1    285
0    170
Name: count, dtype: int64
y_test distribution:
 benign_0__mal_1
1    72
0    42
Name: count, dtype: int64

 Feature scaling completed using StandardScaler.
Sample of scaled X_train:
 [[-1.07200079e+00 -6.58424598e-01 -1.08808010e+00 -9.39273639e-01
  -1.35939882e-01 -1.00871795e+00 -9.68358632e-01 -1.10203235e+00
   2.81062120e-01 -1.13231479e-01 -7.04860874e-01 -4.40938351e-01
  -7.43948977e-01 -6.29804931e-01  7.48061001e-04 -9.91572979e-01
  -6.93759567e-01 -9.83284458e-01 -5.91579010e-01 -4.28972052e-01
  -1.03409427e+00 -6.23497432e-01 -1.07077336e+00 -8.76534437e-01
  -1.69982346e-01 -1.03883630e+00 -1.07899452e+00 -1.35052668e+00
  -3.52658049e-01 -5.41380026e-01]
 [ 1.74874285e+00  6.65017334e-02  1.75115682e+00  1.74555856e+