##  Data Preprocessing – Feature Preparation for Modeling

Before training any machine learning model, we need to **prepare the dataset** by separating features and labels, splitting into training and test sets, and scaling the feature values. These steps ensure that our model can learn effectively and generalize well to unseen data.

---

### Preprocessing Workflow

1. **Load dataset**  
   Read the cancer classification CSV file.

2. **Separate features and target variable**  
   - `X` = All input features  
   - `y` = Target label (`benign_0__mal_1`)

3. **Split data into train and test sets**  
   - Use `train_test_split()` from `sklearn.model_selection`  
   - Stratify to maintain class proportions

4. **Apply feature scaling**  
   - Use `StandardScaler` to normalize feature values to zero mean and unit variance  
   - Fit the scaler on the training data and transform both train and test sets




In [6]:
print(df.columns)
# Create directory if it doesn't exist
import os
os.makedirs("models", exist_ok=True)

# Save the model
import joblib
joblib.dump(model, "models/classifier.pkl")


Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'benign_0__mal_1'],
      dtype='object')


['models/classifier.pkl']

In [7]:
import pandas as pd
# Load data
df = pd.read_csv("C:/Users/sanja/1. Breast_Cancer_Tumor_Classifier/1.Breast_Cancer_Tumor_Classifier/data/raw/cancer_classification.csv")

# Select input features and target
selected_features = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness']
X = df[selected_features]
y = df['benign_0__mal_1']

# Train Random Forest without scaling
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Save the model
import joblib
joblib.dump(model, 'models/classifier.pkl')


['models/classifier.pkl']