#  Notebook 2 — Data Preprocessing

##  Abstract
This notebook cleans and preprocesses the **Wisconsin Breast Cancer Diagnostic Dataset** for downstream analysis and feature selection.  
Steps include:
- **Label encoding** of the target variable.
- **Handling missing values** if any.
- **Scaling numerical features** for better model performance.
- **Saving the processed dataset** for reproducibility.


##  Data Loading and Initial Setup

We begin by **loading the dataset** and preparing it for preprocessing.  
This step includes importing essential libraries, defining the file path, and previewing the data.


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import os
# Load dataset from raw folder
data_path = "C:/Users/sanja/2.Feature_Selection_Biomarker_Identification/Feature_Selection_Biomarker_Identification/data/raw/breast-cancer.csv"
df = pd.read_csv(data_path)
df.head()


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


##  Encoding Target Variable — Diagnosis

The target variable **`diagnosis`** is currently **categorical**, containing two values:
- **M** → Malignant tumor
- **B** → Benign tumor  

Since most machine learning models require **numerical inputs**, we will convert these categorical values into numeric form.

In [2]:
# Encode 'diagnosis': M = 1 (Malignant), B = 0 (Benign)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

# Verify encoding
df['diagnosis'].value_counts()


diagnosis
0    357
1    212
Name: count, dtype: int64

##  Checking and Handling Missing Values

Before training a machine learning model, it is important to **check for missing values** in the dataset.  
Missing data can cause:
- **Errors** during model training.
- **Bias** in results if not handled properly.

In [3]:
# Check missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

# If missing values exist, handle them (example: drop)
df = df.dropna()


Missing values:
 id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64


##  Splitting Features and Target Variable

For supervised machine learning, the dataset must be split into:
- **Features (X)** → Independent variables used for prediction.
- **Target (y)** → Dependent variable (label) we want to predict.

In [4]:
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

##  Feature Scaling with StandardScaler

Machine learning algorithms often perform better when features are on a **similar scale**.  
Since the dataset’s features have different units and ranges, we apply **z-score standardization** using **`StandardScaler`**.


In [5]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame with column names
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Add target back
processed_df = pd.concat([y.reset_index(drop=True), X_scaled_df], axis=1)

processed_df.head()


Unnamed: 0,diagnosis,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,-0.236405,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1,-0.236403,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1,0.431741,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,1,0.432121,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1,0.432201,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


##  Saving the Processed Dataset

Once the dataset is **cleaned and standardized**, it is a good practice to save it in a **processed data folder**.  
This allows us to:
- Avoid repeating preprocessing steps.
- Maintain a **reproducible workflow**.
- Keep raw and processed datasets **separate** for better project organization.

In [6]:
# Ensure processed data folder exists
processed_dir = "C:/Users/sanja/2.Feature_Selection_Biomarker_Identification/Feature_Selection_Biomarker_Identification/data/processed"
os.makedirs(processed_dir, exist_ok=True)

# Save processed CSV
processed_path = os.path.join(processed_dir, "breast_cancer_processed.csv")
processed_df.to_csv(processed_path, index=False)

print(f" Processed dataset saved to: {processed_path}")


 Processed dataset saved to: C:/Users/sanja/2.Feature_Selection_Biomarker_Identification/Feature_Selection_Biomarker_Identification/data/processed\breast_cancer_processed.csv
