#  Notebook 3 — Feature Selection Methods

##  Abstract
This notebook applies **three independent feature selection methods** to identify the most important biomarkers for **breast cancer detection**:
1. **ANOVA F-test (SelectKBest)** — Finds features with highest variance between classes.
2. **Random Forest Feature Importance** — Tree-based ranking of features by predictive power.
3. **Recursive Feature Elimination (RFE)** — Iteratively removes the least important features.

The goal is to produce **rankings from each method** for use in the **final consensus biomarker identification**.


##  Feature Selection Setup — Loading Processed Data

Now that we have a **processed and standardized dataset**, we move toward **feature selection**.  
Feature selection helps identify the most **important predictors (biomarkers)** that contribute to distinguishing between **malignant** and **benign** cases.


In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import os
# Path to processed dataset
processed_path = "C:/Users/sanja/2.Feature_Selection_Biomarker_Identification/Feature_Selection_Biomarker_Identification/data/processed/breast_cancer_processed.csv"
df = pd.read_csv(processed_path)
df.head()


Unnamed: 0,diagnosis,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1,-0.236405,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1,-0.236403,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1,0.431741,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,1,0.432121,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1,0.432201,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


##  Train-Test Split — Preserving Class Balance

Before applying **feature selection** and building models, we split the dataset into **training** and **testing** subsets.  
This ensures that model evaluation is performed on **unseen data** for a realistic performance estimate.

In [11]:
# Separate features (X) and target (y)
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

# Train-Test Split (stratified to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")



Training samples: 455
Testing samples: 114


## Feature Selection — ANOVA F-test (SelectKBest)

To identify the most **statistically significant features** (potential biomarkers) for breast cancer diagnosis,  
we use the **ANOVA F-test** with `SelectKBest`.  
This method evaluates the relationship between each feature and the target variable individually.

In [12]:
# Select all features, ranked by ANOVA F-test score
selector = SelectKBest(score_func=f_classif, k='all')
selector.fit(X_train, y_train)

anova_scores = pd.DataFrame({
    'Feature': X.columns,
    'ANOVA_F_Score': selector.scores_
}).sort_values(by='ANOVA_F_Score', ascending=False)

anova_scores.reset_index(drop=True, inplace=True)
anova_scores.head()


Unnamed: 0,Feature,ANOVA_F_Score
0,concave points_worst,733.724933
1,perimeter_worst,717.246487
2,radius_worst,692.861395
3,concave points_mean,684.526845
4,perimeter_mean,548.413236


##  Feature Selection — Random Forest Importance

In addition to statistical feature selection methods like **ANOVA F-test**,  
we can use **tree-based models** such as **Random Forest** to measure feature importance based on their contribution to model accuracy.

In [13]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

rf_importances = pd.DataFrame({
    'Feature': X.columns,
    'RandomForest_Importance': rf.feature_importances_
}).sort_values(by='RandomForest_Importance', ascending=False)

rf_importances.reset_index(drop=True, inplace=True)
rf_importances.head()



Unnamed: 0,Feature,RandomForest_Importance
0,area_worst,0.131017
1,perimeter_worst,0.130272
2,concave points_worst,0.103211
3,radius_worst,0.097813
4,concave points_mean,0.087399


##  Feature Selection — Recursive Feature Elimination (RFE) with Logistic Regression

To further validate our **biomarker identification**,  
we use **Recursive Feature Elimination (RFE)** with a **Logistic Regression** model.  
RFE works by **iteratively removing the least important features** until the desired number of features is reached.

In [6]:
log_reg = LogisticRegression(max_iter=500, solver='liblinear', random_state=42)
rfe = RFE(log_reg, n_features_to_select=10)
rfe.fit(X_train, y_train)

rfe_ranking = pd.DataFrame({
    'Feature': X.columns,
    'RFE_Rank': rfe.ranking_
}).sort_values(by='RFE_Rank', ascending=True)

rfe_ranking.reset_index(drop=True, inplace=True)
rfe_ranking


Unnamed: 0,Feature,RFE_Rank
0,radius_se,1
1,area_se,1
2,concave points_mean,1
3,concavity_worst,1
4,radius_worst,1
5,texture_worst,1
6,perimeter_worst,1
7,compactness_se,1
8,concave points_worst,1
9,area_worst,1


##  Feature Selection — Recursive Feature Elimination (RFE) with Logistic Regression

We apply **Recursive Feature Elimination (RFE)** using a **Logistic Regression** model to identify the top **10 most important features** (potential biomarkers) for breast cancer classification.


In [14]:
log_reg = LogisticRegression(max_iter=500, solver='liblinear', random_state=42)
rfe = RFE(log_reg, n_features_to_select=10)
rfe.fit(X_train, y_train)

rfe_ranking = pd.DataFrame({
    'Feature': X.columns,
    'RFE_Rank': rfe.ranking_
}).sort_values(by='RFE_Rank', ascending=True)

rfe_ranking.reset_index(drop=True, inplace=True)
rfe_ranking.head()



Unnamed: 0,Feature,RFE_Rank
0,radius_se,1
1,area_se,1
2,concave points_mean,1
3,concavity_worst,1
4,radius_worst,1


##  Saving Feature Selection Results

After performing **three complementary feature selection methods**  
(ANOVA F-test, Random Forest importance, and RFE),  
we save the ranked feature lists as CSV files for **documentation** and **future analysis**.

In [15]:
# Ensure tables directory exists
tables_dir = "C:/Users/sanja/2.Feature_Selection_Biomarker_Identification/Feature_Selection_Biomarker_Identification/results/tables"
os.makedirs(tables_dir, exist_ok=True)

# Save files
anova_scores.to_csv(os.path.join(tables_dir, "anova_scores.csv"), index=False)
rf_importances.to_csv(os.path.join(tables_dir, "rf_importances.csv"), index=False)
rfe_ranking.to_csv(os.path.join(tables_dir, "rfe_ranking.csv"), index=False)

print(f" Rankings saved in: {tables_dir}")


 Rankings saved in: C:/Users/sanja/2.Feature_Selection_Biomarker_Identification/Feature_Selection_Biomarker_Identification/results/tables
