#**Feature Selection.**

#*B.1 Feature Selection using Wrapper Methods for Breast Cancer Prognostic Dataset*

#**Part 1: Data Loading and Preprocessing**

**Step 1: Import Required Libraries**

In [5]:
import numpy as np
import pandas as pd

# Dataset
from sklearn.datasets import load_breast_cancer

# Model and feature selection
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE

# Train-test split
from sklearn.model_selection import train_test_split

# Evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


**Step 2: Load the Breast Cancer Dataset**

In [6]:
# Load dataset
data = load_breast_cancer()

# Convert to DataFrame for easy understanding
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)


**Step 3: Exploratory Data Analysis (EDA)**

*Dataset Shape*

In [7]:
print("Dataset shape:", X.shape)


Dataset shape: (569, 30)


*Summary Statistics*

In [8]:
X.describe()


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


*Check Missing Values*

In [9]:
print("Missing values:\n", X.isnull().sum())


Missing values:
 mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64


**Step 4: Train–Test Split (80% Train, 20% Test)**

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


#**Part 2: Apply Wrapper Method (RFE)**

**Step 5: Initialize Logistic Regression Model**

In [11]:
model = LogisticRegression(max_iter=200)


**Step 6: Apply RFE (Select Top 5 Features)**

In [12]:
# RFE with top 5 features
rfe = RFE(estimator=model, n_features_to_select=5)

# Fit RFE
rfe.fit(X_train, y_train)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

**Step 7: View Selected Features and Rankings**

In [13]:
selected_features = X.columns[rfe.support_]
feature_ranking = rfe.ranking_

print("Selected Top 5 Features:")
print(selected_features)

print("\nFeature Rankings:")
for feature, rank in zip(X.columns, feature_ranking):
    print(feature, ":", rank)


Selected Top 5 Features:
Index(['mean radius', 'texture error', 'worst radius', 'worst concavity',
       'worst symmetry'],
      dtype='object')

Feature Rankings:
mean radius : 1
mean texture : 9
mean perimeter : 15
mean area : 25
mean smoothness : 8
mean compactness : 17
mean concavity : 2
mean concave points : 4
mean symmetry : 6
mean fractal dimension : 14
radius error : 24
texture error : 1
perimeter error : 5
area error : 11
smoothness error : 23
compactness error : 22
concavity error : 12
concave points error : 18
symmetry error : 20
fractal dimension error : 26
worst radius : 1
worst texture : 7
worst perimeter : 10
worst area : 21
worst smoothness : 19
worst compactness : 16
worst concavity : 1
worst concave points : 3
worst symmetry : 1
worst fractal dimension : 13


#**Part 3: Train Model Using Selected Features**

**Step 8: Transform Dataset**

In [14]:
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)


**Step 9: Train Logistic Regression on Selected Features**

In [15]:
model.fit(X_train_rfe, y_train)


**Step 10: Predictions**

In [16]:
y_pred = model.predict(X_test_rfe)
y_prob = model.predict_proba(X_test_rfe)[:, 1]


#**Part 4: Model Evaluation**

*Accuracy*

In [17]:
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.9736842105263158


*Precision*

In [18]:
print("Precision:", precision_score(y_test, y_pred))


Precision: 0.9722222222222222


*Recall*

In [19]:
print("Recall:", recall_score(y_test, y_pred))


Recall: 0.9859154929577465


*F1-Score*

In [20]:
print("F1 Score:", f1_score(y_test, y_pred))


F1 Score: 0.9790209790209791


*ROC-AUC Score*

In [21]:
print("ROC-AUC Score:", roc_auc_score(y_test, y_prob))


ROC-AUC Score: 0.9983622666229938


#**Part 5: Comparison (All Features vs Selected Features)**

**Model Using All Features**

In [22]:
model.fit(X_train, y_train)
y_pred_all = model.predict(X_test)
y_prob_all = model.predict_proba(X_test)[:, 1]

print("Accuracy with ALL features:", accuracy_score(y_test, y_pred_all))
print("ROC-AUC with ALL features:", roc_auc_score(y_test, y_prob_all))


Accuracy with ALL features: 0.956140350877193
ROC-AUC with ALL features: 0.9977071732721914


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#**Part 6: Experiment (Different Feature Counts)**

**Example: Top 3 Features**

In [23]:
rfe_3 = RFE(model, n_features_to_select=3)
rfe_3.fit(X_train, y_train)

X_train_3 = rfe_3.transform(X_train)
X_test_3 = rfe_3.transform(X_test)

model.fit(X_train_3, y_train)
y_pred_3 = model.predict(X_test_3)

print("Accuracy with Top 3 features:", accuracy_score(y_test, y_pred_3))


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuracy with Top 3 features: 0.9649122807017544
