Part 1: Data Loading and Preprocessing
1. Load the Breast Cancer Prognostic Dataset.
2. Dataset is available in Drive.
3. Perform basic exploratory data analysis (EDA) to understand the dataset:
• Summarize key statistics for each feature.
• Check for missing values and handle them appropriately.
4. Split the dataset into training (80%) and testing (20%) sets.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

In [5]:
#Load the dataset
from ucimlrepo import fetch_ucirepo
#fetch the dataset
breast_cancer_wisconsin_original=fetch_ucirepo(id=15)
#extract feature and target variable
X=breast_cancer_wisconsin_original.data.features
y=breast_cancer_wisconsin_original.data.targets

In [6]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 9 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Clump_thickness              699 non-null    int64  
 1   Uniformity_of_cell_size      699 non-null    int64  
 2   Uniformity_of_cell_shape     699 non-null    int64  
 3   Marginal_adhesion            699 non-null    int64  
 4   Single_epithelial_cell_size  699 non-null    int64  
 5   Bare_nuclei                  683 non-null    float64
 6   Bland_chromatin              699 non-null    int64  
 7   Normal_nucleoli              699 non-null    int64  
 8   Mitoses                      699 non-null    int64  
dtypes: float64(1), int64(8)
memory usage: 49.3 KB


In [7]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Class   699 non-null    int64
dtypes: int64(1)
memory usage: 5.6 KB


In [8]:
X.isnull().sum()

Unnamed: 0,0
Clump_thickness,0
Uniformity_of_cell_size,0
Uniformity_of_cell_shape,0
Marginal_adhesion,0
Single_epithelial_cell_size,0
Bare_nuclei,16
Bland_chromatin,0
Normal_nucleoli,0
Mitoses,0


In [9]:
#find indices where column 'Bare_nuclei' has missing values
nan_indices=X[X['Bare_nuclei'].isnull()].index

In [10]:
#drop rows with null values
X=X.dropna()
#check number of null values in each column after dropping
X.isnull().sum()

Unnamed: 0,0
Clump_thickness,0
Uniformity_of_cell_size,0
Uniformity_of_cell_shape,0
Marginal_adhesion,0
Single_epithelial_cell_size,0
Bare_nuclei,0
Bland_chromatin,0
Normal_nucleoli,0
Mitoses,0


In [11]:
#drop target values corresponding to indices where features had null values
y=y.drop(nan_indices)

In [13]:
#print the shape
print(X.shape)
print(y.shape)

(683, 9)
(683, 1)


In [14]:
#split into training and testing sets
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)

Part 2: Apply a Wrapper Method
1. Use Recursive Feature Elimination (RFE) with a Logistic Regression model to perform feature selection:
• Select the top 5 features that contribute the most to predicting the target variable.
• Visualize the ranking of features.
2. Train the Logistic Regression model using only the selected features.

In [15]:
#define the model
model= LogisticRegression(max_iter=200)

In [16]:
#initialize RFE
n_features_to_select= 2
rfe= RFE(estimator=model, n_features_to_select=n_features_to_select)

In [17]:
#fit RFE
rfe.fit(X_train, y_train)

In [18]:
#get boolean mask of selected features from RFE
selected_features=rfe.support_
#get ranking of features from RFE
ranking= rfe.ranking_

In [19]:
#transform the feature set to keep only selected feature
X_train_rfe= rfe.transform(X_train)
X_test_rfe= rfe.transform(X_test)

In [20]:
#train model based on selected features
model.fit(X_train_rfe, y_train)

Part 3: Model Evaluation
1. Evaluate the model’s performance using the testing set:
• Metrics to calculate: Accuracy, Precision, Recall, F1-Score, and ROC-AUC.
2. Compare the performance of the model trained on all features versus the model trained on the selected
features.

Part 4: Experiment
1. Experiment with different numbers of selected features (e.g., top 3, top 7).
2. Discuss how feature selection affects model performance.

In [21]:
#predict and evaluate the model
y_pred= model.predict(X_test_rfe)
accuracy= accuracy_score(y_test, y_pred)
print(f"Selected features mask:{selected_features}")
print(f"Feature ranking:{ranking}")
print(f"Model accuracy with selected features:{accuracy}")

Selected features mask:[False False  True False False False  True False False]
Feature ranking:[3 8 1 5 7 4 1 6 2]
Model accuracy with selected features:0.8978102189781022
