## Library Imports

The following libraries are imported for use throughout the project:

- **pandas**: Data manipulation and analysis.
- **sklearn.svm.SVC**: Support Vector Classifier.
- **sklearn.linear_model.LogisticRegression**: Logistic Regression model.
- **sklearn.ensemble.RandomForestClassifier**: Random Forest Classifier.
- **sklearn.metrics**: Evaluation metrics including accuracy, F1 score, recall, precision, and confusion matrix.
- **sklearn.model_selection**: Tools for splitting data and hyperparameter tuning using GridSearchCV.
- **sklearn.pipeline.Pipeline**: Streamlined model building with preprocessing and modeling steps.
- **sklearn.feature_selection**: Feature selection using SelectKBest and Recursive Feature Elimination (RFE).
- **sklearn.preprocessing.StandardScaler**: Feature scaling to normalize input variables.


In [1]:
import pandas as pd
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler

## Data Loading and Initial Preprocessing

- The dataset is loaded from a CSV file using `pandas`.
- Display options are adjusted to show all columns and their full content.
- The `id` column is dropped as it is not useful for prediction.
- The target column `diagnosis` is converted from categorical (`M` for malignant, `B` for benign) to binary numeric format (1 for malignant, 0 for benign).


In [2]:
df = pd.read_csv("Breast Cancer Wisconsin Dataset.csv")
pd.options.display.max_columns = None
pd.options.display.max_colwidth = None

df.drop(columns=["id"], inplace=True)
df["diagnosis"].replace(["M","B"], [1, 0], inplace=True)
df["diagnosis"].unique()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["diagnosis"].replace(["M","B"], [1, 0], inplace=True)
  df["diagnosis"].replace(["M","B"], [1, 0], inplace=True)


array([1, 0])

## Feature Overview

Displays the list of all column names in the dataset to understand the available features after preprocessing.


In [3]:
df.columns

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'diagnosis'],
      dtype='object')

shows the first five rows of the dataset to provide a snapshot of the feature values and target variable structure.

In [4]:
df.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1


The distribution of the target classes (benign and malignant) is examined to assess class balance.
The dataset contains 357 benign cases (0) and 212 malignant cases (1), indicating a moderate class imbalance.

In [5]:
df["diagnosis"].value_counts()

diagnosis
0    357
1    212
Name: count, dtype: int64

The features show significant variation in their ranges. For example, `radius_mean` ranges from 6.98 to 28.11, and `area_mean` ranges from 143.5 to 2501.0. This variation highlights the need for feature scaling before model training.


In [7]:
df.describe()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,0.405172,1.216853,2.866059,40.337079,0.007041,0.025478,0.031894,0.011796,0.020542,0.003795,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.372583
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,0.277313,0.551648,2.021855,45.491006,0.003003,0.017908,0.030186,0.00617,0.008266,0.002646,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,0.1115,0.3602,0.757,6.802,0.001713,0.002252,0.0,0.0,0.007882,0.000895,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,0.2324,0.8339,1.606,17.85,0.005169,0.01308,0.01509,0.007638,0.01516,0.002248,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,0.3242,1.108,2.287,24.53,0.00638,0.02045,0.02589,0.01093,0.01873,0.003187,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,0.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,0.4789,1.474,3.357,45.19,0.008146,0.03245,0.04205,0.01471,0.02348,0.004558,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,2.873,4.885,21.98,542.2,0.03113,0.1354,0.396,0.05279,0.07895,0.02984,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


### Data Splitting

The dataset is split into features and target variable:

- `x` contains the features from `radius_mean` to `fractal_dimension_worst`.
- `y` is the target variable, which is the `diagnosis` column.

The data is then split into training and testing sets with an 80/20 ratio using `train_test_split`, ensuring reproducibility with a fixed `random_state` value of 10.


In [7]:
x = df.loc[:, "radius_mean" : "fractal_dimension_worst"]
y = df["diagnosis"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=10)

### Model Evaluation Function

A custom function `test_scores` is defined to evaluate model performance on the test set. It returns a dictionary containing:

- **Model**: The model's name (for reference).
- **Accuracy**: Overall correctness of the model.
- **Precision**: Ability of the model to correctly identify positive cases.
- **Recall**: Ability of the model to capture all positive cases.
- **F1 Score**: Harmonic mean of precision and recall.
- **Confusion Matrix**: Detailed breakdown of true/false positives and negatives.

This function simplifies comparison between models by standardizing performance metrics.


In [8]:
def test_scores(model_name, predictions):
    
    accuracy = accuracy_score(y_test, predictions)
    confusion = confusion_matrix(y_test, predictions)
    precision = precision_score(y_test, predictions)
    recall = recall_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    
    confusion_str = (f"True Negative: {confusion[0][0]}, "
                     f"True Positive: {confusion[1][1]}, "
                     f"False Positive: {confusion[0][1]}, "
                    f"False Negative: {confusion[1][0]}")
    
    return {
        "Model": model_name,
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1": f1,
        "Confusion Matrix": confusion_str
    }

### Support Vector Classifier (SVC) Pipeline and Hyperparameter Tuning

An SVC model is built using a pipeline that includes:

- **StandardScaler**: Standardizes features by removing the mean and scaling to unit variance.
- **SelectKBest**: Selects the top `k` features based on the ANOVA F-value (`f_classif`).
- **SVC**: A Support Vector Classifier with `class_weight="balanced"` to handle class imbalance.

GridSearchCV is used to tune hyperparameters across a 10-fold cross-validation setup. The parameters searched include:

- `select__k`: Number of features to select (from 10 to 30).
- `svc__C`: Regularization parameter values `[0.01, 0.1, 1, 10]`.
- `svc__kernel`: Kernel type (`linear` or `rbf`).

The grid search evaluates multiple scoring metrics and refits the model based on the best recall score, optimizing for sensitivity in detecting positive cases.


In [9]:
svcPipe = Pipeline([
    ("scaler", StandardScaler()),
    ('select', SelectKBest(score_func=f_classif)),
    ('svc', SVC(class_weight="balanced"))
])

cv_SVC = GridSearchCV(svcPipe, {
    "select__k": range(10,31),
    "svc__C": [0.01, 0.1, 1, 10],
    "svc__kernel": ["linear", "rbf"]
    }, cv=10,
    scoring=['accuracy', 'precision', 'recall', 'f1'],
    refit="recall",
    n_jobs=-1)

cv_SVC.fit(x_train, y_train)

### Evaluating SVC Cross-Validation Results

The results from the `GridSearchCV` are stored in a DataFrame for analysis. Key steps include:

- **Selecting Relevant Columns**: Focuses on hyperparameters (`k`, `C`, `kernel`) and performance metrics (mean and standard deviation of accuracy, precision, recall, and F1 score).
- **Ranking by Recall**: Since the model is refit using recall, results are sorted by `rank_test_recall` to prioritize configurations that best capture positive cases.

The best-performing model configuration (based on recall) is isolated for detailed inspection, and the top 10 configurations are displayed for comparison.


In [10]:
svc_results_df = pd.DataFrame(cv_SVC.cv_results_)

selected_cols = [
    "param_select__k", "param_svc__C", "param_svc__kernel",
    "mean_test_accuracy", "mean_test_precision", "mean_test_recall", "mean_test_f1",
    "std_test_accuracy", "std_test_precision", "std_test_recall", "std_test_f1",
    "rank_test_accuracy", "rank_test_precision", "rank_test_recall", "rank_test_f1"
]

cv_metrics = ["mean_test_accuracy", "mean_test_precision", "mean_test_recall", "mean_test_f1", "rank_test_recall"]

svc_cv_results_df = svc_results_df[cv_metrics].sort_values("rank_test_recall").head(1).drop(columns="rank_test_recall")
svc_results_df[selected_cols].sort_values("rank_test_recall").head(10)

Unnamed: 0,param_select__k,param_svc__C,param_svc__kernel,mean_test_accuracy,mean_test_precision,mean_test_recall,mean_test_f1,std_test_accuracy,std_test_precision,std_test_recall,std_test_f1,rank_test_accuracy,rank_test_precision,rank_test_recall,rank_test_f1
125,25,1.0,rbf,0.978019,0.972113,0.970588,0.970872,0.024081,0.037437,0.03946,0.032287,8,37,1,8
134,26,10.0,linear,0.973575,0.960313,0.970588,0.965154,0.029446,0.037104,0.047425,0.039515,25,65,1,22
166,30,10.0,linear,0.96256,0.934101,0.970588,0.951531,0.033003,0.048041,0.047425,0.043153,77,105,1,76
84,20,1.0,linear,0.980193,0.982604,0.965033,0.973406,0.020844,0.026666,0.038834,0.028347,1,8,4,1
117,24,1.0,rbf,0.975797,0.972113,0.964706,0.967652,0.025009,0.037437,0.047059,0.033813,14,37,5,14
100,22,1.0,linear,0.980145,0.982604,0.964706,0.973044,0.023096,0.026666,0.047059,0.031644,2,8,5,2
92,21,1.0,linear,0.980145,0.982604,0.964706,0.973044,0.023096,0.026666,0.047059,0.031644,2,8,5,2
116,24,1.0,linear,0.977971,0.977341,0.964706,0.970341,0.022128,0.027879,0.047059,0.030354,9,25,5,9
142,27,10.0,linear,0.96913,0.955168,0.964706,0.959202,0.033261,0.042496,0.059988,0.045323,53,77,5,46
167,30,10.0,rbf,0.9757,0.970799,0.964706,0.967346,0.025169,0.029329,0.047059,0.034338,17,48,5,15


To investigate why the best model (based on recall) is ranked only 37th in precision, the number of entries in the `mean_test_precision` column is checked


In [11]:
svc_results_df["mean_test_precision"].count()

np.int64(168)

Next, the maximum value of `mean_test_precision` is checked to understand the upper bound of precision achieved during cross-validation


In [12]:
svc_results_df["mean_test_precision"].max()

np.float64(0.9888544891640866)

### Selected Features from Best SVC Model

The best-performing SVC model includes a feature selection step using `SelectKBest`. After fitting, the selected features are retrieved to identify which inputs were deemed most informative. These selected features represent the subset of variables that contributed most to the model’s predictive performance, based on univariate statistical tests.


In [13]:
svc = cv_SVC.best_estimator_

svc_selected_features = svc.named_steps["select"].get_support()
x_train.columns[svc_selected_features]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'radius_se', 'perimeter_se',
       'area_se', 'compactness_se', 'concavity_se', 'concave points_se',
       'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
       'smoothness_worst', 'compactness_worst', 'concavity_worst',
       'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

### Excluded Features from Best SVC Model

Features not selected by `SelectKBest` were excluded from the final SVC model due to lower statistical relevance. These features contributed less to classification performance.


In [14]:
x_train.columns[svc_selected_features == False]

Index(['fractal_dimension_mean', 'texture_se', 'smoothness_se', 'symmetry_se',
       'fractal_dimension_se'],
      dtype='object')

### SVC Model Performance on Test Set

The final SVC model achieved strong performance on the test set:

- **Accuracy**: 97.4%
- **Precision**: 92.9%
- **Recall**: 100%
- **F1 Score**: 96.3%
- **Confusion Matrix**: 72 true negatives, 39 true positives, 3 false positives, 0 false negatives

The model successfully identified all positive cases, reflecting its recall-oriented optimization, with only a few false positives.


In [15]:
svc_predictions = svc.predict(x_test)

svc_scores = test_scores("SVC", svc_predictions)
svc_scores

{'Model': 'SVC',
 'Accuracy': 0.9736842105263158,
 'Precision': 0.9285714285714286,
 'Recall': 1.0,
 'F1': 0.9629629629629629,
 'Confusion Matrix': 'True Negative: 72, True Positive: 39, False Positive: 3, False Negative: 0'}

### Logistic Regression Pipeline and Hyperparameter Tuning

A pipeline is constructed for Logistic Regression, incorporating feature scaling and selection. Hyperparameters—regularization strength (`C`) and the number of selected features (`k`)—are optimized using `GridSearchCV` with 10-fold cross-validation and multiple evaluation metrics.


In [16]:
lgPipe = Pipeline([
    ("scale", StandardScaler()),
    ("select", SelectKBest(f_classif)),
    ("lg", LogisticRegression(max_iter=1000, class_weight="balanced"))
                   ])

cv_LG = GridSearchCV(lgPipe, {
    "select__k": range(10,31),
    "lg__C": [0.01, 0.1, 1, 10, 100]
    }, cv=10
    , scoring=['accuracy', 'precision', 'recall', 'f1'],
    refit="f1",
    n_jobs=-1
    )

cv_LG.fit(x_train, y_train)

### Evaluating Logistic Regression Cross-Validation Results

The `GridSearchCV` results for Logistic Regression are analyzed by focusing on key metrics such as accuracy, precision, recall, and F1 score. The results are sorted by **F1 score** because it offers a balanced measure of a model’s performance, considering both precision and recall.

While the model sorted by recall achieves perfect recall, it ranks lower in other metrics like precision and F1. The model sorted by F1, however, has the same recall but ranks higher in precision and other metrics, and it uses one additional feature compared to the recall-sorted model. This makes the F1-sorted model a better overall choice, as it maintains a balance between precision, recall, and other performance measures.

The top-performing model, based on F1 score, is identified by sorting the results and isolating the configuration with the highest F1 rank. The top 10 configurations are displayed for comparison, showing different combinations of feature selection (`k`) and regularization strength (`C`).


In [17]:
lg_results_df = pd.DataFrame(cv_LG.cv_results_)
selected_cols = [
    "param_select__k", "param_lg__C",
    "mean_test_accuracy", "mean_test_precision", "mean_test_recall", "mean_test_f1",
    "std_test_accuracy", "std_test_precision", "std_test_recall", "std_test_f1",
    "rank_test_accuracy", "rank_test_precision", "rank_test_recall", "rank_test_f1"
]

lg_cv_results_df = lg_results_df[["mean_test_accuracy", "mean_test_precision", "mean_test_recall", "mean_test_f1", "rank_test_f1"]].sort_values("rank_test_f1").head(1).drop(columns=["rank_test_f1"])
lg_results_df[selected_cols].sort_values("rank_test_f1").head(10)

Unnamed: 0,param_select__k,param_lg__C,mean_test_accuracy,mean_test_precision,mean_test_recall,mean_test_f1,std_test_accuracy,std_test_precision,std_test_recall,std_test_f1,rank_test_accuracy,rank_test_precision,rank_test_recall,rank_test_f1
62,30,1.0,0.980145,0.982604,0.964706,0.973044,0.023096,0.026666,0.047059,0.031644,1,1,1,1
52,20,1.0,0.978019,0.982604,0.95915,0.970376,0.01977,0.026666,0.037525,0.026927,2,1,5,2
39,28,0.1,0.977971,0.982604,0.958824,0.970203,0.01977,0.026666,0.037665,0.026925,3,1,11,3
61,29,1.0,0.977971,0.977049,0.964706,0.970187,0.022128,0.028205,0.047059,0.030344,3,27,1,4
59,27,1.0,0.977971,0.982604,0.958824,0.970013,0.022128,0.026666,0.045943,0.030341,3,1,11,5
41,30,0.1,0.977923,0.982604,0.958824,0.970013,0.022128,0.026666,0.045943,0.030341,10,1,11,5
40,29,0.1,0.977923,0.982604,0.958824,0.970013,0.022128,0.026666,0.045943,0.030341,10,1,11,5
55,23,1.0,0.977971,0.982604,0.958824,0.970013,0.022128,0.026666,0.045943,0.030341,3,1,11,5
53,21,1.0,0.977971,0.982604,0.958824,0.970013,0.022128,0.026666,0.045943,0.030341,3,1,11,5
56,24,1.0,0.977971,0.982604,0.958824,0.970013,0.022128,0.026666,0.045943,0.030341,3,1,11,5


### Selected Features from Best Logistic Regression Model

The best-performing Logistic Regression model includes a feature selection step using `SelectKBest`. Interestingly, the model selected all 30 features, indicating that none of the features were deemed irrelevant for the classification task. This suggests that all available features contribute valuable information for predicting the target variable.


In [18]:
lg = cv_LG.best_estimator_

lg_selected_features = lg.named_steps["select"].get_support()
x_train.columns[lg_selected_features]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [19]:
x_train.columns[lg_selected_features == False]

Index([], dtype='object')

### Logistic Regression Model Performance on Test Set

The Logistic Regression model achieved the following performance on the test set:

- **Accuracy**: 95.6%
- **Precision**: 92.5%
- **Recall**: 94.9%
- **F1 Score**: 93.7%
- **Confusion Matrix**: 72 true negatives, 37 true positives, 3 false positives, 2 false negatives

The model shows strong overall performance with a high recall, indicating good detection of positive cases, and good precision and F1 scores, reflecting a balanced performance.


In [20]:
lg_predictions = lg.predict(x_test)
lg_scores = test_scores("Logistic Regression", lg_predictions)

lg_scores

{'Model': 'Logistic Regression',
 'Accuracy': 0.956140350877193,
 'Precision': 0.925,
 'Recall': 0.9487179487179487,
 'F1': 0.9367088607594937,
 'Confusion Matrix': 'True Negative: 72, True Positive: 37, False Positive: 3, False Negative: 2'}

### Random Forest Pipeline and Hyperparameter Tuning

A pipeline is built for the Random Forest Classifier, using Recursive Feature Elimination (RFE) for feature selection. The model is tuned via `GridSearchCV`, exploring different configurations for:

- The number of features to select (`n_features_to_select`) in the RFE step (ranging from 10 to 30).
- The number of estimators (`n_estimators`) in the Random Forest (ranging from 100 to 400).

Cross-validation is performed with multiple evaluation metrics, and the model is refitted based on the best recall score to prioritize sensitivity in detecting positive cases.


In [21]:
rf = RandomForestClassifier(criterion="entropy", class_weight="balanced", random_state=10)

rfPipe = Pipeline([
    ("select", RFE(rf)),
    ("rf", rf)
])

cv_RF = GridSearchCV(rfPipe, {
    "select__n_features_to_select": range(10,31),
    "rf__n_estimators": [100, 200, 300, 400]
    }, cv=10
    , scoring=['accuracy', 'precision', 'recall', 'f1'],
    refit="recall",
    n_jobs=-1
        )

cv_RF.fit(x_train, y_train)

### Evaluating Random Forest Cross-Validation Results

The `GridSearchCV` results for the Random Forest model are analyzed by focusing on key metrics such as accuracy, precision, recall, and F1 score. The results are sorted by **recall** since the model is optimized to prioritize detecting positive cases.

The top-performing model, based on recall, is identified by sorting the results and isolating the configuration with the highest recall rank. The top 10 configurations are displayed for comparison, showing the combinations of selected features (`n_features_to_select`) and the number of estimators (`n_estimators`).


In [22]:
rf_results_df = pd.DataFrame(cv_RF.cv_results_)
selected_cols = [
    "param_select__n_features_to_select", "param_rf__n_estimators",
    "mean_test_accuracy", "mean_test_precision", "mean_test_recall", "mean_test_f1",
    "std_test_accuracy", "std_test_precision", "std_test_recall", "std_test_f1",
    "rank_test_accuracy", "rank_test_precision", "rank_test_recall", "rank_test_f1"
]

rf_cv_results_df = rf_results_df[cv_metrics].sort_values("rank_test_recall").head(1).drop(columns="rank_test_recall")
rf_results_df[selected_cols].sort_values("rank_test_recall").head(10)

Unnamed: 0,param_select__n_features_to_select,param_rf__n_estimators,mean_test_accuracy,mean_test_precision,mean_test_recall,mean_test_f1,std_test_accuracy,std_test_precision,std_test_recall,std_test_f1,rank_test_accuracy,rank_test_precision,rank_test_recall,rank_test_f1
9,19,100,0.964734,0.965033,0.94183,0.953085,0.022647,0.038834,0.026339,0.02978,1,29,1,1
6,16,100,0.964638,0.965325,0.941503,0.952718,0.03023,0.046785,0.045575,0.040094,4,25,2,3
4,14,100,0.962512,0.964665,0.935948,0.949687,0.026398,0.039075,0.04129,0.035269,7,48,3,12
25,14,200,0.962512,0.964665,0.935948,0.949687,0.026398,0.039075,0.04129,0.035269,7,48,3,12
29,18,200,0.962464,0.965033,0.935948,0.949865,0.029953,0.046905,0.04129,0.039563,17,29,3,9
30,19,200,0.962512,0.964665,0.935948,0.949876,0.024456,0.039075,0.031825,0.032495,14,48,3,8
26,15,200,0.96029,0.95915,0.935948,0.947013,0.029513,0.045827,0.04129,0.039055,37,63,3,36
27,16,200,0.962512,0.964665,0.935948,0.949687,0.026398,0.039075,0.04129,0.035269,7,48,3,12
24,13,200,0.96029,0.95915,0.935948,0.947013,0.029513,0.045827,0.04129,0.039055,37,63,3,36
61,29,300,0.962464,0.965686,0.935948,0.950028,0.024512,0.045781,0.031825,0.032,17,24,3,7


### Selected Features from Best Random Forest Model

The best-performing Random Forest model includes a feature selection step using Recursive Feature Elimination (RFE). The selected features are retrieved to identify which variables were considered most important for the final model. These features are the ones that contributed the most to the model’s classification ability.


In [23]:
rf = cv_RF.best_estimator_

rf_selected_features = rf.named_steps["select"].get_support()
x_train.columns[rf_selected_features]

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'radius_se', 'perimeter_se', 'area_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst', 'concavity_worst',
       'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

### Excluded Features from Best Random Forest Model

The features not selected by Recursive Feature Elimination (RFE) were excluded from the final Random Forest model. These features were deemed less informative for the classification task and did not contribute significantly to the model's performance.


In [24]:
x_train.columns[rf_selected_features == False]

Index(['smoothness_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'texture_se', 'smoothness_se', 'compactness_se', 'concavity_se',
       'concave points_se', 'symmetry_se', 'fractal_dimension_se',
       'compactness_worst'],
      dtype='object')

### Random Forest Model Performance on Test Set

The Random Forest model achieved the following performance on the test set:

- **Accuracy**: 97.4%
- **Precision**: 95.0%
- **Recall**: 97.4%
- **F1 Score**: 96.2%
- **Confusion Matrix**: 73 true negatives, 38 true positives, 2 false positives, 1 false negative

The model performs exceptionally well, with high precision and recall, indicating a good balance between detecting positive cases and minimizing false positives.


In [25]:
rf.fit(x_train, y_train)

rf_predictions = rf.predict(x_test)
rf_scores = test_scores("Random Forest", rf_predictions)

rf_scores

{'Model': 'Random Forest',
 'Accuracy': 0.9736842105263158,
 'Precision': 0.95,
 'Recall': 0.9743589743589743,
 'F1': 0.9620253164556962,
 'Confusion Matrix': 'True Negative: 73, True Positive: 38, False Positive: 2, False Negative: 1'}

### Random Forest Model without Feature Selection (RFE)

This Random Forest model is trained without the feature selection step (RFE) used in the previous model. The decision to exclude RFE was made due to the longer training time it required in the first model. By removing RFE, we aim to speed up the model training process while still evaluating its performance. This allows for a direct comparison between the performance of the model with and without feature selection.

The hyperparameters explored include the number of estimators (`n_estimators`), ranging from 100 to 400, to assess the impact of the ensemble size on model performance.


In [26]:
rf2 = RandomForestClassifier(criterion="entropy", class_weight="balanced", random_state=10)

rf2Pipe = Pipeline([
    ("rf", rf2)
])

cv_RF2 = GridSearchCV(rf2Pipe, {
    "rf__n_estimators": [100, 200, 300, 400]
    }, cv=10
    , scoring=['accuracy', 'precision', 'recall', 'f1'],
    refit="recall",
    n_jobs=-1
        )

cv_RF2.fit(x_train, y_train)

### Evaluating Random Forest Model without Feature Selection (RFE)

The `GridSearchCV` results for the Random Forest model without feature selection are analyzed by focusing on key metrics such as accuracy, precision, recall, and F1 score. The results are sorted by **recall**, as the model is optimized to prioritize detecting positive cases.

The top-performing model, based on recall, is identified by sorting the results and isolating the configuration with the highest recall rank. The top 10 configurations are displayed for comparison, showing the different combinations of the number of estimators (`n_estimators`), which range from 100 to 400, and their associated performance metrics.


In [27]:
rf2_results_df = pd.DataFrame(cv_RF2.cv_results_)
selected_cols = [
    "param_rf__n_estimators",
    "mean_test_accuracy", "mean_test_precision", "mean_test_recall", "mean_test_f1",
    "std_test_accuracy", "std_test_precision", "std_test_recall", "std_test_f1",
    "rank_test_accuracy", "rank_test_precision", "rank_test_recall", "rank_test_f1"
]

rf2_cv_results_df = rf2_results_df[cv_metrics].sort_values("rank_test_recall").head(1).drop(columns="rank_test_recall")
rf2_results_df[selected_cols].sort_values("rank_test_recall").head(10)

Unnamed: 0,param_rf__n_estimators,mean_test_accuracy,mean_test_precision,mean_test_recall,mean_test_f1,std_test_accuracy,std_test_precision,std_test_recall,std_test_f1,rank_test_accuracy,rank_test_precision,rank_test_recall,rank_test_f1
1,200,0.960242,0.965033,0.930065,0.946835,0.027835,0.046905,0.035535,0.036661,1,2,1,1
2,300,0.960242,0.965033,0.930065,0.946835,0.027835,0.046905,0.035535,0.036661,1,2,1,1
3,400,0.960242,0.965033,0.930065,0.946835,0.027835,0.046905,0.035535,0.036661,1,2,1,1
0,100,0.960242,0.965325,0.929739,0.946468,0.027835,0.046785,0.044118,0.036987,1,1,4,4


### Random Forest Model Performance on Test Set (Without Feature Selection)

The Random Forest model trained without feature selection achieved the following performance on the test set:

- **Accuracy**: 97.4%
- **Precision**: 95.0%
- **Recall**: 97.4%
- **F1 Score**: 96.2%
- **Confusion Matrix**: 73 true negatives, 38 true positives, 2 false positives, 1 false negative

This model shows identical performance metrics to the model with feature selection, indicating that removing the feature selection step did not significantly impact the classification results. Both models performed equally well, with high precision, recall, and F1 score.


In [28]:
rf2 = cv_RF2.best_estimator_
rf2.fit(x_train, y_train)

rf2_predictions = rf2.predict(x_test)
rf2_scores = test_scores("Random Forest Without Feature Selection", rf2_predictions)

rf2_scores

{'Model': 'Random Forest Without Feature Selection',
 'Accuracy': 0.9736842105263158,
 'Precision': 0.95,
 'Recall': 0.9743589743589743,
 'F1': 0.9620253164556962,
 'Confusion Matrix': 'True Negative: 73, True Positive: 38, False Positive: 2, False Negative: 1'}

### Comparison of Model Performance on Test Set

The following table compares the performance of the models based on their predictions on the test set (`x_test`). The models are sorted by **Recall**, then **Accuracy**, **Precision**, and **F1 Score**. 

- **SVC** achieved the highest recall of 100%, but its precision is slightly lower than the Random Forest models.
- Both **Random Forest** models, with and without feature selection, show identical results and outperform Logistic Regression in terms of precision and recall.
- **Logistic Regression** has a slightly lower recall and accuracy compared to the Random Forest models, but still performs well with a balanced precision-recall trade-off.



In [29]:
combined_scores = [svc_scores, lg_scores, rf_scores, rf2_scores]
comparative_df = pd.DataFrame(combined_scores)

comparative_df.sort_values(["Recall", "Accuracy", "Precision", "F1"], ascending=False)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1,Confusion Matrix
0,SVC,0.973684,0.928571,1.0,0.962963,"True Negative: 72, True Positive: 39, False Positive: 3, False Negative: 0"
2,Random Forest,0.973684,0.95,0.974359,0.962025,"True Negative: 73, True Positive: 38, False Positive: 2, False Negative: 1"
3,Random Forest Without Feature Selection,0.973684,0.95,0.974359,0.962025,"True Negative: 73, True Positive: 38, False Positive: 2, False Negative: 1"
1,Logistic Regression,0.95614,0.925,0.948718,0.936709,"True Negative: 72, True Positive: 37, False Positive: 3, False Negative: 2"


### Comparison of Cross Validation Results

The following comparison summarizes the models' cross-validation results, sorted by **Recall**, followed by **Accuracy**, **F1 Score**, and **Precision**.

- **Logistic Regression** shows the highest **mean_test_accuracy** and **mean_test_precision**, though its **mean_test_recall** is slightly lower than that of **SVC**.
- **SVC** outperforms the other models in **mean_test_recall**, indicating it is most effective at identifying positive instances.
- **Random Forest** (with feature selection) and **Random Forest Without Feature Selection** have lower performance across all metrics compared to **Logistic Regression** and **SVC**, with **Random Forest Without Feature Selection** showing the least overall performance.
- Overall, **SVC** and **Logistic Regression** provide better balanced performance in terms of recall and precision compared to the Random Forest models.

This comparison of cross-validation results helps confirm the robustness of **SVC** and **Logistic Regression**, while suggesting that feature selection in Random Forest might have a minimal impact on overall performance.


In [30]:
compare_cv_result_df = pd.concat([svc_cv_results_df, lg_cv_results_df, rf_cv_results_df, rf2_cv_results_df],
                                 keys=["SVC", "Logistic Regression", "Random Forest", "Random Forest Without Feature Selection"])\
                                .sort_values(["mean_test_recall", "mean_test_accuracy", "mean_test_f1"], ascending=False)

compare_cv_result_df.index = compare_cv_result_df.index.get_level_values(0)
compare_cv_result_df.sort_values(["mean_test_recall", "mean_test_accuracy", "mean_test_precision", "mean_test_recall"], ascending=False)

Unnamed: 0,mean_test_accuracy,mean_test_precision,mean_test_recall,mean_test_f1
SVC,0.978019,0.972113,0.970588,0.970872
Logistic Regression,0.980145,0.982604,0.964706,0.973044
Random Forest,0.964734,0.965033,0.94183,0.953085
Random Forest Without Feature Selection,0.960242,0.965033,0.930065,0.946835


### Conclusion

Based on the two comparison tables—one from the **test set results** and the other from **cross-validation**—we can draw several key conclusions about the performance of the models in the context of **breast cancer classification**:

1. **Test Set Results**:
   - **SVC** stands out for its ability to achieve a **perfect recall** of 100%, which is crucial in breast cancer classification where identifying all positive (cancerous) cases is of utmost importance. However, this comes at the cost of a slightly lower **precision**, meaning it may also identify some non-cancerous cases as cancerous. Despite this, **SVC** provides the highest recall score, ensuring that fewer cancerous cases are missed.
   - Both **Random Forest** models (with and without feature selection) offer a strong balance between **precision** and **recall**, performing well in detecting positive cases without generating too many false positives. These models provide reliable results but with a slight trade-off in recall when compared to **SVC**.
   - **Logistic Regression**, while solid in overall performance, has a slightly lower **recall** and **accuracy** compared to **SVC** and **Random Forest** models. This means that **Logistic Regression** might miss a few cancerous cases, which could be a concern in the context of early detection.

2. **Cross-Validation Results**:

   - **Logistic Regression** performs exceptionally well in cross-validation, showing the highest **accuracy** and **precision**. This suggests that while it is consistent and accurate across multiple data splits, it still falls short in terms of **recall**, making it less effective at identifying all cancerous cases compared to **SVC**.
   - **SVC** shows strong performance, particularly in **recall**, which is critical for ensuring that positive (cancerous) instances are not missed. However, it sacrifices **precision** somewhat, meaning it might incorrectly label non-cancerous cases as cancerous.
   - **Random Forest** models, particularly the one without feature selection, demonstrate relatively lower performance in recall and accuracy in comparison to the other models. This model may not be as effective in identifying all cancerous instances.

### Overall Conclusion:
- **SVC** emerges as the best model when **maximizing recall** is critical in breast cancer classification. Ensuring that as many cancerous cases as possible are detected is paramount, and **SVC** does this effectively. However, its lower precision means that further tuning may be necessary to reduce false positives, which could lead to unnecessary follow-up procedures.
- **Logistic Regression** offers strong performance in terms of **accuracy** and **precision**, making it a solid option when false positives are a concern. However, it sacrifices some recall, meaning it may miss a few cancerous cases.
- **Random Forest** models, although generally good in other domains, perform less optimally in this particular case study. While they offer a balanced approach, their lower recall and accuracy compared to **SVC** and **Logistic Regression** make them less suitable for this specific task.

In summary, **SVC** is the top choice for breast cancer classification, as it maximizes the detection of cancerous cases, which is the most critical factor in early diagnosis.
