# IV – Cancer Gene Expression Feature Selection – Genetic Algorithm Case Study

Genetic algorithms (GAs) are population-based evolutionary optimization methods inspired by the principles of natural selection. They iteratively evolve a set of candidate solutions through selection, crossover, and mutation, enabling effective exploration of large and complex search spaces. GAs are particularly well suited for non-convex, discrete, and combinatorial optimization problems, where the objective function is non-differentiable and characterized by many local optima. In this case study, we apply a GA to the problem of gene selection from high-dimensional cancer gene expression data, with the goal of identifying a compact subset of genes that maximizes predictive performance.

Gene expression datasets are a canonical example of a challenging optimization problem in biomedical machine learning. They typically contain thousands to tens of thousands of gene expression measurements for only a few hundred or thousand samples, resulting in an extreme high-dimensional, low-sample-size regime. Each sample corresponds to a patient, and the target label represents a clinical outcome such as cancer subtype or disease status. Predictive signal in such data rarely resides in individual genes; instead, it emerges from complex, nonlinear interactions among small groups of genes, often reflecting underlying biological pathways.

### Genetic Algorithm Formulation

- Variables: Each candidate solution is represented as a binary chromosome of length d, where d is the number of genes. A value of 1 indicates that the corresponding gene is included in the feature subset, while 0 indicates exclusion.

- Objective: Maximize classifier accuracy on held-out validation folds. In our GA, the fitness of a mask is the cross-validated accuracy of a logistic regression model trained on the selected features.

- Constraints: To encourage interpretability and biological plausibility, we optionally impose a maximum number of selected genes (e.g., 30–50). The GA naturally handles this discrete constraint without requiring relaxation or approximation.

- GA Steps: An initial population of random masks is generated. In each generation, masks are evaluated (fitness), then the best are selected for reproduction. New offspring are created via crossover (combining bits from two parents) and mutation (flipping bits) to introduce diversity. Elite individuals may be carried over. This process repeats for many generations, gradually improving the feature subset.

### Implementation
We use the open-source Python library sklearn-genetic-opt (which is built on DEAP) to handle the GA mechanics. This library implements GAFeatureSelectionCV, which wraps scikit-learn estimators in a GA that optimizes cross-validation score while minimizing feature count. 

## Why GA for feature selection?

GA is particularly well suited for this problem because feature selection is a non-convex, combinatorial optimization task with an exponentially large search space ($2^d$ possible subsets). Ensemble-based permutation feature importance methods (e.g., random permutation in random forests or gradient-boosted trees) evaluate features largely in isolation or under conditional perturbations, implicitly assuming that the contribution of each feature can be separated from the rest. This assumption fails in the presence of feature interactions, redundancy, and multicollinearity, where the predictive power emerges only from specific combinations of features rather than individual ones.

More importantly, permutation importance is diagnostic rather than optimization-driven: it explains a trained ensemble model but does not directly solve a well-defined optimization problem. Selecting features based on importance thresholds is heuristic and does not guarantee near-optimal performance for a downstream model. In contrast, a GA directly optimizes the target objective (e.g., cross-validated accuracy with a sparsity constraint) by evaluating entire feature subsets at once. By maintaining a population of candidate solutions and using crossover and mutation, GAs explore multiple regions of the non-convex landscape simultaneously and are far less prone to getting trapped in poor local optima. From an optimization standpoint, GAs are explicitly designed for discrete, non-differentiable, and non-convex problems, making them a principled and academically appropriate choice when the goal is global subset discovery rather than feature ranking.

In [1]:
#!pip install sklearn-genetic-opt

In [2]:
import numpy as np
import pandas as pd

from sklearn.base import clone
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.compose import ColumnTransformer

SEED = 42
np.random.seed(SEED)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
df = pd.read_csv('data/brca_data_w_subtypes/brca_data_w_subtypes.csv')

print("Dataset shape:", df.shape)
df.head()

target_col = "vital.status"
X = df.drop(columns=[target_col])
y = df[target_col]

print("Unique values:", y.unique())
print("Value counts:\n", y.value_counts())

feature_names = X.columns.tolist()

Dataset shape: (705, 1941)
Unique values: [0 1]
Value counts:
 vital.status
0    611
1     94
Name: count, dtype: int64


In [4]:
# Train / test split
# Larger test split to stress generalization (common in genomics)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.5,
    stratify=y,
    random_state=42
)

print(f"Samples: {X.shape[0]}")
print(f"Genes: {X.shape[1]}")
print(f"Train samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")



Samples: 705
Genes: 1940
Train samples: 352
Test samples: 353


## Just ~700 Sample and ~2000 features!!

In [5]:
num_cols = X_train.select_dtypes(include=["int64", "float64"]).columns
cat_cols = X_train.select_dtypes(include=["object", "category", "bool"]).columns
print(f"Numeric features: {len(num_cols)}")
print(f"Categorical features: {len(cat_cols)}")

Numeric features: 1936
Categorical features: 4


In [6]:
# preprocessing
preprocess_only = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

# Fit transform to get numeric matrix
X_train_enc = preprocess_only.fit_transform(X_train)
X_test_enc  = preprocess_only.transform(X_test)

# Get encoded feature names (for reporting)
feat_names = preprocess_only.get_feature_names_out()

# Now GA runs on numeric encoded matrix
lr = LogisticRegression(    max_iter=10000,
    solver="saga",
    class_weight="balanced",
    n_jobs=-1,
    random_state=SEED)

print(f"Total features: {len(feat_names)}")


Total features: 1955


In [7]:
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=SEED)

In [8]:
# Helper function: train/eval logistic regression on a given set of columns
def eval_logreg(col_idx, label, model=lr, scoring_cv="f1"):
    """
    col_idx: list/array of integer feature indices OR boolean mask
    """

    m = clone(model)

    Xtr = X_train_enc[:, col_idx]
    Xte = X_test_enc[:, col_idx]

    # Fit
    m.fit(Xtr, y_train)

    # Test metrics
    y_pred = m.predict(Xte)
    test_acc = accuracy_score(y_test, y_pred)
    test_f1  = f1_score(y_test, y_pred)

    # CV on training
    cv_score = cross_val_score(m, Xtr, y_train, cv=cv, scoring=scoring_cv).mean()

    return {
        "Approach": label,
        "NumFeatures": Xtr.shape[1],
        f"CV_{scoring_cv}_mean(train)": cv_score,
        "Test_Accuracy": test_acc,
        "Test_F1": test_f1,
    }


results = []


In [9]:
## Baseline - all features
all_idx = np.arange(X_train_enc.shape[1])
results.append(eval_logreg(all_idx, "All features (LogReg)"))



In [10]:
results

[{'Approach': 'All features (LogReg)',
  'NumFeatures': 1955,
  'CV_f1_mean(train)': 0.24580016388839918,
  'Test_Accuracy': 0.8101983002832861,
  'Test_F1': 0.32323232323232326}]

The genetic algorithm maintains a population of candidate feature subsets and iteratively evolves them through selection, crossover, and mutation. Population size determines the diversity of candidate solutions explored in each generation, the number of generations controls the depth of evolutionary search, and elitism ensures that high-performing feature subsets are preserved across generations. A sparsity constraint limits the maximum number of selected features, guiding the search toward compact and generalizable solutions. Together, these parameters balance exploration and exploitation in a highly non-convex combinatorial search space.

In [None]:
ga_selector = GAFeatureSelectionCV(
    estimator=lr,              # LogisticRegression(...)
    cv=cv,
    scoring="average_precision",              
    population_size=500,
    generations=100,
    keep_top_k=5,
    max_features=1000,           
    verbose=False,
    n_jobs=1,
)


ga_selector.fit(X_train_enc, y_train)

ga_mask = np.array(ga_selector.best_features_, dtype=bool)
ga_idx = np.where(ga_mask)[0]

ga_cols = list(np.array(feat_names)[ga_idx])

print(f"Selected {len(ga_cols)} encoded features")
print(ga_cols[:20])



In [None]:
results.append(eval_logreg(ga_idx, f"GA-selected (n={len(ga_idx)})", scoring_cv="f1"))


In [None]:
results

[{'Approach': 'All features (LogReg)',
  'NumFeatures': 1955,
  'CV_f1_mean(train)': 0.27673350041771094,
  'Test_Accuracy': 0.8632075471698113,
  'Test_F1': 0.32558139534883723},
 {'Approach': 'GA-selected (n=48)',
  'NumFeatures': 48,
  'CV_f1_mean(train)': 0.2914684603471789,
  'Test_Accuracy': 0.8584905660377359,
  'Test_F1': 0.21052631578947367},
 {'Approach': 'GA-selected (n=88)',
  'NumFeatures': 88,
  'CV_f1_mean(train)': 0.40913804713804713,
  'Test_Accuracy': 0.839622641509434,
  'Test_F1': 0.2608695652173913}]