# Feature selection

Using Bayesian Information Criterion (BIC), we calculate it based on every subset of features of size 1 to 6. Then, we select the features with the lowest BIC to train our model on.

The BIC evaluates the tradeoff between the model's fit and its complexity. This allows us to avoid overfitting.

In [22]:
import statsmodels.api as sm
from sklearn.datasets import load_breast_cancer
import numpy as np
from itertools import combinations


breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

n_features = X.shape[1]
best_bic = np.inf
best_features = None

for k in range(1, 7):
  for combo in combinations(range(n_features), k):
    # Get the subset of features
    X_subset = X[:, combo]

    # Add constant column of 1's to serve as the bias term
    X_with_const = sm.add_constant(X_subset)

    try:
      model = sm.Logit(y, X_with_const).fit(disp=False)
      bic = model.bic

      # BIC is better if it's smaller
      if bic < best_bic:
        best_bic = bic
        best_features = combo
    except Exception as e:
        print(e)
        print("combo:", breast_cancer.feature_names[list(combo)])
        continue

  print(f"Done with {k}-element subsets")
  print("Best BIC:", best_bic)
  print("Best features:", breast_cancer.feature_names[list(best_features)])

print("\nOverall best BIC:", best_bic)
print("Overall best features:", breast_cancer.feature_names[list(best_features)])

Done with 1-element subsets
Best BIC: 222.1677016482091
Best features: ['worst perimeter']
Done with 2-element subsets
Best BIC: 155.1611406813787
Best features: ['worst area' 'worst concave points']
Done with 3-element subsets
Best BIC: 123.36269508812477
Best features: ['worst texture' 'worst area' 'worst concave points']
Done with 4-element subsets
Best BIC: 114.01058850507678
Best features: ['radius error' 'worst texture' 'worst area' 'worst concave points']
Done with 5-element subsets
Best BIC: 110.18011767702438
Best features: ['radius error' 'worst texture' 'worst area' 'worst smoothness'
 'worst concave points']
Done with 6-element subsets
Best BIC: 110.18011767702438
Best features: ['radius error' 'worst texture' 'worst area' 'worst smoothness'
 'worst concave points']

Overall best BIC: 110.18011767702438
Overall best features: ['radius error' 'worst texture' 'worst area' 'worst smoothness'
 'worst concave points']


## Selected Features

Using BIC, 5 features were selected:
* radius error [index 10]
* worst texture [index 21]
* worst area [index 23]
* worst smoothness [index 24]
* worst concave points [index 27]

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# From BIC, the best features correspond to the following indices
features = (10, 21, 23, 24, 27)

# Load data
breast_cancer = load_breast_cancer()

# Get the best features based on BIC
X = breast_cancer.data[:, list(features)]
y = breast_cancer.target

avgScore = 0
runs = 20

for i in range(runs):
    # Split data into 75:25
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)

    # Scale data
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.fit_transform(X_test)

    # Train the logistic regression model
    model = LogisticRegression(max_iter=15)
    model.fit(X_train, y_train)

    # Evaluate the model on the test data
    y_pred = model.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    avgScore += score

# getting average score across all runs
print("Accuracy:", avgScore/runs)

Accuracy: 0.9741258741258744


In [17]:
print(best_features)
print("best feature:", breast_cancer.feature_names[list(best_features)])


(10, 21, 23, 24, 27)
best feature: ['radius error' 'worst texture' 'worst area' 'worst smoothness'
 'worst concave points']
7
