# Feature Selection for Categorical Features using Iterative Hard Thresholding

Feature selection is a very important task in machine learning tasks. Finding a relevant subset of features improves generalization, robustness to noise, and convergence to targets. A prominent technique in feature selection *Iterative Hard Thresholding (IHT)*. Although IHT is a powerful technique for feature selection, it is not designed for categorical features. In this notebook, I will give a walkthrough on **how to extend IHT to categorical datasets**. For validation, I will compare IHT to other machine learning techniques that also perform feature selections such as LASSO and Random Forest.


# Iterative Hard Thresholding



Consider a dataset $(\mathbf{X_i}, y_i)$ for $i = [n] = 1,2,3 ... n$, where $\mathbf{X_i} \in \mathbb{R}^m$ and $y_i \in \mathbb{R}.$ We seek to find a k-sparse vector $\mathbf{B} \in  \mathbb{R}^m$ with the following objective: 


$$ \min_{\|\mathbf{B}\|_0 = k} \| \mathbf{y} - f(\mathbf{X} ;\mathbf{B}) \|
$$

where $\mathbf{X} \in \mathbb{R}^{n \times m}, \mathbf{y} \in  \mathbb{R}^{n} $ denote the data matrix and the response vector. The $l_0$ norm denotes the number of non-zero elements in $\mathbf{B}$ and $f$ is a given task-specific function. 





Intuitively, learning $\mathbf{B}$ is equivalent to learning a set of $k$ relevant features that best recover the response vector. Thus, by solving this optimization problem, one can identify relevant features. 

IHT assumes there exists a linear relationship between a subset of the measured variable and the response vector , and seeks to retain the top-k values of the entire feature vector after each iteration. Let $\mathbf{B}^0 = \mathbf{0}$, IHT follows the update rule:

$$
\mathbf{B}^{t+1} =  \mathbf{H}_k(\mathbf{B}^{t} - \lambda \bigtriangledown_{\mathbf{B}^t} L(f(\mathbf{X} ;\mathbf{B}),y))
$$

In [1]:
from jax import grad
import jax.numpy as NP
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import class_weight
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# DATASET

In [2]:
epoch = 2000
lr = 0.1
s = 50
dataPath ="molecularData.csv"
labelPath= "molecularLabel.csv"

In [3]:
data = np.genfromtxt(dataPath, dtype= str, delimiter = ",")
label= np.genfromtxt(labelPath, dtype = str, delimiter = ",")

classW = class_weight.compute_class_weight('balanced', np.unique(label), label)
uniqueClasses = np.unique(label).tolist()
numClasses = len(uniqueClasses)
classWL = np.array([classW[uniqueClasses.index(i)] for i in label]).reshape(-1,1) #Calculate weight class for each sample
label = LabelEncoder().fit_transform(label)
X_train, X_test, y_train, y_test = train_test_split(
    data, label, test_size=0.3, random_state=42)

In [4]:
OHC = OneHotEncoder().fit(data)
OHCL = OneHotEncoder().fit(label.reshape(-1,1))
classCount = list(map(len, OHC.categories_)) #Number of all possible values for each categorical feature
indices = np.cumsum(classCount) # For indexing between coefficient and categorical features
DIM = indices[-1]
indices = np.insert(indices,0, 0)
X_trainC = OHC.transform(X_train).toarray()
X_testC = OHC.transform(X_test).toarray()
dataC = OHC.transform(data).toarray()
X_current, Y_current, weight =  None, None, None #Placeholder for batch data
num_train = X_train.shape[0]

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [5]:
def aggregateFeature(gradients, numSplit):
    gradients = gradients.flatten()
    sumList = []
    current = 0
    for i in numSplit:
        sumList.append(np.mean(np.abs(gradients[current: current + i ])))
        current = current + i
    return sumList

def thresholding(coeff):
    copyCof = coeff[:]
    coeff = np.sum(np.abs(np.array(coeff)), axis = 0) 
    sum_coeff = aggregateFeature(coeff, classCount)
    rankingBest = np.argsort(np.abs(sum_coeff)).ravel()[-s:]
    if rankingBest.shape[0] < s:
        print("rankBest less than s features")
    selected = set([i for j in rankingBest for i in range(indices[j], indices[j+1])]) #List of selected coefficient
    coeff = coeff.flatten()
    notSelected = list(set(range(len(coeff))).difference(selected))
    copyCof[:,notSelected] = 0
    return copyCof

In [6]:
def generateBatch():
    #int(num_train / 5)
    index = np.random.choice(num_train, size =int(num_train / 1), replace =  False)
    global X_current
    global Y_current
    global weight
    X_current = X_trainC[index]
    Y_current = y_train[index].reshape(-1,1)
    weight = classWL[index]

In [7]:
def regression(coeff):  # Define the softmax function
  y = NP.exp(NP.dot(X_current, coeff.T))
  s = NP.expand_dims(NP.sum(y, axis = 1), 1)
  y = y / s
  Y_currentT = OHCL.transform(Y_current).toarray()
  label_logprobs = NP.multiply(NP.log(y) , Y_currentT) #+ NP.multiply(NP.log(1 - y) , (1 - Y_currentT))
  label_logprobs = weight * label_logprobs
  return -NP.mean(label_logprobs)

  
def regressionTest(coeff):  # Define a function
  y = np.exp(np.dot(X_testC, coeff.T))
  s = np.expand_dims(np.sum(y, axis = 1), 1)
  y = y / s 
  return y

In [8]:
cof = np.zeros((numClasses, DIM))
grad_regression  = grad(regression)
for i in range(epoch):
    generateBatch()
    trainError =  regression(cof)
    acc = accuracy_score(np.argmax(regressionTest(cof), axis = 1) , y_test)
    if i % 100 == 0:
        print("Training error:", trainError, "Accuracy:",acc )
    gradient = np.array(grad_regression(cof))
    cof = thresholding(cof - lr * gradient)



Training error: 0.4223118 Accuracy: 0.25914315569487983
Training error: 0.1705807 Accuracy: 0.9456635318704284
Training error: 0.11736325 Accuracy: 0.9498432601880877
Training error: 0.09449802 Accuracy: 0.9498432601880877
Training error: 0.08155732 Accuracy: 0.9508881922675027
Training error: 0.073095605 Accuracy: 0.955067920585162
Training error: 0.06705454 Accuracy: 0.955067920585162
Training error: 0.0624808 Accuracy: 0.955067920585162
Training error: 0.058869965 Accuracy: 0.9561128526645768
Training error: 0.055928886 Accuracy: 0.9571577847439916
Training error: 0.053474702 Accuracy: 0.9571577847439916
Training error: 0.05138715 Accuracy: 0.9571577847439916
Training error: 0.049583487 Accuracy: 0.9571577847439916
Training error: 0.048004765 Accuracy: 0.9571577847439916
Training error: 0.04660785 Accuracy: 0.9582027168234065
Training error: 0.0453602 Accuracy: 0.9561128526645768
Training error: 0.04423694 Accuracy: 0.9561128526645768
Training error: 0.043218527 Accuracy: 0.95611285

# Baseline Performance

In [9]:
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier


X_train, X_test, y_train, y_test = train_test_split(dataC, label, random_state=1, test_size = 0.3)
        

clf = linear_model.SGDClassifier(loss ='log', penalty = 'l1')

clf.fit(X_train, y_train)


print(accuracy_score(y_test, clf.predict(X_test)))


clf = RandomForestClassifier()

clf.fit(X_train, y_train)


print(accuracy_score(y_test, clf.predict(X_test)))

0.9414838035527691
0.9310344827586207


