# Can we improve anomaly detection results using feature selection? 

Let's start by taking a look at the isolation forest algorithm. For this comparison, we will keep the max_sample parameter constant at 2000. See the file IsolationForest_Plots to see why this value was chosen. In order to understand how feature selection affects the effectiveness of this algorithm, we will compare recall, precision, and f1 scores for different subsets of the KDDcup99 dataset.

In [1]:
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.ensemble import IsolationForest
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np

def calculations(actual, predictions):
    r = recall_score(actual, predictions, pos_label = -1)
    p = precision_score(actual, predictions, pos_label = -1)
    f = f1_score(actual, predictions, pos_label = -1)
    print('Recall =', r, '\nPrecision =', p, '\nf1 =', f)

In [2]:
from sklearn.datasets import fetch_kddcup99
kdd99_data = fetch_kddcup99(subset='http')
X_http = kdd99_data['data']
y_http = kdd99_data['target']
y_http[y_http == b'normal.'] = 1
y_http[y_http != 1] = -1
y_http = np.int64(y_http)

In [3]:
isof = IsolationForest(max_samples= 2000, random_state=42)
isof.fit(X_http)
anomaly_predictions_http = isof.predict(X_http)
anomaly_predictions_http = np.array(anomaly_predictions_http)

In [4]:
calculations(y_http, anomaly_predictions_http)

Recall = 1.0 
Precision = 0.5189100305379375 
f1 = 0.6832663161150634


Running isolation forest on the http subset of the KDDcup99 dataset gives the results for recall, precision, and f1 score above. It can be seen that all anomalies were detected, however there were some normal points that were labelled as anomalous as well. This means that 100% of the anomalies present in the data were found, but of the points labelled as anomalous only about 52% of them were corretly labelled. Consequently, this test was 68% effective in detecting anomalies on the http subset. It is important to note that only 3 features are considered in the http subset. 

-------------------------------------------------------------------------------------------------------------------------------

Now, let's see what happens when we use a randomly selected subset of the data. Because the original dataset is much too large, it is necessary that we find a random sample that has a similar number of instances to http. 

In [5]:
from sklearn.datasets import fetch_kddcup99
kdd99_data = fetch_kddcup99()
X = kdd99_data['data']
y = kdd99_data['target']
y[y == b'normal.'] = 1
y[y != 1] = -1
y = np.int64(y)

In [6]:
sample_indices = np.random.choice(range(len(y)), 59000)
X_sample = X[sample_indices,:]
y_sample=y[sample_indices]
print(X_sample.shape)
print(y_sample.shape)

(59000, 41)
(59000,)


The sample of 59000 from the original dataset has 41 features. Some of these features are categorical, while others are numerical. This can lead to a more complex problem, so for the sake of simplicity the categorical features will be removed.

In [7]:
X_num_sample = np.delete(X_sample,[1,2,3],1)
print(X_num_sample.shape)
print(y_sample.shape)

(59000, 38)
(59000,)


In [8]:
isof = IsolationForest(max_samples= 2000, random_state=42)
isof.fit(X_num_sample)
anomaly_predictions = isof.predict(X_num_sample)
anomaly_predictions = np.array(anomaly_predictions)

In [9]:
calculations(y_sample, anomaly_predictions)

Recall = 0.024347973115779686 
Precision = 0.48040033361134277 
f1 = 0.0463469584808497


Using all 38 numerical features impaired our results. While the precision score did not change much, the recall score dropped significantly. Only about 3% of existing anomalies were detected and thus the accuracy percentage suffers as well. Let's see if we can improve these results by using the scikit-learn feature selection method SelectKBest.

-------------------------------------------------------------------------------------------------------------------------------

In [10]:
from sklearn.datasets import fetch_kddcup99
kdd99_data = fetch_kddcup99()
X_feature = kdd99_data['data']
y_feature = kdd99_data['target']
y_feature[y_feature == b'normal.'] = 1
y_feature[y_feature != 1] = -1
y_feature = np.int64(y_feature)

In [11]:
sample_indices_feat = np.random.choice(range(len(y_feature)), 59000)
X_feat_sample = X_feature[sample_indices_feat,:]
y_feat_sample=y_feature[sample_indices_feat]
print(X_feat_sample.shape)
print(y_feat_sample.shape)

(59000, 41)
(59000,)


In [12]:
X_num = np.delete(X_feat_sample,[1,2,3],1)
print(X_num.shape)
print(y_feat_sample.shape)

(59000, 38)
(59000,)


In [13]:
fs = SelectKBest(score_func = f_classif, k = 3)
X_selected = fs.fit_transform(X_num,y_feat_sample)

  f = msb / msw


In [14]:
isof = IsolationForest(max_samples= 2000, random_state=42)
isof.fit(X_selected)
anomaly_predictions_feature = isof.predict(X_selected)
anomaly_predictions_feature = np.array(anomaly_predictions_feature)

In [15]:
calculations(y_feat_sample, anomaly_predictions_feature)

Recall = 0.1062814123188863 
Precision = 0.35697132363328155 
f1 = 0.16379575764450421


----------------------------------------------------------------------------------------------------------------------------

In [16]:
print('http Subset Results:')
calculations(y_http, anomaly_predictions_http)
print('\n')
print('Random Sample Without Feature Selection Results:')
calculations(y_sample, anomaly_predictions)
print('\n')
print('Feature Selection Results:')
calculations(y_feat_sample, anomaly_predictions_feature)

http Subset Results:
Recall = 1.0 
Precision = 0.5189100305379375 
f1 = 0.6832663161150634


Random Sample Without Feature Selection Results:
Recall = 0.024347973115779686 
Precision = 0.48040033361134277 
f1 = 0.0463469584808497


Feature Selection Results:
Recall = 0.1062814123188863 
Precision = 0.35697132363328155 
f1 = 0.16379575764450421


Using all 38 features without dimensionality reduction proves to be an ineffective approach. While selecting features may have slightly improved results, the accuracy is still lacking. This particular experiment shows that the http subset gives the most accurate results. However, that could entirely be from the feature selection method itself. It is possible that other feature selection methods would perform better in finding relationships between the data and target variables. 