# Comparing Original and Enhanced Pipelines in Intrusion Detection

This notebook compares two pipelines for intrusion detection:

- **Original Pipeline**: Data preprocessing and feature selection without Isolation Forest.
- **Enhanced Pipeline**: Data preprocessing and feature selection with Isolation Forest as an anomaly detection filter.

We aim to assess the impact of Isolation Forest on anomaly detection and dataset efficiency before applying tree-based models.


## Import libraries ----- ADDED SOME IMPORTS 4371 -----

In [1]:
!pip install xgboost

# imports below are from 4371 group to make file work
!pip install pandas
!pip install seaborn
# Core data manipulation and scientific libraries
!pip install numpy pandas

# Data visualization libraries
!pip install seaborn matplotlib

# Machine learning libraries
!pip install scikit-learn xgboost

# Imbalanced data handling
!pip install imbalanced-learn

# Hyperparameter optimization libraries
!pip install hyperopt scikit-optimize

# (Optional) Additional dependencies for compatibility
!pip install scipy

# Custom module FCBF (if available locally, or if it's a GitHub repo, use the clone URL)
# Replace "URL_TO_FCBF_MODULE" with the actual URL or location if it's on GitHub or a local file
!pip install git+https://github.com/SantiagoEG/FCBF_module.git


Collecting git+https://github.com/SantiagoEG/FCBF_module.git
  Cloning https://github.com/SantiagoEG/FCBF_module.git to c:\users\logan\appdata\local\temp\pip-req-build-8_hp_4rb
  Resolved https://github.com/SantiagoEG/FCBF_module.git to commit 092b60b65ee6ceaf9b0227d12b575f2a3336b287


  Running command git clone --filter=blob:none --quiet https://github.com/SantiagoEG/FCBF_module.git 'C:\Users\Logan\AppData\Local\Temp\pip-req-build-8_hp_4rb'
ERROR: git+https://github.com/SantiagoEG/FCBF_module.git does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.


In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,precision_recall_fscore_support
from sklearn.metrics import f1_score,roc_auc_score
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
from xgboost import plot_importance

# Isolation Forest Import -- 4371
from sklearn.ensemble import IsolationForest

# ---Original Codebase Pipeline (Without Isolation Forest)---

## Read the sampled CICIDS2017 dataset
The CICIDS2017 dataset is publicly available at: https://www.unb.ca/cic/datasets/ids-2017.html  
Due to the large size of this dataset, the sampled subsets of CICIDS2017 is used. The subsets are in the "data" folder.  
If you want to use this code on other datasets (e.g., CAN-intrusion dataset), just change the dataset name and follow the same steps. The models in this code are generic models that can be used in any intrusion detection/network traffic datasets.

In [4]:
# Original Pipeline (Without Isolation Forest)
print("---Original Pipeline (Without Isolation Forest)---")

# Read the sampled CICIDS2017 dataset
df_orig = pd.read_csv('./data/CICIDS2017_sample.csv')

---Original Pipeline (Without Isolation Forest)---


In [5]:
df_orig.Label.value_counts()

Label
BENIGN          22731
DoS             19035
PortScan         7946
BruteForce       2767
WebAttack        2180
Bot              1966
Infiltration       36
Name: count, dtype: int64

### Preprocessing (normalization and padding values)

In [6]:
# Z-score normalization
features_orig = df_orig.dtypes[df_orig.dtypes != 'object'].index
df_orig[features_orig] = df_orig[features_orig].apply(
    lambda x: (x - x.mean()) / (x.std())
)
# Fill empty values by 0
df_orig = df_orig.fillna(0)

### Data sampling
Due to the space limit of GitHub files and the large size of network traffic data, we sample a small-sized subset for model learning using **k-means cluster sampling**

In [8]:
labelencoder_orig = LabelEncoder()
df_orig.iloc[:, -1] = labelencoder_orig.fit_transform(df_orig.iloc[:, -1])

In [9]:
df_orig.Label.value_counts()

Label
0    22731
3    19035
5     7946
2     2767
6     2180
1     1966
4       36
Name: count, dtype: int64

In [10]:
# retain the minority class instances and sample the majority class instances
df_minor_orig = df_orig[
    (df_orig['Label'] == 6) | (df_orig['Label'] == 1) | (df_orig['Label'] == 4)
]
df_major_orig = df_orig.drop(df_minor_orig.index)

In [11]:
X_orig = df_major_orig.drop(['Label'], axis=1)
y_orig = df_major_orig['Label'].values
y_orig=np.ravel(y_orig)

In [12]:
# use k-means to cluster the data samples and select a proportion of data from each cluster
from sklearn.cluster import MiniBatchKMeans
kmeans_orig = MiniBatchKMeans(n_clusters=1000, random_state=0).fit(X_orig)

In [13]:
klabel_orig = kmeans_orig.labels_
df_major_orig['klabel'] = klabel_orig

In [14]:
df_major_orig['klabel'].value_counts()

klabel
20     482
842    411
312    348
324    337
745    334
      ... 
973      1
727      1
594      1
410      1
100      1
Name: count, Length: 979, dtype: int64

In [15]:
cols_orig = list(df_major_orig)
cols_orig.insert(78, cols_orig.pop(cols_orig.index('Label')))
df_major_orig = df_major_orig.loc[:, cols_orig]

In [17]:
def typicalSampling_orig(group):
    name = group.name
    frac = 0.008
    return group.sample(frac=frac)

result_orig = df_major_orig.groupby(
    'klabel', group_keys=False
).apply(typicalSampling_orig)

In [18]:
result_orig['Label'].value_counts()

Label
3    120
0    119
5     57
2     18
Name: count, dtype: int64

## 4371 Had to modify the file below because the recommended way to combine DataFrames in recent versions of pandas is by using the pandas.concat() function

In [19]:
import pandas as pd

# Assuming 'result' and 'df_minor' are already defined and loaded

# No need to drop 'klabel' since it doesn't exist
# If you need to drop another column, ensure it exists
# For example, to drop 'Label' (only if intended, which is usually not the case):
# result = result.drop(['Label'], axis=1)

# Concatenate 'result_orig' and 'df_minor_orig' DataFrames
result_orig = pd.concat([result_orig, df_minor_orig], ignore_index=True)

print("DataFrames concatenated successfully.")
print("Updated DataFrame head:")
print(result_orig.head())

DataFrames concatenated successfully.
Updated DataFrame head:
   Flow Duration  Total Fwd Packets  Total Backward Packets  \
0       1.721079           0.037025                0.008402   
1      -0.523395          -0.068426               -0.051737   
2      -0.486085          -0.050851               -0.021667   
3      -0.368223          -0.015701               -0.081806   
4      -0.512273          -0.015701               -0.081806   

   Total Length of Fwd Packets  Total Length of Bwd Packets  \
0                    -0.031683                     0.057881   
1                    -0.030559                    -0.046494   
2                    -0.033088                     0.057881   
3                    -0.032901                    -0.048343   
4                    -0.032901                    -0.048343   

   Fwd Packet Length Max  Fwd Packet Length Min  Fwd Packet Length Mean  \
0              -0.218767              -0.211174               -0.207684   
1              -0.188874      

In [20]:
result_orig.to_csv('./data/CICIDS2017_sample_km_orig.csv', index=False)

### split train set and test set

In [21]:
# Original Pipeline (Without Isolation Forest)
print("---Original Pipeline (Without Isolation Forest)---")

# Read the sampled CICIDS2017 dataset
df_orig = pd.read_csv('./data/CICIDS2017_sample_km_orig.csv')
print(df_orig.isnull().sum())

---Original Pipeline (Without Isolation Forest)---
Flow Duration                     0
Total Fwd Packets                 0
Total Backward Packets            0
Total Length of Fwd Packets       0
Total Length of Bwd Packets       0
                               ... 
Idle Std                          0
Idle Max                          0
Idle Min                          0
klabel                         4182
Label                             0
Length: 79, dtype: int64


## ----- ADDED LINES BELOW 4371 TO FIX ISSUE WITH ValueError: Input X contains NaN. ------

In [22]:
from sklearn.impute import SimpleImputer

# Create an imputer object with the desired strategy (mean, median, most_frequent)
imputer_orig = SimpleImputer(strategy='mean')

# Apply the imputer to the DataFrame
df_orig[df_orig.columns] = imputer_orig.fit_transform(df_orig)

# fixed the issue by using SimpleImputer to replace the NaN values in your dataset with 
# meaningful statistical estimates (like the mean of each feature column). This transformation eliminated missing 
# values from the dataset, which allowed mutual_info_classif to execute without errors.

In [23]:
X_orig = df_orig.drop(['Label'], axis=1).values
y_orig = df_orig['Label'].values
y_orig=np.ravel(y_orig)

In [24]:
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
    X_orig, y_orig, train_size=0.8, test_size=0.2, random_state=0, stratify=y_orig
)

## Feature engineering

### Feature selection by information gain

In [25]:
from sklearn.feature_selection import mutual_info_classif
importances_orig = mutual_info_classif(X_train_orig, y_train_orig)

In [26]:
# Calculate the sum of importance scores
f_list_orig = sorted(
    zip(map(lambda x: round(x, 4), importances_orig), features_orig), reverse=True
)
Sum_orig = sum([score for score, _ in f_list_orig])

# Initialize Sum variable
Sum = 0
fs = []

for i in range(0, len(f_list_orig)):
    Sum = Sum + f_list_orig[i][0]
    fs.append(f_list_orig[i][1

SyntaxError: unexpected EOF while parsing (4146424994.py, line 13)

In [None]:
# Select the important features from top to bottom until the accumulated importance reaches 90%
f_list2 = sorted(
    zip(map(lambda x: round(x, 4), importances_orig / Sum_orig), features_orig),
    reverse=True
)

Sum2 = 0
fs_selected = []

for i in range(0, len(f_list2)):
    Sum2 = Sum2 + f_list2[i][0]
    fs_selected.append(f_list2[i][1])
    if Sum2 >= 0.9:
        break

In [27]:
X_fs_orig = df_orig[fs_orig].values

NameError: name 'fs_orig' is not defined

In [None]:
X_fs_orig.shape

### Feature selection by Fast Correlation Based Filter (FCBF)

The module is imported from the GitHub repo: https://github.com/SantiagoEG/FCBF_module

In [None]:
from FCBF_module import FCBF, FCBFK, FCBFiP, get_i
fcbf_orig = FCBFK(k=20)
#fcbf.fit(X_fs, y)

In [None]:
X_fss_orig = fcbf_orig.fit_transform(X_fs_orig, y_orig)

In [None]:
X_fss_orig.shape

### Re-split train & test sets after feature selection

In [None]:
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(
    X_fss_orig, y_orig, train_size=0.8, test_size=0.2, random_state=0, stratify=y_orig
)


In [None]:
X_train_orig.shape

In [None]:
pd.Series(y_train_orig).value_counts()

# Data on origial codebase pipeline without isolation forest filtering

In [None]:
# Class distribution in training data
print("Original Training Data Class Distribution:")
print(pd.Series(y_train_orig).value_counts())

# Dataset size
print(f"Original Training Data Shape: {X_train_orig.shape}")
print(f"Original Test Data Shape: {X_test_orig.shape}")

# ---Modified Pipeline (With Isolation Forest)---

## Read the sampled CICIDS2017 dataset
The CICIDS2017 dataset is publicly available at: https://www.unb.ca/cic/datasets/ids-2017.html  
Due to the large size of this dataset, the sampled subsets of CICIDS2017 is used. The subsets are in the "data" folder.  
If you want to use this code on other datasets (e.g., CAN-intrusion dataset), just change the dataset name and follow the same steps. The models in this code are generic models that can be used in any intrusion detection/network traffic datasets.

In [None]:
#Read dataset
df = pd.read_csv('./data/CICIDS2017_sample.csv') 
# The results in this code is based on the original CICIDS2017 dataset. Please go to cell [21] if you work on the sampled dataset. 

In [None]:
df

In [None]:
df.Label.value_counts()

### Preprocessing (normalization and padding values)

In [None]:
# Z-score normalization
features = df.dtypes[df.dtypes != 'object'].index
df[features] = df[features].apply(
    lambda x: (x - x.mean()) / (x.std()))
# Fill empty values by 0
df = df.fillna(0)

### Data sampling
Due to the space limit of GitHub files and the large size of network traffic data, we sample a small-sized subset for model learning using **k-means cluster sampling**

In [None]:
labelencoder = LabelEncoder()
df.iloc[:, -1] = labelencoder.fit_transform(df.iloc[:, -1])

In [None]:
df.Label.value_counts()

In [None]:
# retain the minority class instances and sample the majority class instances
df_minor = df[(df['Label']==6)|(df['Label']==1)|(df['Label']==4)]
df_major = df.drop(df_minor.index)

In [None]:
X = df_major.drop(['Label'],axis=1) 
y = df_major.iloc[:, -1].values.reshape(-1,1)
y=np.ravel(y)

In [None]:
# use k-means to cluster the data samples and select a proportion of data from each cluster
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(n_clusters=1000, random_state=0).fit(X)

In [None]:
klabel=kmeans.labels_
df_major['klabel']=klabel

In [None]:
df_major['klabel'].value_counts()

In [None]:
cols = list(df_major)
cols.insert(78, cols.pop(cols.index('Label')))
df_major = df_major.loc[:, cols]

In [None]:
df_major

In [None]:
def typicalSampling(group):
    name = group.name
    frac = 0.008
    return group.sample(frac=frac)

result = df_major.groupby(
    'klabel', group_keys=False
).apply(typicalSampling)

In [None]:
result['Label'].value_counts()

In [None]:
result

## 4371 Had to modify the file below because the recommended way to combine DataFrames in recent versions of pandas is by using the pandas.concat() function

In [None]:
import pandas as pd

# Assuming 'result' and 'df_minor' are already defined and loaded

# No need to drop 'klabel' since it doesn't exist
# If you need to drop another column, ensure it exists
# For example, to drop 'Label' (only if intended, which is usually not the case):
# result = result.drop(['Label'], axis=1)

# Concatenate 'result' and 'df_minor' DataFrames
result = pd.concat([result, df_minor], ignore_index=True)

print("DataFrames concatenated successfully.")
print("Updated DataFrame head:")
print(result.head())

In [None]:
result.to_csv('./data/CICIDS2017_sample_km.csv',index=0)

### split train set and test set

In [None]:
# Enhanced Pipeline (With Isolation Forest)
print("---Enhanced Pipeline (With Isolation Forest)---")

# Read the sampled CICIDS2017 dataset
df_enh = pd.read_csv('./data/CICIDS2017_sample_km.csv')
print(df_enh.isnull().sum())

## ----- ADDED LINES BELOW 4371 TO FIX ISSUE WITH ValueError: Input X contains NaN. ------

In [None]:
from sklearn.impute import SimpleImputer

# Create an imputer object with the desired strategy (mean, median, most_frequent)
imputer = SimpleImputer(strategy='mean')

# Apply the imputer to the DataFrame
df[df.columns] = imputer.fit_transform(df)

# fixed the issue by using SimpleImputer to replace the NaN values in your dataset with 
# meaningful statistical estimates (like the mean of each feature column). This transformation eliminated missing 
# values from the dataset, which allowed mutual_info_classif to execute without errors.

In [None]:
X = df.drop(['Label'],axis=1).values
y = df.iloc[:, -1].values.reshape(-1,1)
y=np.ravel(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2, random_state = 0,stratify = y)

## Feature engineering

### Feature selection by information gain

In [None]:
from sklearn.feature_selection import mutual_info_classif
importances = mutual_info_classif(X_train, y_train)

In [None]:
# calculate the sum of importance scores
f_list = sorted(zip(map(lambda x: round(x, 4), importances), features), reverse=True)
Sum = 0
fs = []
for i in range(0, len(f_list)):
    Sum = Sum + f_list[i][0]
    fs.append(f_list[i][1])

In [None]:
# select the important features from top to bottom until the accumulated importance reaches 90%
f_list2 = sorted(zip(map(lambda x: round(x, 4), importances/Sum), features), reverse=True)
Sum2 = 0
fs = []
for i in range(0, len(f_list2)):
    Sum2 = Sum2 + f_list2[i][0]
    fs.append(f_list2[i][1])
    if Sum2>=0.9:
        break        

In [None]:
X_fs = df[fs].values

In [None]:
X_fs.shape

### Feature selection by Fast Correlation Based Filter (FCBF)

The module is imported from the GitHub repo: https://github.com/SantiagoEG/FCBF_module

In [None]:
from FCBF_module import FCBF, FCBFK, FCBFiP, get_i
fcbf = FCBFK(k = 20)
#fcbf.fit(X_fs, y)

In [None]:
X_fss = fcbf.fit_transform(X_fs,y)

In [None]:
X_fss.shape

## Isolation Forest Implementation 

After performing feature selection using Information Gain (IG) and Fast Correlation-Based Filter (FCBF), we apply the Isolation Forest to detect and filter out anomalies in our dataset. This step enhances our model's ability to differentiate between actual threats and benign unusual behavior by removing potential outliers before training.

n_estimators=100: Number of trees in the forest.
contamination='auto': Let the algorithm decide the proportion of anomalies.
random_state=42: For reproducibility.
Anomaly Detection:

anomaly_predictions == 1: Inliers (normal instances).
anomaly_predictions == -1: Outliers (anomalies).
Filtering Data:

X_filtered: Contains only the inlier instances.
y_filtered: Corresponding labels for inliers.


In [None]:
# Apply Isolation Forest
iso_forest = IsolationForest(n_estimators=100, contamination='auto', random_state=42)
iso_forest.fit(X_fss)

# Obtain anomaly scores and predictions
anomaly_scores = iso_forest.decision_function(X_fss)
anomaly_predictions = iso_forest.predict(X_fss)

## Visualizing Anomaly Scores

To understand how the Isolation Forest has assigned anomaly scores to our data points, we visualize the distribution of these scores. This helps us assess the threshold and proportion of data considered anomalous, providing insights into the filtering process.

In [None]:
# Visualize anomaly scores
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.hist(anomaly_scores, bins=50, color='skyblue', edgecolor='black')
plt.title('Histogram of Anomaly Scores')
plt.xlabel('Anomaly Score')
plt.ylabel('Frequency')
plt.show()

## Filtering Out Detected Anomalies

Using the predictions from the Isolation Forest, we filter out the anomalies (outliers) from our dataset. We retain only the inlier data points (those predicted as normal) for model training. This step aims to improve the quality of our training data by removing noise and potential outliers.

In [None]:
# Filter out anomalies
inlier_mask = anomaly_predictions == 1
X_filtered = X_fss[inlier_mask]
y_filtered = y[inlier_mask]

### Re-split train & test sets after feature selection

In [None]:
# Train-test split after filtering
X_train_enh, X_test_enh, y_train_enh, y_test_enh = train_test_split(
    X_filtered, y_filtered, train_size=0.8, test_size=0.2, random_state=0, stratify=y_filtered
)

In [None]:
X_train.shape

In [None]:
pd.Series(y_train).value_counts()

In [None]:
# Class distribution in training data
print("Enhanced Training Data Class Distribution:")
print(pd.Series(y_train_enh).value_counts())

# Dataset size
print(f"Enhanced Training Data Shape: {X_filtered.shape}")
print(f"Enhanced Test Data Shape: {X_test_enh.shape}")

# Full Comparison of Original pipeline against Isolation Forest Implementation

In [None]:
# Total number of samples before and after anomaly detection
total_samples_before = X_fss.shape[0]
total_samples_after = X_filtered.shape[0]
anomalies_removed = total_samples_before - total_samples_after

print(f"Total Samples Before Isolation Forest: {total_samples_before}")
print(f"Total Samples After Isolation Forest: {total_samples_after}")
print(f"Anomalies Detected and Removed: {anomalies_removed}")

# Percentage of anomalies detected
anomaly_percentage = (anomalies_removed / total_samples_before) * 100
print(f"Percentage of Anomalies Detected: {anomaly_percentage:.2f}%")

In [None]:
# Class distribution before Isolation Forest
print("Class Distribution Before Isolation Forest:")
print(pd.Series(y).value_counts())

# Class distribution after Isolation Forest
print("\nClass Distribution After Isolation Forest:")
print(pd.Series(y_filtered).value_counts())

# Plotting class distributions
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

# Before Isolation Forest
sns.countplot(x=y, ax=ax[0])
ax[0].set_title('Before Isolation Forest')
ax[0].set_xlabel('Class')
ax[0].set_ylabel('Count')

# After Isolation Forest
sns.countplot(x=y_filtered, ax=ax[1])
ax[1].set_title('After Isolation Forest')
ax[1].set_xlabel('Class')
ax[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

In [None]:
# Original dataset sizes
print("Original Training Data Shape:", X_train_orig.shape)
print("Original Test Data Shape:", X_test_orig.shape)

# Enhanced dataset sizes
print("\nEnhanced Training Data Shape:", X_train_enh.shape)
print("Enhanced Test Data Shape:", X_test_enh.shape)

In [None]:
summary_data = {
    'Metric': [
        'Total Samples',
        'Anomalies Detected and Removed',
        'Percentage of Anomalies Detected',
        'Training Data Shape',
        'Test Data Shape',
        'Time Taken for Preprocessing (seconds)'
    ],
    'Original Pipeline': [
        total_samples_before,  # X_fss.shape[0]
        'N/A',
        'N/A',
        X_train_orig.shape,
        X_test_orig.shape,
        f"{time_orig:.2f}"
    ],
    'Enhanced Pipeline': [
        total_samples_after,  # X_filtered.shape[0]
        anomalies_removed,
        f"{anomaly_percentage:.2f}%",
        X_train_enh.shape,
        X_test_enh.shape,
        f"{time_enh:.2f}"
    ]
}

summary_df = pd.DataFrame(summary_data)
print(summary_df.to_string(index=False))


In [None]:
# Compare anomaly scores
plt.figure(figsize=(10, 6))
sns.kdeplot(anomaly_scores, shade=True, color='red')
plt.title('Density Plot of Anomaly Scores')
plt.xlabel('Anomaly Score')
plt.ylabel('Density')
plt.show()