Artificial Bee Colony (ABC) is an optimization algorithm
that is inspired by the behavior of honeybees. It is a type
of metaheuristic algorithm, which means that it is designed
to search for optimal solutions in complex, high-dimensional
search spaces.

The ABC algorithm works by simulating the behavior of a colony
of bees as they forage for food. The colony is made up of
three types of bees: employed bees, onlookers, and scouts.

Employed bees are responsible for exploring the neighborhood
around a food source. They do this by taking a candidate
solution from their current position and modifying it slightly.
The quality of the modified solution is then evaluated,
and if it is better than the current solution,
the employed bee returns to the hive and shares the new
solution with the other bees.

Onlooker bees observe the dances of the employed bees and
choose a food source to visit based on the quality of the dance.
The onlookers then visit the chosen food source and repeat
the process of exploring the neighborhood around it.

Scout bees are responsible for exploring the search space beyond
the neighborhood of the food sources. If a food source has not
been improved for a certain number of iterations, a scout bee
is sent to explore a new region of the search space.

The ABC algorithm continues to run until a stopping criterion is met,
such as a maximum number of iterations or a minimum improvement threshold.

In [4]:
!pip3 install pyabc
!pip3 install pandas
!pip3 install sklearn

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.0/35.0 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Collecting cloudpickle>=1.5.0
  Downloading cloudpickle-2.2.1-py3-none-any.whl (25 kB)
Collecting scikit-learn>=0.23.1
  Downloading scikit_learn-1.2.2-cp311-cp311-macosx_10_9_x86_64.whl (9.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting click>=7.1.2
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.6/96.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting redis>=2.10.6
  Downloading redis-4.5.3-py3-none-any.whl (238 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m238.6/238.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting distributed>=2022.10.2
  Downloading distributed-2023.3.2-py3-none-any.whl (956 kB)
[2K    

Load the NSL-KDD dataset into pandas dataframe

In [29]:
import pandas as pd

# Load the NSL-KDD dataset into a Pandas dataframe

col_names = ["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]


df = pd.read_csv('../NSL-KDD/NSL_KDD_Train.csv', header=None, names=col_names)


print(df.columns)

df.columns = df.columns.astype(str)

# Access the dataframe using the correct column name
X = df[['root_shell', 'urgent', 'protocol_type', 'num_file_creations', 'srv_count',
        'service', 'dst_host_srv_serror_rate', 'dst_bytes', 'logged_in', 'hot',
        'diff_srv_rate', 'land', 'wrong_fragment', 'flag',
        'dst_host_serror_rate', 'is_host_login', 'num_shells', 'num_access_files',
        'dst_host_count', 'srv_serror_rate', 'su_attempted',
        'dst_host_srv_rerror_rate', 'count', 'num_failed_logins',
        'dst_host_same_srv_rate', 'num_root', 'same_srv_rate', 'is_guest_login',
        'dst_host_srv_count', 'num_outbound_cmds', 'srv_diff_host_rate',
        'num_compromised', 'rerror_rate', 'dst_host_srv_diff_host_rate',
        'srv_rerror_rate', 'serror_rate', 'src_bytes', 'dst_host_diff_srv_rate',
        'duration', 'dst_host_rerror_rate', 'dst_host_same_src_port_rate']]
y = df['label']


Index(['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
       'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
       'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
       'num_access_files', 'num_outbound_cmds', 'is_host_login',
       'is_guest_login', 'count', 'srv_count', 'serror_rate',
       'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
       'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
       'dst_host_srv_count', 'dst_host_same_srv_rate',
       'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
       'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
       'dst_host_srv_rerror_rate', 'label'],
      dtype='object')


Prepare the data by separating the features and labels, and normalizing the feature values

In [30]:
from sklearn.preprocessing import StandardScaler

# Separate the categorical and numerical features
cat_cols = [1, 2, 3]
num_cols = list(set(df.columns) - set(cat_cols))

# One-hot encode the categorical features
cat_data = pd.get_dummies(df.iloc[:, cat_cols])

# Combine the one-hot encoded categorical features and numerical features
X = pd.concat([cat_data, df.iloc[:, num_cols]], axis=1)

# Normalize the feature values
scaler = StandardScaler()
X = scaler.fit_transform(X)

IndexError: .iloc requires numeric indexers, got ['root_shell' 'urgent' 'protocol_type' 'num_file_creations' 'srv_count'
 'service' 'dst_host_srv_serror_rate' 'dst_bytes' 'logged_in' 'hot'
 'diff_srv_rate' 'label' 'land' 'wrong_fragment' 'flag'
 'dst_host_serror_rate' 'is_host_login' 'num_shells' 'num_access_files'
 'dst_host_count' 'srv_serror_rate' 'su_attempted'
 'dst_host_srv_rerror_rate' 'count' 'num_failed_logins'
 'dst_host_same_srv_rate' 'num_root' 'same_srv_rate' 'is_guest_login'
 'dst_host_srv_count' 'num_outbound_cmds' 'srv_diff_host_rate'
 'num_compromised' 'rerror_rate' 'dst_host_srv_diff_host_rate'
 'srv_rerror_rate' 'serror_rate' 'src_bytes' 'dst_host_diff_srv_rate'
 'duration' 'dst_host_rerror_rate' 'dst_host_same_src_port_rate']

Define the objective function: In ABC, the objective function is used to evaluate the quality of candidate solutions. In the context of intrusion detection, the objective function could be the accuracy of a machine learning model trained on the NSL-KDD dataset.

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the objective function
def objective_function(params):
    n_estimators, max_depth = params
    # Train a Random Forest Classifier with the specified parameters
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
    clf.fit(X_train, y_train)
    # Evaluate the model on the testing set and return the accuracy
    y_pred = clf.predict(X_test)
    return accuracy_score(y_test, y_pred)

Define the search space: In ABC, the search space is the range of values that the algorithm can search to find the optimal solution. For example, in the context of intrusion detection, the search space could be the range of values for the number of trees in a Random Forest Classifier and the maximum depth of the trees

In [32]:
import numpy as np

# Define the search space
search_space = [
    ("n_estimators", (50, 200)),
    ("max_depth", (2, 20))
]

Once the objective function and search space are defined, you can run the ABC algorithm to find the optimal solution

In [34]:
from pyabc import ABCSMC, Distribution

# Define the distributions for the search space
distributions = {}
for name, (min_val, max_val) in search_space:
    distributions[name] = Distribution("uniform", min_val, max_val)

# Run the ABC algorithm
abc = ABCSMC(objective_function, distributions)
history = abc.run()

AttributeError: 'str' object has no attribute 'items'