## CTU-13 Botnet Detection

</BR>AN EXAMPLE, FOR HOW CAN WE REVIEW AN DATASET

The CTU-13 dataset is a collection of network traffic captured in a university environment, containing both normal and botnet traffic. This dataset is often used for research in botnet detection

In [None]:
# importing libraries
import numpy as np
import pandas as pd

In [None]:
# path of sample data
path = 'C:\\Home\\CLDC\\CTU-13/'

In [None]:
# sample dataframe
df = pd.read_parquet(path+'1-Neris-20110810.binetflow.parquet')

In [None]:
df.head()

Unnamed: 0,dur,proto,dir,state,stos,dtos,tot_pkts,tot_bytes,src_bytes,label,Family
0,1.026539,tcp,->,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT,20110810.binetflow.csv
1,1.009595,tcp,->,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT,20110810.binetflow.csv
2,3.056586,tcp,->,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt,20110810.binetflow.csv
3,3.111769,tcp,->,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt,20110810.binetflow.csv
4,3.083411,tcp,->,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt,20110810.binetflow.csv


In [None]:
# details of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1621173 entries, 0 to 1621172
Data columns (total 11 columns):
 #   Column     Non-Null Count    Dtype   
---  ------     --------------    -----   
 0   dur        1621173 non-null  float32 
 1   proto      1621173 non-null  category
 2   dir        1621173 non-null  category
 3   state      1621172 non-null  category
 4   stos       1616028 non-null  float32 
 5   dtos       1529939 non-null  float32 
 6   tot_pkts   1621173 non-null  int32   
 7   tot_bytes  1621173 non-null  int64   
 8   src_bytes  1621173 non-null  int64   
 9   label      1621173 non-null  category
 10  Family     1621173 non-null  category
dtypes: category(5), float32(3), int32(1), int64(2)
memory usage: 58.8 MB


In [None]:
# list of all columns
df.columns

Index(['dur', 'proto', 'dir', 'state', 'stos', 'dtos', 'tot_pkts', 'tot_bytes',
       'src_bytes', 'label', 'Family'],
      dtype='object')

we have total 5 categorical columns (family col here just represents datafile name), thier categories are as follows 

In [None]:
# all categories of protocol
df['proto'].value_counts()

udp          1153511
tcp           448832
icmp           13411
rtp             2570
rtcp            2304
arp              443
ipv6-icmp         50
esp               10
ipv6              10
udt               10
ipx/spx            8
pim                7
rarp               4
igmp               2
unas               1
Name: proto, dtype: int64

In [None]:
# all categories of directory
df['dir'].value_counts()

  <->    1109168
   ->     502286
  <?>       6139
  <-        1702
   ?>       1426
  who        447
  <?           5
Name: dir, dtype: int64

In [None]:
# all categories of state
df['state'].value_counts()

CON            1108492
FSPA_FSPA       215447
S_RA             52222
INT              49410
SRPA_FSPA        35374
                ...   
FSRPAE_FSPA          1
SPA_FSA              1
FSAU_SA              1
FSPA_FA              1
SPA_SRPAC            1
Name: state, Length: 230, dtype: int64

In [None]:
# all categories of label
df['label'].value_counts()

flow=Background-UDP-Established                                            815604
flow=To-Background-UDP-CVUT-DNS-Server                                     220506
flow=Background-TCP-Established                                            218569
flow=Background-Established-cmpgw-CVUT                                     136333
flow=Background-TCP-Attempt                                                 60097
                                                                            ...  
flow=From-Botnet-V42-TCP-HTTP-Not-Encrypted-Down-2                              1
flow=From-Botnet-V42-TCP-Established-HTTP-Ad-60                                 1
flow=From-Botnet-V42-TCP-Established-HTTP-Binary-Download-Custom-Port-5         1
flow=From-Botnet-V42-ICMP                                                       1
flow=From-Botnet-V42-TCP-Established-HTTP-Adobe-4                               1
Name: label, Length: 113, dtype: int64

### Features

First we need to understand the features of the dataset


The features in the CTU-13 dataset:

1. 'dur': This feature represents the duration of the flow, in seconds. A flow is a sequence of packets that share common properties such as source and destination IP addresses, source and destination ports, and protocol.


2. 'proto': This feature represents the transport layer protocol used by the flow, such as TCP, UDP, ICMP, etc.


3. 'dir': This feature represents the direction of the flow.


4. 'state': This feature represents the state of the flow.


5. 'stos': This feature represents the source IP address's type of service (ToS) value, which indicates the priority of the packet.


6. 'dtos': This feature represents the destination IP address's ToS value.


7. 'tot_pkts': This feature represents the total number of packets in the flow.


8. 'tot_bytes': This feature represents the total number of bytes in the flow.


9. 'src_bytes': This feature represents the number of bytes sent from the source IP address.


10. 'label': This feature represents the label or class of the flow, indicating whether it is normal traffic or botnet traffic.


11. 'Family': This feature represents the family of the botnet associated with the flow.


These features are used as input to machine learning algorithms for botnet detection. By analyzing these features, machine learning algorithms can learn to distinguish between normal traffic and botnet traffic and identify the characteristics of different botnet families.

### Approach

Several machine learning algorithms can be used for botnet detection in the CTU-13 dataset. The choice of algorithm will depend on the specific characteristics of the dataset, the available computational resources, and the performance metrics of interest. Some of the machine learning algorithms for botnet detection include:


1. Decision Trees: Decision trees are simple and easy to interpret. They can be used for both classification and regression problems, and they can handle categorical and numerical data.


2. Random Forests: Random forests are an ensemble learning method that combines multiple decision trees. They are effective for handling high-dimensional data and can handle missing values.


3. Support Vector Machines (SVM): SVM is a powerful algorithm for classification problems. It is effective for handling non-linear data and can handle high-dimensional data.


4. Naive Bayes: Naive Bayes is a simple and fast algorithm that is effective for handling high-dimensional data. It is based on Bayes' theorem and assumes that the features are independent.


5. Artificial Neural Networks (ANN): ANNs are a powerful algorithm for handling complex data. They are effective for handling non-linear data and can handle high-dimensional data.


6. K-Nearest Neighbors (KNN): KNN is a simple and effective algorithm for classification problems. It works by finding the k nearest neighbors to a new data point and using their labels to classify the new point.


7. Gradient Boosting Machines (GBM): GBM is an ensemble learning method that combines multiple weak models to form a strong model. It is effective for handling complex data and can handle missing values.


It is important to note that the choice of algorithm should be based on the specific requirements of the problem and the characteristics of the dataset. It is also important to properly evaluate the performance of the chosen algorithm using appropriate performance metrics such as accuracy, precision, recall, and F1-score.