# AC&CD: Active C&C Detector
![ACCD](./images/ACCD.png)

# How AC&CD works

AC&CD analyzes time delta size distribution and data size(sent and received bytes) distribution of a traffic between the same source and the destination and produces an overall score.

## Time delta distribution scoring
AC&CD calculates a score based on time delta distribution. 

In a real attack, attackers change the sleep and jitter configuration on the fly according to their needs such as SOCKS tunneling, off-business hours, etc. The change in the sleep and jitter makes the traffic not look like beaconing and causes False Negatives. To solve the issue, AC&CD analyzes the certain part of the time delta set where higher sleep values can't cause false negatives. The certain part of the time delta set is roughly the first 60% of the time delta set. Therefore, AC&CD calculates the time delta distribution score from the first 60% of the time delta set by using 15th, 30th and 45th percentiles of the whole dataset.

### Median Absolute Deviation of time delta
When jitter is used in beaconing, the time delta set is expected to have some dispersion around the median of the time delta set. Median Absolute Deviation(MAD) is used to check dispersion. Dividing the MAD by the median of the time delta set gives a normalized dispersion value which is the jitter in the dataset. The normalized dispersion value is used to calculate time delta score.


## Data size scoring
AC&CD calculates an overal data size score based on data size(sent and received bytes) distribution. 

### Sent Bytes Distribution
AC&CD calculates the sent bytes distribution score by using 15th, 30th and 45th percentiles of the whole dataset as in time delta distribution scoring.

### Received Bytes Distribution
In real attacks, attackers send command/tools to the beacon. Therefore, there has to be some communication with higher received bytes. AC&CD checks the maximum received bytes in the dataset and if it's less than 20.000, it reduces the data size score. 

## Skewness
Jitter usage in beaconing causes random distribution of the sleep values and does not guarantee uniform distribution. Therefore, AC&CD doesn't use skewness in the scoring.

---

# Implementing the algorithm using Jupyter notebook

## Steps:  
0. Prepare data
1. Group connections between the same hosts and aggregate timestamps into a list
2. Calculate connection count and remove short sessions
3. Sort the connection timestamp ascending
4. Calculate time deltas
5. Generate variables required for score calculation (time delta and data size)
6. Calculate the score (time delta, data size, overall score)

In [36]:
import pandas as pd
import numpy as np
import warnings
# Disable warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

## Preparing the Data

### Loading the data

In [37]:
http_df = pd.read_csv('./sample-data/sample_beacon_dataset.csv')
http_df.head()

Unnamed: 0,Timestamp,DestinationHostName,DestinationPort,DestinationIP,ReceivedBytes,SentBytes,Protocol,RequestMethod,SourceIP
0,"7/14/2023, 1:11:02.921 AM",www.amazon.com,,1.2.3.4.5,257,279,,GET,10.10.25.52
1,"7/14/2023, 1:10:10.909 AM",www.amazon.com,,1.2.3.4.5,257,279,,GET,10.10.25.52
2,"7/14/2023, 1:09:13.932 AM",www.amazon.com,,1.2.3.4.5,257,279,,GET,10.10.25.52
3,"7/14/2023, 1:08:06.920 AM",www.amazon.com,,1.2.3.4.5,257,279,,GET,10.10.25.52
4,"7/14/2023, 1:06:58.916 AM",www.amazon.com,,1.2.3.4.5,257,279,,GET,10.10.25.52


In [38]:
# assign field/column names to variables
f_timestamp = 'Timestamp'
f_src_ip = 'SourceIP'
f_dst_ip = 'DestinationIP'
f_dst_host = 'DestinationHostName'
f_dst_port = 'DestinationPort'
f_http_method = 'RequestMethod'
f_sent_bytes = 'SentBytes'
f_received_bytes = 'ReceivedBytes'

columns_to_filter = [f_timestamp, f_src_ip, f_dst_ip, f_dst_host, f_dst_port, f_http_method, f_sent_bytes, f_received_bytes]
columns_to_groupby = [f_src_ip, f_dst_host, f_http_method]
# columns to display after the analysis
columns_to_display = ['Score', 'tsScore', 'dsScore', 'conn_count', 'dsResponseMax', f_src_ip, f_dst_ip, f_dst_host, f_http_method, f_dst_port, f_sent_bytes, f_received_bytes, 'deltas']

### Filtering Required Columns
```
df.loc[first_row_index:last_row_index , ['column1', 'column3']]
```

In [39]:
# get all rows and only the required colums
http_df = http_df.loc[:,columns_to_filter]
http_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 876 entries, 0 to 875
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Timestamp            876 non-null    object 
 1   SourceIP             876 non-null    object 
 2   DestinationIP        876 non-null    object 
 3   DestinationHostName  876 non-null    object 
 4   DestinationPort      0 non-null      float64
 5   RequestMethod        876 non-null    object 
 6   SentBytes            876 non-null    int64  
 7   ReceivedBytes        876 non-null    int64  
dtypes: float64(1), int64(2), object(5)
memory usage: 54.9+ KB


### Fixing the Timestamp Column Type

In [40]:
#  unit conversion should match with the format (try ns, ms, or s)
http_df[f_timestamp] = pd.to_datetime(http_df[f_timestamp], unit='ns')
http_df.head(2)

Unnamed: 0,Timestamp,SourceIP,DestinationIP,DestinationHostName,DestinationPort,RequestMethod,SentBytes,ReceivedBytes
0,2023-07-14 01:11:02.921,10.10.25.52,1.2.3.4.5,www.amazon.com,,GET,279,257
1,2023-07-14 01:10:10.909,10.10.25.52,1.2.3.4.5,www.amazon.com,,GET,279,257


## Analysing the Data
Now the data is ready for statistical analysis. 

### Grouping the Connections
We can group the connections between the same hosts and aggregate the timestamps into a list. This way, we can use **pd.Series().diff** to easily calculate time deltas.  
First, let's group the traffic and aggregate.

In [41]:
# If you have a large dateset, using groupby and aggregate(list) might be slow. Consider using dask
http_df = http_df.groupby(columns_to_groupby).agg(list)
http_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Timestamp,DestinationIP,DestinationPort,SentBytes,ReceivedBytes
SourceIP,DestinationHostName,RequestMethod,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
10.10.25.52,www.amazon.com,GET,"[2023-07-14 01:11:02.921000, 2023-07-14 01:10:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[279, 279, 279, 279, 279, 279, 279, 279, 279, ...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ..."
10.10.25.52,www.amazon.com,POST,"[2023-07-14 01:03:40.909000, 2023-07-14 01:02:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[595, 425214, 595, 5116, 595, 992910, 1870023,...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ..."


#### Reseting the index
We need to reset the index to use all the columns.

In [42]:
http_df.reset_index(inplace=True)
http_df.head(2)

Unnamed: 0,SourceIP,DestinationHostName,RequestMethod,Timestamp,DestinationIP,DestinationPort,SentBytes,ReceivedBytes
0,10.10.25.52,www.amazon.com,GET,"[2023-07-14 01:11:02.921000, 2023-07-14 01:10:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[279, 279, 279, 279, 279, 279, 279, 279, 279, ...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ..."
1,10.10.25.52,www.amazon.com,POST,"[2023-07-14 01:03:40.909000, 2023-07-14 01:02:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[595, 425214, 595, 5116, 595, 992910, 1870023,...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ..."


### Calculating Connection Count
Length of the timestamp list ('ts') = connection count (since we grouped the traffic and aggregated the timestamp into a list)  

In [43]:
# create a new column 'conn_count', and for each row in the 'ts' column, apply a function and assign the returned value to the 'conn_count' column
http_df['conn_count'] = http_df[f_timestamp].apply(lambda x: len(x))
http_df.head(2)

Unnamed: 0,SourceIP,DestinationHostName,RequestMethod,Timestamp,DestinationIP,DestinationPort,SentBytes,ReceivedBytes,conn_count
0,10.10.25.52,www.amazon.com,GET,"[2023-07-14 01:11:02.921000, 2023-07-14 01:10:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[279, 279, 279, 279, 279, 279, 279, 279, 279, ...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...",848
1,10.10.25.52,www.amazon.com,POST,"[2023-07-14 01:03:40.909000, 2023-07-14 01:02:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[595, 425214, 595, 5116, 595, 992910, 1870023,...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...",28


### Remove short sessions
Filter out traffic where the connection count is quite small

In [44]:
http_df = http_df.loc[http_df['conn_count'] >= 24]
http_df.shape

(2, 9)

#### Sorting the Timestamps
Apply a lambda function to each row on the specified columns.

In [45]:
http_df[f_timestamp] = http_df[f_timestamp].apply(lambda x: sorted(x))
http_df.head(2)

Unnamed: 0,SourceIP,DestinationHostName,RequestMethod,Timestamp,DestinationIP,DestinationPort,SentBytes,ReceivedBytes,conn_count
0,10.10.25.52,www.amazon.com,GET,"[2023-07-13 11:05:59.388000, 2023-07-13 11:06:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[279, 279, 279, 279, 279, 279, 279, 279, 279, ...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...",848
1,10.10.25.52,www.amazon.com,POST,"[2023-07-13 11:10:33.076000, 2023-07-13 11:10:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[595, 425214, 595, 5116, 595, 992910, 1870023,...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...",28


### Calculating Time Delta
Apply **pd.Series.diff()** to the 'ts' column, which is sorted, and assign the resulting list into a new column

In [46]:
# Convert list into a Series object, get time delta, convert the result back into a list and assign it to the 'deltas' column
http_df['deltas'] = http_df[f_timestamp].apply(lambda x: pd.Series(x).diff().dt.seconds.dropna().tolist())
http_df.head(2)

Unnamed: 0,SourceIP,DestinationHostName,RequestMethod,Timestamp,DestinationIP,DestinationPort,SentBytes,ReceivedBytes,conn_count,deltas
0,10.10.25.52,www.amazon.com,GET,"[2023-07-13 11:05:59.388000, 2023-07-13 11:06:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[279, 279, 279, 279, 279, 279, 279, 279, 279, ...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...",848,"[10.0, 155.0, 68.0, 39.0, 0.0, 8.0, 8.0, 54.0,..."
1,10.10.25.52,www.amazon.com,POST,"[2023-07-13 11:10:33.076000, 2023-07-13 11:10:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[595, 425214, 595, 5116, 595, 992910, 1870023,...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...",28,"[0.0, 0.0, 0.0, 0.0, 4100.0, 123.0, 458.0, 60...."


### Generate variables
We need to generate the required variables to analyze time delta distribution, median absolute deviation, and connection count.

#### Variables for time delta dispersion

In [47]:
http_df['tsLow'] = http_df['deltas'].apply(lambda x: np.percentile(np.array(x), 15))
http_df['tsMid'] = http_df['deltas'].apply(lambda x: np.percentile(np.array(x), 30))
http_df['tsHigh'] = http_df['deltas'].apply(lambda x: np.percentile(np.array(x), 45))
http_df['tsMadm'] = http_df['deltas'].apply(lambda x: np.percentile(np.absolute(np.array(x) - np.median(np.array(x))), 30))

#### Variables for data size dispersion

In [48]:
http_df['dsLow'] = http_df[f_sent_bytes].apply(lambda x: np.percentile(np.array(x), 15))
http_df['dsMid'] = http_df[f_sent_bytes].apply(lambda x: np.percentile(np.array(x), 30))
http_df['dsHigh'] = http_df[f_sent_bytes].apply(lambda x: np.percentile(np.array(x), 45))
http_df['dsMadm'] = http_df[f_sent_bytes].apply(lambda x: np.percentile(np.absolute(np.array(x) - np.median(np.array(x))), 30))
# Get max of received bytes
http_df['dsResponseMax'] = http_df[f_received_bytes].apply(lambda x: max(x))

In [49]:
http_df.head(3)

Unnamed: 0,SourceIP,DestinationHostName,RequestMethod,Timestamp,DestinationIP,DestinationPort,SentBytes,ReceivedBytes,conn_count,deltas,tsLow,tsMid,tsHigh,tsMadm,dsLow,dsMid,dsHigh,dsMadm,dsResponseMax
0,10.10.25.52,www.amazon.com,GET,"[2023-07-13 11:05:59.388000, 2023-07-13 11:06:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[279, 279, 279, 279, 279, 279, 279, 279, 279, ...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...",848,"[10.0, 155.0, 68.0, 39.0, 0.0, 8.0, 8.0, 54.0,...",18.0,52.0,56.0,4.0,279.0,279.0,279.0,0.0,386058
1,10.10.25.52,www.amazon.com,POST,"[2023-07-13 11:10:33.076000, 2023-07-13 11:10:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[595, 425214, 595, 5116, 595, 992910, 1870023,...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...",28,"[0.0, 0.0, 0.0, 0.0, 4100.0, 123.0, 458.0, 60....",47.7,59.4,66.1,13.8,595.0,595.0,944.0,420.0,61478


### Interactive Beaconing Duration and Connection Count Validation
Calculate duration of the assumed interactive beaconing phase.  
Based on tsHigh and sum of the deltas up to the tsHigh value in the time delta list, there should be enough connections.  
A Session with just 5 connections in 10 minutes shouldn't be consdired as beaconing

In [52]:
def calc_duration(deltas: list) -> float:
    _tmp = np.percentile(np.array(deltas), 45)
    # if sleep is set to 0, we may see deltas as 0 seconds but the actual time delta is around is 0.2 seconds in that situation.
    # so, we are adding 0.2 seconds to the delta if it is less than or equal to 0, and getting the sum of all deltas up to the tsHigh which is 45th percentile.
    return sum([i if i > 0 else 0.2 for i in [num for num in deltas if num <= _tmp]])

In [57]:
# interactivity_duration: seconds
# min_required_conn_count: minimum number of connections required to be considered as a valid beacon
http_df['interactivity_duration'] = http_df['deltas'].apply(lambda x: calc_duration(x))
http_df['min_required_conn_count'] = http_df['interactivity_duration'] / http_df['tsHigh']

In [59]:
# False Positives can be removed by filtering out the rows with 'min_required_conn_count' less than the 'conn_count'. 
# Additionally, we can filter out the rows with 'interactivity_duration' less than a certain threshold.
http_df = http_df.loc[(http_df['min_required_conn_count'] <= http_df['conn_count']) & (http_df['interactivity_duration'] >= 3600)] # min interactivity duration: 1 hour
http_df.head()

Unnamed: 0,SourceIP,DestinationHostName,RequestMethod,Timestamp,DestinationIP,DestinationPort,SentBytes,ReceivedBytes,conn_count,deltas,...,tsMid,tsHigh,tsMadm,dsLow,dsMid,dsHigh,dsMadm,dsResponseMax,interactivity_duration,min_required_conn_count
0,10.10.25.52,www.amazon.com,GET,"[2023-07-13 11:05:59.388000, 2023-07-13 11:06:...","[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...","[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[279, 279, 279, 279, 279, 279, 279, 279, 279, ...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...",848,"[10.0, 155.0, 68.0, 39.0, 0.0, 8.0, 8.0, 54.0,...",...,52.0,56.0,4.0,279.0,279.0,279.0,0.0,386058,13475.2,240.628571


### Calculating the score

In [60]:
# Set jitter percentage threshold
jitter_treshold = 55 # => 55%

# Time delta score calculation
def ts_score(tsMadm, tsMid, jitter_threshold = 55):
    score = 0
    if tsMid > 0:
        if (tsMadm / tsMid) * 100 < jitter_threshold:
            score = 1
        else:
            score = 1 - ((tsMadm / tsMid) * 0.4)
    else:
        score = 1
    return score

# Data size score calculation
def ds_score(dsMadm, dsMid, dsResponseMax, jitter_threshold = 25):
    score = 0
    if dsMid > 0:
        if (dsMadm / dsMid) * 100 < jitter_threshold:
            score = 1
        else:
            score = 1 - ((dsMadm / dsMid) * 0.4)
    else:
        score = 1
    if dsResponseMax < 20000:
        score = score - 0.3
    return score

http_df['tsScore'] = http_df.apply(lambda x: ts_score(x['tsMadm'], x['tsMid']), axis=1)
http_df['dsScore'] = http_df.apply(lambda x: ds_score(x['dsMadm'], x['dsMid'], x['dsResponseMax']), axis=1)

# Final Score calculation
http_df['Score'] = (http_df['dsScore'] + http_df['tsScore']) / 2

# Sort Results
http_df.sort_values(by= 'Score', ascending=False, inplace=True, ignore_index=True)


## Display the traffic with high scores

In [61]:
http_df[columns_to_display].query("Score > 0.85")

Unnamed: 0,Score,tsScore,dsScore,conn_count,dsResponseMax,SourceIP,DestinationIP,DestinationHostName,RequestMethod,DestinationPort,SentBytes,ReceivedBytes,deltas
0,1.0,1,1,848,386058,10.10.25.52,"[1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1.2.3.4.5, 1...",www.amazon.com,GET,"[nan, nan, nan, nan, nan, nan, nan, nan, nan, ...","[279, 279, 279, 279, 279, 279, 279, 279, 279, ...","[257, 257, 257, 257, 257, 257, 257, 257, 257, ...","[10.0, 155.0, 68.0, 39.0, 0.0, 8.0, 8.0, 54.0,..."
