# HTTP C2 Beaconing Detection using Statistical Analysis

# Statistics 101
## Median:
The median of a finite list of numbers is the "middle" number, when those numbers are listed in order from smallest to greatest. 
For example, the median value of the data (1, 1, 2, 2, 4, 6, 9) is 2 
## Median Absolute Deviation:  
MAD is a measurement of how wide or narrow the distribution of the values is.
Consider the data (1, 1, 2, 2, 4, 6, 9). It has a median value of 2. The absolute deviations about 2 are (1, 1, 0, 0, 2, 4, 7) which in turn have a median value of 1 (because the sorted absolute deviations are (0, 0, 1, 1, 2, 4, 7)). So the median absolute deviation for this data is 1.  

## Skewness 
Skewness is a measure of the asymmetry of the distribution of a real-valued random variable about its mean(average)  

![skewness](./images/skewness.png)

# Regular user traffic vs. Beacon traffic

Sample beacon config:  

|Sleep|Jitter|Sleep interval|CS Sleep interval
|--|--|--|--|
|300 seconds| 20% | 240-360 seconds|240-300 seconds|

~300 connections in 24 hours.  
*Sleep = time delta  

### Users randomly visit/refresh the same web page  

---

### Beacon traffic: 300 time deltas, each between 240-360 seconds


### User traffic: 300 time deltas, each between 60-600 seconds (assumption)

![Normal vs Beacon](./images/Normal-vs-Beacon.png)

- Median of time delta of beacon traffic: 319 
- Median of time delta of user traffic: 236
- MAD of time delta of beacon traffic: 21
- MAD of time delta of user traffic: 95  

MAD = Median absolute deviation

## What does it mean?

Simply put:
- If beacon traffic ==> **uniform** distribution and **small** Median Absolute Deviation
- If user traffic ==> **skewed** distribution and **large** Median Absolute Deviation

## How can we detect beaconing traffic using this method?

# RITA to the rescue!

## RITA (Real Intelligence Threat Analytics)

![RITA Logo](./images/rita-logo.png)


Sponsored by [Active Countermeasures](https://activecountermeasures.com/).


RITA is an open source framework for network traffic analysis.

The framework ingests [Zeek Logs](https://www.zeek.org/) in TSV format, and currently supports the following major features:
 - **Beaconing Detection**: Search for signs of beaconing behavior in and out of your network
 - **DNS Tunneling Detection** Search for signs of DNS based covert channels
 - **Blacklist Checking**: Query blacklists to search for suspicious domains and hosts

## How RITA works

RITA analyzes time delta size distribution and data size(sent bytes) distribution of a traffic between the same source and the destination and calculates an overall score for time delta and data size. For now, we'll focus only on the time delta.

RITA calculates an overal score from 3 different scores based on time delta. 

## 1. Time delta distribution
Perfect beacons should have symmetric delta time distribution. Bowley skewness measure is used to check symmetry. 

## 2. Median Absolute Deviation of time delta
Perfect beacons should have very low dispersion around the median of their delta times. Median Absolute Deviation is used to check dispersion.

## 3. Connection Count
If the total connection count of traffic is high, it's more likely a beacon and vice versa.

# Implementing the algorithm using Jupyter notebook

We will:  
- Implement RITA beacon analyzer in Jupyter Notebook
  - Use dataset shared by Ali Alwashali ([@ali_alwashali](https://twitter.com/ali_alwashali))
    - Zeek logs from "malware-traffic-analysis.net" PCAP files, from 2013 to 2021
    - Suricata alerts triggered by the PCAP analysis
    
RITA beacon analyzer code: [https://github.com/activecm/rita/blob/master/pkg/beacon/analyzer.go](https://github.com/activecm/rita/blob/master/pkg/beacon/analyzer.go)

## Steps:  
0. Prepare data
1. Group connections between the same hosts and aggregate timestamps into a list
2. Calculate connection count and remove short sessions
3. Sort the connection timestamp ascending
4. Calculate time deltas
5. Generate variables required for score calculation
6. Calculate the score
7. Validation with Suricata alerts

In [1]:
import math
import pandas as pd
import numpy as np
import warnings
# Disable warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

## Preparing the Data

### Loading the data

In [2]:
# assign field/column names to variables
f_timestamp = 'ts'
f_src_ip = 'id.orig_h'
f_dst_ip = 'id.resp_h'
f_dst_host = 'host'
f_dst_port = 'id.resp_p'
f_http_method = 'method'
f_delimiter = '\t'

columns_to_filter = [f_timestamp, f_src_ip, f_dst_ip, f_dst_host, f_dst_port, f_http_method]
columns_to_groupby = [f_src_ip, f_dst_ip, f_dst_host, f_dst_port, f_http_method]
# columns to display after the analysis
columns_to_display = ['tsScore','conn_count',f_src_ip,f_dst_ip,f_dst_host,f_http_method, f_dst_port,'deltas']

http_df = pd.read_csv('./sample-data/http-dataset.log', sep=f_delimiter)
http_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108967 entries, 0 to 108966
Data columns (total 30 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   ts                 108967 non-null  float64
 1   uid                108967 non-null  object 
 2   id.orig_h          108967 non-null  object 
 3   id.orig_p          108967 non-null  int64  
 4   id.resp_h          108967 non-null  object 
 5   id.resp_p          108967 non-null  int64  
 6   trans_depth        108967 non-null  int64  
 7   method             108967 non-null  object 
 8   host               108967 non-null  object 
 9   uri                108967 non-null  object 
 10  referrer           108967 non-null  object 
 11  version            108967 non-null  object 
 12  user_agent         108967 non-null  object 
 13  origin             108949 non-null  object 
 14  request_body_len   108967 non-null  int64  
 15  response_body_len  108967 non-null  int64  
 16  st

### Filtering Required Columns
```
df.loc[first_row_index:last_row_index , ['column1', 'column3']]
```

In [3]:
# get all rows and only the required colums
http_df = http_df.loc[:,columns_to_filter]
http_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108967 entries, 0 to 108966
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   ts         108967 non-null  float64
 1   id.orig_h  108967 non-null  object 
 2   id.resp_h  108967 non-null  object 
 3   host       108967 non-null  object 
 4   id.resp_p  108967 non-null  int64  
 5   method     108967 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 5.0+ MB


### Fixing the Timestamp Column Type

In [4]:
http_df[f_timestamp] = pd.to_datetime(http_df[f_timestamp], unit='s')
http_df.head(2)

Unnamed: 0,ts,id.orig_h,id.resp_h,host,id.resp_p,method
0,2013-06-19 00:25:23.332814848,192.168.122.178,173.247.253.210,www.insightcrime.org,80,GET
1,2013-06-19 00:25:23.961981184,192.168.122.178,93.171.172.220,93.171.172.220,80,GET


## Analysing the Data
Now the data is ready for statistical analysis. 

### Grouping the Connections
We can group the connections between the same hosts and aggregate the timestamps into a list. This way, we can use **pd.Series().diff** to easily calculate time deltas.  
First, let's group the traffic and aggregate.

In [5]:
# If you have a large dateset, using groupby and aggregate(list) might be slow. Consider using dask
http_df = http_df.groupby(columns_to_groupby).agg(list)
http_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,ts
id.orig_h,id.resp_h,host,id.resp_p,method,Unnamed: 5_level_1
1.8.31.101,104.168.98.206,104.168.98.206,80,GET,"[2019-08-31 13:07:54.941824, 2019-08-31 13:08:..."
1.8.31.101,13.107.4.50,www.download.windowsupdate.com,80,GET,[2019-08-31 13:00:38.149271040]
1.8.31.101,147.135.15.186,ip-api.com,80,POST,[2019-08-31 12:44:03.817665024]
1.8.31.101,170.238.117.187,170.238.117.187,8082,POST,"[2019-08-31 13:07:53.975055872, 2019-08-31 13:..."
1.8.31.101,172.217.12.228,www.google.com,80,GET,"[2019-08-31 12:44:28.299268864, 2019-08-31 12:..."


#### Reseting the index
We need to reset the index to use all the columns.

In [6]:
http_df.reset_index(inplace=True)
http_df.head(2)

Unnamed: 0,id.orig_h,id.resp_h,host,id.resp_p,method,ts
0,1.8.31.101,104.168.98.206,104.168.98.206,80,GET,"[2019-08-31 13:07:54.941824, 2019-08-31 13:08:..."
1,1.8.31.101,13.107.4.50,www.download.windowsupdate.com,80,GET,[2019-08-31 13:00:38.149271040]


### Calculating Connection Count
Length of the timestamp list ('ts') = connection count (since we grouped the traffic and aggregated the timestamp into a list)  

In [7]:
# create a new column 'conn_count', and for each row in the 'ts' column, apply a function and assign the returned value to the 'conn_count' column
http_df['conn_count'] = http_df[f_timestamp].apply(lambda x: len(x))
http_df.head(2)

Unnamed: 0,id.orig_h,id.resp_h,host,id.resp_p,method,ts,conn_count
0,1.8.31.101,104.168.98.206,104.168.98.206,80,GET,"[2019-08-31 13:07:54.941824, 2019-08-31 13:08:...",3
1,1.8.31.101,13.107.4.50,www.download.windowsupdate.com,80,GET,[2019-08-31 13:00:38.149271040],1


### Remove short sessions
Filter out traffic where the connection count is quite small

In [8]:
http_df = http_df.loc[http_df['conn_count'] > 20]
http_df.shape

(433, 7)

#### Sorting the Timestamps
Apply a lambda function to each row on the specified columns.

In [9]:
http_df[f_timestamp] = http_df[f_timestamp].apply(lambda x: sorted(x))
http_df.head(2)

Unnamed: 0,id.orig_h,id.resp_h,host,id.resp_p,method,ts,conn_count
41,10.0.0.102,46.29.183.211,46.29.183.211,8080,POST,"[2019-10-25 16:30:30.652640, 2019-10-25 16:30:...",35
49,10.0.0.134,5.252.177.17,5.252.177.17,80,GET,"[2021-06-15 15:08:54.879534848, 2021-06-15 15:...",25


### Calculating Time Delta
Apply **pd.Series.diff()** to the 'ts' column, which is sorted, and assign the resulting list into a new column

In [10]:
# Convert list into a Series object, get time delta, convert the result back into a list and assign it to the 'deltas' column
http_df['deltas'] = http_df[f_timestamp].apply(lambda x: pd.Series(x).diff().dt.seconds.dropna().tolist())
http_df.head(2)

Unnamed: 0,id.orig_h,id.resp_h,host,id.resp_p,method,ts,conn_count,deltas
41,10.0.0.102,46.29.183.211,46.29.183.211,8080,POST,"[2019-10-25 16:30:30.652640, 2019-10-25 16:30:...",35,"[2.0, 831.0, 880.0, 8.0, 862.0, 3.0, 120.0, 80..."
49,10.0.0.134,5.252.177.17,5.252.177.17,80,GET,"[2021-06-15 15:08:54.879534848, 2021-06-15 15:...",25,"[60.0, 60.0, 61.0, 60.0, 60.0, 60.0, 60.0, 60...."


### Generate variables
We need to generate the required variables to analyze time delta distribution, median absolute deviation, and connection count.

#### Variables for time delta dispersion

In [11]:
http_df['tsLow'] = http_df['deltas'].apply(lambda x: np.percentile(np.array(x), 20))
http_df['tsMid'] = http_df['deltas'].apply(lambda x: np.percentile(np.array(x), 50))
http_df['tsHigh'] = http_df['deltas'].apply(lambda x: np.percentile(np.array(x), 80))
http_df['tsBowleyNum'] = http_df['tsLow'] + http_df['tsHigh'] - 2*http_df['tsMid']
http_df['tsBowleyDen'] = http_df['tsHigh'] - http_df['tsLow']
http_df['tsSkew'] = http_df[['tsLow','tsMid','tsHigh','tsBowleyNum','tsBowleyDen']].apply(
    lambda x: x['tsBowleyNum'] / x['tsBowleyDen'] if x['tsBowleyDen'] != 0 and x['tsMid'] != x['tsLow'] and x['tsMid'] != x['tsHigh'] else 0.0, axis=1
    )
http_df['tsMadm'] = http_df['deltas'].apply(lambda x: np.median(np.absolute(np.array(x) - np.median(np.array(x)))))
http_df['tsConnDiv'] = http_df[f_timestamp].apply(lambda x: (x[-1].to_pydatetime() - x[0].to_pydatetime()).seconds / 90)

In [12]:
http_df.head(3)

Unnamed: 0,id.orig_h,id.resp_h,host,id.resp_p,method,ts,conn_count,deltas,tsLow,tsMid,tsHigh,tsBowleyNum,tsBowleyDen,tsSkew,tsMadm,tsConnDiv
41,10.0.0.102,46.29.183.211,46.29.183.211,8080,POST,"[2019-10-25 16:30:30.652640, 2019-10-25 16:30:...",35,"[2.0, 831.0, 880.0, 8.0, 862.0, 3.0, 120.0, 80...",2.6,824.5,880.0,-766.4,877.4,-0.87349,74.5,1758.0
49,10.0.0.134,5.252.177.17,5.252.177.17,80,GET,"[2021-06-15 15:08:54.879534848, 2021-06-15 15:...",25,"[60.0, 60.0, 61.0, 60.0, 60.0, 60.0, 60.0, 60....",60.0,60.0,60.0,0.0,0.0,0.0,0.0,145.8
50,10.0.0.134,5.252.177.17,5.252.177.17,443,GET,"[2021-06-15 15:08:55.172674816, 2021-06-15 15:...",1689,"[60.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,150.1


### Calculating the score

In [13]:
http_df['tsSkewScore'] = 1.0 - abs(http_df['tsSkew'])
http_df['tsMadmScore'] = 1.0 - http_df['tsMadm']/30.0
http_df['tsMadmScore'] = http_df['tsMadmScore'].apply(lambda x: 0 if x < 0 else x)
http_df['tsConnCountScore'] = (http_df['conn_count']) / http_df['tsConnDiv']
http_df['tsConnCountScore'] = http_df['tsConnCountScore'].apply(lambda x: 1.0 if x > 1.0 else x)
http_df['tsScore'] = (((http_df['tsSkewScore'] + http_df['tsMadmScore'] + http_df['tsConnCountScore']) / 3.0) * 1000) / 1000
http_df.sort_values(by= 'tsScore', ascending=False, inplace=True, ignore_index=True)
http_df[columns_to_display].head(30)

Unnamed: 0,tsScore,conn_count,id.orig_h,id.resp_h,host,method,id.resp_p,deltas
0,1.0,21,192.168.204.145,69.174.53.234,www.techo-bloc.com,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,1.0,81,192.168.204.148,188.95.248.45,www.divxatope.com,GET,80,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,1.0,21,10.6.25.102,62.76.188.61,cerberhhyed5frqa.xmfir0.top,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,1.0,21,10.6.26.101,103.208.86.43,cerberhhyed5frqa.raress.win,GET,80,"[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,1.0,158,192.168.1.43,192.157.76.194,www.knowyourteeth.com,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,1.0,106,10.6.28.101,50.233.80.221,www.thetechhaus.com,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6,1.0,26,192.168.1.138,188.116.34.246,www1.v5ipk3gc8hug1du9459.4pu.com,GET,80,"[0.0, 1.0, 0.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
7,1.0,24,192.168.204.151,67.215.234.26,womenshealthhelp.net,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ..."
8,1.0,36,192.168.204.151,64.12.245.3,l.5min.com,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, ..."
9,1.0,26,192.168.204.150,184.72.236.84,www.programmersheaven.com,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, ..."


## Display the traffic with high scores

In [18]:
http_df.loc[http_df['tsScore'] > 0.80, columns_to_display]

Unnamed: 0,tsScore,conn_count,id.orig_h,id.resp_h,host,method,id.resp_p,deltas
0,1.000000,21,192.168.204.145,69.174.53.234,www.techo-bloc.com,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,1.000000,81,192.168.204.148,188.95.248.45,www.divxatope.com,GET,80,"[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,1.000000,21,10.6.25.102,62.76.188.61,cerberhhyed5frqa.xmfir0.top,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,1.000000,21,10.6.26.101,103.208.86.43,cerberhhyed5frqa.raress.win,GET,80,"[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,1.000000,158,192.168.1.43,192.157.76.194,www.knowyourteeth.com,GET,80,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...,...,...,...,...,...,...
269,0.830368,69,10.6.10.102,8.208.101.157,addlock.mitial.at,GET,80,"[22.0, 21.0, 21.0, 21.0, 21.0, 21.0, 20.0, 20...."
270,0.822222,52,192.168.204.149,178.74.214.194,oak-tureght.ru,GET,80,"[2.0, 1.0, 2.0, 6.0, 0.0, 1.0, 2.0, 4.0, 2.0, ..."
271,0.822222,34,10.4.18.106,212.124.117.180,n43adshostnet.com,GET,80,"[1.0, 6.0, 0.0, 6.0, 0.0, 0.0, 21.0, 0.0, 4.0,..."
272,0.816667,27,10.4.19.101,208.66.66.17,uat-net.technoratimedia.com,GET,80,"[7.0, 1.0, 0.0, 8.0, 0.0, 5.0, 6.0, 0.0, 3.0, ..."


# Validation with Suricata Alerts

In [20]:
suricata_df = pd.read_csv('./sample-data/suricata alerts.csv', sep=',')
suricata_df.loc[suricata_df['dest_ip'].isin(['107.181.187.14','5.199.162.3','23.74.28.9','35.198.166.240',
                                             '193.33.134.7','31.184.192.202','5.149.222.125','185.180.198.24',
                                             '173.254.231.111','139.60.161.74','192.254.79.71','185.68.93.18',
                                             '69.16.143.110','37.59.68.215','31.44.184.33','178.74.214.194']),
                                             ['src_ip','dest_ip','alert.signature']].drop_duplicates()

Unnamed: 0,src_ip,dest_ip,alert.signature
1392,10.17.6.93,139.60.161.74,ET MALWARE Cobalt Strike Beacon Observed
1772,10.17.6.93,139.60.161.74,ET HUNTING GENERIC SUSPICIOUS POST to Dotted Q...
8151,10.5.26.4,5.199.162.3,ET MALWARE Cobalt Strike Malleable C2 Profile ...
11403,10.2.2.101,192.254.79.71,ET MALWARE Cobalt Strike Beacon Observed
11433,10.2.2.101,192.254.79.71,ET JA3 Hash - [Abuse.ch] Possible Dridex
13811,10.2.2.101,192.254.79.71,ET HUNTING GENERIC SUSPICIOUS POST to Dotted Q...
20123,10.2.2.101,192.254.79.71,ET ADWARE_PUP Fun Web Products Spyware User-Ag...
62016,10.7.25.101,31.44.184.33,ET MALWARE Cobalt Strike Beacon Observed
72576,10.7.25.101,31.44.184.33,ET HUNTING GENERIC SUSPICIOUS POST to Dotted Q...
73327,10.7.22.101,31.44.184.33,ET HUNTING GENERIC SUSPICIOUS POST to Dotted Q...
