# Data Analysis for Network Security

This is the [Wildcard 400 of the 2019 Trendmicro CTF](https://ctf.trendmicro.com). It's a fun set of exercises!

## Introduction

​You are a network security administrator for the medium sized business XYZcorp.  You often use network flow data to uncover anomalous security events.  This challenge provides some sample aggregated data on flows, and uses answers from the anomalous events to construct the flag.

Knowledge of network security or protocols is not required.  This challenge requires data stacking, slicing, and/or anomaly detection.

### Data
  - timestamp,src,dst,port,bytes
  - Internal hosts have IPs beginning with 12-14
  - External IPs include everything else

## Preprocessing

In [1]:
import pandas as pd 
import numpy as np

In [2]:
df = pd.read_csv(
    '/kaggle/input/2019-trendmicro-ctf-wildcard-400/gowiththeflow_20190826.csv',
    header = 0, 
    names= ['ts', 'src', 'dst', 'port', 'bytes']
)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105747729 entries, 0 to 105747728
Data columns (total 5 columns):
ts       int64
src      object
dst      object
port     int64
bytes    int64
dtypes: int64(3), object(2)
memory usage: 3.9+ GB


Before starting the analysis itself, we will pre-process the data as we will use this information throughout the entire challenge. Thus, we made the following modifications:
* Implementation of a function that checks whether an address is internal or not,
* Application of the 'is_internal_host()' function on the 'src' and 'dst' columns,
* Transformation of timestamp into datetime

In [3]:
def is_internal_host(host_ip):
    return host_ip.startswith(("12.", "13.", "14."))

df['internal_host_src'] = df['src'].map(is_internal_host)
df['internal_host_dst'] = df['dst'].map(is_internal_host)
df['date_time'] = pd.to_datetime(df['ts'], unit='ms')

## Challenges

The data used here is highly synthetic, so it should be obvious when you get the _right_ answer. 

In [None]:
answers = []

### Question 1: Discover Data Exfiltration 1

*Our intellectual property is leaving the building in large chunks. A machine inside is being used to send out all of our widget designs. One host is sending out much more data from the enterprise than the others. What is its IP?*

To answer this question, we will analyze how 'src' vs 'bytes' behaves, since the answer will be which machine is sending more data to external machines, ie, transmitting more bytes.

In [None]:
bytes_internal_host_suspects = df[(df["internal_host_src"]==True) & 
                                  (df["internal_host_dst"]==False)]
data_suspects = {'src': bytes_internal_host_suspects['src'], 
                 'bytes': bytes_internal_host_suspects['bytes']} 

df_suspects = pd.DataFrame(data_suspects)  

In [None]:
sum_bytes_suspects = df_suspects.groupby('src').sum()
sum_bytes_suspects['bytes'].sort_values(ascending = False).head()
machine_result = sum_bytes_suspects['bytes'].idxmax()

In [None]:
answers.append(machine_result)

### Question 2: Discover Data Exfiltration 2

*Another attacker has a job scheduled that export the contents of our internal wiki. One host is sending out much more data during off hours from the enterprise than the others, different from the host in the Question 1. What is its IP?* 



Analyzing what are the peak hours:

In [None]:
bytes_internal_host_suspects['hour'] = bytes_internal_host_suspects['date_time'].dt.hour
bytes_internal_host_suspects['hour'].value_counts()

Building a dataset with only these times:

In [None]:
bytes_internal_host_suspects = bytes_internal_host_suspects[(bytes_internal_host_suspects["hour"]>=0) & 
                                                            (bytes_internal_host_suspects["hour"]<=15)]

src_suspects = {'src': bytes_internal_host_suspects['src'], 
                'bytes': bytes_internal_host_suspects['bytes']} 

df_src_suspects = pd.DataFrame(src_suspects)  

Finding the host 'src' that sent the most data outside business hours and verifying that this was not the same as in the previous question:

In [None]:
sum_bytes_src_suspects = df_src_suspects[df_src_suspects['src'] != '13.37.84.125']
sum_bytes_src_suspects = sum_bytes_src_suspects.groupby('src').sum()
sum_bytes_src_suspects['bytes'].idxmax()

In [None]:
answers.append(sum_bytes_src_suspects['bytes'].idxmax())

### Question 3: Discover Data Exfiltration 3

*Some assailant is grabbing all the employee and vendor email addresses, and sending them out on a channel normally reserved for other uses. This is similar to attackers abusing DNS for data exfiltration. One host is sending out much more data on a some port from the enterprise than other hosts do, different from the hosts in Questions 1 and 2. What is its port?*


Filtering only internal hosts that have an external destination:

In [None]:
bytes_internal_host_suspects = df[(df["internal_host_src"]==True) & 
                                  (df["internal_host_dst"]==False)]

data_suspects_3 = {'src': bytes_internal_host_suspects['src'], 
                 'port': bytes_internal_host_suspects['port'], 
                 'bytes': bytes_internal_host_suspects['bytes']} 

data_suspects_3 = pd.DataFrame(data_suspects_3)

Calculating how many bytes are passing through each port and calculating the variation coefficient of this value, so we can find out which port is behaving suspiciously:

In [None]:
ports = data_suspects_3['port'].unique()

res = []
for i in range(len(ports)):
    ports_bytes = data_suspects_3[data_suspects_3['port'] == ports[i]]
    ports_bytes = ports_bytes['bytes']
    if ports_bytes.std() == 0:
        res.append(0)
    else:
        res.append(max(ports_bytes - ports_bytes.mean())/ports_bytes.std())

data_suspects_3_1 = {'port': ports, 
                     'metric': res} 
data_suspects_3_1 = pd.DataFrame(data_suspects_3_1)

Thus, we return the one with the highest variation coefficient:

In [None]:
suspect = data_suspects_3_1[data_suspects_3_1['metric'] == max(data_suspects_3_1['metric'])]
answers.append(pd.DataFrame(suspect['port'])['port'].iloc[0])

### Question 4: Private C&C channel

*We're always running a low-grade infection; some internal machines will always have some sort of malware. Some of these infected hosts phone home to C&C on a private channel. What unique port is used by external malware C&C to marshal its bots?*

The first one performed was to filter all external 'src' hosts, so we can track who the external malware is:

In [None]:
#external malware
malware_suspects = df[df["internal_host_src"] == False]

data_malware_suspects = {'src': malware_suspects['src'], 
                         'port': malware_suspects['port']} 

data_malware_suspects = pd.DataFrame(data_malware_suspects)

We then calculate the variation coefficient to analyze the ports used by external hosts. The solution will be the port that has the greatest variation:

In [None]:
data_malware_suspects.groupby(['src','port']).sum()

ports_frequency = data_malware_suspects['port'].value_counts()
ports_metric = ((ports_frequency - ports_frequency.mean())/ports_frequency.std())

data_port_suspects = {'port': data_malware_suspects['port'], 
                      'metric': ports_metric} 

data_port_suspects = pd.DataFrame(data_port_suspects)
answers.append(data_port_suspects['metric'].idxmin())

### Question 5: Internal P2P

*Sometimes our low-grade infection is visible in other ways.  One particular virus has spread through a number of machines, which now are used to relay commands to each other.  The malware has created an internal P2P network.  What unique port is used by the largest internal clique, of all hosts talking to each other?*

If the malware created an internal P2P network, we know the communication went from an internal host to an internal one. Therefore, we filter the dataset following these settings:

In [None]:
# internal P2P network
malware_suspects = df[(df["internal_host_src"] == True) & 
                      (df["internal_host_dst"] == True)]

data_malware_suspects = {'src': malware_suspects['src'], 
                         'port': malware_suspects['port']} 

data_malware_suspects = pd.DataFrame(data_malware_suspects)

And the answer will be the port used in an internal communication between hosts that has the highest variation coefficient:

In [None]:
data_malware_suspects.groupby(['src','port']).sum()

ports_frequency = data_malware_suspects['port'].value_counts()
ports_metric = ((ports_frequency - ports_frequency.mean())/ports_frequency.std())

data_port_suspects = {'port': data_malware_suspects['port'], 
                      'metric': ports_metric} 

data_port_suspects = pd.DataFrame(data_port_suspects)

In [None]:
answers.append(data_port_suspects['metric'].idxmax())

### Question 6: Malware Controller

*We were just blacklisted by an IP reputation service, because some host in our network is behaving badly.  One host is a bot herder receiving C&C callbacks from its botnet, which has little other reason to communicate with hosts in the enterprise.  What is its IP?*

The first step here is to find the host that communicates with an external C&C. So we filter all external -> internal communication:

In [6]:
#host in our network receiving C&C callbacks from its botnet
cc_suspects = df[(df["internal_host_src"] == False) & 
                  (df["internal_host_dst"] == True)]

data_cc_suspects = {'src': cc_suspects['src'],
                     'dst': cc_suspects['dst']} 

data_cc_suspects = pd.DataFrame(data_cc_suspects)

As we know from the statement, the malicious host communicates with C&C via botnet, so we will look for those 'dst' that communicated only once with some external 'src'.

In [7]:
src_dst_frequency = data_cc_suspects.groupby(["src", "dst"]).size().reset_index(name = "communications")
src_frequency = src_dst_frequency.groupby("src").size().reset_index(name = "src_communications")
src_min_frequency = src_frequency[src_frequency['src_communications'] == min(src_frequency['src_communications'])]
cc_suspects = (src_min_frequency['src']).to_numpy()

In [8]:
bot_suspects = []
for i in range(len(cc_suspects)):
    possible_bot = src_dst_frequency[src_dst_frequency['src'] == cc_suspects[i]]
    bot_suspects.append((possible_bot['dst']).to_numpy())

pd.value_counts(pd.DataFrame(bot_suspects).values.flatten())

14.45.67.46    172
dtype: int64

In [16]:
answers.append('14.45.67.46')

### Question 7: Infected Host

*One host is part of the botnet from Question 6, what is its IP?*

To get the answer, we will analyze how this 'dst' communicates with the internal hosts:

In [18]:
#has little other reason to communicate with hosts in the enterprise
bot_suspects = df[(df["internal_host_src"] == True) & 
                  (df["internal_host_dst"] == True)]

data_bot_suspects = {'src': bot_suspects['src'],
                     'dst': bot_suspects['dst']} 

data_bot_suspects = pd.DataFrame(data_bot_suspects)

src_dst_frequency = data_bot_suspects.groupby(["src", "dst"]).size().reset_index(name = "communications")
dst_frequency = data_bot_suspects.groupby(["dst"]).size().reset_index(name = "src_communications")

possible_bot = src_dst_frequency[src_dst_frequency['dst'] == '14.45.67.46']
possible_bot

Unnamed: 0,src,dst,communications
117337,13.42.70.40,14.45.67.46,44
235758,14.51.84.50,14.45.67.46,3


As we can see, communication takes place only with two internal hosts and with a low frequency. Therefore, we can infer that this botnet is the one that does the least communication with the malicious host.

In [19]:
answers.append('14.51.84.50')

### Question 8: Botnet Inside
*There is a stealthier botnet in the network, using low frequency periodic callbacks to external C&C, with embedded higher frequency calls.  What port does it use?*

We know the botnet is communicating with an external callback, so:

In [20]:
botnet_suspects = df[df["internal_host_dst"] == False]

data_botnet_suspects = {'dst': botnet_suspects['dst'],
                         'port': botnet_suspects['port']} 
data_botnet_suspects = pd.DataFrame(data_botnet_suspects)

As we know that it communicates with low frequency, we will group the dst using information from the ports and find the one that has a lower frequency:

In [21]:
port_frequency = data_botnet_suspects.groupby("port").size().reset_index(name = "communications")
botnet = port_frequency['port'][port_frequency['communications'].idxmin()]
botnet

51

In [22]:
answers.append(botnet)

### Question 9: Lateral Brute

*Once a machine is popped, it's often used to explore what else can be reached.  One host is being used to loudly probe the entire enterprise, trying to find ways onto every other host in the enterprise.  What is its IP?*


If he's polling the entire company, then he's communicated with all the devices. In this way, we will analyze communications with internal 'src' and internal 'dst'.

In [8]:
probe_suspects = df[(df["internal_host_src"] == True) & 
                    (df["internal_host_dst"] == True)]

data_probe_suspects = {'src': probe_suspects['src'],
                     'dst': probe_suspects['dst'],
                     'port': probe_suspects['port']} 

data_probe_suspects = pd.DataFrame(data_probe_suspects)

Thus, among the internal communications carried out, we will check which of the hosts used the greatest number of ports.

In [9]:
src_dst_frequency = data_probe_suspects.groupby(["src", "dst"]).size().reset_index(name = "communications")
src_probe_frequency = src_dst_frequency.groupby("src").size().reset_index(name = "src_communications")

probe_suspect = src_probe_frequency.loc[src_probe_frequency['src_communications'].idxmax()]
probe_suspect

src                   13.42.70.40
src_communications            999
Name: 456, dtype: object

In [25]:
answers.append(probe_suspect['src'])

### Question 10: Lateral Spy

*One host is trying to find a way onto every other host more quietly.  What is its IP?*

Let's now analyze all the present combinations of 'src', 'dst' and 'port':

In [4]:
spy_suspects = df[(df["internal_host_src"] == True) & 
                    (df["internal_host_dst"] == True)]
spy_suspects = spy_suspects[spy_suspects['src'] != '13.42.70.40']

data_spy_suspects = {'src': spy_suspects['src'],
                     'dst': spy_suspects['dst'],
                     'port': spy_suspects['port']} 

data_spy_suspects = pd.DataFrame(data_spy_suspects)
src_dst_port_frequency = data_spy_suspects.groupby(["src", "dst", "port"]).size().reset_index(name = "communications")

The spy will be the one who will carry out the most communications with the internal hosts:

In [5]:
src_frequency = src_dst_port_frequency.groupby(["src"]).size().reset_index(name = "src_communications")
spy = src_frequency.iloc[src_frequency['src_communications'].idxmax()]

In [None]:
answers.append(spy['src'])

# Checking the answers

Use the following code to check if your answers are correct.

In [13]:
import hashlib
answer_hash = hashlib.md5(':'.join(answers).encode('utf-8')).hexdigest()
assert answer_hash == 'ec766132cac80b821793fb9e7fdfd763'