# Data Analysis for Network Security

This is the [Wildcard 400 of the 2019 Trendmicro CTF](https://ctf.trendmicro.com). It's a fun set of exercises!

## Introduction

​You are a network security administrator for the medium sized business XYZcorp.  You often use network flow data to uncover anomalous security events.  This challenge provides some sample aggregated data on flows, and uses answers from the anomalous events to construct the flag.

Knowledge of network security or protocols is not required.  This challenge requires data stacking, slicing, and/or anomaly detection.

### Data
  - timestamp,src,dst,port,bytes
  - Internal hosts have IPs beginning with 12-14
  - External IPs include everything else

## Preprocessing

In [None]:
import pandas as pd

import gc
import networkx
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from networkx.algorithms.approximation.clique import large_clique_size 

In [None]:
df = pd.read_csv(
    '/kaggle/input/2019-trendmicro-ctf-wildcard-400/gowiththeflow_20190826.csv',
    header = 0, 
    names= ['ts', 'src', 'dst', 'port', 'bytes']
)
df.info()

In [None]:
def isInternalHost (hst):
    """
    Check if host is internal (return 1) or external (return 0)
    """
    return hst.startswith(('12.', '13.', '14.'))

In [None]:
# Create a new column classifying if host is source internal, destiny internal, or not
df['src_internal'] = df['src'].map(isInternalHost)
df['dst_internal'] = df['dst'].map(isInternalHost)

# Extract information of timestamp column
df['ts'] = pd.to_datetime(df['ts'], unit='ms')
df['hour'] = df.ts.dt.hour.astype('uint8')
df['minute'] = df.ts.dt.minute.astype('uint8')

detected_ips = []

gc.collect()
df.head()

## Challenges

The data used here is highly synthetic, so it should be obvious when you get the _right_ answer. 

In [None]:
answers = []

### Question 1: Discover Data Exfiltration 1

*Our intellectual property is leaving the building in large chunks. A machine inside is being used to send out all of our widget designs. One host is sending out much more data from the enterprise than the others. What is its IP?*

In [None]:
# Sum total bytes exported of each internal host and sort the values
src_out = df[(df['src_internal']==1) & (df['dst_internal']==0)].groupby('src').bytes.sum().pipe(lambda x: x[x>0]).sort_values(ascending=False)
print(src_out.to_frame().head())

In [None]:
answers.append('13.37.84.125')
detected_ips.append('13.37.84.125')
del src_out
gc.collect()

### Question 2: Discover Data Exfiltration 2

*Another attacker has a job scheduled that export the contents of our internal wiki. One host is sending out much more data during off hours from the enterprise than the others, different from the host in the Question 1. What is its IP?* 



In [None]:
# Identify off hours
plt.hist(df['hour'], bins=70)
plt.plot()

In [None]:
# Sum total bytes of hosts exporting data during off hours
off_hours = df[((df['src_internal']==1) & (df['dst_internal']==0)) & (df['hour']>=0) & (df['hour']<16)].groupby('src').bytes.sum().sort_values(ascending=False).where(lambda x: x>0)
off_hours.head()

In [None]:
answers.append('12.55.77.96')
detected_ips.append('12.55.77.96')
del off_hours
gc.collect()

### Question 3: Discover Data Exfiltration 3

*Some assailant is grabbing all the employee and vendor email addresses, and sending them out on a channel normally reserved for other uses. This is similar to attackers abusing DNS for data exfiltration. One host is sending out much more data on a some port from the enterprise than other hosts do, different from the hosts in Questions 1 and 2. What is its port?*


In [None]:
# Calculate bytes exported by each port
src_port = df[(df['src_internal']==1) & (df['dst_internal']==0)].groupby(['src','port']).bytes.sum().reset_index()
src_port.groupby('port').bytes.sum().sort_values(ascending=False).plot.bar(figsize=(20,5))

In [None]:
# Evaluating Z-Score of each port
src_port.groupby('port').apply(lambda x: np.max((x.bytes-x.bytes.mean())/x.bytes.std())).sort_values(ascending=False).dropna().head(5)

In [None]:
# Checking source and port
print(src_port.pipe(lambda x: x[x['port']==124]).sort_values('bytes',ascending=False).head(1))

In [None]:
answers.append('124')
detected_ips.append('12.30.96.87')
del src_port
gc.collect()

### Question 4: Private C&C channel

*We're always running a low-grade infection; some internal machines will always have some sort of malware. Some of these infected hosts phone home to C&C on a private channel. What unique port is used by external malware C&C to marshal its bots?*

In [None]:
# Looking for ports that doesn't behave like the others
df[df['src_internal']==0].drop_duplicates(('src','port')).groupby('port').size().sort_values().head()

In [None]:
answers.append('113')

### Question 5: Internal P2P

*Sometimes our low-grade infection is visible in other ways.  One particular virus has spread through a number of machines, which now are used to relay commands to each other.  The malware has created an internal P2P network.  What unique port is used by the largest internal clique, of all hosts talking to each other?*

In [None]:
# To solve this question, I will use a complete graph
int_edges = df[(df['src_internal']==1) & (df['dst_internal']==1)].drop_duplicates(['src', 'dst', 'port'])
int_ports = int_edges.port.unique()

In [None]:
upper_bounds = []
for p in int_ports:
    """
    Build the graph
    """
    internal_edges = int_edges.pipe(lambda x: x[x['port'] == p]).drop_duplicates(['src','dst'])
    edges = set()
    for l, r in zip(internal_edges.src, internal_edges.dst):
        k = min((l, r), (r, l))
        edges.add(k)
    degrees = Counter()
    for (l, r) in edges:
        degrees[l] += 1
        degrees[r] += 1
    max_clique_size = 0
    min_degrees = len(degrees)
    for idx, (node, degree) in enumerate(degrees.most_common()):
        min_degrees = min(min_degrees, degree)
        if min_degrees >= idx:
            max_clique_size = max(max_clique_size, idx+1)
        if min_degrees < max_clique_size:
            break
    upper_bounds.append((p, max_clique_size + 1))

In [None]:
"""
Identify the port with maximum clique numbers
"""
max_port = 0
max_clique = 0
for p, upper_bound in upper_bounds:
    if max_clique > upper_bound: break
    internal_edges = int_edges.pipe(lambda x: x[x['port']==p]).drop_duplicates(['src','dst'])
    internal_nodes = set(internal_edges.src) | set(internal_edges.dst)
    G = networkx.Graph()
    G.add_nodes_from(internal_nodes)
    for l, r in zip(internal_edges.src, internal_edges.dst):
        G.add_edge(l, r)        
    size = large_clique_size(G) 
    if max_clique < size:
        max_clique = size
        max_port = p

In [None]:
print(max_port, max_clique)

In [None]:
answers.append('83')
del int_edges,int_ports,upper_bounds,G
gc.collect()

### Question 6: Malware Controller

*We were just blacklisted by an IP reputation service, because some host in our network is behaving badly.  One host is a bot herder receiving C&C callbacks from its botnet, which has little other reason to communicate with hosts in the enterprise.  What is its IP?*

In [None]:
# Identify single unique destiny
single_dst = df[(df['src_internal']==0) & (df['dst_internal']==1)].drop_duplicates(['src','dst']).src.value_counts().pipe(lambda x: x[x==1]).index

In [None]:
# Identify external host that is communicating
df[(df['src_internal']==0) & (df['dst_internal']==1)].pipe(lambda x: x[x.src.isin(single_dst)]).drop_duplicates(['src','dst']).groupby('dst').size().where(lambda x: x>0).dropna()

In [None]:
answers.append('14.45.67.46')
detected_ips.append('14.45.67.46')
del single_dst
gc.collect()

### Question 7: Infected Host

*One host is part of the botnet from Question 6, what is its IP?*

In [None]:
# Identify hosts that communicates with question 6 IP
df[(df['src_internal']==1) & (df['dst_internal']==1) & (df['dst']=='14.45.67.46') & (df['port']==27)].drop_duplicates('src')

In [None]:
answers.append('14.51.84.50')
detected_ips.append('14.51.84.50')

### Question 8: Botnet Inside
*There is a stealthier botnet in the network, using low frequency periodic callbacks to external C&C, with embedded higher frequency calls.  What port does it use?*

In [None]:
# Evaluate callbacks in company network
df[(df['dst_internal']==0)].groupby('port').size().sort_values().head()

In [None]:
answers.append('51')

### Question 9: Lateral Brute

*Once a machine is popped, it's often used to explore what else can be reached.  One host is being used to loudly probe the entire enterprise, trying to find ways onto every other host in the enterprise.  What is its IP?*


In [None]:
# Identify hosts with large numbers of ports
df[(df['src_internal']==1) & (df['dst_internal']==1)].drop_duplicates(['src','dst']).groupby('src').size().sort_values(ascending=False).head()

In [None]:
answers.append('13.42.70.40')
detected_ips.append('13.42.70.40')

### Question 10: Lateral Spy

*One host is trying to find a way onto every other host more quietly.  What is its IP?*

In [None]:
# Recovering internal connections
internal_hosts = df[(df['src_internal']==1) & (df['dst_internal']==1)].pipe(lambda x: x[~x.src.isin(detected_ips)]).drop_duplicates(('src','dst','port'))
# Recovering destinations of internal hosts
dst_ports = internal_hosts.groupby(['dst','port']).src.apply(list).dropna()

In [None]:
dst_ports.pipe(lambda x: x[x.map(len)==1]).to_frame().reset_index().explode('src').src.value_counts()

In [None]:
answers.append('12.49.123.62')
detected_ips.append('12.49.123.62')

# Checking the answers

Use the following code to check if your answers are correct.

In [None]:
answers

In [None]:
import hashlib
answer_hash = hashlib.md5(':'.join(answers).encode('utf-8')).hexdigest()
assert answer_hash == 'ec766132cac80b821793fb9e7fdfd763'