# Data Analysis for Network Security

This is the [Wildcard 400 of the 2019 Trendmicro CTF](https://ctf.trendmicro.com). It's a fun set of exercises!

## Introduction

​You are a network security administrator for the medium sized business XYZcorp.  You often use network flow data to uncover anomalous security events.  This challenge provides some sample aggregated data on flows, and uses answers from the anomalous events to construct the flag.

Knowledge of network security or protocols is not required.  This challenge requires data stacking, slicing, and/or anomaly detection.

### Data
  - timestamp,src,dst,port,bytes
  - Internal hosts have IPs beginning with 12-14
  - External IPs include everything else

## Preprocessing

In [None]:
import pandas as pd 
import numpy as np
import gc

In [None]:
df = pd.read_csv(
    '/kaggle/input/2019-trendmicro-ctf-wildcard-400/gowiththeflow_20190826.csv',
    header = 0, 
    names= ['ts', 'src', 'dst', 'port', 'bytes']
)
df['port'] = df['port'].astype('uint8')

# Internal hosts have IPs beginning with 12-14

def is_internal(s):
    return s.str.startswith(('12', '13', '14')) 

df['internal_src'] = is_internal(df['src'])
df['internal_dst'] = is_internal(df['dst'])

df.head()

## Challenges

The data used here is highly synthetic, so it should be obvious when you get the _right_ answer. 

In [None]:
answers = []
infected_hosts = []
relevant_ports = []

In [None]:
def save_answer(answer,tipo) :
    if tipo == 'host':
        infected_hosts.append(answer)
    if tipo == 'port':
        relevant_ports.append(answer)
    answers.append(answer)
    gc.collect()
    return answer

### Question 1: Discover Data Exfiltration 1

*Our intellectual property is leaving the building in large chunks. A machine inside is being used to send out all of our widget designs. One host is sending out much more data from the enterprise than the others. What is its IP?*

In [None]:
# A "machine inside" means its an internal IP
# One host is sending out much more data

result = df.loc[(df['internal_src'])]\
           .groupby(['src'])\
           .bytes.sum()\
           .sort_values(ascending=False)\
           .head(1)

In [None]:
answer = result.index[0]
save_answer(answer,'host')

### Question 2: Discover Data Exfiltration 2

*Another attacker has a job scheduled that export the contents of our internal wiki. One host is sending out much more data during off hours from the enterprise than the others, different from the host in the Question 1. What is its IP?* 



In [None]:
df['ts'] = (pd.to_datetime(df['ts'],unit='ms'))
df['hour'] = df['ts'].dt.hour.astype('uint8')
df.head()

In [None]:
result = df.loc[(df['hour'] >= 0) &\
                (df['hour'] <=16) &\
                ~df['src'].isin(answers)]\
           .groupby(['src'])\
           .bytes.sum()\
           .sort_values(ascending=False)\
           .head(1)

In [None]:
answer = result.index[0]
save_answer(answer,'host')

### Question 3: Discover Data Exfiltration 3

*Some assailant is grabbing all the employee and vendor email addresses, and sending them out on a channel normally reserved for other uses. This is similar to attackers abusing DNS for data exfiltration. One host is sending out much more data on a some port from the enterprise than other hosts do, different from the hosts in Questions 1 and 2. What is its port?*


In [None]:
# In this case we're not interested in absolute numbers, but relative numbers. 
# One most and port pair is sending much more data when compared to others.
# Therefore the first row of result is not the answer we want.

result = df.loc[(~df['src'].isin(infected_hosts))]\
           .groupby(['src','port'])\
           .bytes.sum()\
           .sort_values(ascending=False)\
           .reset_index()
result

In [None]:
# Method .drop_duplicates() returns a dataframe. Method .unique() returns an array. Both exclude duplicate elements.

ports = result['port'].unique()
ports

In [None]:
# Since we are interested in finding an atypical entry among our data, we can use the Z-Score :

# A Z-Score is a statistical measurement of a score's relationship to the mean in a group of scores.
# A Z-score can reveal to a trader if a value is typical for a specified data set or if it is atypical.

result = result.groupby('port')\
               .apply(lambda x: np.max((x.bytes - x.bytes.mean()) / x.bytes.std()))\
               .sort_values(ascending=False).head(1)

In [None]:
answer = result.index[0]
save_answer(answer,'port')

### Question 4: Private C&C channel

*We're always running a low-grade infection; some internal machines will always have some sort of malware. Some of these infected hosts phone home to C&C on a private channel. What unique port is used by external malware C&C to marshal its bots?*

In [None]:
# The unique port we want is probably the least accessed one, since legitimate ones tend to be accessed multiple times.
# Also, the unique port is external

result = df.loc[(~df['internal_src'])]\
           .groupby('port')\
           .size().reset_index(name='counts').sort_values(by='counts')\
           .head(10)
result.reset_index(drop=True,inplace=True)
result

In [None]:
answer = result.port[0]
save_answer(answer,'port')

### Question 5: Internal P2P

*Sometimes our low-grade infection is visible in other ways.  One particular virus has spread through a number of machines, which now are used to relay commands to each other.  The malware has created an internal P2P network.  What unique port is used by the largest internal clique, of all hosts talking to each other?*

In [None]:
answers.append('<Port>')

### Question 6: Malware Controller

*We were just blacklisted by an IP reputation service, because some host in our network is behaving badly.  One host is a bot herder receiving C&C callbacks from its botnet, which has little other reason to communicate with hosts in the enterprise.  What is its IP?*

In [None]:
# Little other reason to communicate with hosts in the enterprise - this probably means said host communicates with
# few other hosts listed in the database.

result = df.drop_duplicates(['src','dst']).src.value_counts().sort_values()
result

In [None]:
# Lets isolate the hosts that communicate with only one other host :
result = result.pipe(lambda x: x[x == 1])
result

In [None]:
# Lets find who these hosts are communicating with :
herder_candidate = df.loc[(df['src'] == result.index[0])]\
                     .drop_duplicates(['src','dst'])\
                     .dst.iloc[0]
herder_candidate

In [None]:
answer = df.query('dst == "14.45.67.46"')\
           .drop_duplicates(['src','dst'])
answer

In [None]:
index_as_list = list(result.index)
answer.loc[(answer['src'].isin(index_as_list))]

In [None]:
# Since the number of hosts that communicate with only one other host (i.e. len(result)) 
# matches that of the above dataframe, and all of its entries use the same port, 
# it is safe to assume that our herder_candidate is in fact the answer we want

save_answer(herder_candidate, 'host')

### Question 7: Infected Host

*One host is part of the botnet from Question 6, what is its IP?*

In [None]:
# The botnet herded by 14.45.67.46 uses only port 27. We are interested in unique internal hosts that have the same pair dst,port :

test = df.loc[(~df['src'].isin(index_as_list)) &\
              (~df['src'].isin(infected_hosts)) &\
              (df['internal_src']) &\
              (df['internal_dst']) &\
              (df['dst'] == '14.45.67.46') &\
              (df['port'] == 27)]\
         .drop_duplicates('src')
test

In [None]:
# We have two candidates. Since 13.42.70.40 is the answer for question 9,
# in this case the answer is :

save_answer('14.51.84.50', 'host')

### Question 8: Botnet Inside
*There is a stealthier botnet in the network, using low frequency periodic callbacks to external C&C, with embedded higher frequency calls.  What port does it use?*

In [None]:
answers.append('<Port>')

### Question 9: Lateral Brute

*Once a machine is popped, it's often used to explore what else can be reached.  One host is being used to loudly probe the entire enterprise, trying to find ways onto every other host in the enterprise.  What is its IP?*


In [None]:
# If said host is trying to probe every other host, this probably means its the one with most unique access to different hosts.

result = df.drop_duplicates(['src','dst']).groupby(['src']).size().reset_index(name='counts').sort_values(by='counts',ascending=False)
result.reset_index(drop=True,inplace=True)
result.head(10)

In [None]:
answer = result.src[0]
save_answer(answer,'host')

### Question 10: Lateral Spy

*One host is trying to find a way onto every other host more quietly.  What is its IP?*

In [None]:
answers.append('<IP address>')

# Checking the answers

Use the following code to check if your answers are correct.

In [None]:
for i in range (len(answers)):
    answers[i] = str(answers[i])

In [None]:
import hashlib
answer_hash = hashlib.md5(':'.join(answers).encode('utf-8')).hexdigest()
assert answer_hash == 'ec766132cac80b821793fb9e7fdfd763'

In [None]:
print('\n'.join(answers))