# Lab 1: Exploring Honeypot Data

This lab will examine some data that was gathered from various honeypots. Three different honeypot packages were used to generate this data: Snort, Amun, and Glastopf. Snort looks for patterns in network traffic and can be run in addition to the other types of honeypots. Amun is a low-interaction honeypot that listens on several ports and records connections to those ports. Glastopf is another low-interaction honeypot that runs a web server and records client requests.
Timeseries graphs and other exploration techniques will be used to understand the types and frequency of scans/attacks against the honeypot infrastructure.

In [1]:
%matplotlib inline
from datetime import datetime
import json
import pandas as pd
import re

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (15,5)

In [2]:
data_path = 'data/honeypot.json' # change this to the location of the honeypot.json file in your google drive!

In [3]:
%%bash 
head 'data/honeypot.json'

{ "_id" : { "$oid" : "5426456e9f8c6d41306aea57" }, "ident" : "a16f5f36-3c41-11e4-9ee4-0a0b6e7c3e9e", "timestamp" : { "$date" : "2014-09-27T05:04:46.363+0000" }, "normalized" : true, "payload" : "{\"pattern\": \"style_css\", \"time\": \"2014-09-27 05:04:58\", \"filename\": null, \"source\": [\"162.197.24.67\", 60871], \"request_raw\": \"GET /style.css HTTP/1.1\\r\\nAccept: text/css,*/*;q=0.1\\r\\nAccept-Encoding: gzip,deflate,sdch\\r\\nAccept-Language: en-US,en;q=0.8\\r\\nConnection: keep-alive\\r\\nDnt: 1\\r\\nHost: ec2-54-68-96-53.us-west-2.compute.amazonaws.com\\r\\nReferer: http://ec2-54-68-96-53.us-west-2.compute.amazonaws.com/comments\\r\\nUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36\", \"request_url\": \"/style.css\"}", "channel" : "glastopf.events" }
{ "_id" : { "$oid" : "542645799f8c6d41306aea59" }, "ident" : "a16f5f36-3c41-11e4-9ee4-0a0b6e7c3e9e", "timestamp" : { "$date" : "2014-09-27T05:04

# Section 1: Read in Data

This script parses the honeypot.json file linked to in the data_path variable defined above.  It reads each line, unserializing the json object and matching the channel key with the type of dataset.  It then appends the key's dictionary to the dataset list.

In [4]:
dionaea_conn = []
amun = []
dionaea_cap = []
glastopf = []
snort = []

with open(data_path, 'r') as f:
    for index, line in enumerate(f):
        columns = json.loads(line)

        del columns['_id']
        del columns['ident']

        channel = columns['channel']

        if channel == 'dionaea.connections':
            payload = json.loads(columns.pop("payload"))
            columns.update(payload)
            dionaea_conn.append(columns)
        elif channel == 'amun.events':
            payload = json.loads(columns.pop("payload"))
            columns.update(payload)
            amun.append(columns)
        elif channel == 'dionaea.capture':
            payload = json.loads(columns.pop("payload"))
            columns.update(payload)    
            dionaea_cap.append(columns)
        elif channel == 'glastopf.events':
            payload = json.loads(columns.pop("payload"))
            columns.update(payload)
            glastopf.append(columns)    
        elif channel == 'snort.alerts':
            payload = json.loads(columns.pop("payload"))
            columns.update(payload)
            snort.append(columns)
        else:
            continue

With the dataset lists, we then build pandas DataFrame objects.  This allows easier data manipulation and exploration

In [5]:
dionaea_conn_df = pd.DataFrame(dionaea_conn)
dionaea_cap_df = pd.DataFrame(dionaea_cap)
amun_df = pd.DataFrame(amun)
glastopf_df = pd.DataFrame(glastopf)
snort_df = pd.DataFrame(snort)

dionaea_conn_df["timestamp"] = [x['$date'] for x in dionaea_conn_df['timestamp'].values]
dionaea_cap_df['timestamp'] = [x['$date'] for x in dionaea_cap_df['timestamp'].values]
# amun_df['timestamp'] = [x['$date'] or None for x in amun_df['timestamp'].values]
glastopf_df['timestamp'] = [x['$date'] for x in glastopf_df['timestamp'].values]
snort_df['timestamp'] = [x['$date'] for x in snort_df['timestamp'].values]

In [6]:
type(dionaea_conn_df)

pandas.core.frame.DataFrame

We now have a bunch of pandas dataframes!

In [7]:
dionaea_conn_df.head() # We can show the first 5 number of rows of a dataframe with the .head() method.  you can use .head(x) to show x number of rows

Unnamed: 0,channel,connection_protocol,connection_transport,connection_type,local_host,local_port,normalized,remote_host,remote_hostname,remote_port,timestamp
0,dionaea.connections,pcap,tcp,reject,162.244.30.100,23,True,176.232.136.46,,44516,2015-03-03T16:40:31.681+0000
1,dionaea.connections,pcap,tcp,reject,162.244.30.100,3128,True,61.160.213.108,,33122,2015-03-03T16:50:35.359+0000
2,dionaea.connections,pcap,tcp,reject,162.244.30.100,23,True,115.50.182.177,,56252,2015-03-03T16:56:09.910+0000
3,dionaea.connections,pcap,tcp,reject,162.244.30.100,80,True,104.207.136.102,,42412,2015-03-03T17:02:31.759+0000
4,dionaea.connections,pcap,tcp,reject,162.244.30.100,3128,True,61.160.213.108,,33122,2015-03-03T17:08:44.617+0000


In [8]:
dionaea_conn_df.dtypes # show the feature types

channel                 object
connection_protocol     object
connection_transport    object
connection_type         object
local_host              object
local_port               int64
normalized                bool
remote_host             object
remote_hostname         object
remote_port              int64
timestamp               object
dtype: object

### Clean Timestamps & Set Index
**[Task]** Review panda's documentation and online resources to set the date column to the DateTime data type.  Then reset the index to use the Date column.

In [9]:
dionaea_conn_df.head(2)

Unnamed: 0,channel,connection_protocol,connection_transport,connection_type,local_host,local_port,normalized,remote_host,remote_hostname,remote_port,timestamp
0,dionaea.connections,pcap,tcp,reject,162.244.30.100,23,True,176.232.136.46,,44516,2015-03-03T16:40:31.681+0000
1,dionaea.connections,pcap,tcp,reject,162.244.30.100,3128,True,61.160.213.108,,33122,2015-03-03T16:50:35.359+0000


**[Task/Question]**  
1. Over what time period are the dionaea_conn logs collected? (Hint: .index.min() )

## Section 1 Questions
1. What are honeypots?  What is the difference between a low-interaction and a high-interaction honeypot?*  
2. What are 3 other honeypots on Github?     
3. What are use cases of honeypots?  What are the limitations?*  
*please write 2-3 paragraphs and cite at least 2 references

1.  
2.   
3. 

# Section 2: Explore Data
This section shows examples of how to analyze honeypot data with pandas

## Review properties of dataframes 
**[Task]** For each dataset [dionaea_conn_df, amun_df, dionaea_cap_df, glastopf_df, snort_df] show the dataframe dimensions and variables

You can use the df.shape method to print the number of rows and columns.  You can use the list(df) method to list each of the columns

In [10]:
# Fill this in
for df in [dionaea_conn_df, dionaea_cap_df, amun_df, glastopf_df, snort_df]:
    #

SyntaxError: unexpected EOF while parsing (<ipython-input-10-b8dd93619890>, line 3)

In [None]:
glastopf[1] # from the glastopf list, second item

In [None]:
glastopf_df.iloc[1] # from the glastopf dataframe, second row

**[Question]** Describe each of the honeypots used in this dataset (dionaea, amun, glastopf).

### Explore Amun Dataset

In [None]:
amun_df.head(1)

**[Question]**
1. What are the use cases for Amun? How/would you use this in a production enterprise?  
2. What are the limitations?
3. Describe the features (columns) in the amun dataframe

First, let's drop unnecessary columns

In [None]:
len(amun_df)

In [None]:
amun_df.isnull().sum() # let's get the count of NAs across all columns

In [None]:
amun_df = amun_df.drop(['attackerID'], axis=1) # since attackerID is only filled with NAs

In [None]:
amun_df.downloadMethod.value_counts()[:5]

In [None]:
amun_df.shellcodeName.value_counts()

In [None]:
amun_df.vulnName.value_counts()

**[Question]**
1. Describe any interesting things you see in the Amun dataset.

### Explore Dionaea Dataset

In [None]:
dionaea_conn_df.head(1)

In [None]:
dionaea_cap_df.head(1)

**[Question]**
1. What are the use cases for Dionaea? How/would you use this in a production enterprise?  
2. What are the limitations?
3. Describe each of the feautres (columns) in the dionaea_conn_df and dionaea_cap_df datasets

**First let's drop unnecessary columns**  
**[Task]**  
Explain why you dropped columns you deemed unncessary.

dionaea_conn_df

In [None]:
len(dionaea_conn_df)

In [None]:
dionaea_conn_df = dionaea_conn_df.drop([], axis=1)

dionaea_cap_df

In [None]:
len(dionaea_cap_df)

In [None]:
dionaea_cap_df = dionaea_cap_df.drop([], axis=1)

In [None]:
dionaea_cap_df['url'].value_counts()[:5]

**[Question]**
1. Describe any interesting things you see in the Dionaea dataset.

### Explore Glastopf Dataset

In [None]:
glastopf_df.head(1)

**[Question]**
1. What are the use cases for Glastopf? How/would you use this in a production enterprise?  
2. What are the limitations?
3. Describe the features (columns) in the glastopf dataframe

**First, let's drop unnecessary columns**

In [None]:
len(glastopf_df)

In [None]:
glastopf_df.isnull().sum() # let's get the count of NAs across all columns

In [None]:
glastopf_df = glastopf_df.drop(['channel', 'filename', 'normalized', 'time'], axis=1)

In [None]:
glastopf_df.head(1) # print the first row of the dataframe

**Regex**  
We can then use regex (regular expressions) to extract the user agent string from the Request_Raw column.  Let's add that to the dataframe.

In [None]:
import re

regex = re.compile('.*[Uu][Ss][Ee][Rr]-[Aa][Gg][Ee][Nn][Tt]:(.*?)(?:\\r|$)')
glastopf_df['user-agent'] = glastopf_df['request_raw'].apply(lambda x: re.search(regex, x).group(1) if re.search(regex, x) else None)
glastopf_df.head(1)

**[Task/Question]**  
- Add a feature, HTTP Method, to the glastopf dataframe.  Read about using the .str.split() method.  (hint: '/')
Find out what is the most commonly used http method.  
- Is there anything unusual?

**[Question]** 
1. What are the 5 most popular user-agent strings? (hint: value_counts()[:5])
2. What are the 5 most popular source IPs? Use the glastopf_df['source'] column

**Searching Strings**  
Just as we can extract and manipulate strings from columns, we can search string columns to find things.  Since we have the raw message of the request, we can use this to search for suspicious things in requests.

Read and learn about Shellshock: https://blog.cloudflare.com/inside-shellshock/

**[Question]**
1. What are the patterns used in Shellshock?

Let's search the request_raw column...

In [None]:
glastopf_df[glastopf_df['request_raw'].str.contains('\.\.')]['request_raw'].value_counts()[:3] # only show the first 3 rows

**[Question]**  
- What do you think the pattern '\.\.' is used to detect?  

In [None]:
shell_shock_pattern = '() { :; };'# fill in with the 'magic string' referenced in the cloudflare article above

In [None]:
shell_shock_requests = glastopf_df[glastopf_df['request_raw'].str.contains(shell_shock_pattern)]['request_raw'].value_counts()[:3] # only show the first 3 rows
shell_shock_requests

**[Question]**  
- Great, so now we know there's evidence some attackers tried to exploit the Shellshock vulneraiblity on the honeypot.  What IP addresses are they from?

In [None]:
glastopf_df[glastopf_df['request_raw'].str.contains(shell_shock_pattern)]['request_raw'].apply(lambda x: x[x.find('http://'):x[x.find('http://'):].find(' ') + x.find('http://')] if x.find('http://') > 0 else 'a').unique()

### Explore Snort Dataset

In [None]:
snort_df.head(1)

**[Question]**
1. What is snort? How/would you use this in a production enterprise?  
2. Describe the features (columns) in the snort dataframe

In [None]:
len(snort_df)

In [None]:
snort_df.isnull().sum()

In [None]:
snort_df = snort_df.drop(['channel'], axis=1)

In [None]:
snort_df.classification.value_counts()

**[Question]**
1.  What does snort classify as Potentially Bad Traffic, Misc activity, and Attempted Denial of Service?

In [None]:
bad_traffic = snort_df[snort_df['classification']=='Potentially Bad Traffic']
bad_traffic.shape

In [None]:
bad_traffic.signature.value_counts()

In [None]:
bad_traffic.head(1)

## Section 2 Questions


**[Question]**
1. Describe what you learned from looking at this honeypot data
2. What is missing?  What did you expect?

# Section 3: Further Applications

- https://dev.maxmind.com/geoip/geoip2/geolite2/

In [None]:
import geoip2.database

geo = geoip2.database.Reader('GeoLite2-City.mmdb')

# amun_df['attackerCountry'] = amun_df['attackerIP'].apply(lambda x: (reader.city(x).subdivisions.most_specific.name) if (reader.city(x)) else None )

In [None]:
def get_state(ip):
    try:
        response = geo.city(ip)
        return response.subdivisions.most_specific.name
    except:
        return pd.np.nan

In [None]:
amun_df['attackerState'] = amun_df['attackerIP'].apply(get_state)

In [None]:
amun_df.attackerCountry.value_counts()[:5]

**[Challenge Task]** 
- Get the get_lat_long method working.  Then visualize a choropleth plot of all of the data to show where the attackers come from.

In [None]:
def get_lat_long(ip):
    try:
        response = geo.city(ip)
        return [response.location.latitude, response.location.longitude]
    except:
        return [pd.np.nan, pd.np.nan]

In [None]:
amun_df['attackerLat'], amun_df['attackerLong'] = amun_df['attackerIP'].apply(get_lat_long)

In [None]:
amun_df.head()

In [None]:
amun_df['attackerCountry'] = amun_df['attackerIP'].map(unique_ips.apply(get_country))

**Time Plots**

In [None]:
snort_df['timestamp'] = pd.to_datetime(snort_df['timestamp'])
snort_df.set_index('timestamp', inplace=True)

glastopf_df['timestamp'] = pd.to_datetime(glastopf_df['timestamp'])
glastopf_df.set_index('timestamp', inplace=True)

In [None]:
plt.plot(snort_df['source_ip'].resample("D", how='count'), label="Total Events")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)    
plt.show()

In [None]:
plt.plot(glastopf_df['source'].resample("D", how='count'), label="Total Events")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)    
plt.show()

## Section 3 Questions

1. What other datasets can be used to enrich honeypot data?

# Section 4: Deploying Honeypots

This section explores how to deploy a honeypot on your local machine with Docker

**Readings/Resources** 
- http://www.isg.rhul.ac.uk/~pnai166/thesis.pdf
- Awesome List of Honeypot: https://github.com/paralax/awesome-honeypots
- Docker: https://2018.djangocon.us/talk/an-intro-to-docker-for-djangonauts/

**Requirements**  
- [Docker](https://www.docker.com) must be installed.
- https://github.com/cowrie/cowrie

## Cowrie
Run Cowrie with Docker  
$ docker run -p 2222:2222 cowrie/cowrie

**[Question]** What is Cowrie? What are the use cases of it? What kind of information can it collect?

**[Task]** Take picture of running Cowrie

Open another terminal and run:   
$ ssh -p 2222 root@localhost  
**[Task]** Take picture of sshing into cowrie.  Hint: don't worry about the password :) 

**[Task]** Take picture of running 'cat' on /etc/passwd file

**[Task/Question]** What is honeypot fingerprinting?  How can you fingerprint cowrie?  Add a snapshot for full credit.

2. Snare
1. git clone https://github.com/mushorg/snare.git

## Section 4 Questions
**[Question/Task]**
1. Speculate on how researchers can improve honeypots (~2 paragraphs)
2. What is honeypot camoflauge?  What ways can camoflauge elements enhance honeypots?