# Analysing network traffic with Pandas

Dirk Loss, http://dirk-loss.de, @dloss. 
v1.1, 2013-06-02

This IPython notebook shows how to analyse network traffic using the following tools:
    
* **Pandas**, a Python library for analysing data <http://pandas.pydata.org/>
* **tshark**, the command line version of the Wireshark sniffer <http://http://www.wireshark.org/>
* **Matplotlib**, a Python plotting library <http://matplotlib.org/>

Pandas allows for very flexible analysis, treating your PCAP files as a timeseries of packet data. 

So if the statistics provided by Wireshark are not enough, you might want to try this. And it's more fun, of course. :) 

## Get a PCAP file

First we need a PCAP file. I chose a sample file from the Digital Corpora site that has been used for courses in network forensics:

In [ ]:
from IPython.display import HTML
HTML('<iframe src=http://digitalcorpora.org/corpora/scenarios/nitroba-university-harassment-scenario width=600 height=300></iframe>')

In [ ]:
!mkdir -p pcap

In [ ]:
cd pcap

We can download it using curl or pure Python. Just uncomment one of the following cells:

In [ ]:
url="http://digitalcorpora.org/corp/nps/packets/2008-nitroba/nitroba.pcap"

In [ ]:
# If you have curl installed, we can get nice progress bars:
#!curl -o nitroba.pcap $url

In [ ]:
# Or use pure Python:
# import urllib
# urllib.urlretrieve(url, "nitroba.pcap")

In [ ]:
ls -l nitroba.pcap

In [ ]:
!md5sum nitroba.pcap

## Convert PCAP to a CSV using tshark

We can use the `tshark` command from the Wireshark tool suite to read the PCAP file and convert it into a tab-separated file. This might not be very fast, but it is very flexible, because all of Wireshark's diplay filters can be used to select the packets that we are interested in.

In [ ]:
!tshark -v

For now, I just select the frame number and the frame length and redirect the output to a file:

In [ ]:
!head -10 frame.len

Two columns, tab-separaed. (Not exactly CSV, but who cares. ;-)

Pandas can read those tables into a DataFrame object:

In [ ]:
import pandas as pd

In [ ]:
df=pd.read_table("frame.len")

The object has a nice default representation that shows the number of values in each row:

In [ ]:
df

Some statistics about the frame length:

In [ ]:
df["frame.len"].describe()

The minimum and maximum frame lengths are plausible for an Ethernet connection.

## Plotting

For a better overview, we plot the frame length over time.

We initialise IPython to show inline graphics:

In [ ]:
%pylab inline

Set a figure size in inches:

In [ ]:
figsize(10,6)

Pandas automatically uses Matplotlib for plotting. We plot with small dots and an alpha channel of 0.2:

In [ ]:
df["frame.len"].plot(style=".", alpha=0.2)
title("Frame length")
ylabel("bytes")
xlabel("frame number")

So there are always lots of small packets (< 100 bytes) and lots of large packets (> 1400 bytes). Some bursts of packets with other sizes (around 400 bytes, 1000 bytes, etc.) can be clearly seen.

### A Python function to read PCAP files into Pandas DataFrames

Passing all those arguments to tshark is quite cumbersome. Here is a convenience function that reads the given fields into a Pandas DataFrame:

In [ ]:
import subprocess
import datetime
import pandas as pd

def read_pcap(filename, fields=[], display_filter="", 
              timeseries=False, strict=False):
    """ Read PCAP file into Pandas DataFrame object. 
    Uses tshark command-line tool from Wireshark.

    filename:       Name or full path of the PCAP file to read
    fields:         List of fields to include as columns
    display_filter: Additional filter to restrict frames
    strict:         Only include frames that contain all given fields 
                    (Default: false)
    timeseries:     Create DatetimeIndex from frame.time_epoch 
                    (Default: false)

    Syntax for fields and display_filter is specified in
    Wireshark's Display Filter Reference:
 
      http://www.wireshark.org/docs/dfref/
    """
    if timeseries:
        fields = ["frame.time_epoch"] + fields
    fieldspec = " ".join("-e %s" % f for f in fields)

    display_filters = fields if strict else []
    if display_filter:
        display_filters.append(display_filter)
    filterspec = "-R '%s'" % " and ".join(f for f in display_filters)

    options = "-r %s -n -T fields -Eheader=y" % filename
    cmd = "tshark %s %s %s" % (options, filterspec, fieldspec)
    proc = subprocess.Popen(cmd, shell = True, 
                                 stdout=subprocess.PIPE)
    if timeseries:
        df = pd.read_table(proc.stdout, 
                        index_col = "frame.time_epoch", 
                        parse_dates=True, 
                        date_parser=datetime.datetime.fromtimestamp)
    else:
        df = pd.read_table(proc.stdout)
    return df

We will use this function in my further analysis.

## Bandwidth

By summing up the frame lengths we can calculate the complete (Ethernet) bandwidth used.
First use our convenience function to read the PCAP into a DataFrame:

In [ ]:
framelen=read_pcap("nitroba.pcap", ["frame.len"], timeseries=True)
framelen

Then we re-sample the timeseries into buckets of 1 second, summing over the lengths of all frames that were captured in that second:

In [ ]:
bytes_per_second=framelen.resample("S", how="sum")

Here are the first 5 rows. We get NaN for those timestamps where no frames were captured:

In [ ]:
bytes_per_second.head()

In [ ]:
bytes_per_second.plot()

## TCP Time-Sequence Graph

Let's try to replicate the TCP Time-Sequence Graph that is known from Wireshark (Statistics > TCP Stream Analysis > Time-Sequence Graph (Stevens).

In [ ]:
fields=["tcp.stream", "ip.src", "ip.dst", "tcp.seq", "tcp.ack", "tcp.window_size", "tcp.len"]
ts=read_pcap("nitroba.pcap", fields, timeseries=True, strict=True)
ts

Now we have to select a TCP stream to analyse. As an example, we just pick stream number 10:

In [ ]:
stream=ts[ts["tcp.stream"] == 10]

In [ ]:
stream

Pandas only print the overview because the table is to wide. So we force a display:

In [ ]:
print stream.to_string()

Add a column that shows who sent the packet (client or server). 

The fancy lambda expression is a function that distinguishes between the client and the server side of the stream by comparing the source IP address with the source IP address of the first packet in the stream (for TCP steams that should have been sent by the client).

In [ ]:
stream["type"] = stream.apply(lambda x: "client" if x["ip.src"] == stream.irow(0)["ip.src"] else "server", axis=1)

In [ ]:
print stream.to_string()

In [ ]:
client_stream=stream[stream.type == "client"]

In [ ]:
client_stream["tcp.seq"].plot(style="r-o")

Notice that the x-axis shows the real timestamps.

For comparison, change the x-axis to be the packet number in the stream:

In [ ]:
client_stream.index = arange(len(client_stream))
client_stream["tcp.seq"].plot(style="r-o")

Looks different of course.

## Bytes per stream

In [ ]:
per_stream=ts.groupby("tcp.stream")
per_stream.head()

In [ ]:
bytes_per_stream = per_stream["tcp.len"].sum()
bytes_per_stream.head()

In [ ]:
bytes_per_stream.plot()

In [ ]:
bytes_per_stream.max()

In [ ]:
biggest_stream=bytes_per_stream.idxmax()
biggest_stream

In [ ]:
bytes_per_stream.ix[biggest_stream]

## Ethernet Padding

Let's have a look at the padding of the Ethernet frames. Some cards have been leaking data in the past. For more details, see
http://www.securiteam.com/securitynews/5BP01208UO.html

In [ ]:
trailer_df = read_pcap("nitroba.pcap", ["eth.src", "eth.trailer"], timeseries=True)
trailer_df

In [ ]:
trailer=trailer_df["eth.trailer"]
trailer

Ok. Most frames do not seem to have padding, but some have. Let's count per value to get an overview:

In [ ]:
trailer.value_counts()

Mostly zeros, but some data. Let's decode the hex strings: 

In [ ]:
import binascii

def unhex(s, sep=":"):
    return binascii.unhexlify("".join(s.split(sep)))

In [ ]:
s=unhex("3b:02:a7:19:aa:aa:03:00:80:c2:00:07:00:00:00:02:3b:02")
s

In [ ]:
padding = trailer_df.dropna()

In [ ]:
padding["unhex"]=padding["eth.trailer"].map(unhex)

In [ ]:
def printable(s):
    chars = []
    for c in s:
        if c.isalnum():
            chars.append(c)
        else:
            chars.append(".")
    return "".join(chars)
           

In [ ]:
printable("\x95asd\x33")

In [ ]:
padding["printable"]=padding["unhex"].map(printable)

In [ ]:
padding["printable"].value_counts()

In [ ]:
def ratio_printable(s):
    printable = sum(1.0 for c in s if c.isalnum())
    return printable / len(s)         

In [ ]:
ratio_printable("a\x93sdfs")

In [ ]:
padding["ratio_printable"] = padding["unhex"].map(ratio_printable)

In [ ]:
padding[padding["ratio_printable"] > 0.5]

In [ ]:
_.printable.value_counts()

Now find out which Ethernet cards sent those packets with more than 50% ASCII data in their padding:

In [ ]:
padding[padding["ratio_printable"] > 0.5]['eth.src'].drop_duplicates()

In [ ]:
HTML('<iframe src=http://www.coffer.com/mac_find/?string=00%3A1d%3Ad9%3A2e%3A4f%3A61 width=600 height=300></iframe>')

Thats 'Hon Hai Precision' (and "Netopia Inc" for the other MAC address).