# Analysis of packet labels for sample flows

### Loading the data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/labels.csv')

In [3]:
df.head()

Unnamed: 0,FlowFileName,Label,NumPackets
0,tcp_syn/0.0.0.0_37148_216.58.223.106_443_15518...,Google,1
1,tcp_syn/0.0.0.0_37150_216.58.223.106_443_15518...,Google,1
2,tcp_syn/0.0.0.0_49666_10.2.3.1_80_1557510056.pcap,HTTP,1
3,tcp_syn/0.0.0.0_55856_157.240.22.54_443_155694...,WhatsApp,1
4,tcp_syn/10.2.0.10_32798_188.65.76.133_443_1554...,TLS,1


### Summary statistics

Note:
- 2 685 560 flows
- Average of 8.18 packets per flow
- Maximum of 67 472 packets per flow

In [4]:
df.describe()

Unnamed: 0,NumPackets
count,268560.0
mean,8.185545
std,208.140353
min,1.0
25%,2.0
50%,2.0
75%,4.0
max,67472.0


### Flow and packet counts analysis

Top 15 labels ordered by number of flows per label
e.g. DNS has 56 979 flows, and WhatsApp has 2 892 flows

In [5]:
df["Label"].value_counts()[:15]

DNS               65979
Google            50928
HTTP              27429
TLS               22707
TeamViewer        19568
Microsoft         15029
Amazon            12128
GoogleServices     9345
LLMNR              7078
Facebook           6868
YouTube            3333
WhatsApp           2892
Skype              2887
PlayStore          2512
WindowsUpdate      2409
Name: Label, dtype: int64

Top 15 labels ordered by number of packets per label
e.g. YouTube has 569 413 packets, and WhatsApp has 35 786 packets

In [6]:
pkt_counts = df.groupby("Label").sum()["NumPackets"].sort_values(ascending=False)
pkt_counts[:15]

Label
YouTube           569413
HTTP              329519
Google            227416
DNS               180344
TLS               146195
Unknown           130525
Instagram         103521
Microsoft          77268
Facebook           71406
MDNS               68881
Amazon             51115
TeamViewer         50469
WhatsApp           35786
GoogleServices     31569
LLMNR              14050
Name: NumPackets, dtype: int64