# Understanding Network Data

**Objective:** In this discussion section, we will discuss the different types of protocols we will observe at each layer within a packet. We will understand how to process and work with this data programatically.


## 5-Layer Network Model
![](images/network_layers.png?raw=true)

## The Hourglass ("Thin-Waist") Model in IP Networking

The **"thin-waist"** or **"hourglass" model** is a powerful analogy used to conceptualize the architecture of the Internet Protocol (IP) networking stack. This model highlights how a broad range of applications and technologies converge on a narrow, unified middle layer (the "waist" of the hourglass) to ensure interoperability and scalability.

![](images/hourglass_model.png?raw=true)

### Structure of the Hourglass Model
- **Applications (Top-most Layer):**
    - Represents the Application Layer where there's a vast variety of applications such as web browsers, email clients, messaging apps, and more.
    - Each application might use the Internet differently, but all depend on the layers below.
- **Transport (Top Layer):**
    - Just above the "thin waist" of the Internet Layer, lies the Transport Layer. This layer is integral to enabling end-to-end communication and carries responsibilities like:
        - Segmentation and Reassembly: Breaking down large messages and reassembling them at the destination.
        - Error Control: Ensuring reliable data transmission (e.g., TCP retransmits lost packets).
        - Flow Control: Managing data transmission rates suitable for the receiver.
        - Connection Management: Handling the setup, maintenance, and termination of connections in protocols like TCP.
        - Port Numbers: Allowing multiple services on a device to simultaneously use the network.
    - The Transport Layer ensures this data reaches the right destination in the desired form.
- **IP (Middle Layer):**
    - This is the crux of the hourglass: the Internet Layer, specifically IP (Internet Protocol).
    - It acts as a universal glue that binds all parts of the Internet together, irrespective of the application above or the technology below.
    - The benefits of this design include:
        - Interoperability: Ensuring communication across devices and networks from different vendors and technologies.
        - Flexibility: Facilitating the introduction of new technologies without disrupting the IP layer.
        - Innovation: Allowing the creation of new applications or network tech without redesigning the Internet's core.
- **Link (Bottom Layer):**
    - Represents technologies that transmit IP packets: Ethernet, Wi-Fi, cellular networks, satellite links, fiber optics, and more.
    - These technologies may vary in operation but all serve to transport IP packets from one device to another.

This unified overview of the hourglass model along with the Transport Layer's role provides a comprehensive insight into the Internet protocol stack's modular design.

In [None]:
# Imports
from scapy.all import *        # Packet manipulation
import pandas as pd            # Pandas - Create and Manipulate DataFrames
import numpy as np             # Math Stuff (don't worry only used for one line :] )
import binascii                # Binary to Ascii 
import seaborn as sns          # Graphing Library
sns.set(color_codes=True)
%matplotlib inline

## Reading in a PCAP
We can read in packet capture programatically using the `scapy` library. The `rdpcap` function will return an object of time `scapy.plist.PacketList` which represents a list of packet objects that we can iterate over. We can receive a summary of the transport protocols used in this pcap.

In [None]:
pcap = rdpcap('pcaps/video_call.pcap')

In [None]:
print(pcap)

In [None]:
sample_pkt = pcap[127]

In [None]:
print(sample_pkt)

In [None]:
sample_pkt.show()

## Ethernet Layer

### MAC Address
A **MAC (Media Access Control)** address is a unique identifier assigned to network interfaces for communications on the physical network segment. It's essentially the "physical address" of a device on a local network, and it plays a pivotal role in the data link layer of the OSI model to help route packets of data on a local area network (LAN).

#### Key Points:
**Uniqueness:** Every network interface card (NIC) manufactured is typically given a unique MAC address, ensuring that every device on a local network can be uniquely identified by its MAC address.

**Physical Addressing:** While IP addresses operate at the network layer to route packets over wide networks, MAC addresses function at the data link layer to deliver packets within local networks.

**Immutable:** In theory, a MAC address is permanent, burned into the hardware. However, many modern NICs and software interfaces allow for MAC address modification, known as MAC spoofing.

**Format:**
A MAC address is 48 bits long and is usually represented as six groups of two hexadecimal digits, separated by colons (:) or hyphens (-).

```
 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                     Destination MAC Address                    |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                       Source MAC Address                       |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           EtherType          |                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
|                                                               |
|                           Payload                             |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|   Frame Check Sequence (FCS)   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

```
**Destination MAC Address (6 bytes):** This is the MAC address of the destination network interface card (NIC). The data will be read by the device that matches this address.

**Source MAC Address (6 bytes):** The MAC address of the NIC that sent the frame.  

**EtherType (2 bytes):** This field indicates which protocol is encapsulated in the payload of the frame. For instance, an EtherType value of 0x0800 indicates that the payload is an IPv4 packet.  

**Payload (46-1500 bytes):** This contains the encapsulated protocol data, like an IP packet or ARP message. The minimum length is 46 bytes, and the maximum is 1500 bytes. If a smaller message needs to be sent, padding is used to reach the 46-byte minimum.  

**Frame Check Sequence (FCS) (4 bytes):** This is a checksum value used to detect errors in the transmitted frame. It's generated by the source NIC and verified by the destination NIC.  

This diagram provides a simplified view. Ethernet standards and variations (like VLAN-tagged frames) can introduce additional fields or variations, but the Ethernet II frame format is foundational and widely recognized.

In [None]:
# Ethernet Layer is the first layer we read
ethernet_frame = sample_pkt

In [None]:
type(ethernet_frame)

In [None]:
# Timestamp saved for packet on capture device
# Epoch time
print(ethernet_frame.time)

In [None]:
print(ethernet_frame.src + " ==> " + ethernet_frame.dst) 

Vendor Lookup available at: https://maclookup.app/search

`e4:c3:2a:92:b6:57` TP-Link Technologies  
`f8:4d:89:7b:42:83` Apple  

## IP Layer

```
 0                   1                   2                   3   
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version|  IHL  |Type of Service|          Total Length         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Identification        |Flags|      Fragment Offset    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Time to Live |    Protocol   |         Header Checksum       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Source Address                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Destination Address                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Options                    |    Padding    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
```
**Version (4 bits):** Specifies the version of the IP protocol. For IPv4, this is always set to 4.

**IHL (Internet Header Length, 4 bits):** Indicates the header length in 32-bit words. The minimum value for this field is 5, which represents a header length of 20 bytes.

**Type of Service (8 bits):** Specifies how the datagram should be handled, such as its priority.

**Total Length (16 bits):** Indicates the entire packet size, including header and data, in bytes.

**Identification (16 bits):** Helps in assembling the fragments of a datagram.

**Flags (3 bits):** Control or identify fragments. There are three flags: Reserved (always zero), Don't Fragment (DF), and More Fragments (MF).

**Fragment Offset (13 bits):** Indicates where in the datagram this fragment belongs.

**Time to Live (TTL, 8 bits):** Specifies the time limit in seconds or hops for the datagram before it's discarded.

**Protocol (8 bits):** Defines the protocol used in the data portion of the IP datagram.

**Header Checksum (16 bits):** Used for error-checking the header.

**Source Address (32 bits):** The IPv4 address of the source node.

**Destination Address (32 bits):** The IPv4 address of the destination node.

**Options (variable):** Allows for various options, such as security. Its presence is determined by the IHL field.

**Padding (variable):** Ensures the header is a multiple of 32 bits in length. The length of the padding is determined by the IHL field.

In [None]:
# IP Layer is the next layer we read
ethernet_frame = sample_pkt

In [None]:
ip_pkt = ethernet_frame.payload

In [None]:
type(ip_pkt)

In [None]:
# Length of the IP payload
print(ip_pkt.len)

In [None]:
# Incoming packet
print(ip_pkt.src + " ==> " + ip_pkt.dst) 

### Local Subnets

1. **10.0.0.0/8**       (10.0.0.0 - 10.255.255.255)   
2. **172.16.0.0/12**    (172.16.0.0 - 172.31.255.255)   
3. **192.168.0.0/16**   (192.168.0.0 - 192.168.255.255)

## Transport Layer

### TCP

```
0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          Source Port          |       Destination Port        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Sequence Number                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Acknowledgment Number                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Data |           |U|A|P|R|S|F|                               |
| Offset| Reserved  |R|C|S|S|Y|I|            Window             |
|       |           |G|K|H|T|N|N|                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Checksum            |         Urgent Pointer        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Options                    |    Padding    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             data                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
02 04 05 a0 01 03 03 05 01 01 08 0a 1d 74 65 c5 00 00 00 00 04 02 00 00
```
**Source Port (16 bits):** Specifies the source port number.

**Destination Port (16 bits):** Specifies the destination port number.

**Sequence Number (32 bits):** If the SYN flag is set, this is the initial sequence number. Otherwise, it's the accumulated sequence number of the first data byte in this segment (with the first byte during the connection setup counting as 1).

**Acknowledgment Number (32 bits):** If the ACK flag is set, this field contains the value of the next sequence number the sender of the segment is expecting to receive.

**Data Offset (4 bits):** Specifies the size of the TCP header in 32-bit words. Also known as the header length field.

**Reserved (3 bits):** Reserved for future use and should be set to zero.

**Flags (9 bits):**

- URG (1 bit): Urgent pointer field significant.
- ACK (1 bit): Acknowledgment field significant.
- PSH (1 bit): Push Function.
- RST (1 bit): Reset the connection.
- SYN (1 bit): Synchronize sequence numbers.
- FIN (1 bit): No more data from the sender.

**Window (16 bits):** Specifies the size of the sender's receive window (or buffer space) that is available.

**Checksum (16 bits):** Used for error-checking of the header and data. It checks whether the data has been transmitted accurately.

**Urgent Pointer (16 bits):** If the URG flag is set, then this 16-bit field is an offset from the sequence number indicating the last urgent data byte.

**Options (variable):** The length of this field is determined by the data offset field. This field could be zero or more 32-bit words.

**Padding (variable):** Ensures the header is a multiple of 32 bits in length. The length of the padding is determined by the data offset field.

**Data (variable):** The actual data transmitted by the TCP segment. The length of this field can be determined by the total length minus the data offset value.

This header format provides an overview of how TCP segments are structured. Remember, TCP is a connection-oriented protocol, ensuring reliable, ordered, and error-checked delivery of a stream of octets between applications running on hosts in an IP network.

In [None]:
transport_segment = ip_pkt.payload

In [None]:
type(transport_segment)

In [None]:
transport_segment.show()

In [None]:
# ACK flag is set
print(transport_segment.flags)

In [None]:
print(transport_segment.seq)

In [None]:
print(transport_segment.ack)

### UDP

```
 0                   1                   2                   3   
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Source Port (16 bits)     |  Destination Port (16 bits)   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Length (16 bits)    |       Checksum (16 bits)      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              data                             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

```
**Source Port (16 bits):** Specifies the port number on the source machine. This is often assigned dynamically by the source machine for outgoing packets (ephemeral port). For well-known services, this field often contains the service's assigned port number.   

**Destination Port (16 bits):** Specifies the port number on the destination machine. For well-known services, this field contains the service's assigned port number.   

**Length (16 bits):** Specifies the length in bytes of the entire datagram: header and data. The minimum value for this field is 8, which corresponds to the length of the header.   

**Checksum (16 bits):** Used for error-checking the header and data. The checksum field provides a means of detecting errors introduced during transmission or reception. The algorithm to compute the checksum is defined in the RFC for UDP.   

**Data (variable):** The actual data transmitted by the UDP datagram. Its size can be determined by subtracting the length of the header (8 bytes) from the value in the Length field.   

UDP is a connectionless protocol that doesn't guarantee delivery, order, or data integrity, but it's faster and requires fewer resources compared to TCP. It's often used for real-time services like video or voice streaming, where occasional lost packets are acceptable and can be managed at higher layers or by the application itself.







In [None]:
sample_pkt2 = pcap[68]
ip_pkt2 = sample_pkt2.payload
transport_segment = ip_pkt2.payload

In [None]:
type(transport_segment)

In [None]:
print(transport_segment.sport)

In [None]:
print(transport_segment.dport)

# Convert PCAP to Dataframe

In [None]:
def pcap_to_df(pcap):
    # Collect field names from IP/TCP/UDP (These will be columns in DF)
    ip_fields = [field.name for field in IP().fields_desc]
    tcp_fields = [field.name for field in TCP().fields_desc]
    udp_fields = [field.name for field in UDP().fields_desc]

    dataframe_fields = ip_fields + ['time'] + ["tcp_{}".format(i) for i in tcp_fields] + ["udp_{}".format(i) for i in udp_fields] + ['payload','payload_raw','payload_hex']

    # Create blank DataFrame
    df = pd.DataFrame(columns=dataframe_fields)
    for packet in pcap[IP]:
        # Field array for each row of DataFrame
        field_values = []
        # Add all IP fields to dataframe
        for field in ip_fields:
            if field == 'options':
                # Retrieving number of options defined in IP Header
                field_values.append(len(packet[IP].fields[field]))
            else:
                field_values.append(packet[IP].fields[field])

        field_values.append(packet.time)

        layer_type = type(packet[IP].payload)
        for field in tcp_fields:
            try:
                if field == 'options':
                    field_values.append(len(packet[layer_type].fields[field]))
                else:
                    field_values.append(packet[layer_type].fields[field])
            except:
                field_values.append(None)

        for field in udp_fields:
            try:
                if field == 'options':
                    field_values.append(len(packet[layer_type].fields[field]))
                else:
                    field_values.append(packet[layer_type].fields[field])
            except:
                field_values.append(None)

        # Append payload
        field_values.append(len(packet[layer_type].payload))
        field_values.append(packet[layer_type].payload.original)
        field_values.append(binascii.hexlify(packet[layer_type].payload.original))
        # Add row to DF
        df_append = pd.DataFrame([field_values], columns=dataframe_fields)
        df = pd.concat([df, df_append], axis=0)

    # Reset Index
    df = df.reset_index()
    # Drop old index column
    df = df.drop(columns="index")
    return df

In [None]:
pandas_df = pcap_to_df(pcap)

### Dataframe Basics

In [None]:
# Retrieve first row from DataFrame
print(pandas_df.iloc[0])

In [None]:
print(pandas_df.shape)

In [None]:
# Return first 5 rows
pandas_df.head()

In [None]:
# Return last 5 rows
pandas_df.tail()

In [None]:
# Return the Source Address for all rows
pandas_df['src']

In [None]:
# Return Src Address, Dst Address, Src Port, Dst Port
pandas_df[['src','dst','tcp_sport','tcp_dport']]

# Analyzing Ookla Speedtest

**Objective:** In this section we will examine two captures representing single / multi connection speedtests and analyze the difference between the two measurement approaches.


### Single Connection

In [None]:
speed_test_single_filename = "pcaps/speed_test_single_sample10000.pcap"
speed_test_single_pcap = rdpcap(speed_test_single_filename)
speed_test_single_df = pcap_to_df(speed_test_single_pcap)

In [None]:
speed_test_single_df[['src','dst','tcp_sport','tcp_dport']]

### Multi Connection

In [None]:
speed_test_multi_filename = "pcaps/speed_test_multi_sample10000.pcap"
speed_test_multi_pcap = rdpcap(speed_test_multi_filename)
speed_test_multi_df = pcap_to_df(speed_test_multi_pcap)