## Pre-Processing

The dataset is very large and contains many unwanted and sensitive entries. This notebook generates a slimmer CSV file for the [Network Fingerprint AI Model Notebook]( https://github.com/Charm-q/AI-Capstone/blob/main//Network%20Fingerprint%20AI%20Model.ipynb) to use by removing unwanted entries and obfuscating sensitive data. Several sections of this notebook have been altered after processing to avoid publicly uploading sensitive information.

The dataset is of all local network traffic but only two users on four devices are targeted for the fingerprinting process. The dataset is filtered for these four users and their unique MAC addresses are obfuscated:
- User 1 - Phone
- User 1 - Computer
- User 2 - Phone
- User 2 - Computer

### Initialize

Import Pandas, other useful library and the dataset called `traffic.csv`. This data was captured using a man in the middle network attack documented in the [writeup](https://github.com/Charm-q/AI-Capstone/blob/main/README.md).

In [73]:
import pandas as pd
from datetime import datetime

df = pd.read_csv('data/traffic.csv')

### Dataset Breakdown

Each entry in the dataset is a network frame with only the high level details.

- `No.` is a leftover index from Wireshark.
- `Time` represents the time since the network capture started.
- `Source` represent the FQDN or domain of whoever is sending a network frame.
- `Destination` represents the FQDN or domain of the intended receiver.
- `Protocol` represents the protocol used to send the network frame.
- `Length` represents the size in bytes if the network frame.
- `Receiver address` represents the unique MAC address of the intended receiver.
- `Source address` represents the MAC address of the frame sender.
- `Info` includes a short description of the network frame's objective.


In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2048708 entries, 0 to 2048707
Data columns (total 9 columns):
 #   Column            Dtype  
---  ------            -----  
 0   No.               int64  
 1   Time              float64
 2   Source            object 
 3   Destination       object 
 4   Protocol          object 
 5   Length            int64  
 6   Receiver address  object 
 7   Source address    object 
 8   Info              object 
dtypes: float64(1), int64(2), object(6)
memory usage: 140.7+ MB


### Obfuscate MAC Addresses
MAC addresses are unique to each device's network card and the user's devices need to be obfuscated for their privacy. They are replaced with `{MAC ADDRESS}` after pre-processing in the cell below.

In [75]:
df = df.replace({'{MAC ADDRESS}': 'User 1 - Phone'}, regex=True)
df = df.replace({'{MAC ADDRESS}': 'User 1 - Computer'}, regex=True)
df = df.replace({'{MAC ADDRESS}': 'User 2 - Phone'}, regex=True)
df = df.replace({'{MAC ADDRESS}': 'User 2 - Computer'}, regex=True)

df.head()

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Receiver address,Source address,Info
0,32421,6.676769,s3-1-w.amazonaws.com,MacBook-Air-4.local,TCP,96,User 1 - Computer,80:da:c2:c0:64:80,"https(443) > 52037 [FIN, ACK] Seq=1 Ack=1 Wi..."
1,32431,6.679702,s3-1-w.amazonaws.com,MacBook-Air-4.local,TCP,96,User 1 - Computer,80:da:c2:c0:64:80,"[TCP Out-Of-Order] https(443) > 52037 [FIN, ..."
2,32664,6.742886,us-ne-courier-4.push-apple.com.akadns.net,MacBook-Air-4.local,TCP,102,User 1 - Computer,80:da:c2:c0:64:80,hpvirtgrp(5223) > 49942 [ACK] Seq=1 Ack=1 Wi...
3,32780,6.753906,2600:9000:2508:6600:c:cfd4:a580:93a1,21a0c8cd-34bb-4b35-a2f4-9300143e832b.local,TCP,122,User 1 - Computer,80:da:c2:c0:64:80,"https(443) > 52025 [FIN, ACK] Seq=1 Ack=1 Wi..."
4,32826,6.769492,d27xxe7juh1us6.cloudfront.net,MacBook-Air-4.local,TCP,102,User 1 - Computer,80:da:c2:c0:64:80,"https(443) > 52022 [FIN, ACK] Seq=1 Ack=1 Wi..."


### Filter for Targets, Source and Client Hello

The following filters are applied to the dataset:

- Devices on a local network communicate to each other automatically, this is not interesting data. Any entries with both the `Source address` and `Receiver address` belonging to the `targets` device list are filtered out.

- Only user activity is of interested so another filter is applied to remove any `Source address` that isn't in the `targets` device list. Anything else is a response from the domain the user is accessing and not actual user activity.

- Network frames are relatively small in size. A streaming service such as Netflix or Youtube will result in far more frames than something like Facebook or Reddit, even if the user spent more time on the latter. This skews the data but can be filtered out using the `Info` column. A `Client Hello` as the frame descriptor signifies a new connection to that domain. Filtering for it will give a better representation of the user's activity.

In [76]:
targets = ['User 1 - Phone', 'User 1 - Computer', 'User 2 - Phone', 'User 2 - Computer']

# filter out local network communication
filter_df = df[df['Receiver address'].isin(targets)]
filter_df = filter_df[filter_df['Source address'].isin(targets)]
df = df[-df['No.'].isin(filter_df['No.'])]

# filter for requests made directly by the user
df = df[df['Source address'].isin(targets)]

# filter for new connection
df = df[df['Info'] == "Client Hello"]

df.head()

Unnamed: 0,No.,Time,Source,Destination,Protocol,Length,Receiver address,Source address,Info
8,33121,6.865295,21a0c8cd-34bb-4b35-a2f4-9300143e832b.local,doh2.gslb2.xfinity.com,TLSv1.2,706,80:da:c2:c0:64:81,User 1 - Computer,Client Hello
2256,179536,40.384901,MacBook-Air-4.local,api.segment.io,TLSv1.2,619,80:da:c2:c0:64:81,User 1 - Computer,Client Hello
2258,179538,40.38496,MacBook-Air-4.local,ec2-52-25-39-107.us-west-2.compute.amazonaws.com,TLSv1.2,619,80:da:c2:c0:64:81,User 1 - Computer,Client Hello
2960,263625,57.798034,Besart-Copas-iPhone.local,securetoken.googleapis.com,TLSv1.3,619,80:da:c2:c0:64:81,User 1 - Phone,Client Hello
2962,263627,57.799276,Besart-Copas-iPhone.local,us-ne-courier-4.push-apple.com.akadns.net,TLSv1.2,619,80:da:c2:c0:64:81,User 1 - Phone,Client Hello


### Drop Unwanted Columns
- `No.` is the Wireshark index and not useful.
- `Protocol` was previously filtered by Wireshark for only TCP related protocols and therefore trivial.
- `Info` defines activity based on TCP/IP frame info, this was useful for filtering by `Client Hello` but is no longer necessary.
- `Receiver address` represents the MAC address of the destination device. Only the FQDN or domain listed in `Destination` is useful as often hosts have many servers with many MAC addresses.
- `Source` is the FQDN of the user's device and is not constant. Only the MAC address listed in `Source address` is constant and useful.
- `Length` could be useful, but since the data was filtered for only identical `Client Hello` frames then it becomes trivial.

In [77]:
df = df.drop(columns=['No.', 'Protocol', 'Info', 'Receiver address', 'Source', 'Length']).reset_index(drop=True)
df.head()

Unnamed: 0,Time,Destination,Source address
0,6.865295,doh2.gslb2.xfinity.com,User 1 - Computer
1,40.384901,api.segment.io,User 1 - Computer
2,40.38496,ec2-52-25-39-107.us-west-2.compute.amazonaws.com,User 1 - Computer
3,57.798034,securetoken.googleapis.com,User 1 - Phone
4,57.799276,us-ne-courier-4.push-apple.com.akadns.net,User 1 - Phone


### Upgrade the `Time` Column

The `Time` column refers to how long it has been since the network capture was started. Since the network capture was started on Wed Feb 1st, 2023 at 17:04.20 EST, it can be replaced by two separate and more useful columns:
- `Week` will represent the day of the week the frame was sent.
    - Mon/Tue/Wed/Thu/Fri/Sat/Sun
    
    
- `Day` will represent the time of day the frame was sent.
    - Morning (4:00 - 12:00)
    - Afternoon (12:00 - 20:00)
    - Evening (20:00 - 4:00)

In [78]:
def time_of_day(hour):
    # converts hours to time of day
    if 4 < hour < 12:
        return "Morning"
    if 12 < hour < 20:
        return "Afternoon"
    else:
        return "Evening"

In [79]:
# convert the capture start date to seconds to be used as an offset
epoch = datetime(2023, 2, 1, 17, 4, 20).timestamp()

# apply the offset to each column and convert to day of the week
df['Week'] = df.apply(
    lambda row : 
        datetime.fromtimestamp(epoch+row['Time']).strftime('%A')[:3], axis = 1)

# apply the offset to each column and convert to day of the week
df['Day'] = df.apply(
    lambda row : 
        time_of_day(datetime.fromtimestamp(epoch+row['Time']).hour), axis = 1)

df = df.drop(columns=['Time']).reset_index(drop=True)
df.index.names = ['Index']
df.head()

Unnamed: 0_level_0,Destination,Source address,Week,Day
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,doh2.gslb2.xfinity.com,User 1 - Computer,Wed,Afternoon
1,api.segment.io,User 1 - Computer,Wed,Afternoon
2,ec2-52-25-39-107.us-west-2.compute.amazonaws.com,User 1 - Computer,Wed,Afternoon
3,securetoken.googleapis.com,User 1 - Phone,Wed,Afternoon
4,us-ne-courier-4.push-apple.com.akadns.net,User 1 - Phone,Wed,Afternoon


In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1014 entries, 0 to 1013
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Destination     1014 non-null   object
 1   Source address  1014 non-null   object
 2   Week            1014 non-null   object
 3   Day             1014 non-null   object
dtypes: object(4)
memory usage: 31.8+ KB


### Export to CSV

Data Pre-Processing is complete. The data was slimmed down from ~2 million entries to ~1 thousand. Exporting the results to `prerocessed.csv` to be used by the [Network Fingerprint AI Model Notebook]( https://github.com/Charm-q/AI-Capstone/blob/main//Network%20Fingerprint%20AI%20Model.ipynb).

In [82]:
df.to_csv('data/preprocessed.csv')