## Network Fingerprinting

Physical fingerprints are used in for forensics to piece together crimes. Network fingerprints can be used in the same way as humans automatically express habitual activity online. This notebook uses captured network data from two users, each with a phone and a laptop, and attempts to train an AI model to recognize and differentiate between them.

### Initialize

Import Pandas, other useful library and the dataset called `preprocessed.csv`. This data was captured using a man in the middle network attack documented in the [writeup](https://github.com/Charm-q/AI-Capstone/blob/main/README.md). Due to the very large size of the network capture, preprocessing has already been completed in the Data Pre-Processing [notebook]( https://github.com/Charm-q/AI-Capstone/blob/main/Data%20Pre-Processing.ipynb).

In [67]:
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = pd.read_csv('data/preprocessed.csv').set_index('Index')
df.head()

Unnamed: 0_level_0,Destination,Source address,Week,Day
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,doh2.gslb2.xfinity.com,User 1 - Computer,Wed,Afternoon
1,api.segment.io,User 1 - Computer,Wed,Afternoon
2,ec2-52-25-39-107.us-west-2.compute.amazonaws.com,User 1 - Computer,Wed,Afternoon
3,securetoken.googleapis.com,User 1 - Phone,Wed,Afternoon
4,us-ne-courier-4.push-apple.com.akadns.net,User 1 - Phone,Wed,Afternoon


### Dataset Breakdown

Each entry in the dataset is a preprocessed network frame including only the high level and useful details.


- `Destination` represents the FQDN or domain of the intended receiver.
- `Source address` represents the MAC address of the frame sender.
- `Week` represents the day of the week the frame was sent.
- `Day` represents the time of day the frame was sent, either Morning, Afternoon or Evening.

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1014 entries, 0 to 1013
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Destination     1014 non-null   object
 1   Source address  1014 non-null   object
 2   Week            1014 non-null   object
 3   Day             1014 non-null   object
dtypes: object(4)
memory usage: 39.6+ KB


### One-Hot-Encode Columns
The `Destination`, `Week` and `Day` features all need to be One-Hot-Encoded to support the machine learning model. The `Source address` is the exception as it includes the users who's habits need to be learned.



In [69]:
def encode_and_bind(dataframe, features_to_encode):
    # given a dataframe and list of features encode_and_bind() will One-Hot-Encode them
    for feature in features_to_encode:
        dummies = pd.get_dummies(dataframe[[feature]])
        dataframe = pd.concat([dataframe, dummies], axis=1)
    return(dataframe)

In [70]:
columns = ['Destination', 'Week', 'Day']
df = encode_and_bind(df, columns).drop(columns=columns)
df.head()

Unnamed: 0_level_0,Source address,Destination_104.16.185.44,Destination_104.18.155.62,Destination_104.244.42.129,Destination_138.251.186.35.bc.googleusercontent.com,Destination_141.226.224.32,Destination_141.226.224.48,Destination_146.75.78.109,Destination_146.75.78.133,Destination_146.75.78.137,...,Destination_xx-fbcdn6-shv-01-ord5.fbcdn.net,Destination_xx-fbcdn6-shv-02-ord5.fbcdn.net,Destination_yi-in-f95.1e100.net,Destination_youtubei.googleapis.com,Destination_zoom.us,Week_Thu,Week_Wed,Day_Afternoon,Day_Evening,Day_Morning
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,User 1 - Computer,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
1,User 1 - Computer,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
2,User 1 - Computer,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
3,User 1 - Phone,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
4,User 1 - Phone,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0


### Training AI Models - SVC

All data preparations are complete. Two models will be used to learn the users habits. The Support Vector Machine will be used as it is very compatible and has slightly beaten basic neural net models for this classification. 


SVC essentially draws a line between points on a multidimensional graph. Since users tend to use the same websites at the same time of day, this is an excellent way of differentiating between the two. There will of course be some overlap so getting a perfect training and test score is impossible.

In [106]:
# training parameters
svc_params = {'svc__kernel': ['sigmoid', 'linear', 'rbf'],
              'svc__gamma': [0.1, 1.0, 10.0],
              'svc__coef0':[0, 1, 2]
             }

In [137]:
# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='Source address'), df['Source address'])

In [93]:
# generate the Support Vector Machine, no scaling necessary since everything is One-Hot-Encoded
svc_pipe = Pipeline([('svc', SVC())])
svc_grid = GridSearchCV(svc_pipe, param_grid=svc_params)

svc = svc_grid.fit(X_train, y_train)

In [94]:
# get training results
svc_train_score = accuracy_score(svc.predict(X_train), y_train)
svc_test_score = accuracy_score(svc.predict(X_test), y_test)
svc_best_params = svc.best_params_

print("Train accuracy: ", svc_train_score)
print("Test accuracy: ", svc_test_score)
print("Best parameters: ", svc_best_params)

Train accuracy:  0.9302631578947368
Test accuracy:  0.8267716535433071
Best parameters:  {'svc__coef0': 0, 'svc__gamma': 0.1, 'svc__kernel': 'linear'}


### Training AI Models - Neural Net MPLClassifier

To compare the results, a neural net using an MPLClassifer was generated. The results of such a model are very similar to the SVC above with around the same time requirement. However, on average the SVC seems to perform very slightly better.

In [138]:
# training parameters
params = {'neurons': [10, 50, 100]}

In [139]:
# generate the neural net MPLClassifier and train it
model = MLPClassifier(hidden_layer_sizes=(50,50), max_iter=500, alpha=0.0001,
                      solver='adam', verbose=10,  random_state=21, tol=0.000000001)

ann_grid = GridSearchCV(model, param_grid=params)
model.fit(X_train, y_train)

Iteration 1, loss = 1.15430037
Iteration 2, loss = 1.11000515
Iteration 3, loss = 1.07062877
Iteration 4, loss = 1.03358880
Iteration 5, loss = 0.99917632
Iteration 6, loss = 0.96308219
Iteration 7, loss = 0.92683405
Iteration 8, loss = 0.89017808
Iteration 9, loss = 0.85083320
Iteration 10, loss = 0.81522764
Iteration 11, loss = 0.78044583
Iteration 12, loss = 0.74678786
Iteration 13, loss = 0.71626432
Iteration 14, loss = 0.68861877
Iteration 15, loss = 0.66268644
Iteration 16, loss = 0.63843464
Iteration 17, loss = 0.61546454
Iteration 18, loss = 0.59335884
Iteration 19, loss = 0.57204052
Iteration 20, loss = 0.54920878
Iteration 21, loss = 0.52683945
Iteration 22, loss = 0.50354439
Iteration 23, loss = 0.47977079
Iteration 24, loss = 0.45521660
Iteration 25, loss = 0.43054904
Iteration 26, loss = 0.40531317
Iteration 27, loss = 0.38154330
Iteration 28, loss = 0.35796685
Iteration 29, loss = 0.33552021
Iteration 30, loss = 0.31414211
Iteration 31, loss = 0.29380981
Iteration 32, los

In [140]:
# get training results
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy}")

Test accuracy: 0.8346456692913385


### Conclusion

It is possible to train a AI model that can fingerprint and classifier individuals solely based on their internet activity to an accuracy around ~83% with two separate models. Improvements could be made to increase this accuracy.