# Assignment: Application Classification

To this point in the class, you have learned various techniques for application/traffic classification. In this assignment, you will put it into practice, by training a model to identify applications using a network traffic trace.

**Submission Instructions**: 
- Ensure your assignment is submitted through Gradescope. To do so, sign up for the course on Gradescope using the code: B2W3YG. 
- You should submit a single notebook containing your code to extract features and model evaluation, and response to Part 3.
- You should assume the CSVs are located in a folder called `data`, co-located with the notebook.
- Make sure the notebook is styled well. Write code in the relevant sections of the notebook. 
- I should be able to run the entire notebook without any errors. 

## Dataset download and Warmup

We will use a public dataset that consists of annotated traffic logs. The dataset we will use for this assignment is available on [Onedrive](https://csciitd-my.sharepoint.com/:u:/g/personal/tmangla_csciitd_onmicrosoft_com/EafyJbnixmJIvZN1bgwD2W4BIrzc5yy9AP9uNrkmNTMfoA?e=Qa1NVU). The data consists of TSVs (tab-seprated) with headers corresponding to packets for an application. Each row corresponds to one packet. The headers follow have this schema: 
```
columns = ["frame.time_epoch", "frame.len", "ip.src", "ip.dst", "ip.proto",
    "udp.srcport", "udp.dstport", "tcp.srcport", "tcp.dstport",
    "tcp.flags", "tcp.flags.syn", "tcp.flags.fin", "dns.qry.name"]
```

**Getting application ground truth:** You can use the filename of the CSV file

Download the dataset and read it. You can use read the data in a dataframe: 
```
df = pd.read_csv(filename, sep="\t", header=None, names=columns)
```

## Part 1: Extracting Features

### Data cleaning

Your goal is to extract the following features from the dataset: 
- Flow-level (5 features): flow duration, volume (upstream and downstream), number of packets (upstream and downstream)
- Packet-level features (36 features): Statistics on packet inter-arrival times and packet size. These need to be computed for both upstream and downstream direction. You should compute the following statistics for each flow: mean, median, std, min, max, quantiles (25%ile, 75%iles), and deciles (10%ile, 90%ile). Compute these statistics per feature (IAT, size) and direction (upstream, downstream).

**Defining Flows**: For TCP, a flow is same as connection (determined using SYN/FIN packets). You should define UDP flows using inactivity timeout (as discussed in class). Use an inactivity timeout of 60s. 

Make sure you filter out the non-IP traffic as well as the DNS traffic from the data.

**Checkpoint**: Once you do that, summarize the number of flows for each application. You can extract the application name from the file name. VPN and non-VPN applications should be treated differently. You can remove classes with less than 10 instances for the next part. 

#### Basic Code Setup

In [1]:
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore') #Disable logs
columns = ["frame.time_epoch", "frame.len", "ip.src", "ip.dst", "ip.proto",
    "udp.srcport", "udp.dstport", "tcp.srcport", "tcp.dstport",
    "tcp.flags", "tcp.flags.syn", "tcp.flags.fin", "dns.qry.name"]

**Read All csv files from 'data' folder**

In [2]:
folder_path = './data/'

# Get the list of CSV files in the folder
csv_files = [file for file in os.listdir(folder_path) if file.endswith('.csv')]
# Sort the files so that all records will be in sequence order
csv_files.sort()

# Initialize an empty dictionary
dfs_vpn_dict = {}
dfs_nonvpn_dict = {}
filenames_vpn = []
filenames_nonvpn = []

# Iterate over each file in the folder and prepare dataFrame accordingly
previousFileName = ''
for file in csv_files:
    newFileName = file.rsplit('_', 1)[0]
    isVPN = newFileName.find("nonvpn") == -1
    selected_dfs_dict = dfs_vpn_dict if isVPN else dfs_nonvpn_dict
    selected_name_list = filenames_vpn if isVPN else filenames_nonvpn
    file_path = os.path.join(folder_path, file)
    # read_csv
    df_file = pd.read_csv(file_path, sep="\t", header=None, names=columns)
    if(newFileName == previousFileName):
        # append
        selected_dfs_dict[newFileName] = pd.concat([selected_dfs_dict[newFileName], df_file], ignore_index=True)
    else:
        # Create new df
        selected_dfs_dict[newFileName] = df_file
        selected_name_list.append(newFileName)
    previousFileName = newFileName

**Create common df for vpn and non-vpn connections**


In [3]:
# Create common df for vpn and non-vpn connections
merged_df_nonvpn = pd.concat(dfs_nonvpn_dict.values(), keys=dfs_nonvpn_dict.keys(), names=['file_name'])
merged_df_vpn = pd.concat(dfs_vpn_dict.values(), keys=dfs_vpn_dict.keys(), names=['file_name'])

**getTCP_UDPFeatures(dataframe)** Return the flow and packet features for UDP and TCP connections

This method is based on 1 assumption for UDP that source port is smaller than dst port.

In [4]:
def getTCP_UDPFeatures(df):
    # Filter out non-IP traffic
    df = df[df['ip.src'].notnull() & df['ip.dst'].notnull()]
    # Filter out TCP and UDP traffic only
    df_tcp = df[df['ip.proto'].isin([6])]  # Keep only TCP (6)
    df_udp = df[df['ip.proto'].isin([17])]  # Keep only  UDP (17) traffic


    def collect_tcp_features(df_tcp):
        # Define flows for TCP based on SYN/FIN flags
        df_tcp['flow_id'] = (df_tcp['tcp.flags.syn'] == 1) | (df_tcp['tcp.flags.fin'] == 1)
        df_tcp['flow_id'] = df_tcp.groupby(['ip.src', 'ip.dst', 'tcp.srcport', 'tcp.dstport', ])['flow_id'].cumsum()
        df_tcp['direction'] = np.where(df_tcp['tcp.flags.syn'] == 1, 'upstream', 'downstream')

        # Extract Flow-level Features
        flow_features = df_tcp.groupby(['flow_id','direction']).agg({
            'frame.time_epoch': lambda x: x.max() - x.min(),  # Flow duration
            'frame.len': ['sum', 'count'],
            'tcp.flags.syn': ['sum'],
            'tcp.flags.fin': ['sum'],
        }).reset_index()
        flow_features.columns = ['Flow', 'Direction', 'FlowDuration', 'TotalVolume', 'PacketCount','SYNCount', 'FINCount']

        # Extract Packet-level Features
        packet_features = df_tcp.groupby(['flow_id', 'direction']).agg({
            'frame.time_epoch': ['mean', 'median', lambda x: np.std(x), 'min', 'max', percentile_25, percentile_75, percentile_10, percentile_90],
            'frame.len': ['mean', 'median', lambda x: np.std(x), 'min', 'max', percentile_25, percentile_75, percentile_10, percentile_90]
        }).reset_index()
        packet_features.columns = ['Flow', 'Direction', 'IATMean', 'IATMedian', 'IATStd', 'IATMin', 'IATMax', 'IATQ1', 'IATQ3', 'IATD1', 'IATD9',
                            'SizeMean', 'SizeMedian', 'SizeStd', 'SizeMin', 'SizeMax', 'SizeQ1', 'SizeQ3', 'SizeD1', 'SizeD9']

        return flow_features, packet_features

    def collect_udp_features(df_udp):
        # Define UDP flows using inactivity timeout of 60s
        df_udp['flow_id'] = (df_udp['frame.time_epoch'].diff() > 60).cumsum()
        # Assuming that source port is smaller than dst port. In Netflix example, 53 port number is open for all clients
        df_udp['direction'] = np.where(df_udp['udp.srcport'] <= df_udp['udp.dstport'], 'upstream', 'downstream')

        # Extract Flow-level Features
        flow_features = df_udp.groupby(['flow_id','direction']).agg({
            'frame.time_epoch': lambda x: x.max() - x.min(),  # Flow duration
            'frame.len': ['sum', 'count']
        }).reset_index()
        flow_features.columns = [ 'Flow', 'Direction', 'FlowDuration', 'TotalVolume', 'PacketCount']

        # Extract Packet-level Features
        packet_features = df_udp.groupby(['flow_id', 'direction']).agg({
            'frame.time_epoch': ['mean', 'median', lambda x: np.std(x), 'min', 'max', percentile_25, percentile_75, percentile_10, percentile_90],
            'frame.len': ['mean', 'median', lambda x: np.std(x), 'min', 'max', percentile_25, percentile_75, percentile_10, percentile_90]
        }).reset_index()
        packet_features.columns = ['Flow', 'Direction', 'IATMean', 'IATMedian', 'IATStd', 'IATMin', 'IATMax', 'IATQ1', 'IATQ3', 'IATD1', 'IATD9',
                            'SizeMean', 'SizeMedian', 'SizeStd', 'SizeMin', 'SizeMax', 'SizeQ1', 'SizeQ3', 'SizeD1', 'SizeD9']

        return flow_features, packet_features

    def percentile_25(x):
        return np.percentile(x, 25)

    def percentile_75(x):
        return np.percentile(x, 75)

    def percentile_10(x):
        return np.percentile(x, 10)

    def percentile_90(x):
        return np.percentile(x, 90)

    # Collecting TCP features
    flow_features, packet_features = collect_tcp_features(df_tcp)
    merged_df_tcp = pd.merge(flow_features, packet_features, on=['Flow', 'Direction'], how='inner')
    # Collecting UDP features
    udp_flow_features, udp_packet_features = collect_udp_features(df_udp)
    merged_df_udp = pd.merge(udp_flow_features, udp_packet_features, on=['Flow', 'Direction'], how='inner')

    return merged_df_tcp, merged_df_udp

**getSummary(fileNames, merged_df)** Return summary will have following things
1. Summary for: [FileName]
2. TCP Summary 
2. UDP Summary 


In [5]:
def getSummary(fileNames, merged_df):
    dfs_dict_tcp={}
    dfs_dict_udp={}
    for fileName in fileNames:
        print("Summary for : " + fileName)
        df = merged_df.loc[fileName]
        merged_df_tcp, merged_df_udp  = getTCP_UDPFeatures(df)
        print("TCP Summary for : " + fileName)
        print(merged_df_tcp)
        print("UDP Summary for : " + fileName)
        print(merged_df_udp)
        #Create dictionary for TCP Summary
        if fileName in dfs_dict_tcp:
            dfs_dict_tcp[fileName] = pd.concat([dfs_dict_tcp[fileName], merged_df_tcp], ignore_index=True)
        else:
            dfs_dict_tcp[fileName] = merged_df_tcp

        #Create dictionary for UDP Summary
        if fileName in dfs_dict_udp:
            dfs_dict_udp[fileName] = pd.concat([dfs_dict_udp[fileName], merged_df_udp], ignore_index=True)
        else:
            dfs_dict_udp[fileName] = merged_df_udp
    return dfs_dict_tcp, dfs_dict_udp

#### Summary for NON VPN Applications

In [6]:
summary_nonvpn = getSummary(filenames_nonvpn, merged_df_nonvpn)
summary_nonvpn

Summary for : nonvpn_netflix


TCP Summary for : nonvpn_netflix
   Flow   Direction  FlowDuration  TotalVolume  PacketCount  SYNCount  \
0     0  downstream  9.467833e+01       653900          889       0.0   
1     1  downstream  1.740715e+06    642891485       683427       0.0   
2     1    upstream  1.740713e+06        20520          342     342.0   
3     2  downstream  1.740695e+06    112310447        83340       0.0   
4     2    upstream  0.000000e+00           60            1       1.0   
5     3  downstream  1.740277e+06          676           13       0.0   
6     4  downstream  3.129411e-01          332            4       0.0   
7     5  downstream  3.210020e-01          332            4       0.0   
8     6  downstream  3.379910e-01          332            4       0.0   
9     7  downstream  6.131309e+00          748           12       0.0   

   FINCount       IATMean     IATMedian         IATStd  ...         IATD9  \
0       0.0  1.563295e+09  1.563295e+09       7.350833  ...  1.563295e+09   
1       2

({'nonvpn_netflix':    Flow   Direction  FlowDuration  TotalVolume  PacketCount  SYNCount  \
  0     0  downstream  9.467833e+01       653900          889       0.0   
  1     1  downstream  1.740715e+06    642891485       683427       0.0   
  2     1    upstream  1.740713e+06        20520          342     342.0   
  3     2  downstream  1.740695e+06    112310447        83340       0.0   
  4     2    upstream  0.000000e+00           60            1       1.0   
  5     3  downstream  1.740277e+06          676           13       0.0   
  6     4  downstream  3.129411e-01          332            4       0.0   
  7     5  downstream  3.210020e-01          332            4       0.0   
  8     6  downstream  3.379910e-01          332            4       0.0   
  9     7  downstream  6.131309e+00          748           12       0.0   
  
     FINCount       IATMean     IATMedian         IATStd  ...         IATD9  \
  0       0.0  1.563295e+09  1.563295e+09       7.350833  ...  1.563295e+09

#### Summary for VPN Applications

In [7]:
summary_vpn = getSummary(filenames_vpn, merged_df_vpn)
summary_vpn

Summary for : vpn_netflix


TCP Summary for : vpn_netflix
Empty DataFrame
Columns: [FlowDuration, TotalVolume, PacketCount, SYNCount, FINCount, Flow, Direction, IATMean, IATMedian, IATStd, IATMin, IATMax, IATQ1, IATQ3, IATD1, IATD9, SizeMean, SizeMedian, SizeStd, SizeMin, SizeMax, SizeQ1, SizeQ3, SizeD1, SizeD9]
Index: []

[0 rows x 25 columns]
UDP Summary for : vpn_netflix
   Flow Direction  FlowDuration  TotalVolume  PacketCount       IATMean  \
0     0  upstream   1793.492512    326802136       317820  1.563293e+09   

      IATMedian      IATStd        IATMin        IATMax  ...         IATD9  \
0  1.563293e+09  531.781413  1.563293e+09  1.563294e+09  ...  1.563294e+09   

      SizeMean  SizeMedian     SizeStd  SizeMin  SizeMax  SizeQ1  SizeQ3  \
0  1028.261708      1482.0  613.372432      122     1482   154.0  1482.0   

   SizeD1  SizeD9  
0   154.0  1482.0  

[1 rows x 23 columns]
Summary for : vpn_rdp
TCP Summary for : vpn_rdp
Empty DataFrame
Columns: [FlowDuration, TotalVolume, PacketCount, SYNCount, FIN

({'vpn_netflix': Empty DataFrame
  Columns: [FlowDuration, TotalVolume, PacketCount, SYNCount, FINCount, Flow, Direction, IATMean, IATMedian, IATStd, IATMin, IATMax, IATQ1, IATQ3, IATD1, IATD9, SizeMean, SizeMedian, SizeStd, SizeMin, SizeMax, SizeQ1, SizeQ3, SizeD1, SizeD9]
  Index: []
  
  [0 rows x 25 columns],
  'vpn_rdp': Empty DataFrame
  Columns: [FlowDuration, TotalVolume, PacketCount, SYNCount, FINCount, Flow, Direction, IATMean, IATMedian, IATStd, IATMin, IATMax, IATQ1, IATQ3, IATD1, IATD9, SizeMean, SizeMedian, SizeStd, SizeMin, SizeMax, SizeQ1, SizeQ3, SizeD1, SizeD9]
  Index: []
  
  [0 rows x 25 columns],
  'vpn_rdp_capture': Empty DataFrame
  Columns: [FlowDuration, TotalVolume, PacketCount, SYNCount, FINCount, Flow, Direction, IATMean, IATMedian, IATStd, IATMin, IATMax, IATQ1, IATQ3, IATD1, IATD9, SizeMean, SizeMedian, SizeStd, SizeMin, SizeMax, SizeQ1, SizeQ3, SizeD1, SizeD9]
  Index: []
  
  [0 rows x 25 columns],
  'vpn_rsync': Empty DataFrame
  Columns: [FlowDuration

## Part 2: Application Classification

### Prepare your data

### Train Your Model
- Select a model of your choice.
- Train the model using the training data.

### Tune Your Model
Perform hyperparameter tuning to find optimal parameters for your model.

### Evaluate Your Model

**Checkpoint**: Evaluate your model accuracy according to the following metrics using 10-fold cross validation:

- Accuracy
- F1 Score
- Confusion Matrix
- ROC/AUC

Your code should evaluate these metrics in separate cells

In [8]:
## Code here
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, roc_auc_score
from sklearn.preprocessing import LabelEncoder
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn #Disable all the console warning
# Extract relevant features for training
X_TCP = ['FlowDuration', 'TotalVolume', 'PacketCount',
                     'SYNCount', 'FINCount',
                     'IATMean', 'IATMedian', 'IATStd', 'IATMin', 'IATMax', 'IATQ1', 'IATQ3',
                     'IATD1', 'IATD9', 'SizeMean', 'SizeMedian', 'SizeStd', 'SizeMin', 'SizeMax',
                     'SizeQ1', 'SizeQ3', 'SizeD1', 'SizeD9']
X_UDP = ['FlowDuration', 'TotalVolume', 'PacketCount',
                     'IATMean', 'IATMedian', 'IATStd', 'IATMin', 'IATMax', 'IATQ1', 'IATQ3',
                     'IATD1', 'IATD9', 'SizeMean', 'SizeMedian', 'SizeStd', 'SizeMin', 'SizeMax',
                     'SizeQ1', 'SizeQ3', 'SizeD1', 'SizeD9']
FOLD = 10

def evaluateModel(X,y,FOLD):
    # Encode the labels if they are not numerical
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y)

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Check if the number of samples is sufficient for 10-fold cross-validation
    if len(X_train) >= FOLD:
        # Train a RandomForestClassifier (you can choose another model if needed)
        model = RandomForestClassifier(random_state=42)

        # Hyperparameter tuning using GridSearchCV
        param_grid = {
            'n_estimators': [50, 100, 200],
            'max_depth': [None, 10, 20],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }

        grid_search = GridSearchCV(model, param_grid, cv=FOLD, scoring='accuracy', n_jobs=-1)
        grid_search.fit(X_train, y_train)

        # Get the best parameters
        best_params = grid_search.best_params_

        # Use the best model for evaluation
        best_model = grid_search.best_estimator_

        # Evaluate the model using 10-fold cross-validation
        cv_accuracy = cross_val_score(best_model, X_train, y_train, cv=FOLD, scoring='accuracy')
        cv_f1_score = cross_val_score(best_model, X_train, y_train, cv=FOLD, scoring='f1_macro')

        # Print the results
        print(f"Best Hyperparameters: {best_params}")
        print(f"Cross-Validation Accuracy: {cv_accuracy.mean()}")
        print(f"Cross-Validation F1 Score: {cv_f1_score.mean()}")

        # Evaluate the model on the test set
        y_pred = best_model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='macro')
        conf_matrix = confusion_matrix(y_test, y_pred)
        roc_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])

        # Print the test set results
        print(f"Test Set Accuracy: {accuracy}")
        print(f"Test Set F1 Score: {f1}")
        print("Confusion Matrix:")
        print(conf_matrix)
        print(f"ROC AUC: {roc_auc}")
    else:
        print("Not enough samples for 10-fold cross-validation. Please use a larger dataset.")


In [9]:
TCP_NONVPN, UDP_NONVPN = summary_nonvpn
merged_df_tcp_nonvpn = pd.concat(TCP_NONVPN.values())
merged_df_udp_nonvpn = pd.concat(UDP_NONVPN.values())

In [10]:
TCP_VPN, UDP_VPN = summary_vpn
merged_df_tcp_vpn = pd.concat(TCP_VPN.values())
merged_df_udp_vpn = pd.concat(UDP_VPN.values())

In [11]:
# Create the feature matrix X and the target variable y
X = merged_df_tcp_nonvpn[X_TCP]
y = merged_df_tcp_nonvpn['Direction']
evaluateModel(X,y,FOLD) #TCP NON VPN

Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Cross-Validation Accuracy: 0.9666666666666666
Cross-Validation F1 Score: 0.9400000000000001
Test Set Accuracy: 0.8
Test Set F1 Score: 0.6875
Confusion Matrix:
[[7 0]
 [2 1]]
ROC AUC: 1.0


In [12]:
# Create the feature matrix X and the target variable y
X = merged_df_udp_nonvpn[X_UDP]
y = merged_df_udp_nonvpn['Direction']
evaluateModel(X,y,FOLD) #UDP NON VPN

Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Cross-Validation Accuracy: 0.9985714285714286
Cross-Validation F1 Score: 0.9985711369667278
Test Set Accuracy: 0.9827586206896551
Test Set F1 Score: 0.9827443720868732
Confusion Matrix:
[[83  0]
 [ 3 88]]
ROC AUC: 0.9986760227724084


In [13]:
# Create the feature matrix X and the target variable y
X = merged_df_tcp_vpn[X_TCP]
y = merged_df_tcp_vpn['Direction']
evaluateModel(X,y,FOLD) #TCP VPN

Not enough samples for 10-fold cross-validation. Please use a larger dataset.


In [14]:
# Create the feature matrix X and the target variable y
X = merged_df_udp_vpn[X_UDP]
y = merged_df_udp_vpn['Direction']
evaluateModel(X,y,FOLD) #UDP VP

Best Hyperparameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Cross-Validation Accuracy: 0.9666666666666666
Cross-Validation F1 Score: 0.9400000000000001
Test Set Accuracy: 1.0
Test Set F1 Score: 1.0
Confusion Matrix:
[[1 0]
 [0 7]]
ROC AUC: 1.0


## Part 3: Results analysis

Write a short report summarizing the results. Also, explain your results along the following questions:

- Which category of applications were categorized correctly (incorrectly) and why?
- For application categories that were predicted incorrectly, how would you improve their accuracy? Be specific about your answer. For instance, do not write I will collect more data. Explain what data would you collect and why that will help? 

In [15]:
## Report here