# Assignment: DarkNet traffic detection

## Task:  Traffic Classification

Kaggle challenge: https://www.kaggle.com/peterfriedrich1/cicdarknet2020-internet-traffic

### CIC-Darknet2020

Darknet is the unused address space of the internet which is not speculated to interact with other computers in the world. Any communication from the dark space is considered sceptical owing to its passive listening nature which accepts incoming packets, but outgoing packets are not supported. Due to the absence of legitimate hosts in the darknet, any traffic is contemplated to be unsought and is characteristically treated as probe, backscatter, or misconfiguration. Darknets are also known as network telescopes, sinkholes, or blackholes.

Darknet traffic classification is significantly important to categorize real-time applications. Analyzing darknet traffic helps in early monitoring of malware before onslaught and detection of malicious activities after outbreak.


### Data
In CICDarknet2020 dataset, a two-layered approach is used to generate benign and darknet traffic at the first layer. The darknet traffic constitutes Audio-Stream, Browsing, Chat, Email, P2P, Transfer, Video-Stream and VOIP which is generated at the second layer. To generate the representative dataset, we amalgamated our previously generated datasets, namely, ISCXTor2016 and ISCXVPN2016, and combined the respective VPN and Tor traffic in corresponding Darknet categories. 

## Task 1: Problem Statement
Discuss the problem setting and the first implcations of the given data set... 
* What assumptions can we make about the data?
** Labels are correct
**
* What problems are we expecting?
** problematic values
** null values
** infinite values

## Task 2: First Data Analysis, Cleaning and Feature Extraction
* Import the data to a Pandas DataFrame
* Run first simple statistics and visualizations
* Is there a need to clean the data? If yes, do so...
* Can you use the raw data directly, or should you extract features? What features are suitable ? 


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


In [2]:
data = pd.read_csv('Darknet.CSV' , encoding = "ISO-8859-1" )

In [3]:
data.head()

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,...,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Label.1
0,10.152.152.11-216.58.220.99-57158-443-6,10.152.152.11,57158,216.58.220.99,443,6,24/07/2015 04:09:48 PM,229,1,1,...,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
1,10.152.152.11-216.58.220.99-57159-443-6,10.152.152.11,57159,216.58.220.99,443,6,24/07/2015 04:09:48 PM,407,1,1,...,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
2,10.152.152.11-216.58.220.99-57160-443-6,10.152.152.11,57160,216.58.220.99,443,6,24/07/2015 04:09:48 PM,431,1,1,...,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
3,10.152.152.11-74.125.136.120-49134-443-6,10.152.152.11,49134,74.125.136.120,443,6,24/07/2015 04:09:48 PM,359,1,1,...,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
4,10.152.152.11-173.194.65.127-34697-19305-6,10.152.152.11,34697,173.194.65.127,19305,6,24/07/2015 04:09:45 PM,10778451,591,400,...,0,0,0,0,1437765000000000.0,3117718.0,1437765000000000.0,1437765000000000.0,Non-Tor,AUDIO-STREAMING


In [4]:
#data.describe()

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 141530 entries, 0 to 141529
Data columns (total 85 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Flow ID                     141530 non-null  object 
 1   Src IP                      141530 non-null  object 
 2   Src Port                    141530 non-null  int64  
 3   Dst IP                      141530 non-null  object 
 4   Dst Port                    141530 non-null  int64  
 5   Protocol                    141530 non-null  int64  
 6   Timestamp                   141530 non-null  object 
 7   Flow Duration               141530 non-null  int64  
 8   Total Fwd Packet            141530 non-null  int64  
 9   Total Bwd packets           141530 non-null  int64  
 10  Total Length of Fwd Packet  141530 non-null  int64  
 11  Total Length of Bwd Packet  141530 non-null  int64  
 12  Fwd Packet Length Max       141530 non-null  int64  
 13  Fwd Packet Len

In [6]:
data["Label"].unique()

array(['Non-Tor', 'NonVPN', 'Tor', 'VPN'], dtype=object)

In [24]:
from sklearn.preprocessing import LabelEncoder
data["Label"] = LabelEncoder().fit_transform(data["Label"])
data["Label.1"] = LabelEncoder().fit_transform(data["Label.1"])


In [25]:
#Flow ID is a collection from Src and Dst IP and Port => duplicate information
data = data.drop("Flow ID", axis = 1)

#Strip IPs and Timestamp from problematic characters
data['Src IP'] = data['Src IP'].str.replace('\W', '')
data['Dst IP'] = data['Dst IP'].str.replace('\W', '')
data['Timestamp'] = data['Timestamp'].str.replace('\W', '')
data['Timestamp'] = data['Timestamp'].str.replace('A', '')
data['Timestamp'] = data['Timestamp'].str.replace('P', '')
data['Timestamp'] = data['Timestamp'].str.replace('M', '')

KeyError: "['Flow ID'] not found in axis"

In [26]:

data = data.replace([np.inf, -np.inf], np.nan)
data = data.dropna()

In [27]:
data.head()

Unnamed: 0,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,...,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Label.1
0,1015215211,57158,2165822099,443,6,24072015040948,229,1,1,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0,0
1,1015215211,57159,2165822099,443,6,24072015040948,407,1,1,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0,0
2,1015215211,57160,2165822099,443,6,24072015040948,431,1,1,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0,0
3,1015215211,49134,74125136120,443,6,24072015040948,359,1,1,0,...,0,0,0,0,0.0,0.0,0.0,0.0,0,0
4,1015215211,34697,17319465127,19305,6,24072015040945,10778451,591,400,64530,...,0,0,0,0,1437765000000000.0,3117718.0,1437765000000000.0,1437765000000000.0,0,0


In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141481 entries, 0 to 141529
Data columns (total 84 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Src IP                      141481 non-null  object 
 1   Src Port                    141481 non-null  int64  
 2   Dst IP                      141481 non-null  object 
 3   Dst Port                    141481 non-null  int64  
 4   Protocol                    141481 non-null  int64  
 5   Timestamp                   141481 non-null  object 
 6   Flow Duration               141481 non-null  int64  
 7   Total Fwd Packet            141481 non-null  int64  
 8   Total Bwd packets           141481 non-null  int64  
 9   Total Length of Fwd Packet  141481 non-null  int64  
 10  Total Length of Bwd Packet  141481 non-null  int64  
 11  Fwd Packet Length Max       141481 non-null  int64  
 12  Fwd Packet Length Min       141481 non-null  int64  
 13  Fwd Packet Len

## Task 3: Train a  Model
* Which ML model would you choose and why?
* Train and evaluate the model using the train data
* Is the data blanced? What are the implications, how can you deal with this?
* Discuss the results -> possible improvements?


In [16]:
from sklearn.model_selection import train_test_split

X = data.drop("Label", axis = 1)
y = data["Label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [17]:
from sklearn.ensemble import RandomForestClassifier

rfmodel = RandomForestClassifier(random_state=1, n_jobs=4)
rfmodel.fit(X_train, y_train)
predictrfmodel = rfmodel.predict(X_test)


In [18]:
#feature_importance = rfmodel.feature_importances_
#f_names = np.array(X.get_feature_names())
#df = pd.DataFrame(list(zip(f_names, feature_importance)), columns = ["name", "importance"])
#df.head()

## Task 4: Evaluate 
* report the F1-Score on the test data - Who will build the bes model?

In [19]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictrfmodel))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18655
           1       1.00      1.00      1.00      4752
           2       1.00      0.96      0.98       282
           3       1.00      1.00      1.00      4608

    accuracy                           1.00     28297
   macro avg       1.00      0.99      0.99     28297
weighted avg       1.00      1.00      1.00     28297



NotFittedError: This LabelEncoder instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.