# CIC-Darknet 2020 Verisinin İncelenmesi ve Sınıflandırma Çalışması

---

## 0- Importlar

In [55]:
import pandas as pd
pd.options.display.min_rows = 100
pd.options.display.max_columns = 100

import os
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import mutual_info_classif

---

## 1- Giriş

**Amaç:** Bu sunumun amacı, CIC-Darknet 2020 verisini inceleyip alanındaki önemini, özelliklerini ve potansiyel uygulama alanlarını anlamaktır.

---

## 2- Arka Plan

Siber güvenlik alanında verinin önemi azımsanamayacak kadar büyüktür. Temel amaçlarından biri verilerimizin güvenliğini sağlamak olan bu alanda çeşitli çalıştırma ve geliştirmelere malzeme olan birçok veri seti vardır. 

CIC-Darknet 2020 verisi, Darknet gibi siber güvenlik alanında önemli bir konseptle ilgili bir veri seti olması, sınıflandırma çalışmalarına uygun olması ve (her ne kadar bu sunumda ham haliyle çalışmaya başlıyor olsak da) internette temizlenip parquet dosyasına çevrilmiş haliyle bulunması sebebiyle önemli bir veri setidir. 

---

## 3- Verinin İncelenmesi

### 3.2- İncelemeye Giriş
Verinin genel incelenmesi öncesi ilgilendiği konsept Darknet hakkında bilgi sahibi olalım. Darknet, diğer bilgisayarlarla etkileşime gireceği beklenmeyen, internetin kullanılmayan adres alanıdır. Bu alandan gelen herhangi bir iletişim, pasif dinleme doğasından ötürü kuşkuyla karşılanır. Gelen paketler kabul edilir ancak giden paketler için destek söz konusu değildir. 

Darknet;
- Ağ teleskobu: Kişinin Dünya üzerinde meydana gelen farklı büyük ölçekli olayları gözlemlemesine olanak tanıyan bir internet sistemi
- Sinkhole: Tüm alan adları için yönlendirilemez adresler dağıtmaya ayarlı bir DNS sunucusu
- Blackhole: Bir internet servis sağlayıcısının kullanılmayan, yönlendirilebilir adres ağı

olarak da bilinir.

Darknet trafiği sınıflandırması, gerçek zamanlı uygulamaları kategorize etmek için oldukça önemlidir. Darknet analizi, kötü amaçlı yazılımların saldırı öncesinden izlenmesine ve salgın sonrası zararlı faaliyetlerin tespit edilmesine yardımcı olur.

Bu veri, akademik bir ağ trafiği sınıflandırma verisidir. CIC (Canadian Institute for Cybersecurity, Kanada Siber Güvenlik Enstitüsü) tarafından yayınlanmış olup kendinden önceki iki CIC veri yayınının ([ISCXTor2016](https://www.unb.ca/cic/datasets/tor.html) ve [ISCXVPN2016](https://www.unb.ca/cic/datasets/vpn.html)) birleşiminden oluşmaktadır. 

Bu veri setinde ilk katmanda zararsız ve Darknet trafiği üretmek için iki katmanlı bir yaklaşım uygulanmıştır. Darknet trafiği, ikinci katmanda oluşturulmuş olan Audio-Stream, Browsing, Chat, Email, P2P, Transfer, Video-Stream ve VOIP'den (Voice Over Internet Protocol) oluşur. Tablo 1'de Darknet trafiğinin kategorileri ve kategorilerin oluşturulmasında yararlanılan uygulamalar görülebilir.

##### **Tablo 1: Darknet Ağ Trafiği Detayları**<sup>[1]</sup>
| Kategori     | Kullanılan Uygulamalar                                                                                |
|--------------|-------------------------------------------------------------------------------------------------------|
| Audio-Stream | Vimeo ve YouTube                                                                                      | 
| Browsing     | Firefox ve Chrome                                                                                     | 
| Chat         | ICQ, AIM, Skype, Facebook ve Hangouts                                                                 |
| Email        | SMTPS, POP3S ve IMAPS                                                                                 | 
| P2P          | uTorrent ve Transmission (BitTorrent)                                                                 | 
| Transfer     | Skype, Filezilla ve harici bir hizmet kullanarak SSH üzerinden FTP (SFTP) ve SSL üzerinden FTP (FTPS) | 
| Video-Stream | Vimeo ve YouTube                                                                                      | 
| VOIP         | Facebook, Skype ve Hangouts sesli aramaları                                                           | 

---

### 3.2- Verinin Detayları
Yukarıda açıklanan çift katmanlı yaklaşım ve kategorilerin dağılımı Görsel 1'de görülebilir.<sup>[2]</sup>

![title](darknet.jpg)

İlk katmanda zararsız ve Darknet trafiği görülürken ikinci katmanda ise Darknet trafiğini oluşturan özelliklerin dağılımı görülmektedir.

Verinin ham halinde 85 sütun (özellik), 141.530 satır bulunup özelliklerin açıklaması şu şekildedir:
- **Flow ID:** Ağ trafiğinin her bir akışı için benzersiz kimlik.
- **Src IP:** Ağ trafiğinin kaynak IP adresi.
- **Src Port:** Ağ trafiğinin kaynak port numarası.
- **Dst IP:** Hedefin IP adresi.
- **Dst Port:** Hedefin port numarası.
- **Protocol:** Ağ trafiği için kullanılan protokol (örneğin, TCP, UDP).
- **Timestamp:** Ağ trafiğinin gerçekleştiği zaman damgası.
- **Flow Duration:** Akışın süresi, yani akışın ilk ve son paketi arasındaki zaman farkı.
- **Total Fwd Packet:** İleri yönde gönderilen (forward) toplam paket sayısı.
- **Total Bwd packets:** Geri yönde gönderilen (backward) toplam paket sayısı.
- **Total Length of Fwd Packet:** İleri yönde gönderilen paketlerin toplam uzunluğu.
- **Total Length of Bwd Packet:** Geri yönde gönderilen paketlerin toplam uzunluğu.
- **Fwd Packet Length Max:** İleri yönde gönderilen paketlerin maksimum uzunluğu.
- **Fwd Packet Length Min:** İleri yönde gönderilen paketlerin minimum uzunluğu.
- **Fwd Packet Length Mean:** İleri yönde gönderilen paketlerin ortalama uzunluğu.
- **Fwd Packet Length Std:** İleri yönde gönderilen paketlerin uzunluğunun standart sapması.
- **Bwd Packet Length Max:** Geri yönde gönderilen paketlerin maksimum uzunluğu.
- **Bwd Packet Length Min:** Geri yönde gönderilen paketlerin minimum uzunluğu.
- **Bwd Packet Length Mean:** Geri yönde gönderilen paketlerin ortalama uzunluğu.
- **Bwd Packet Length Std:** Geri yönde gönderilen paketlerin uzunluğunun standart sapması.
- **Flow Bytes/s:** Byte cinsinden saniyedeki akış hızı.
- **Flow Packets/s:**  Paket cinsinden saniyedeki akış hızı.
- **Flow IAT Mean:** Akış paketlerinin arasındaki ortalama zaman. (IAT = inter-arrival time)
- **Flow IAT Std:** Akış paketlerinin arasındaki zamanın standart sapması.
- **Flow IAT Max:** Akış paketlerinin arasındaki maksimum zaman.
- **Flow IAT Min:** Akış paketlerinin arasındaki minimum zaman.
- **Fwd IAT Total:** İleri yönde gönderilen paketlerin arasındaki toplam zaman.
- **Fwd IAT Mean:** İleri yönde gönderilen paketlerin arasındaki ortalama zaman.
- **Fwd IAT Std:** İleri yönde gönderilen paketlerin arasındaki zamanın standart sapması.
- **Fwd IAT Max:** İleri yönde gönderilen paketlerin arasındaki maksimum zaman.
- **Fwd IAT Min:** İleri yönde gönderilen paketlerin arasındaki minimum zaman.
- **Bwd IAT Total:** Geri yönde gönderilen paketlerin arasındaki toplam zaman.
- **Bwd IAT Mean:** Geri yönde gönderilen paketlerin arasındaki ortalama zaman.
- **Bwd IAT Std:** Geri yönde gönderilen paketlerin arasındaki zamanın standart sapması.
- **Bwd IAT Max:** Geri yönde gönderilen paketlerin arasındaki maksimum zaman.
- **Bwd IAT Min:** Geri yönde gönderilen paketlerin arasındaki minimum zaman.
- **Fwd PSH Flags:** İleri yönde PUSH bayrağı (verinin olabildiğince çabuk bir şekilde alıcı uygulamaya gönderilmesini belirten bayrak) ayarlanmış paket sayısı.
- **Bwd PSH Flags:** Geri yönde PUSH bayrağı ayarlanmış paket sayısı.
- **Fwd URG Flags:** İleri yönde URGENT bayrağı (verinin acil ilgi beklediğini belirten bayrak) ayarlanmış paket sayısı.
- **Bwd URG Flags:** Geri yönde URGENT bayrağı ayarlanmış paket sayısı.
- **Fwd Header Length:** İleri yönde gönderilen paketlerin toplam başlık uzunluğu.
- **Bwd Header Length:** Geri yönde gönderilen paketlerin toplam başlık uzunluğu.
- **Fwd Packets/s:** İleri yönde saniyedeki paket oranı.
- **Bwd Packets/s:** Geri yönde saniyedeki paket oranı.
- **Packet Length Min:** Paketlerin minimum uzunluğu.
- **Packet Length Max:** Paketlerin maksimum uzunluğu.
- **Packet Length Mean:** Paketlerin ortalama uzunluğu.
- **Packet Length Std:** Paket uzunluğunun standart sapması.
- **Packet Length Variance:** Paket uzunluğunun varyansı.
- **FIN Flag Count:** FIN bayrağı (Finish, TCP bağlantısının sonlandığını belirten bayrak) ayarlanmış paket sayısı.
- **SYN Flag Count:** SYN bayrağı (Synchronize, TCP bağlantısı kurulmasını başlatan bayrak) ayarlanmış paket sayısı.
- **RST Flag Count:** RST bayrağı (Reset, bağlantıyı sıfırlamak için kullanılan bayrak) ayarlanmış paket sayısı.
- **PSH Flag Count:** PSH bayrağı ayarlanmış paket sayısı.
- **ACK Flag Count:** ACK bayrağı (Acknowledgement, alınan veriyi onaylamak için kullanılan bayrak) ayarlanmış paket sayısı.
- **URG Flag Count:** URG bayrağı ayarlanmış paket sayısı.
- **CWE Flag Count:** CWE bayrağı (Congestion Window Reduced, göndericinin bir TCP segmentini ECE bayrağı ayarlanmış olarak aldığını ve iletim hızını buna göre azalttığını bildiren bayrak) ayarlanmış paket sayısı.
- **ECE Flag Count:** ECE bayrağı (Explicit Congestion Notification Echo, ağda bir tıkanıklık yaşandığını göndericiye bildiren bayrak) ayarlanmış paket sayısı.
- **Down/Up Ratio:** Aşağı yönde gönderilen trafiğin yukarıya oranı.
- **Average Packet Size:** Ortalama paket boyutu.
- **Fwd Segment Size Avg:** İleri yönde gönderilen segmentlerin ortalama boyutu.
- **Bwd Segment Size Avg:** Geri yönde gönderilen segmentlerin ortalama boyutu.
- **Fwd Bytes/Bulk Avg:** İleri yönde küme olarak gönderilen ortalama bayt değeri.
- **Fwd Packet/Bulk Avg:** İleri yönde küme olarak gönderilen ortalama paket değeri.
- **Fwd Bulk Rate Avg:** İleri yönde gönderilen kümelerin ortalama oranı.
- **Bwd Bytes/Bulk Avg:** Geri yönde küme olarak gönderilen ortalama bayt değeri.
- **Bwd Packet/Bulk Avg:** Geri yönde küme olarak gönderilen ortalama paket değeri.
- **Bwd Bulk Rate Avg:** Geri yönde gönderilen kümelerin ortalama oranı.
- **Subflow Fwd Packets:** İleri yönde gönderilen alt akış paket sayısı.
- **Subflow Fwd Bytes:** İleri yönde gönderilen alt akış bayt sayısı.
- **Subflow Bwd Packets:** Geri yönde gönderilen alt akış paket sayısı.
- **Subflow Bwd Bytes:** Geri yönde gönderilen alt akış bayt sayısı.
- **FWD Init Win Bytes:** İleri başlangıç penceresi boyutu.
- **Bwd Init Win Bytes:** Geri başlangıç penceresi boyutu.
- **Fwd Act Data Pkts:** İleri yönde asıl veriyi içeren paket sayısı.
- **Fwd Seg Size Min:** Minimum segment boyutu.
- **Active Mean:** Bir akışın aktif olduğu ortalama süre.
- **Active Std:** Bir akışın aktif olduğu sürenin standart sapması.
- **Active Max:** Bir akışın aktif olduğu maksimum süre.
- **Active Min:** Bir akışın aktif olduğu minimum süre.
- **Idle Mean:** Bir akışın boşta olduğu ortalama süre.
- **Idle Std:** Bir akışın boşta olduğu sürenin standart sapması.
- **Idle Max:** Bir akışın boşta olduğu maksimum süre.
- **Idle Min:** Bir akışın boşta olduğu minimum süre.
- **Label:** Trafik türünü gösteren sınıflandırma etiketi (Non-Tor, NonVPN, Tor, VPN).
- **Label.1:** Verinin kategorisi (AUDIO-STREAMING, Browsing, Chat, Email, File-Transfer, File-transfer, P2P, Video-Streaming, Audio-Streaming, Video-streaming, VOIP).


---

## 4- Veri Ön İşleme

Bunun için öncelikle veriyi elde etmemiz gerekiyor.

In [2]:
data = pd.read_csv('DarknetHamVeriSeti.CSV')
df = pd.DataFrame(data)

Şimdi özelliklerin boş olma durumlarını yüzdelik olarak inceleyelim.

In [5]:
(df.isnull().sum().sort_values(ascending=False) / len(df)) * 100

Flow Bytes/s              0.033209
Flow ID                   0.000000
Bwd Bytes/Bulk Avg        0.000000
Fwd Packet/Bulk Avg       0.000000
Fwd Bytes/Bulk Avg        0.000000
Bwd Segment Size Avg      0.000000
Fwd Segment Size Avg      0.000000
Average Packet Size       0.000000
Down/Up Ratio             0.000000
ECE Flag Count            0.000000
CWE Flag Count            0.000000
URG Flag Count            0.000000
ACK Flag Count            0.000000
PSH Flag Count            0.000000
RST Flag Count            0.000000
SYN Flag Count            0.000000
FIN Flag Count            0.000000
Packet Length Variance    0.000000
Packet Length Std         0.000000
Packet Length Mean        0.000000
Packet Length Max         0.000000
Fwd Bulk Rate Avg         0.000000
Bwd Packet/Bulk Avg       0.000000
Bwd Packets/s             0.000000
Bwd Bulk Rate Avg         0.000000
Label                     0.000000
Idle Min                  0.000000
Idle Max                  0.000000
Idle Std            

Sadece *Flow Bytes/s* özelliğinde, 0.03%'lük bir boşluk var. Bu kadar küçük bir parçanın bırakılması herhangi bir sorun çıkarmayacaktır. Bu sebeple bu boşluğu doldurmaya çalışmak yerine doğrudan bırakıyoruz.

In [6]:
df.dropna(inplace=True)

In [7]:
df.head()

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Label.1
0,10.152.152.11-216.58.220.99-57158-443-6,10.152.152.11,57158,216.58.220.99,443,6,24/07/2015 04:09:48 PM,229,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,8733.624454,229.0,0.0,229,229,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,4366.812227,4366.812227,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1892,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
1,10.152.152.11-216.58.220.99-57159-443-6,10.152.152.11,57159,216.58.220.99,443,6,24/07/2015 04:09:48 PM,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
2,10.152.152.11-216.58.220.99-57160-443-6,10.152.152.11,57160,216.58.220.99,443,6,24/07/2015 04:09:48 PM,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
3,10.152.152.11-74.125.136.120-49134-443-6,10.152.152.11,49134,74.125.136.120,443,6,24/07/2015 04:09:48 PM,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
4,10.152.152.11-173.194.65.127-34697-19305-6,10.152.152.11,34697,173.194.65.127,19305,6,24/07/2015 04:09:45 PM,10778451,591,400,64530,6659,131,0,109.187817,22.283313,498,0,16.6475,46.833714,6604.75239,91.942711,10887.32424,11412.46641,78158,13,10778451,18268.56102,11786.14309,81171,126,10747836,26936.93233,15897.73845,78158,307,1,0,0,0,11820,8000,54.831627,37.111084,0,498,71.876008,56.93647,3241.761603,1,0,0,659,991,0,0,0,0,71.948537,109.187817,16.6475,0,0,0,0,659,6605,0,65,0,6,1382,2320,581,20,0,0,0,0,1437760000000000.0,3117718.131,1437760000000000.0,1437760000000000.0,Non-Tor,AUDIO-STREAMING


Göze çarpan ilk şey verinin aslında iki adet *Label* adında sütun olduğu için ikinci *Label* sütununun *Label.1* olarak adlandırılması. Bunu *Category* olarak değiştiriyoruz.

In [9]:
df = df.rename(columns={'Label.1': 'Category'})
df.head()

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Category
0,10.152.152.11-216.58.220.99-57158-443-6,10.152.152.11,57158,216.58.220.99,443,6,24/07/2015 04:09:48 PM,229,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,8733.624454,229.0,0.0,229,229,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,4366.812227,4366.812227,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1892,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
1,10.152.152.11-216.58.220.99-57159-443-6,10.152.152.11,57159,216.58.220.99,443,6,24/07/2015 04:09:48 PM,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
2,10.152.152.11-216.58.220.99-57160-443-6,10.152.152.11,57160,216.58.220.99,443,6,24/07/2015 04:09:48 PM,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
3,10.152.152.11-74.125.136.120-49134-443-6,10.152.152.11,49134,74.125.136.120,443,6,24/07/2015 04:09:48 PM,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
4,10.152.152.11-173.194.65.127-34697-19305-6,10.152.152.11,34697,173.194.65.127,19305,6,24/07/2015 04:09:45 PM,10778451,591,400,64530,6659,131,0,109.187817,22.283313,498,0,16.6475,46.833714,6604.75239,91.942711,10887.32424,11412.46641,78158,13,10778451,18268.56102,11786.14309,81171,126,10747836,26936.93233,15897.73845,78158,307,1,0,0,0,11820,8000,54.831627,37.111084,0,498,71.876008,56.93647,3241.761603,1,0,0,659,991,0,0,0,0,71.948537,109.187817,16.6475,0,0,0,0,659,6605,0,65,0,6,1382,2320,581,20,0,0,0,0,1437760000000000.0,3117718.131,1437760000000000.0,1437760000000000.0,Non-Tor,AUDIO-STREAMING


Şimdi işimize yaramayacak bazı sütunları bırakabiliriz. Bunların başında kimlik belirleyici sütunlar geliyor. Eşsiz oldukları için hem dağılımlarını inceleme hem de bir çıkarımda bulunurken kullanma şansımız yok bu sütunları. Ayrıca bu tarz meta veriler tahmin aşamasında fazla güçlü birer kısayol tahminleyici<sup>[3]</sup> olarak çalışacağı için tahmin aşamasını da olumsuz etkileyecektir.

Neyse ki CIC tarafından paylaşılan verilerin geneli için kullanılabilecek drop list herkesin kullanımına açık bir şekilde paylaşılmış durumda. Bu tür özellikleri manuel olarak aramadan droplayabiliyoruz.

In [10]:
drop_columns = [
    "Flow ID",    
    'Fwd Header Length.1',
    "Source IP", "Src IP",
    "Source Port", "Src Port",
    "Destination IP", "Dst IP",
    "Destination Port", "Dst Port",
    # "Timestamp", bunu kullanmayı düşündüğüm için droplamıyoruz
]

In [11]:
df.drop(columns=drop_columns, inplace=True, errors="ignore")

In [12]:
df.shape

(141483, 80)

Görülebileceği üzere 6 adet özellik bırakılmış durumda.

Şimdi herhangi bir kopya satırın olup olmadığına bakalım. Kopya satırlar bulundurdukları verinin sayısını artıracağı için tahmin aşamasında da istenmedik bir bias durumuna sebep olabilir.

In [13]:
df.duplicated().sum()

24905

24.905 kopya satır varmış. Bunları veriden kaldırmamız gerekiyor.

In [14]:
df.drop_duplicates(inplace=True)
df.shape

(116578, 80)

Şu anda elimizde temizlenmiş bir veri var. İleride lazım olması ihtimalini göz önünde bulundurarak bu veriyi kaydetmek faydalı olacaktır. Bunu yaparken de veriyi CSV yerine bir Parquet dosyası olarak kaydedeceğiz. Böylece ileriki kullanımlarda okuma gibi işlemler daha hızlı yapılabilir.

In [16]:
if not os.path.exists('./artifacts'):
    os.mkdir('./artifacts')

if os.path.exists('./artifacts/Darknet.parquet'):
    os.remove('./artifacts/Darknet.parquet')

df.to_parquet('./artifacts/Darknet.parquet')

Elimizdeki verinin son halinden bir örnek:

In [17]:
df.head()

Unnamed: 0,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Category
0,6,24/07/2015 04:09:48 PM,229,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,8733.624454,229.0,0.0,229,229,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,4366.812227,4366.812227,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1892,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
1,6,24/07/2015 04:09:48 PM,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
2,6,24/07/2015 04:09:48 PM,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
3,6,24/07/2015 04:09:48 PM,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
4,6,24/07/2015 04:09:45 PM,10778451,591,400,64530,6659,131,0,109.187817,22.283313,498,0,16.6475,46.833714,6604.75239,91.942711,10887.32424,11412.46641,78158,13,10778451,18268.56102,11786.14309,81171,126,10747836,26936.93233,15897.73845,78158,307,1,0,0,0,11820,8000,54.831627,37.111084,0,498,71.876008,56.93647,3241.761603,1,0,0,659,991,0,0,0,0,71.948537,109.187817,16.6475,0,0,0,0,659,6605,0,65,0,6,1382,2320,581,20,0,0,0,0,1437760000000000.0,3117718.131,1437760000000000.0,1437760000000000.0,Non-Tor,AUDIO-STREAMING


Şimdi kategorik verilere encoding uygulamamız gerekiyor. *Label* ve *Category* değişkenleri kategorik. Bu özellikleri sırayla inceleyelim.

In [52]:
df.Label.value_counts()

Label
Non-Tor    68840
NonVPN     23761
VPN        22798
Tor         1179
Name: count, dtype: int64

*Label* özelliği bizim ileride tahmin edeceğimiz özellik. Bu özelliğe encoding uygulamaktansa Benign-Darknet tarzı bir sınıflandırma yapacağız şimdilik.

In [21]:
encoded_df = df.copy()

d = {
    'Non-Tor': 'Benign',
    'NonVPN': 'Benign',
    'Tor': 'Darknet',
    'VPN': 'Darknet'
}

encoded_df['Label'] = encoded_df['Label'].map(d)
encoded_df.head()

Unnamed: 0,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Category
0,6,24/07/2015 04:09:48 PM,229,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,8733.624454,229.0,0.0,229,229,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,4366.812227,4366.812227,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1892,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign,AUDIO-STREAMING
1,6,24/07/2015 04:09:48 PM,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign,AUDIO-STREAMING
2,6,24/07/2015 04:09:48 PM,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign,AUDIO-STREAMING
3,6,24/07/2015 04:09:48 PM,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign,AUDIO-STREAMING
4,6,24/07/2015 04:09:45 PM,10778451,591,400,64530,6659,131,0,109.187817,22.283313,498,0,16.6475,46.833714,6604.75239,91.942711,10887.32424,11412.46641,78158,13,10778451,18268.56102,11786.14309,81171,126,10747836,26936.93233,15897.73845,78158,307,1,0,0,0,11820,8000,54.831627,37.111084,0,498,71.876008,56.93647,3241.761603,1,0,0,659,991,0,0,0,0,71.948537,109.187817,16.6475,0,0,0,0,659,6605,0,65,0,6,1382,2320,581,20,0,0,0,0,1437760000000000.0,3117718.131,1437760000000000.0,1437760000000000.0,Benign,AUDIO-STREAMING


In [22]:
encoded_df.Label.value_counts()

Label
Benign     92601
Darknet    23977
Name: count, dtype: int64

Şimdi sıra *Category* özelliğinde.

In [23]:
encoded_df['Category'].unique()

array(['AUDIO-STREAMING', 'Browsing', 'Chat', 'Email', 'File-Transfer',
       'File-transfer', 'P2P', 'Video-Streaming', 'Audio-Streaming',
       'Video-streaming', 'VOIP'], dtype=object)

Öncelikle, aynı şeyi belirten birden fazla girdinin olduğunu görüyoruz. *AUDIO-STREAMING*/*Audio-Streaming*, *Video-Streaming*/*Video-streaming*, *File-Transfer*/*File-transfer* girdileri arasından birer seçim yapmamız gerekiyor. Standardizasyonu sağlamak için kısaltma olmadığı sürece capitalize edeceğiz.

In [24]:
d = {
    'AUDIO-STREAMING': 'Audio-Streaming',
    'Audio-Streaming': 'Audio-Streaming',
    'Video-streaming': 'Video-Streaming',
    'Video-Streaming': 'Video-Streaming',
    'File-transfer': 'File-Transfer',
    'File-Transfer': 'File-Transfer',
    'Browsing': 'Browsing',
    'Chat': 'Chat',
    'Email': 'Email',
    'P2P': 'P2P',
    'VOIP': 'VOIP'
}

encoded_df['Category'] = encoded_df['Category'].map(d)
encoded_df['Category'].unique()

array(['Audio-Streaming', 'Browsing', 'Chat', 'Email', 'File-Transfer',
       'P2P', 'Video-Streaming', 'VOIP'], dtype=object)

Girdilerin formatını düzelttik. Fakat bu özellik üzerinde biraz düşündüğümüz zaman modele sokmanın pek de mantıklı olmadığı bir özellik olduğu kanısına varıyoruz. Yine de bu kadar işlemi boşuna yapmadık. Asıl DataFrame'deki değişkenin de formatını düzeltmek için kullanabiliriz bunu.

In [25]:
df['Category'] = encoded_df['Category']
encoded_df.drop(columns=['Category'], inplace=True)

df['Category'].value_counts()

Category
Browsing           32567
P2P                24243
Audio-Streaming    17822
Chat               11428
File-Transfer      11106
Video-Streaming     9715
Email               6131
VOIP                3566
Name: count, dtype: int64

Verinin ileriki aşamalar öncesi halini de kaydediyoruz ki bir sorun çıkarsa buraya dönebilelim.

In [26]:
if os.path.exists('./artifacts/EncodedDarknet.parquet'):
    os.remove('./artifacts/EncodedDarknet.parquet')

encoded_df.to_parquet('./artifacts/EncodedDarknet.parquet')

Şimdi outlier olup olmadığını kontrol etmemiz gerekiyor. Bu kontrolü IQR yöntemi ile yapacağız. Normalizasyon öncesi outlier kontrolü yapmak ve bu outlierlara gerekli görülen işlemleri yapmak hem verinin kalitesini hem de normalizasyondan alacağımız verimi artıracaktır.

In [27]:
def outlier_finder(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    inner_lower_fence = Q1 - 1.5*IQR
    inner_upper_fence = Q3 + 1.5*IQR

    outer_lower_fence = Q1 - 3*IQR
    outer_upper_fence = Q3 + 3*IQR

    mild_outliers = [i for i in data if (i < inner_lower_fence and i > outer_lower_fence) or (i > inner_upper_fence and i < outer_upper_fence)]
    extreme_outliers = [i for i in data if i < outer_lower_fence or i > outer_upper_fence]

    return mild_outliers, extreme_outliers


def outlier_printer(outliers, feature, df_):
    mild_outliers = outliers[0]
    extreme_outliers = outliers[1]
    
    if len(mild_outliers) != 0:
        o1 = f"{feature} özelliğinde {len(mild_outliers)} mild outlier var."
        o2 = f"{feature} özelliği {round((len(mild_outliers) / len(df_)) * 100, 2)}% mild outlierdan oluşuyor."

        print(o1)
        print(o2)

        print()

    if len(extreme_outliers) != 0:
        o1 = f"{feature} özelliğinde {len(extreme_outliers)} extreme outlier var."
        o2 = f"{feature} özelliği {round((len(extreme_outliers) / len(df_)) * 100, 2)}% extreme outlierdan oluşuyor."

        print(o1)
        print(o2)

Her özelliğe outlier kontrolü yapacak halimiz yok, hele bazı özellikler aslında başka ana özelliklerin istatistiki verilerini tuttuğu için bu özelliklere outlier kontrolü yapmamız mantıksız olur. Outlier kontrolü yapacağımız özellikleri seçmeliyiz önce.

In [28]:
o_list = [
    'Flow Duration',
    'Total Fwd Packet',
    'Total Bwd packets',
    'Total Length of Fwd Packet',
    'Total Length of Bwd Packet',
    'Flow Bytes/s',
    'Flow Packets/s',
    'Fwd IAT Total',
    'Bwd IAT Total',
    'Fwd Header Length',
    'Bwd Header Length',
    'Fwd Packets/s',
    'Bwd Packets/s',
    'Subflow Fwd Packets',
    'Subflow Fwd Bytes',
    'Subflow Bwd Packets',
    'Subflow Bwd Bytes',
    'FWD Init Win Bytes',
    'Bwd Init Win Bytes',
]

Bu özelliklerin hepsinin outlierlarını umursayacağız diye bir şey de yok. Outlier kontrolü hassas bir işlem, aykırı diye nitelendirip attığımız değer aslında bir bilgi içeriyor olabilir. Bunlar sadece şüphelenilen özellikler. Her bir özeliğin çıktısını inceleyip karar vereceğiz.

In [29]:
outliers_d = {}

for column in o_list: 
    print(f"{column} için outlier bilgisi:")
    print("-"*len(f"{column} için outlier bilgisi:"))
    outliers = outlier_finder(encoded_df[column])
    outlier_printer(outliers, column, encoded_df)
    outliers_d[column] = outliers
    if column != df.columns[-1]:
        print("\n")

Flow Duration için outlier bilgisi:
-----------------------------------
Flow Duration özelliğinde 1606 mild outlier var.
Flow Duration özelliği 1.38% mild outlierdan oluşuyor.

Flow Duration özelliğinde 22572 extreme outlier var.
Flow Duration özelliği 19.36% extreme outlierdan oluşuyor.


Total Fwd Packet için outlier bilgisi:
--------------------------------------
Total Fwd Packet özelliğinde 4496 mild outlier var.
Total Fwd Packet özelliği 3.86% mild outlierdan oluşuyor.

Total Fwd Packet özelliğinde 14334 extreme outlier var.
Total Fwd Packet özelliği 12.3% extreme outlierdan oluşuyor.


Total Bwd packets için outlier bilgisi:
---------------------------------------
Total Bwd packets özelliğinde 3767 mild outlier var.
Total Bwd packets özelliği 3.23% mild outlierdan oluşuyor.

Total Bwd packets özelliğinde 13786 extreme outlier var.
Total Bwd packets özelliği 11.83% extreme outlierdan oluşuyor.


Total Length of Fwd Packet için outlier bilgisi:
-------------------------------------

Elimizdeki veri halihazırda epey büyük olduğu için outlierları atıp ne kadar veri kaldığına bakalım. 

In [30]:
encoded_df.shape

(116578, 79)

In [31]:
outlier_removed_df = encoded_df.copy()

for key, value in outliers_d.items():
    outlier_removed_df = outlier_removed_df[~outlier_removed_df[key].isin(value[1])]
    outlier_removed_df = outlier_removed_df[~outlier_removed_df[key].isin(value[0])]

outlier_removed_df.shape

(45136, 79)

Verinin 60% kadarını kaybediyoruz. Bu istenen bir sonuç değil. Veriyi her ne kadar temizlese de biası çok artıracaktır. Zaten önceden de belirtildiği üzere her özellik üzerinde outlier temizlemesi yapmak gerekli değil. Bu bağlamda özellikleri tek tek inceleyip karar vermemiz gerekiyor. Karar aşamasında özelliklerin istatistiki değerlerine bakmak da fayda sağlayabilir.

In [32]:
encoded_df[o_list].describe()

Unnamed: 0,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Flow Bytes/s,Flow Packets/s,Fwd IAT Total,Bwd IAT Total,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes
count,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0,116578.0
mean,20069320.0,162.371382,164.391257,118070.0,138609.3,inf,inf,19232370.0,16073270.0,3098.832,3294.911,6749.886,4687.435864,0.258968,44.440795,0.0,62.20461,5156.647901,1798.067457
std,37866920.0,2536.073439,3721.614251,3508775.0,4990447.0,,,37506880.0,35494450.0,50676.85,76191.02,38011.79,21732.160477,0.43807,147.094654,0.0,138.879967,10176.550596,7820.753841
min,0.0,1.0,0.0,0.0,0.0,0.0,0.01666866,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,980.0,1.0,0.0,0.0,0.0,0.0,0.8313146,0.0,0.0,16.0,0.0,0.5400239,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,411276.5,2.0,1.0,44.0,20.0,106.879,8.393401,87869.5,0.0,24.0,16.0,4.879721,1.843236,0.0,18.0,0.0,1.0,913.0,0.0
75%,10088140.0,4.0,3.0,196.0,236.0,1566.665,2254.791,9577738.0,924540.8,92.0,72.0,1169.591,57.52086,1.0,22.0,0.0,75.0,7992.0,1016.0
max,120000000.0,238161.0,470862.0,769307400.0,670428700.0,inf,inf,120000000.0,120000000.0,4768644.0,9417240.0,2000000.0,1000000.0,1.0,6644.0,0.0,4872.0,65535.0,65535.0


Baktığımız zaman *Flow Bytes/s* ve *Flow Packets/s* inf, yani sonsuz değerler görüyoruz. Bu özellikler müdahale istiyor. Bunlar dışında *Bwd Packets/s*, *Subflow Fwd Bytes*, *Subflow Bwd Bytes*, *FWD Init Win Bytes* ve *Bwd Init Win Bytes* özellikleri de sıkıntılı görünüyor. Fakat bu özelliklerin açıklamalarına baktığımız zaman outlier verilerin normal karışlanabileceği hatta önemli bilgiler barındırabileceği sonucuna varıyoruz. Dolayısıyla bu özelliklere müdahale etmiyoruz.

In [33]:
outlier_removed_df = encoded_df.copy()

for key in ['Flow Bytes/s', 'Flow Packets/s']:
    outlier_removed_df = outlier_removed_df[~outlier_removed_df[key].isin(outliers_d[key][1])]
    outlier_removed_df = outlier_removed_df[~outlier_removed_df[key].isin(outliers_d[key][0])]

outlier_removed_df.shape

(81879, 79)

Verinin 30%'u civarını kaybetmiş olduk. Bu normalde büyük sayılacak bir oran ancak bu işlemin sonucunda bile elimizde azımsanamayacak miktarda veri olduğu için sorun yok. Verinin bu halini de kaydedip devam ediyoruz.

In [34]:
if os.path.exists('./artifacts/OutlierRemovedDarknet.parquet'):
    os.remove('./artifacts/OutlierRemovedDarknet.parquet')
    
outlier_removed_df.to_parquet('./artifacts/OutlierRemovedDarknet.parquet')

In [35]:
outlier_removed_df.head()

Unnamed: 0,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
1,6,24/07/2015 04:09:48 PM,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign
2,6,24/07/2015 04:09:48 PM,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign
3,6,24/07/2015 04:09:48 PM,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign
5,6,24/07/2015 04:10:00 PM,421362,5,3,72,79,72,0,14.4,32.199379,79,0,26.333333,45.610671,358.361694,18.98605,60194.57143,157282.8248,416869,13,421362,105340.5,207813.1,417054,190,417945,208972.5,294010.0,416869,1076,0,0,0,0,112,72,11.866281,7.119769,0,79,16.777778,33.338333,1111.444444,1,2,0,2,7,0,0,0,0,18.875,14.4,26.333333,0,0,0,0,0,0,0,9,0,9,14600,913,1,20,0,0,0,0,1437770000000000.0,186611.1,1437770000000000.0,1437770000000000.0,Benign
6,6,24/07/2015 04:09:45 PM,119682119,488,487,89259,314105,1460,0,182.907787,360.042956,1460,0,644.979466,647.036832,3370.294605,8.14658,122876.9189,822593.3869,13485507,2,119642787,245673.0739,1151352.0,13485507,5,119028043,244913.6687,1155469.0,13522678,18,0,0,0,0,9760,9740,4.077468,4.069112,0,1460,413.282787,571.826981,326986.0964,0,0,0,403,975,0,0,0,0,413.706667,182.907787,644.979466,0,0,0,0,44,9225,0,91,0,322,12108,9520,209,20,0,0,0,0,1437770000000000.0,31846300.0,1437770000000000.0,1437760000000000.0,Benign


Şimdi feature engineering kısmına geçebiliriz. Burada işe yarayabileceğini düşündüğüm 1 fikrim var.

Timestamp özelliğini kullanarak internet akışının saatini bir özellik olarak çıkarabiliriz. Zararlı aktivitelerin yoğunluğu günün belirli saatlerine göre değişkenlik gösteriyor olabilir.

In [41]:
hours = []

for i in outlier_removed_df['Timestamp']:
    hours.append(int(i.split()[1].split(':')[0]))

outlier_removed_df['Hour'] = hours
outlier_removed_df.drop(columns=['Timestamp'], inplace=True)

outlier_removed_df.head()

Unnamed: 0,Protocol,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Hour
1,6,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign,4
2,6,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign,4
3,6,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Benign,4
5,6,421362,5,3,72,79,72,0,14.4,32.199379,79,0,26.333333,45.610671,358.361694,18.98605,60194.57143,157282.8248,416869,13,421362,105340.5,207813.1,417054,190,417945,208972.5,294010.0,416869,1076,0,0,0,0,112,72,11.866281,7.119769,0,79,16.777778,33.338333,1111.444444,1,2,0,2,7,0,0,0,0,18.875,14.4,26.333333,0,0,0,0,0,0,0,9,0,9,14600,913,1,20,0,0,0,0,1437770000000000.0,186611.1,1437770000000000.0,1437770000000000.0,Benign,4
6,6,119682119,488,487,89259,314105,1460,0,182.907787,360.042956,1460,0,644.979466,647.036832,3370.294605,8.14658,122876.9189,822593.3869,13485507,2,119642787,245673.0739,1151352.0,13485507,5,119028043,244913.6687,1155469.0,13522678,18,0,0,0,0,9760,9740,4.077468,4.069112,0,1460,413.282787,571.826981,326986.0964,0,0,0,403,975,0,0,0,0,413.706667,182.907787,644.979466,0,0,0,0,44,9225,0,91,0,322,12108,9520,209,20,0,0,0,0,1437770000000000.0,31846300.0,1437770000000000.0,1437760000000000.0,Benign,4


Şimdi veriyi normalize etmeye geçebiliriz. Normalize edilmiş bir veriyi model daha iyi anlayacağı için elimizdeki son verinin normalize edilmiş bir halini oluşturup saklayacağız.

Normalizasyon için StandardScaler sınıfını kullanacağız. Bu sınıf F-Score normalizasyonu yaptığı için tercih ediyoruz.

In [51]:
normalized_df = outlier_removed_df.copy()

scaler = StandardScaler()

for column in normalized_df.columns:
    if column != 'Label':
        normalized_df[column] = scaler.fit_transform(normalized_df[[column]])

normalized_df.head()

Unnamed: 0,Protocol,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Hour
1,-0.882372,-0.637194,-0.024254,-0.014846,-0.129765,-0.161487,-0.499564,-0.558656,-0.536426,-0.460476,-0.395573,-0.367389,-0.444693,-0.386884,-0.548607,6.929582,-0.428737,-0.488296,-0.592191,-0.181287,-0.613295,-0.423379,-0.420155,-0.565324,-0.282574,-0.528956,-0.346735,-0.376716,-0.478098,-0.211557,-0.310173,0.0,0.0,0.0,-0.022046,-0.015671,5.003937,8.330437,-0.636723,-0.488344,-0.602205,-0.54941,-0.212007,2.911797,-0.587581,-0.077637,-0.090463,-0.023071,0.0,0.0,0.0,0.084998,-0.648532,-0.536426,-0.444693,0.0,0.0,0.0,0.0,-0.062094,-0.029465,-0.690266,-0.480155,0.0,-0.428806,-0.370054,-0.079226,-0.098412,0.613286,0.0,0.0,0.0,0.0,-1.189111,-0.333684,-1.212176,-0.950426,Benign,-0.603738
2,-0.882372,-0.637193,-0.024254,-0.014846,-0.129765,-0.161487,-0.499564,-0.558656,-0.536426,-0.460476,-0.395573,-0.367389,-0.444693,-0.386884,-0.548607,6.532759,-0.428734,-0.488296,-0.59219,-0.181283,-0.613295,-0.423379,-0.420155,-0.565324,-0.282574,-0.528956,-0.346735,-0.376716,-0.478098,-0.211557,-0.310173,0.0,0.0,0.0,-0.022046,-0.015671,4.714712,7.857764,-0.636723,-0.488344,-0.602205,-0.54941,-0.212007,2.911797,-0.587581,-0.077637,-0.090463,-0.023071,0.0,0.0,0.0,0.084998,-0.648532,-0.536426,-0.444693,0.0,0.0,0.0,0.0,-0.062094,-0.029465,-0.690266,-0.480155,0.0,-0.428806,-0.363069,-0.079226,-0.098412,0.613286,0.0,0.0,0.0,0.0,-1.189111,-0.333684,-1.212176,-0.950426,Benign,-0.603738
3,-0.882372,-0.637195,-0.024254,-0.014846,-0.129765,-0.161487,-0.499564,-0.558656,-0.536426,-0.460476,-0.395573,-0.367389,-0.444693,-0.386884,-0.548607,7.882399,-0.428743,-0.488296,-0.592193,-0.181295,-0.613295,-0.423379,-0.420155,-0.565324,-0.282574,-0.528956,-0.346735,-0.376716,-0.478098,-0.211557,-0.310173,0.0,0.0,0.0,-0.022046,-0.015671,5.698399,9.465378,-0.636723,-0.488344,-0.602205,-0.54941,-0.212007,2.911797,-0.587581,-0.077637,-0.090463,-0.023071,0.0,0.0,0.0,0.084998,-0.648532,-0.536426,-0.444693,0.0,0.0,0.0,0.0,-0.062094,-0.029465,-0.690266,-0.480155,0.0,-0.428806,-0.367688,-0.079226,-0.098412,0.613286,0.0,0.0,0.0,0.0,-1.189111,-0.333684,-1.212176,-0.950426,Benign,-0.603738
5,-0.882372,-0.626844,-0.01916,-0.012559,-0.123563,-0.157829,-0.331757,-0.558656,-0.409622,-0.250211,-0.25759,-0.367389,-0.310106,-0.159963,-0.018268,-0.169162,-0.421043,-0.469696,-0.572281,-0.18135,-0.602857,-0.415096,-0.393229,-0.545102,-0.282558,-0.518035,-0.328307,-0.335866,-0.456371,-0.211451,-0.310173,0.0,0.0,0.0,-0.015961,-0.012702,-0.164974,-0.133379,-0.636723,-0.364776,-0.472985,-0.365599,-0.206505,1.088104,1.69362,-0.077637,-0.065831,-0.018764,0.0,0.0,0.0,-0.094053,-0.512487,-0.409622,-0.310106,0.0,0.0,0.0,0.0,-0.062094,-0.029465,-0.690266,-0.360429,0.0,-0.340456,1.050951,-0.099024,-0.079153,0.613286,0.0,0.0,0.0,0.0,0.889237,-0.333684,0.809676,1.03725,Benign,-0.603738
6,-0.882372,2.305313,0.596017,0.540783,7.559409,14.38115,2.903204,-0.558656,1.074233,1.890644,2.154491,-0.367389,2.851727,2.832233,4.43909,-0.184881,-0.412976,-0.39102,0.05249,-0.181352,2.350573,-0.40406,-0.270973,0.088572,-0.282573,2.581362,-0.325137,-0.216177,0.226701,-0.211556,-0.310173,0.0,0.0,0.0,0.622166,0.539181,-0.181439,-0.143918,-0.636723,1.795318,2.580837,2.603369,1.40673,-0.735589,-0.587581,-0.077637,4.872902,0.815165,0.0,0.0,0.0,-0.094053,2.333321,1.074233,2.851727,0.0,0.0,0.0,0.0,0.53208,-0.018521,-0.690266,0.730409,0.0,2.732164,0.770197,1.172641,3.926795,0.613286,0.0,0.0,0.0,0.0,0.889237,-0.333684,0.809676,1.037236,Benign,-0.603738


Normalize edilmiş veriyi modele sokacağımız için Label değişkenine de encoding uygulamamız gerekiyor. Bu bir binary classification problemi olduğu için 0/1 şeklinde bir encoding uygulayabiliriz.

In [52]:
d = {
    'Benign': 0,
    'Darknet': 1
}

normalized_df['Label'] = normalized_df['Label'].map(d)

In [53]:
if os.path.exists('./artifacts/NormalizedDarknet.parquet'):
    os.remove('./artifacts/NormalizedDarknet.parquet')

normalized_df.to_parquet('./artifacts/NormalizedDarknet.parquet')

## 5- Veri Analizi

Öncelikle elimizdeki özellikleri bir daha hatırlayalım:

In [54]:
outlier_removed_df.columns

Index(['Protocol', 'Flow Duration', 'Total Fwd Packet', 'Total Bwd packets',
       'Total Length of Fwd Packet', 'Total Length of Bwd Packet',
       'Fwd Packet Length Max', 'Fwd Packet Length Min',
       'Fwd Packet Length Mean', 'Fwd Packet Length Std',
       'Bwd Packet Length Max', 'Bwd Packet Length Min',
       'Bwd Packet Length Mean', 'Bwd Packet Length Std', 'Flow Bytes/s',
       'Flow Packets/s', 'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max',
       'Flow IAT Min', 'Fwd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std',
       'Fwd IAT Max', 'Fwd IAT Min', 'Bwd IAT Total', 'Bwd IAT Mean',
       'Bwd IAT Std', 'Bwd IAT Max', 'Bwd IAT Min', 'Fwd PSH Flags',
       'Bwd PSH Flags', 'Fwd URG Flags', 'Bwd URG Flags', 'Fwd Header Length',
       'Bwd Header Length', 'Fwd Packets/s', 'Bwd Packets/s',
       'Packet Length Min', 'Packet Length Max', 'Packet Length Mean',
       'Packet Length Std', 'Packet Length Variance', 'FIN Flag Count',
       'SYN Flag Count', 'RST Flag Count',

Bu özelliklerin hepsini incelemek fazla zaman alıcı ve muhtemelen gereksiz olacaktır. Tahmin etmek istediğimiz özellik ile en çok ilişkili olanları seçip yolumuza devam edelim. Bunun için *mutual_info_classif* fonksiyonunu kullanacağız. Bu fonksiyon diğer özelliklerin hedef özellikle arasındaki ilişkiyi ölçmeye yarayan bir fonksiyon.

In [56]:
y = outlier_removed_df['Label']
X = outlier_removed_df.drop(columns=['Label'])

mi = mutual_info_classif(X, y, random_state=42)

mi = pd.Series(mi, index=X.columns)
mi = mi.sort_values(ascending=False)

mi

Packet Length Max             0.191858
Idle Max                      0.184460
Packet Length Mean            0.182635
Packet Length Std             0.177801
Idle Mean                     0.177754
Average Packet Size           0.177338
Packet Length Variance        0.177267
Flow IAT Max                  0.174556
Bwd Packet Length Max         0.173535
Flow Duration                 0.167115
Bwd Packet Length Mean        0.166407
Bwd Segment Size Avg          0.166276
Idle Min                      0.164423
Flow IAT Min                  0.158818
Bwd Packet Length Min         0.154446
Total Length of Bwd Packet    0.152151
Fwd Header Length             0.142431
Total Length of Fwd Packet    0.142221
Fwd IAT Max                   0.134019
Flow Packets/s                0.128556
Bwd Header Length             0.127773
Fwd Packets/s                 0.126610
Subflow Bwd Bytes             0.124752
Bwd Packets/s                 0.121628
Flow IAT Mean                 0.120938
Hour                     

Hiçbir özellik hedef ile inanılmaz ilişkili görünmüyor. Öte yandan bazıları neredeyse ilişkisiz olarak nitelendirildiği için bunları kullanmamak faydamıza olabilir. Normalde 0.3 gibi bir eşik belirleyip 0.3-1.0 arasını alacağımızı varsayarsak bu özellikler için 0.06 gibi bir eşik belirlemek pek de yanlış olmayacaktır. 

In [59]:
features = mi[mi > 0.06].index
features

Index(['Packet Length Max', 'Idle Max', 'Packet Length Mean',
       'Packet Length Std', 'Idle Mean', 'Average Packet Size',
       'Packet Length Variance', 'Flow IAT Max', 'Bwd Packet Length Max',
       'Flow Duration', 'Bwd Packet Length Mean', 'Bwd Segment Size Avg',
       'Idle Min', 'Flow IAT Min', 'Bwd Packet Length Min',
       'Total Length of Bwd Packet', 'Fwd Header Length',
       'Total Length of Fwd Packet', 'Fwd IAT Max', 'Flow Packets/s',
       'Bwd Header Length', 'Fwd Packets/s', 'Subflow Bwd Bytes',
       'Bwd Packets/s', 'Flow IAT Mean', 'Hour', 'Fwd IAT Total',
       'FWD Init Win Bytes', 'Fwd IAT Mean', 'Fwd IAT Min', 'Flow Bytes/s',
       'Fwd Packet Length Max', 'Fwd Packet Length Mean',
       'Fwd Segment Size Avg', 'Fwd Seg Size Min', 'Bwd Init Win Bytes',
       'Fwd Packet Length Min', 'Packet Length Min', 'Total Fwd Packet',
       'Total Bwd packets', 'Flow IAT Std'],
      dtype='object')

In [60]:
features.size

41

Elimizde 41 özellik kalıyor. Hem varyanstan çok kaybetmemiş oluyoruz hem de elimizdeki veriyi bir miktar trimlemiş oluyoruz. 

## Referanslar
[0] Veri seti: Arash Habibi Lashkari, Gurdip Kaur, and Abir Rahali, “DIDarknet: A Contemporary Approach to Detect and Characterize the Darknet Traffic using Deep Image Learning”, 10th International Conference on Communication and Network Security, Tokyo, Japan, November 2020.

[1], [2] https://www.unb.ca/cic/datasets/darknet2020.html (Nisan 2024'te erişilmiştir.)

[3] D’hooge, L., Verkerken, M., Volckaert, B., Wauters, T., De Turck, F. (2022). Establishing the Contaminating Effect of Metadata Feature Inclusion in Machine-Learned Network Intrusion Detection Models. In: Cavallaro, L., Gruss, D., Pellegrino, G., Giacinto, G. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2022. Lecture Notes in Computer Science, vol 13358. Springer, Cham. https://doi.org/10.1007/978-3-031-09484-2_2