# CIC-Darknet 2020 Verisinin İncelenmesi ve Sınıflandırma Çalışması

---

## 0- Importlar

In [39]:
import pandas as pd
pd.options.display.min_rows = 100
pd.options.display.max_columns = 100

import os
from pyarrow import parquet as pq
from sklearn.preprocessing import StandardScaler

---

## 1- Giriş

**Amaç:** Bu sunumun amacı, CIC-Darknet 2020 verisini inceleyip alanındaki önemini, özelliklerini ve potansiyel uygulama alanlarını anlamaktır.

---

## 2- Arka Plan

Siber güvenlik alanında verinin önemi azımsanamayacak kadar büyüktür. Temel amaçlarından biri verilerimizin güvenliğini sağlamak olan bu alanda çeşitli çalıştırma ve geliştirmelere malzeme olan birçok veri seti vardır. 

CIC-Darknet 2020 verisi, Darknet gibi siber güvenlik alanında önemli bir konseptle ilgili bir veri seti olması, sınıflandırma çalışmalarına uygun olması ve (her ne kadar bu sunumda ham haliyle çalışmaya başlıyor olsak da) internette temizlenip parquet dosyasına çevrilmiş haliyle bulunması sebebiyle önemli bir veri setidir. 

---

## 3- Verinin İncelenmesi

### 3.2- İncelemeye Giriş
Verinin genel incelenmesi öncesi ilgilendiği konsept Darknet hakkında bilgi sahibi olalım. Darknet, diğer bilgisayarlarla etkileşime gireceği beklenmeyen, internetin kullanılmayan adres alanıdır. Bu alandan gelen herhangi bir iletişim, pasif dinleme doğasından ötürü kuşkuyla karşılanır. Gelen paketler kabul edilir ancak giden paketler için destek söz konusu değildir. 

Darknet;
- Ağ teleskobu: Kişinin Dünya üzerinde meydana gelen farklı büyük ölçekli olayları gözlemlemesine olanak tanıyan bir internet sistemi
- Sinkhole: Tüm alan adları için yönlendirilemez adresler dağıtmaya ayarlı bir DNS sunucusu
- Blackhole: Bir internet servis sağlayıcısının kullanılmayan, yönlendirilebilir adres ağı

olarak da bilinir.

Darknet trafiği sınıflandırması, gerçek zamanlı uygulamaları kategorize etmek için oldukça önemlidir. Darknet analizi, kötü amaçlı yazılımların saldırı öncesinden izlenmesine ve salgın sonrası zararlı faaliyetlerin tespit edilmesine yardımcı olur.

Bu veri, akademik bir ağ trafiği sınıflandırma verisidir. CIC (Canadian Institute for Cybersecurity, Kanada Siber Güvenlik Enstitüsü) tarafından yayınlanmış olup kendinden önceki iki CIC veri yayınının ([ISCXTor2016](https://www.unb.ca/cic/datasets/tor.html) ve [ISCXVPN2016](https://www.unb.ca/cic/datasets/vpn.html)) birleşiminden oluşmaktadır. 

Bu veri setinde ilk katmanda zararsız ve Darknet trafiği üretmek için iki katmanlı bir yaklaşım uygulanmıştır. Darknet trafiği, ikinci katmanda oluşturulmuş olan Audio-Stream, Browsing, Chat, Email, P2P, Transfer, Video-Stream ve VOIP'den (Voice Over Internet Protocol) oluşur. Tablo 1'de Darknet trafiğinin kategorileri ve kategorilerin oluşturulmasında yararlanılan uygulamalar görülebilir.

##### **Tablo 1: Darknet Ağ Trafiği Detayları**<sup>[1]</sup>
| Kategori     | Kullanılan Uygulamalar                                                                                |
|--------------|-------------------------------------------------------------------------------------------------------|
| Audio-Stream | Vimeo ve YouTube                                                                                      | 
| Browsing     | Firefox ve Chrome                                                                                     | 
| Chat         | ICQ, AIM, Skype, Facebook ve Hangouts                                                                 |
| Email        | SMTPS, POP3S ve IMAPS                                                                                 | 
| P2P          | uTorrent ve Transmission (BitTorrent)                                                                 | 
| Transfer     | Skype, Filezilla ve harici bir hizmet kullanarak SSH üzerinden FTP (SFTP) ve SSL üzerinden FTP (FTPS) | 
| Video-Stream | Vimeo ve YouTube                                                                                      | 
| VOIP         | Facebook, Skype ve Hangouts sesli aramaları                                                           | 

---

### 3.2- Verinin Detayları
Yukarıda açıklanan çift katmanlı yaklaşım ve kategorilerin dağılımı Görsel 1'de görülebilir.<sup>[2]</sup>

![title](darknet.jpg)

İlk katmanda zararsız ve Darknet trafiği görülürken ikinci katmanda ise Darknet trafiğini oluşturan özelliklerin dağılımı görülmektedir.

Verinin ham halinde 85 sütun (özellik), 141.530 satır bulunup özelliklerin açıklaması şu şekildedir:
- **Flow ID:** Ağ trafiğinin her bir akışı için benzersiz kimlik.
- **Src IP:** Ağ trafiğinin kaynak IP adresi.
- **Src Port:** Ağ trafiğinin kaynak port numarası.
- **Dst IP:** Hedefin IP adresi.
- **Dst Port:** Hedefin port numarası.
- **Protocol:** Ağ trafiği için kullanılan protokol (örneğin, TCP, UDP).
- **Timestamp:** Ağ trafiğinin gerçekleştiği zaman damgası.
- **Flow Duration:** Akışın süresi, yani akışın ilk ve son paketi arasındaki zaman farkı.
- **Total Fwd Packet:** İleri yönde gönderilen (forward) toplam paket sayısı.
- **Total Bwd packets:** Geri yönde gönderilen (backward) toplam paket sayısı.
- **Total Length of Fwd Packet:** İleri yönde gönderilen paketlerin toplam uzunluğu.
- **Total Length of Bwd Packet:** Geri yönde gönderilen paketlerin toplam uzunluğu.
- **Fwd Packet Length Max:** İleri yönde gönderilen paketlerin maksimum uzunluğu.
- **Fwd Packet Length Min:** İleri yönde gönderilen paketlerin minimum uzunluğu.
- **Fwd Packet Length Mean:** İleri yönde gönderilen paketlerin ortalama uzunluğu.
- **Fwd Packet Length Std:** İleri yönde gönderilen paketlerin uzunluğunun standart sapması.
- **Bwd Packet Length Max:** Geri yönde gönderilen paketlerin maksimum uzunluğu.
- **Bwd Packet Length Min:** Geri yönde gönderilen paketlerin minimum uzunluğu.
- **Bwd Packet Length Mean:** Geri yönde gönderilen paketlerin ortalama uzunluğu.
- **Bwd Packet Length Std:** Geri yönde gönderilen paketlerin uzunluğunun standart sapması.
- **Flow Bytes/s:** Byte cinsinden saniyedeki akış hızı.
- **Flow Packets/s:**  Paket cinsinden saniyedeki akış hızı.
- **Flow IAT Mean:** Akış paketlerinin arasındaki ortalama zaman. (IAT = inter-arrival time)
- **Flow IAT Std:** Akış paketlerinin arasındaki zamanın standart sapması.
- **Flow IAT Max:** Akış paketlerinin arasındaki maksimum zaman.
- **Flow IAT Min:** Akış paketlerinin arasındaki minimum zaman.
- **Fwd IAT Total:** İleri yönde gönderilen paketlerin arasındaki toplam zaman.
- **Fwd IAT Mean:** İleri yönde gönderilen paketlerin arasındaki ortalama zaman.
- **Fwd IAT Std:** İleri yönde gönderilen paketlerin arasındaki zamanın standart sapması.
- **Fwd IAT Max:** İleri yönde gönderilen paketlerin arasındaki maksimum zaman.
- **Fwd IAT Min:** İleri yönde gönderilen paketlerin arasındaki minimum zaman.
- **Bwd IAT Total:** Geri yönde gönderilen paketlerin arasındaki toplam zaman.
- **Bwd IAT Mean:** Geri yönde gönderilen paketlerin arasındaki ortalama zaman.
- **Bwd IAT Std:** Geri yönde gönderilen paketlerin arasındaki zamanın standart sapması.
- **Bwd IAT Max:** Geri yönde gönderilen paketlerin arasındaki maksimum zaman.
- **Bwd IAT Min:** Geri yönde gönderilen paketlerin arasındaki minimum zaman.
- **Fwd PSH Flags:** İleri yönde PUSH bayrağı (verinin olabildiğince çabuk bir şekilde alıcı uygulamaya gönderilmesini belirten bayrak) ayarlanmış paket sayısı.
- **Bwd PSH Flags:** Geri yönde PUSH bayrağı ayarlanmış paket sayısı.
- **Fwd URG Flags:** İleri yönde URGENT bayrağı (verinin acil ilgi beklediğini belirten bayrak) ayarlanmış paket sayısı.
- **Bwd URG Flags:** Geri yönde URGENT bayrağı ayarlanmış paket sayısı.
- **Fwd Header Length:** İleri yönde gönderilen paketlerin toplam başlık uzunluğu.
- **Bwd Header Length:** Geri yönde gönderilen paketlerin toplam başlık uzunluğu.
- **Fwd Packets/s:** İleri yönde saniyedeki paket oranı.
- **Bwd Packets/s:** Geri yönde saniyedeki paket oranı.
- **Packet Length Min:** Paketlerin minimum uzunluğu.
- **Packet Length Max:** Paketlerin maksimum uzunluğu.
- **Packet Length Mean:** Paketlerin ortalama uzunluğu.
- **Packet Length Std:** Paket uzunluğunun standart sapması.
- **Packet Length Variance:** Paket uzunluğunun varyansı.
- **FIN Flag Count:** FIN bayrağı (Finish, TCP bağlantısının sonlandığını belirten bayrak) ayarlanmış paket sayısı.
- **SYN Flag Count:** SYN bayrağı (Synchronize, TCP bağlantısı kurulmasını başlatan bayrak) ayarlanmış paket sayısı.
- **RST Flag Count:** RST bayrağı (Reset, bağlantıyı sıfırlamak için kullanılan bayrak) ayarlanmış paket sayısı.
- **PSH Flag Count:** PSH bayrağı ayarlanmış paket sayısı.
- **ACK Flag Count:** ACK bayrağı (Acknowledgement, alınan veriyi onaylamak için kullanılan bayrak) ayarlanmış paket sayısı.
- **URG Flag Count:** URG bayrağı ayarlanmış paket sayısı.
- **CWE Flag Count:** CWE bayrağı (Congestion Window Reduced, göndericinin bir TCP segmentini ECE bayrağı ayarlanmış olarak aldığını ve iletim hızını buna göre azalttığını bildiren bayrak) ayarlanmış paket sayısı.
- **ECE Flag Count:** ECE bayrağı (Explicit Congestion Notification Echo, ağda bir tıkanıklık yaşandığını göndericiye bildiren bayrak) ayarlanmış paket sayısı.
- **Down/Up Ratio:** Aşağı yönde gönderilen trafiğin yukarıya oranı.
- **Average Packet Size:** Ortalama paket boyutu.
- **Fwd Segment Size Avg:** İleri yönde gönderilen segmentlerin ortalama boyutu.
- **Bwd Segment Size Avg:** Geri yönde gönderilen segmentlerin ortalama boyutu.
- **Fwd Bytes/Bulk Avg:** İleri yönde küme olarak gönderilen ortalama bayt değeri.
- **Fwd Packet/Bulk Avg:** İleri yönde küme olarak gönderilen ortalama paket değeri.
- **Fwd Bulk Rate Avg:** İleri yönde gönderilen kümelerin ortalama oranı.
- **Bwd Bytes/Bulk Avg:** Geri yönde küme olarak gönderilen ortalama bayt değeri.
- **Bwd Packet/Bulk Avg:** Geri yönde küme olarak gönderilen ortalama paket değeri.
- **Bwd Bulk Rate Avg:** Geri yönde gönderilen kümelerin ortalama oranı.
- **Subflow Fwd Packets:** İleri yönde gönderilen alt akış paket sayısı.
- **Subflow Fwd Bytes:** İleri yönde gönderilen alt akış bayt sayısı.
- **Subflow Bwd Packets:** Geri yönde gönderilen alt akış paket sayısı.
- **Subflow Bwd Bytes:** Geri yönde gönderilen alt akış bayt sayısı.
- **FWD Init Win Bytes:** İleri başlangıç penceresi boyutu.
- **Bwd Init Win Bytes:** Geri başlangıç penceresi boyutu.
- **Fwd Act Data Pkts:** İleri yönde asıl veriyi içeren paket sayısı.
- **Fwd Seg Size Min:** Minimum segment boyutu.
- **Active Mean:** Bir akışın aktif olduğu ortalama süre.
- **Active Std:** Bir akışın aktif olduğu sürenin standart sapması.
- **Active Max:** Bir akışın aktif olduğu maksimum süre.
- **Active Min:** Bir akışın aktif olduğu minimum süre.
- **Idle Mean:** Bir akışın boşta olduğu ortalama süre.
- **Idle Std:** Bir akışın boşta olduğu sürenin standart sapması.
- **Idle Max:** Bir akışın boşta olduğu maksimum süre.
- **Idle Min:** Bir akışın boşta olduğu minimum süre.
- **Label:** Trafik türünü gösteren sınıflandırma etiketi (Non-Tor, NonVPN, Tor, VPN).
- **Label.1:** Verinin kategorisi (AUDIO-STREAMING, Browsing, Chat, Email, File-Transfer, File-transfer, P2P, Video-Streaming, Audio-Streaming, Video-streaming, VOIP).


---

## 4- Veri Ön İşleme

Bunun için öncelikle veriyi elde etmemiz gerekiyor.

In [2]:
data = pd.read_csv('DarknetHamVeriSeti.CSV')
df = pd.DataFrame(data)

Şimdi, herhangi bir boş girdi olup olmadığına bakalım.

In [5]:
(df.isnull().sum().sort_values(ascending=False) / len(df)) * 100

Flow Bytes/s              0.033209
Flow ID                   0.000000
Bwd Bytes/Bulk Avg        0.000000
Fwd Packet/Bulk Avg       0.000000
Fwd Bytes/Bulk Avg        0.000000
Bwd Segment Size Avg      0.000000
Fwd Segment Size Avg      0.000000
Average Packet Size       0.000000
Down/Up Ratio             0.000000
ECE Flag Count            0.000000
CWE Flag Count            0.000000
URG Flag Count            0.000000
ACK Flag Count            0.000000
PSH Flag Count            0.000000
RST Flag Count            0.000000
SYN Flag Count            0.000000
FIN Flag Count            0.000000
Packet Length Variance    0.000000
Packet Length Std         0.000000
Packet Length Mean        0.000000
Packet Length Max         0.000000
Fwd Bulk Rate Avg         0.000000
Bwd Packet/Bulk Avg       0.000000
Bwd Packets/s             0.000000
Bwd Bulk Rate Avg         0.000000
Label                     0.000000
Idle Min                  0.000000
Idle Max                  0.000000
Idle Std            

Sadece *Flow Bytes/s* özelliğinde, 0.03%'lük bir boşluk var. Bu kadar küçük bir parçanın bırakılması herhangi bir sorun çıkarmayacaktır. Bu sebeple bu boşluğu doldurmaya çalışmak yerine doğrudan bırakıyoruz.

In [6]:
df.dropna(inplace=True)

In [7]:
df.head()

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Label.1
0,10.152.152.11-216.58.220.99-57158-443-6,10.152.152.11,57158,216.58.220.99,443,6,24/07/2015 04:09:48 PM,229,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,8733.624454,229.0,0.0,229,229,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,4366.812227,4366.812227,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1892,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
1,10.152.152.11-216.58.220.99-57159-443-6,10.152.152.11,57159,216.58.220.99,443,6,24/07/2015 04:09:48 PM,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
2,10.152.152.11-216.58.220.99-57160-443-6,10.152.152.11,57160,216.58.220.99,443,6,24/07/2015 04:09:48 PM,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
3,10.152.152.11-74.125.136.120-49134-443-6,10.152.152.11,49134,74.125.136.120,443,6,24/07/2015 04:09:48 PM,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
4,10.152.152.11-173.194.65.127-34697-19305-6,10.152.152.11,34697,173.194.65.127,19305,6,24/07/2015 04:09:45 PM,10778451,591,400,64530,6659,131,0,109.187817,22.283313,498,0,16.6475,46.833714,6604.75239,91.942711,10887.32424,11412.46641,78158,13,10778451,18268.56102,11786.14309,81171,126,10747836,26936.93233,15897.73845,78158,307,1,0,0,0,11820,8000,54.831627,37.111084,0,498,71.876008,56.93647,3241.761603,1,0,0,659,991,0,0,0,0,71.948537,109.187817,16.6475,0,0,0,0,659,6605,0,65,0,6,1382,2320,581,20,0,0,0,0,1437760000000000.0,3117718.131,1437760000000000.0,1437760000000000.0,Non-Tor,AUDIO-STREAMING


Göze çarpan ilk şey verinin aslında iki adet *Label* adında sütun olduğu için ikinci *Label* sütununun *Label.1* olarak adlandırılması. Bunu *Category* olarak değiştiriyoruz.

In [8]:
df = df.rename(columns={'Label.1': 'Category'})
df.head()

Unnamed: 0,Flow ID,Src IP,Src Port,Dst IP,Dst Port,Protocol,Timestamp,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Category
0,10.152.152.11-216.58.220.99-57158-443-6,10.152.152.11,57158,216.58.220.99,443,6,24/07/2015 04:09:48 PM,229,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,8733.624454,229.0,0.0,229,229,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,4366.812227,4366.812227,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1892,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
1,10.152.152.11-216.58.220.99-57159-443-6,10.152.152.11,57159,216.58.220.99,443,6,24/07/2015 04:09:48 PM,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
2,10.152.152.11-216.58.220.99-57160-443-6,10.152.152.11,57160,216.58.220.99,443,6,24/07/2015 04:09:48 PM,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
3,10.152.152.11-74.125.136.120-49134-443-6,10.152.152.11,49134,74.125.136.120,443,6,24/07/2015 04:09:48 PM,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
4,10.152.152.11-173.194.65.127-34697-19305-6,10.152.152.11,34697,173.194.65.127,19305,6,24/07/2015 04:09:45 PM,10778451,591,400,64530,6659,131,0,109.187817,22.283313,498,0,16.6475,46.833714,6604.75239,91.942711,10887.32424,11412.46641,78158,13,10778451,18268.56102,11786.14309,81171,126,10747836,26936.93233,15897.73845,78158,307,1,0,0,0,11820,8000,54.831627,37.111084,0,498,71.876008,56.93647,3241.761603,1,0,0,659,991,0,0,0,0,71.948537,109.187817,16.6475,0,0,0,0,659,6605,0,65,0,6,1382,2320,581,20,0,0,0,0,1437760000000000.0,3117718.131,1437760000000000.0,1437760000000000.0,Non-Tor,AUDIO-STREAMING


Şimdi işimize yaramayacak bazı sütunları bırakabiliriz. Bunların başında kimlik belirleyici sütunlar geliyor. Eşsiz oldukları için hem dağılımlarını inceleme hem de bir çıkarımda bulunurken kullanma şansımız yok bu sütunları. Ayrıca bu tarz meta veriler tahmin aşamasında fazla güçlü birer kısayol tahminleyici<sup>[3]</sup> olarak çalışacağı için tahmin aşamasını da olumsuz etkileyecektir.

Neyse ki CIC tarafından paylaşılan verilerin geneli için kullanılabilecek drop list herkesin kullanımına açık bir şekilde paylaşılmış durumda. Bu tür özellikleri manuel olarak aramadan droplayabiliyoruz.

In [9]:
drop_columns = [
    "Flow ID",    
    'Fwd Header Length.1',
    "Source IP", "Src IP",
    "Source Port", "Src Port",
    "Destination IP", "Dst IP",
    "Destination Port", "Dst Port",
    "Timestamp",
]

In [10]:
df.drop(columns=drop_columns, inplace=True, errors="ignore")

In [11]:
df.shape

(141483, 79)

Görülebileceği üzere 6 adet özellik bırakılmış durumda.

Şimdi herhangi bir kopya satırın olup olmadığına bakalım. Kopya satırlar bulundurdukları verinin sayısını artıracağı için tahmin aşamasında da istenmedik bir bias durumuna sebep olabilir.

In [12]:
df.duplicated().sum()

38360

38.360 kopya satır varmış. Bunları veriden kaldırmamız gerekiyor.

In [13]:
df.drop_duplicates(inplace=True)
df.shape

(103123, 79)

Şu anda elimizde temizlenmiş bir veri var. İleride lazım olması ihtimalini göz önünde bulundurarak bu veriyi kaydetmek faydalı olacaktır. Bunu yaparken de veriyi CSV yerine bir Parquet dosyası olarak kaydedeceğiz. Böylece ileriki kullanımlarda okuma gibi işlemler daha hızlı yapılabilir.

In [15]:
os.mkdir('./artifacts')
df.to_parquet('./artifacts/Darknet.parquet')

Elimizdeki verinin son halinden bir örnek:

In [16]:
df.head()

Unnamed: 0,Protocol,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label,Category
0,6,229,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,8733.624454,229.0,0.0,229,229,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,4366.812227,4366.812227,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1892,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
1,6,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
2,6,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
3,6,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,Non-Tor,AUDIO-STREAMING
4,6,10778451,591,400,64530,6659,131,0,109.187817,22.283313,498,0,16.6475,46.833714,6604.75239,91.942711,10887.32424,11412.46641,78158,13,10778451,18268.56102,11786.14309,81171,126,10747836,26936.93233,15897.73845,78158,307,1,0,0,0,11820,8000,54.831627,37.111084,0,498,71.876008,56.93647,3241.761603,1,0,0,659,991,0,0,0,0,71.948537,109.187817,16.6475,0,0,0,0,659,6605,0,65,0,6,1382,2320,581,20,0,0,0,0,1437760000000000.0,3117718.131,1437760000000000.0,1437760000000000.0,Non-Tor,AUDIO-STREAMING


Şimdi ileride kullanmak için elimizdeki verinin normalize edilmiş bir halini oluşturacağız. Bunun için elimizde kategorik değişken kalmamış olması gerekiyor ancak *Label* ve *Category* değişkenleri kategorik. Bunlara encoding uygulamamız gerekiyor. Bu özellikleri sırayla inceleyelim.

In [18]:
df.Label.value_counts()

Label
Non-Tor    64806
NonVPN     20216
VPN        16922
Tor         1179
Name: count, dtype: int64

*Label* özelliğinde çok fazla eşsiz veri yok, bu veriler arasında belirli bir sıralama da yok. Bu durumda One-Hot Encoding uygulamak bu durum için en iyisi olacaktır.

In [34]:
encoded_df = pd.get_dummies(df, columns=['Label'])
encoded_df.head()

Unnamed: 0,Protocol,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Category,Label_Non-Tor,Label_NonVPN,Label_Tor,Label_VPN
0,6,229,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,8733.624454,229.0,0.0,229,229,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,4366.812227,4366.812227,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1892,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,AUDIO-STREAMING,True,False,False,False
1,6,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,AUDIO-STREAMING,True,False,False,False
2,6,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,AUDIO-STREAMING,True,False,False,False
3,6,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,AUDIO-STREAMING,True,False,False,False
4,6,10778451,591,400,64530,6659,131,0,109.187817,22.283313,498,0,16.6475,46.833714,6604.75239,91.942711,10887.32424,11412.46641,78158,13,10778451,18268.56102,11786.14309,81171,126,10747836,26936.93233,15897.73845,78158,307,1,0,0,0,11820,8000,54.831627,37.111084,0,498,71.876008,56.93647,3241.761603,1,0,0,659,991,0,0,0,0,71.948537,109.187817,16.6475,0,0,0,0,659,6605,0,65,0,6,1382,2320,581,20,0,0,0,0,1437760000000000.0,3117718.131,1437760000000000.0,1437760000000000.0,AUDIO-STREAMING,True,False,False,False


Özellik isimleri pek hoş durmuyor. Bunları toplu olarak düzenlememiz ve True/False yerine 1/0 girmemiz gerekiyor.

In [35]:
d = {
    'Label_Non-Tor': 'Non-Tor',
    'Label_NonVPN': 'NonVPN',
    'Label_Tor': 'Tor',
    'Label_VPN': 'VPN'
}

encoded_df = encoded_df.rename(columns=d)

for key, value in d.items():
    encoded_df[value] = encoded_df[value].astype('int8')

encoded_df.head()

Unnamed: 0,Protocol,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Total,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Total,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Packet Length Min,Packet Length Max,Packet Length Mean,Packet Length Std,Packet Length Variance,FIN Flag Count,SYN Flag Count,RST Flag Count,PSH Flag Count,ACK Flag Count,URG Flag Count,CWE Flag Count,ECE Flag Count,Down/Up Ratio,Average Packet Size,Fwd Segment Size Avg,Bwd Segment Size Avg,Fwd Bytes/Bulk Avg,Fwd Packet/Bulk Avg,Fwd Bulk Rate Avg,Bwd Bytes/Bulk Avg,Bwd Packet/Bulk Avg,Bwd Bulk Rate Avg,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Category,Non-Tor,NonVPN,Tor,VPN
0,6,229,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,8733.624454,229.0,0.0,229,229,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,4366.812227,4366.812227,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1892,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,AUDIO-STREAMING,1,0,0,0
1,6,407,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4914.004914,407.0,0.0,407,407,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2457.002457,2457.002457,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1987,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,AUDIO-STREAMING,1,0,0,0
2,6,431,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,4640.37123,431.0,0.0,431,431,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2320.185615,2320.185615,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2049,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,AUDIO-STREAMING,1,0,0,0
3,6,359,1,1,0,0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,5571.030641,359.0,0.0,359,359,0,0.0,0.0,0,0,0,0.0,0.0,0,0,0,0,0,0,20,20,2785.51532,2785.51532,0,0,0.0,0.0,0.0,2,0,0,0,2,0,0,0,1,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,2008,1047,0,20,0,0,0,0,0.0,0.0,0.0,0.0,AUDIO-STREAMING,1,0,0,0
4,6,10778451,591,400,64530,6659,131,0,109.187817,22.283313,498,0,16.6475,46.833714,6604.75239,91.942711,10887.32424,11412.46641,78158,13,10778451,18268.56102,11786.14309,81171,126,10747836,26936.93233,15897.73845,78158,307,1,0,0,0,11820,8000,54.831627,37.111084,0,498,71.876008,56.93647,3241.761603,1,0,0,659,991,0,0,0,0,71.948537,109.187817,16.6475,0,0,0,0,659,6605,0,65,0,6,1382,2320,581,20,0,0,0,0,1437760000000000.0,3117718.131,1437760000000000.0,1437760000000000.0,AUDIO-STREAMING,1,0,0,0


Şimdi sıra *Category* özelliğinde.

In [36]:
encoded_df['Category'].unique()

array(['AUDIO-STREAMING', 'Browsing', 'Chat', 'Email', 'File-Transfer',
       'File-transfer', 'P2P', 'Video-Streaming', 'Audio-Streaming',
       'Video-streaming', 'VOIP'], dtype=object)

Öncelikle, aynı şeyi belirten birden fazla girdinin olduğunu görüyoruz. *AUDIO-STREAMING*/*Audio-Streaming*, *Video-Streaming*/*Video-streaming*, *File-Transfer*/*File-transfer* girdileri arasından birer seçim yapmamız gerekiyor. Standardizasyonu sağlamak için kısaltma olmadığı sürece capitalize edeceğiz.

In [37]:
d = {
    'AUDIO-STREAMING': 'Audio-Streaming',
    'Audio-Streaming': 'Audio-Streaming',
    'Video-streaming': 'Video-Streaming',
    'Video-Streaming': 'Video-Streaming',
    'File-transfer': 'File-Transfer',
    'File-Transfer': 'File-Transfer',
    'Browsing': 'Browsing',
    'Chat': 'Chat',
    'Email': 'Email',
    'P2P': 'P2P',
    'VOIP': 'VOIP'
}

encoded_df['Category'] = encoded_df['Category'].map(d)
encoded_df['Category'].unique()

array(['Audio-Streaming', 'Browsing', 'Chat', 'Email', 'File-Transfer',
       'P2P', 'Video-Streaming', 'VOIP'], dtype=object)

Girdilerin formatını düzelttik. Fakat bu özellik üzerinde biraz düşündüğümüz zaman modele sokmanın pek de mantıklı olmadığı bir özellik olduğu kanısına varıyoruz. Yine de bu kadar işlemi boşuna yapmadık. Asıl DataFrame'deki değişkenin de formatını düzeltmek için kullanabiliriz bunu.

In [38]:
df['Category'] = encoded_df['Category']
encoded_df.drop(columns=['Category'], inplace=True)

df['Category'].value_counts()

Category
Browsing           29862
P2P                23404
Audio-Streaming    11328
File-Transfer      10648
Chat               10365
Video-Streaming     9013
Email               5442
VOIP                3061
Name: count, dtype: int64

Verinin ileriki aşamalar öncesi halini de kaydediyoruz ki bir sorun çıkarsa buraya dönebilelim.

In [40]:
encoded_df.to_parquet('./artifacts/EncodedDarknet.parquet')

Şimdi outlier olup olmadığını kontrol etmemiz gerekiyor. Bu kontrolü IQR yöntemi ile yapacağız. Normalizasyon öncesi outlier kontrolü yapmak ve bu outlierlara gerekli görülen işlemleri yapmak hem verinin kalitesini hem de normalizasyondan alacağımız verimi artıracaktır.

In [51]:
def outlier_finder(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    inner_lower_fence = Q1 - 1.5*IQR
    inner_upper_fence = Q3 + 1.5*IQR

    outer_lower_fence = Q1 - 3*IQR
    outer_upper_fence = Q3 + 3*IQR

    mild_outliers = [i for i in data if (i < inner_lower_fence and i > outer_lower_fence) or (i > inner_upper_fence and i < outer_upper_fence)]
    extreme_outliers = [i for i in data if i < outer_lower_fence or i > outer_upper_fence]

    return mild_outliers, extreme_outliers


def outlier_printer(outliers, feature, df_):
    mild_outliers = outliers[0]
    extreme_outliers = outliers[1]
    
    if len(mild_outliers) != 0:
        o1 = f"{feature} özelliğinde {len(mild_outliers)} mild outlier var."
        o2 = f"{feature} özelliği {(len(mild_outliers) / len(df_)) * 100}% mild outlierdan oluşuyor."

        print(o1)
        print(o2)

        print()

    if len(extreme_outliers) != 0:
        o1 = f"{feature} özelliğinde {len(extreme_outliers)} extreme outlier var."
        o2 = f"{feature} özelliği {(len(extreme_outliers) / len(df_)) * 100}% extreme outlierdan oluşuyor."

        print(o1)
        print(o2)

Her özelliğe outlier kontrolü yapacak halimiz yok, hele bazı özellikler aslında başka ana özelliklerin istatistiki verilerini tuttuğu için bu özelliklere outlier kontrolü yapmamız mantıksız olur. Outlier kontrolü yapacağımız özellikleri seçmeliyiz önce.

In [52]:
o_list = [
    'Flow Duration',
    'Total Fwd Packet',
    'Total Bwd packets',
    'Total Length of Fwd Packet',
    'Total Length of Bwd Packet',
    'Flow Bytes/s',
    'Flow Packets/s',
    'Fwd IAT Total',
    'Bwd IAT Total',
    'Fwd Header Length',
    'Bwd Header Length',
    'Fwd Packets/s',
    'Bwd Packets/s',
    'Subflow Fwd Packets',
    'Subflow Fwd Bytes',
    'Subflow Bwd Packets',
    'Subflow Bwd Bytes',
    'FWD Init Win Bytes',
    'Bwd Init Win Bytes',
    
]

Bu özelliklerin hepsinin outlierlarını umursayacağız diye bir şey de yok. Outlier kontrolü hassas bir işlem, aykırı diye nitelendirip attığımız değer aslında bir bilgi içeriyor olabilir. Bunlar sadece şüphelenilen özellikler. Her bir özeliğin çıktısını inceleyip karar vereceğiz.

In [56]:
outliers_d = {}

for column in o_list: 
    print(f"{column} için outlier bilgisi:")
    print("-"*len(f"{column} için outlier bilgisi:"))
    outliers = outlier_finder(encoded_df[column])
    outlier_printer(outliers, column, encoded_df)
    outliers_d[column] = outliers
    if column != df.columns[-1]:
        print("\n")

Flow Duration için outlier bilgisi:
-----------------------------------
Flow Duration özelliğinde 4504 mild outlier var.
Flow Duration özelliği 4.367599856482065% mild outlierdan oluşuyor.

Flow Duration özelliğinde 17998 extreme outlier var.
Flow Duration özelliği 17.452944541954754% extreme outlierdan oluşuyor.


Total Fwd Packet için outlier bilgisi:
--------------------------------------
Total Fwd Packet özelliğinde 3945 mild outlier var.
Total Fwd Packet özelliği 3.825528737527031% mild outlierdan oluşuyor.

Total Fwd Packet özelliğinde 11635 extreme outlier var.
Total Fwd Packet özelliği 11.2826430573199% extreme outlierdan oluşuyor.


Total Bwd packets için outlier bilgisi:
---------------------------------------
Total Bwd packets özelliğinde 3428 mild outlier var.
Total Bwd packets özelliği 3.3241856811768473% mild outlierdan oluşuyor.

Total Bwd packets özelliğinde 11239 extreme outlier var.
Total Bwd packets özelliği 10.898635609902737% extreme outlierdan oluşuyor.


Total Le

Subflow Bwd Bytes özelliğinde 5436 mild outlier var.
Subflow Bwd Bytes özelliği 5.271374959999224% mild outlierdan oluşuyor.

Subflow Bwd Bytes özelliğinde 8047 extreme outlier var.
Subflow Bwd Bytes özelliği 7.803302851934098% extreme outlierdan oluşuyor.


FWD Init Win Bytes için outlier bilgisi:
----------------------------------------
FWD Init Win Bytes özelliğinde 135 mild outlier var.
FWD Init Win Bytes özelliği 0.13091162980130522% mild outlierdan oluşuyor.

FWD Init Win Bytes özelliğinde 1663 extreme outlier var.
FWD Init Win Bytes özelliği 1.6126373359968194% extreme outlierdan oluşuyor.


Bwd Init Win Bytes için outlier bilgisi:
----------------------------------------
Bwd Init Win Bytes özelliğinde 3575 mild outlier var.
Bwd Init Win Bytes özelliği 3.466733900293824% mild outlierdan oluşuyor.

Bwd Init Win Bytes özelliğinde 5443 extreme outlier var.
Bwd Init Win Bytes özelliği 5.278162970433366% extreme outlierdan oluşuyor.




Elimizdeki veri halihazırda epey büyük olduğu için outlierları atıp ne kadar veri kaldığına bakalım. 

In [58]:
encoded_df.shape

(103123, 81)

In [57]:
outlier_removed_df = encoded_df.copy()

for key, value in outliers_d.items():
    outlier_removed_df = outlier_removed_df[~outlier_removed_df[key].isin(value[1])]
    outlier_removed_df = outlier_removed_df[~outlier_removed_df[key].isin(value[0])]

outlier_removed_df.shape

(41087, 81)

Verinin 60%'ı kadarını kaybediyoruz. Bu istenen bir sonuç değil. Veriyi her ne kadar temizlese de biası çok artıracaktır. Zaten önceden de belirtildiği üzere her özellik üzerinde outlier temizlemesi yapmak gerekli değil. Bu bağlamda özellikleri tek tek inceleyip karar vermemiz gerekiyor. Karar aşamasında özelliklerin istatistiki değerlerine bakmak da fayda sağlayabilir.

In [60]:
encoded_df[o_list].describe()

Unnamed: 0,Flow Duration,Total Fwd Packet,Total Bwd packets,Total Length of Fwd Packet,Total Length of Bwd Packet,Flow Bytes/s,Flow Packets/s,Fwd IAT Total,Bwd IAT Total,Fwd Header Length,Bwd Header Length,Fwd Packets/s,Bwd Packets/s,Subflow Fwd Packets,Subflow Fwd Bytes,Subflow Bwd Packets,Subflow Bwd Bytes,FWD Init Win Bytes,Bwd Init Win Bytes
count,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0,103123.0
mean,22679510.0,183.400308,185.735714,133471.8,156673.8,inf,inf,21733510.0,18170430.0,3501.326,3723.561,4624.713,3106.101642,0.266817,48.580705,0.0,60.024553,5739.192741,1998.307303
std,39521650.0,2695.740842,3956.465459,3730389.0,5305770.0,,,39193250.0,37230780.0,53868.56,80999.42,32707.49,18566.831714,0.442298,155.888977,0.0,144.716974,10597.954921,8292.673233
min,0.0,1.0,0.0,0.0,0.0,0.0,0.01666866,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,80788.0,1.0,0.0,0.0,0.0,0.2663485,0.6548717,0.0,0.0,16.0,0.0,0.3916324,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,440316.0,2.0,1.0,44.0,8.0,105.8367,6.22666,409819.0,0.0,40.0,16.0,4.826383,0.783385,0.0,18.0,0.0,1.0,913.0,0.0
75%,16643740.0,5.0,4.0,372.0,225.0,923.5991,64.38449,13069290.0,2580879.0,104.0,92.0,39.46447,8.530203,1.0,26.0,0.0,49.0,14600.0,1055.0
max,120000000.0,238161.0,470862.0,769307400.0,670428700.0,inf,inf,120000000.0,120000000.0,4768644.0,9417240.0,2000000.0,1000000.0,1.0,6644.0,0.0,4872.0,65535.0,65535.0


Baktığımız zaman *Flow Bytes/s* ve *Flow Packets/s* inf, yani sonsuz değerler görüyoruz. Bu özellikler müdahale istiyor. Bunlar dışında *Bwd Packets/s*, *Subflow Fwd Bytes*, *Subflow Bwd Bytes*, *FWD Init Win Bytes* ve *Bwd Init Win Bytes* özellikleri de sıkıntılı görünüyor. Fakat bu özelliklerin açıklamalarına baktığımız zaman outlier verilerin normal karışlanabileceği hatta önemli bilgiler barındırabileceği sonucuna varıyoruz. Dolayısıyla bu özelliklere müdahale etmiyoruz.

In [61]:
outlier_removed_df = encoded_df.copy()

for key in ['Flow Bytes/s', 'Flow Packets/s']:
    outlier_removed_df = outlier_removed_df[~outlier_removed_df[key].isin(outliers_d[key][1])]
    outlier_removed_df = outlier_removed_df[~outlier_removed_df[key].isin(outliers_d[key][0])]

outlier_removed_df.shape

(74098, 81)

Verinin 30%'u civarını kaybetmiş olduk. Bu normalde büyük sayılacak bir oran ancak bu işlemin sonucunda bile elimizde azımsanamayacak miktarda veri olduğu için sorun yok. Verinin bu halini de kaydedip devam ediyoruz.

In [62]:
outlier_removed_df.to_parquet('./artifacts/OutlierRemovedDarknet.parquet')

## Referanslar
[0] Veri seti: Arash Habibi Lashkari, Gurdip Kaur, and Abir Rahali, “DIDarknet: A Contemporary Approach to Detect and Characterize the Darknet Traffic using Deep Image Learning”, 10th International Conference on Communication and Network Security, Tokyo, Japan, November 2020.

[1], [2] https://www.unb.ca/cic/datasets/darknet2020.html (Nisan 2024'te erişilmiştir.)

[3] D’hooge, L., Verkerken, M., Volckaert, B., Wauters, T., De Turck, F. (2022). Establishing the Contaminating Effect of Metadata Feature Inclusion in Machine-Learned Network Intrusion Detection Models. In: Cavallaro, L., Gruss, D., Pellegrino, G., Giacinto, G. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2022. Lecture Notes in Computer Science, vol 13358. Springer, Cham. https://doi.org/10.1007/978-3-031-09484-2_2