## Task 3: Botnet Classification
#### For this task, we use the CTU-13 data , containing NetFlows from 13 scenarios of both benign and infected hosts, build classifier for detecting anomalous behaviour based on paper by Garcia, Sebastian, et al.

In [1]:
import pandas as pd
df = pd.read_csv('capture20110810.binetflow')
df.head(5)

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes,SrcBytes,Label
0,2011/08/10 09:46:59.607825,1.026539,tcp,94.44.127.113,1577,->,147.32.84.59,6881,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT
1,2011/08/10 09:47:00.634364,1.009595,tcp,94.44.127.113,1577,->,147.32.84.59,6881,S_RA,0.0,0.0,4,276,156,flow=Background-Established-cmpgw-CVUT
2,2011/08/10 09:47:48.185538,3.056586,tcp,147.32.86.89,4768,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
3,2011/08/10 09:47:48.230897,3.111769,tcp,147.32.86.89,4788,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt
4,2011/08/10 09:47:48.963351,3.083411,tcp,147.32.86.89,4850,->,77.75.73.33,80,SR_A,0.0,0.0,3,182,122,flow=Background-TCP-Attempt


In [2]:
df['Label'].value_counts()

flow=Background-UDP-Established                                            1169677
flow=To-Background-UDP-CVUT-DNS-Server                                      941706
flow=Background-TCP-Established                                             223543
flow=Background-Established-cmpgw-CVUT                                      137257
flow=Background-TCP-Attempt                                                 105438
flow=Background-UDP-Attempt                                                  66699
flow=Background                                                              40216
flow=Background-Attempt-cmpgw-CVUT                                           30983
flow=From-Botnet-V42-UDP-DNS                                                 26140
flow=To-Background-CVUT-Proxy                                                19542
flow=From-Normal-V42-Stribrek                                                18438
flow=From-Botnet-V42-TCP-Attempt-SPAM                                         8105
flow

In [3]:
df.shape

(2824636, 15)

###  About the label of flows
Sicco: Please note that the labels of the flows generated by the malware start with “From-Botnet”. The labels “To-Botnet” are flows sent to the botnet by unknown computers, so they should not be considered malicious perse. Also for the normal computers, the counts are for the labels “From-Normal”. The labels “To-Normal” are flows sent to the botnet by unknown computers, so they should not be considered malicious perse.”
You should filter the data to from-botnet and from-normal, we don't know whether the connections to the botnet/normal hosts are malicious or not

In [4]:
# drop out the rows labeled with "Background"
df2 = df[df['Label'].str.contains("Background")==False]
# drop out the rows labeled with "To-Normal""To-Botnet"
df3 = df2[df2['Label'].str.contains("To")==False]

df3[:1]

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes,SrcBytes,Label
150,2011/08/10 09:46:53.160043,0.0,udp,147.32.80.9,53,->,147.32.86.111,54230,INT,0.0,,1,141,141,flow=From-Normal-V42-UDP-CVUT-DNS-Server


In [5]:
# replace the labels to binary value: Botnet --> 1, Normal --> 0
df3.loc[df3['Label'].str.contains("Botnet"), 'biLabel'] = '1'
df3.loc[df3['Label'].str.contains("Normal"), 'biLabel'] = '0'
df3[:1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,Dir,DstAddr,Dport,State,sTos,dTos,TotPkts,TotBytes,SrcBytes,Label,biLabel
150,2011/08/10 09:46:53.160043,0.0,udp,147.32.80.9,53,->,147.32.86.111,54230,INT,0.0,,1,141,141,flow=From-Normal-V42-UDP-CVUT-DNS-Server,0


In [6]:
df3['biLabel'].value_counts()

1    40948
0    30267
Name: biLabel, dtype: int64

In [7]:
df3['Proto'].value_counts()

udp     50080
tcp     21085
icmp       37
arp        13
Name: Proto, dtype: int64

In [8]:
df3['State'].value_counts()

CON           46946
S_             9802
FSPA_FSPA      5585
FSA_FSA        3204
INT            3142
FSPA_FSRPA      462
SRPA_FSPA       457
SRPA_SPA        323
S_RA            200
FSRPA_FSPA      192
SPA_SRPA        143
FSRA_FSPA        91
FSA_SRA          85
PA_PA            63
SA_SRA           57
SRA_FSPA         56
SPA_SPA          51
SPA_SRA          48
URP              37
FSPA_FSA         33
SRA_SA           29
FSRPA_SPA        27
FSPA_FSRA        18
SPA_FSRPA        18
FSRPA_SRPA       17
FSRPA_SA         16
FSA_FSRA         15
RA_              11
PA_A             10
SPA_FSPA         10
SRPA_SA           8
FPA_FPA           6
RSP               5
S_FRA             5
FSRA_FSA          4
SR_SA             4
S_R               3
FSRA_SPA          3
FSPA_SRPA         3
PA_RPA            3
SA_               3
RPA_FRPA          2
FA_FA             2
SRA_SPA           1
A_PA              1
FSPA_SA           1
PA_FPA            1
FPA_RPA           1
FSA_SA            1
FPA_FA            1


In [9]:
df3['dTos'].value_counts()

0.0    58205
2.0        2
Name: dTos, dtype: int64

In [10]:
df3['SrcAddr'].value_counts()

147.32.84.165    40948
147.32.84.170    18438
147.32.84.164     7654
147.32.84.134     3808
147.32.87.36       269
147.32.80.9         83
147.32.87.11         6
147.32.86.187        4
147.32.87.252        3
147.32.86.140        1
147.32.86.111        1
Name: SrcAddr, dtype: int64

In [11]:
df3['Dir'].value_counts()

  <->    46938
   ->    24200
  <?>       52
  who       13
   ?>       12
Name: Dir, dtype: int64

## Now, do pre-processing on the data to prepare for classification.

In [12]:
df3.isnull().sum()

StartTime        0
Dur              0
Proto            0
SrcAddr          0
Sport           13
Dir              0
DstAddr          0
Dport           13
State            0
sTos            13
dTos         13008
TotPkts          0
TotBytes         0
SrcBytes         0
Label            0
biLabel          0
dtype: int64

### We drop the rows with small number of NAN, and drop the column 'dTos' when doing classificaiton since there are too many NaN value. Also we leave out some other columns with irrelevant features.

In [13]:
df3.drop('dTos',axis=1,inplace=True)
df3.drop('Label',axis=1,inplace=True)
df3.drop('StartTime',axis=1,inplace=True)
# df3.drop('SrcAddr',axis=1,inplace=True)
df3.drop('DstAddr',axis=1,inplace=True)
df3.dropna(inplace=True)
df3[:1]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stab

Unnamed: 0,Dur,Proto,SrcAddr,Sport,Dir,Dport,State,sTos,TotPkts,TotBytes,SrcBytes,biLabel
150,0.0,udp,147.32.80.9,53,->,54230,INT,0.0,1,141,141,0


In [14]:
df3.shape

(71202, 12)

In [15]:
# source IP convertion dic
dic = {}
srcIP = df3['SrcAddr'].unique()
for idx,ip in enumerate(srcIP):
    dic[ip] = idx
df3["SrcAddr"].replace(dic, inplace=True)

dic_inverse = { k:v for v, k in dic.items() }
dic
# dic_inverse

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


{'147.32.80.9': 0,
 '147.32.84.134': 1,
 '147.32.84.164': 2,
 '147.32.84.165': 9,
 '147.32.84.170': 3,
 '147.32.86.111': 5,
 '147.32.86.140': 6,
 '147.32.86.187': 7,
 '147.32.87.11': 10,
 '147.32.87.252': 8,
 '147.32.87.36': 4}

#### Categorical features have to be mapped either using a numerical mapping or one-hot encoding. The latter is preferred in general because it doesn't create artificial "neighbours" caused by the numerical indexing, but it has the disadvantage that each category is mapped to a new feature leading to a big sparse feature map, so it's not suitable for categories with many cardinality.

We decide to use a simple numerical mapping for performance reasons.

In [16]:

for col in ['Proto', 'State','Dir','Sport','Dport']:
    df3[col] = df3[col].astype('category').cat.codes


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


###  get the label for each source IP address, if one of flows has been labeled as malicious, then we label the host as Botnet

In [17]:
host_label = df3.groupby('SrcAddr').biLabel.max()

In [18]:
df_label = df3['biLabel']
# host_label = df3.groupby('biLabel').sum()
df3.drop('biLabel',axis=1,inplace=True)
df3[:1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,Dur,Proto,SrcAddr,Sport,Dir,Dport,State,sTos,TotPkts,TotBytes,SrcBytes
150,0.0,2,0,17392,0,59,27,0.0,1,141,141


In [19]:
df3.dtypes

Dur         float64
Proto          int8
SrcAddr       int64
Sport         int16
Dir            int8
Dport          int8
State          int8
sTos        float64
TotPkts       int64
TotBytes      int64
SrcBytes      int64
dtype: object

## Now, let's do classification.
### Evaluate based on (1) flow level and (2) host level.
### For host level: do classification on flow level and then aggregare for per IP address(host) and evaluate the performance.

In [20]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from imblearn.over_sampling import SMOTE

In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier, VotingClassifier
import numpy as np


# ------------Split into training and testing
X = df3
y = df_label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 

## -------------prepare for host level evaluation---------------
dftest = X_test.copy()
dftest['y_test'] = y_test
host_label = dftest.groupby('SrcAddr').y_test.max()
# print(host_label)

classifiers = {
    'RandomForestClassifier': {'instance': RandomForestClassifier()},
    'KNeighborsClassifier': {'instance': KNeighborsClassifier()},
}

#----------- Perform oversampling of training set using SMOTE----------
sm = SMOTE(ratio='auto', kind='regular')
X_train_os, y_train_os = sm.fit_sample(X_train, y_train)


for name, classifier in classifiers.items():
    print("Classifier: {}".format(name))
    clf = classifier['instance']


    clf.fit(X_train_os, y_train_os)
    y_predict = clf.predict(X_test)
    
    #-------------------This is for flow level----------------

    cf_matrix = confusion_matrix(y_test, y_predict)
    
    #-------------------This is for host level----------------
    dftest["SrcAddr"].replace(dic_inverse, inplace=True)
    dftest['y_predict'] = y_predict
    host_label_predict = dftest.groupby('SrcAddr').y_predict.max()
    host_list = dftest.groupby('SrcAddr').apply(list).index
    cf_matrix_label = confusion_matrix(host_label, host_label_predict)

    print("test size is :{}".format(len(dftest['y_test'])))

    print('- Flow level:With SMOTE -> TN: {}, FP: {}, FN: {}, TP: {}, Acc: {}'.format(*cf_matrix.flatten(), accuracy_score(y_test, y_predict)))
    print('- Host level:TN: {}, FP: {}, FN: {}, TP: {}, Acc: {}'.format(*cf_matrix_label.flatten(), accuracy_score(host_label, host_label_predict)))
    print("host_label:{}, host_label_predict{}".format(np.array(host_label),np.array(host_label_predict)))

Classifier: KNeighborsClassifier
test size is :14241
- Flow level:With SMOTE -> TN: 5966, FP: 25, FN: 123, TP: 8127, Acc: 0.9896074713854364
- Host level:TN: 3, FP: 5, FN: 0, TP: 1, Acc: 0.4444444444444444
host_label:['0' '0' '0' '0' '0' '0' '0' '1' '0'], host_label_predict['0' '1' '1' '1' '1' '0' '0' '1' '1']
Classifier: RandomForestClassifier
test size is :14241
- Flow level:With SMOTE -> TN: 5990, FP: 1, FN: 0, TP: 8250, Acc: 0.9999297802120638
- Host level:TN: 6, FP: 2, FN: 1, TP: 0, Acc: 0.6666666666666666
host_label:['0' '0' '0' '0' '0' '0' '0' '1' '0'], host_label_predict['0' '0' '0' '1' '0' '0' '1' '0' '0']


In [31]:
print("This the host addresses in test set:")
host_list

This the host addresses in test set:


Index(['147.32.80.9', '147.32.84.134', '147.32.84.164', '147.32.84.165',
       '147.32.84.170', '147.32.86.187', '147.32.87.11', '147.32.87.252',
       '147.32.87.36'],
      dtype='object', name='SrcAddr')