##### **Name:** Rohan Karthikeyan
##### **Roll Number:** MDS202226

In the third task of our midsemester exam, we will be using ML methods for traffic classification on the custom dataset.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.impute import SimpleImputer

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import HistGradientBoostingClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import make_scorer

# Set a dataframe for holding results
model_performance = pd.DataFrame(columns = ['Precision', 'Accuracy'])
pd.set_option('display.max_columns', 50)

### Data Preparation

The following steps were performed to convert the raw PCAP file into a CSV:
* Split the file into smaller files each of size 1000 MB;
* Use two tools: `Argus` and `Bro-IDS` for feature extraction;
* Match the extracted features and output the CSV.

Only the last step is Python-based and is shown in the below code cell.

**Disclaimer:** The raw PCAP file has records for 15 minutes. While Zeek provided the complete info. for each smaller 1GB subset, Argus kept throwing some error on each subset. Specifically, only the first and the fourth 1 GB split produced a good number of records from Argus. In all other splits, Argus terminated with an error in the first few hundred records itself.

I could not discern the reason behind the error, and so all subsequent analysis is not based on the complete data.

In [2]:
common_path = '../input/simply/Actual/Split'
zeek_headers = ['stime', 'saddr', 'sport', 'daddr', 'dport', 'proto', 'service']
argus_headers = ['stime', 'saddr', 'daddr', 'sport', 'dport', 'proto', 'state', 'dur', 'sbytes', 'dbytes', 'sttl', 'dttl',
                 'sloss', 'dloss', 'sload', 'dload', 'spkts', 'dpkts', 'smeansz', 'dmeansz', 'sjit', 'djit',
                 'sintpkt', 'dintpkt', 'synack', 'ackdat']

In [3]:
%%time
# Merge files
flow_data = pd.DataFrame()

for file_num in range(6):
    zeek = pd.read_csv(common_path + '{}.tsv'.format(file_num), sep='\t', names=zeek_headers)
    argus = pd.read_table(common_path + '{}.txt'.format(file_num), sep='\s+', header=0, names=argus_headers)

    # Change datatypes of some columns
    cols_to_change = ['sport', 'dport', 'dur', 'sloss', 'dloss', 'sload', 'dload', 'sjit']
    for col in cols_to_change:
        argus[col] = pd.to_numeric(argus[col], errors='coerce')

    # Create new dataframe
    subset = argus.merge(zeek)
    flow_data = pd.concat([flow_data, subset])

flow_data.reset_index(drop=True, inplace=True)
flow_data



CPU times: user 42.6 s, sys: 6.01 s, total: 48.6 s
Wall time: 59.4 s


Unnamed: 0,stime,saddr,daddr,sport,dport,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,sload,dload,spkts,dpkts,smeansz,dmeansz,sjit,djit,sintpkt,dintpkt,synack,ackdat,service
0,1.681535e+09,202.151.45.173,120.29.57.200,80.0,443.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,-
1,1.681535e+09,120.29.57.200,202.151.45.173,443.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,-
2,1.681535e+09,162.44.164.99,202.151.45.173,8848.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,-
3,1.681535e+09,162.44.164.99,202.151.45.1,8848.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,-
4,1.681535e+09,120.29.57.200,202.151.45.1,443.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1173390,1.681536e+09,200.246.48.17,202.151.45.173,443.0,80.0,tcp,RST,0.0,188.0,396.0,246.0,248.0,1.0,0.0,0.0,0.0,2.0,6.0,94.0,66.0,0.0,0.0,,,,,-
1173391,1.681536e+09,151.21.149.111,202.151.45.1,443.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,-
1173392,1.681536e+09,151.21.149.111,202.151.45.173,443.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,-
1173393,1.681536e+09,162.44.164.99,202.151.45.173,8848.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,-


Many of the entries in the `service` column is coded as `-`, therefore, we need to manually fill in the services.<br>

According to [Wikipedia](https://en.wikipedia.org/wiki/Port_(computer_networking)#Port_number),
> Applications implementing common services often use ***specifically reserved well-known port numbers*** for **receiving** service requests from clients. This process is known as listening, and involves the receipt of a request on the well-known port potentially establishing a one-to-one server-client dialog, using this listening port. Other clients may simultaneously connect to the same listening port; this works because a TCP connection is identified by a tuple consisting of the local address, the local port, the remote address, and the remote port... Conversely, the client end of a connection typically uses a high port number allocated for short-term use...

As a result, we use the destination port to identify the services.

The 15 ports used as keys in the dictionary below are based on an inspection of frequently occurring destination ports (with a value less than 1000).

In [4]:
port_service_map = {21: "ftp", 22: "ssh", 23: "telnet", 25: "smtp", 53: "dns", 80: "http", 110: "pop3",
                    123: "ntp", 143: "imap", 161: "snmp", 179: "bgp", 443: "ssl", 465: "urd",
                    587: "submission", 993: "imaps"}
flow_data['service'] = flow_data['dport'].map(port_service_map)
flow_data

Unnamed: 0,stime,saddr,daddr,sport,dport,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,sload,dload,spkts,dpkts,smeansz,dmeansz,sjit,djit,sintpkt,dintpkt,synack,ackdat,service
0,1.681535e+09,202.151.45.173,120.29.57.200,80.0,443.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,ssl
1,1.681535e+09,120.29.57.200,202.151.45.173,443.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,http
2,1.681535e+09,162.44.164.99,202.151.45.173,8848.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,http
3,1.681535e+09,162.44.164.99,202.151.45.1,8848.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,http
4,1.681535e+09,120.29.57.200,202.151.45.1,443.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,http
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1173390,1.681536e+09,200.246.48.17,202.151.45.173,443.0,80.0,tcp,RST,0.0,188.0,396.0,246.0,248.0,1.0,0.0,0.0,0.0,2.0,6.0,94.0,66.0,0.0,0.0,,,,,http
1173391,1.681536e+09,151.21.149.111,202.151.45.1,443.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,http
1173392,1.681536e+09,151.21.149.111,202.151.45.173,443.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,http
1173393,1.681536e+09,162.44.164.99,202.151.45.173,8848.0,80.0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,http


From below, we observe that *pandas* has mapped the service to ***a null value*** wherever the destination port was not provided.

In [5]:
flow_data.service.unique()

array(['ssl', 'http', 'pop3', 'ssh', 'telnet', nan, 'dns', 'ntp', 'ftp',
       'smtp', 'submission', 'imaps', 'snmp', 'urd', 'bgp', 'imap'],
      dtype=object)

#### Q0: What are the number of flows?

**Assumption:** I consider each record as an individual flow due to the lack of a flow ID.

In [6]:
# Find the number of records
print('There are {} flows in the dataset.'.format(len(flow_data)))

There are 1173395 flows in the dataset.


#### Q1: What is the duration of flows?

The unit of the `dur` column is in seconds.

In [7]:
# Sum up the `dur` column
num_hours = flow_data['dur'].sum()/86400
print('The flows last for a total of {:.2f} hours.'.format(num_hours))

The flows last for a total of 0.17 hours.


#### Q2: What are the sizes of the packets?

In [8]:
# Total number of bytes sent
total_bytes = (flow_data['sbytes'] + flow_data['dbytes']).sum()
print('{} MB was sent across the network (considering available records).'.format(round(total_bytes/1e6), 3))

66 MB was sent across the network (considering available records).


#### Feature selection

In [9]:
# Columns with at least one null value
cols_with_missing = list(flow_data.columns[flow_data.isnull().any()])

# Find the number of NaNs in these columns
flow_data[cols_with_missing].isna().sum()

dloss          461
sload           41
dload          413
spkts       130354
dpkts       130355
smeansz     130375
dmeansz     130376
sjit        130404
djit       1170211
sintpkt    1173393
dintpkt    1173393
synack     1173393
ackdat     1173394
service     865371
dtype: int64

Almost all the records in the `djit`, `sintpkt`, `dintpkt`, `synack`, and `ackdat` columns are null, so it is best to drop them.

And we remove some other unnecessary columns:
* We drop `saddr` and `daddr` because what service is used is seldom dependent on the source and destination IP addresses.
* We drop `stime` because it is a timestamp and won't be useful for prediction;

Should we retain `sport` and `dport` for predicting `service`? There is an almost one-to-one correspondence between the port numbers and service used, right?<br> 'Almost' because of possible port masquerading. We drop these as well.

In [10]:
to_drop = ['saddr', 'daddr', 'sport', 'dport', 'stime', 'djit', 'sintpkt',
           'dintpkt', 'synack', 'ackdat']
flow_data.drop(to_drop, axis=1, inplace=True)

We drop the rows where the `service` column is null.

In [11]:
flow_data.dropna(subset=['service'], inplace=True, ignore_index=True)
flow_data

Unnamed: 0,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,sload,dload,spkts,dpkts,smeansz,dmeansz,sjit,service
0,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,ssl
1,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,http
2,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,http
3,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,http
4,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,http
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
308019,tcp,RST,0.0,188.0,396.0,246.0,248.0,1.0,0.0,0.0,0.0,2.0,6.0,94.0,66.0,0.0,http
308020,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,http
308021,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,http
308022,tcp,INT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,http


Starting with 1.17M records, we are now left with ~308K records, a 73% drop!

#### Check distribution of numeric columns

In [12]:
numeric_cols = flow_data.select_dtypes(include=np.number).columns.tolist()
print(numeric_cols)

['dur', 'sbytes', 'dbytes', 'sttl', 'dttl', 'sloss', 'dloss', 'sload', 'dload', 'spkts', 'dpkts', 'smeansz', 'dmeansz', 'sjit']


In [13]:
flow_data[numeric_cols].describe()

Unnamed: 0,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,sload,dload,spkts,dpkts,smeansz,dmeansz,sjit
count,308024.0,308024.0,308024.0,308024.0,308024.0,308024.0,307708.0,308006.0,307720.0,219335.0,219335.0,219318.0,219317.0,219290.0
mean,0.010958,45.319306,7.017801,95.071124,0.915127,0.003961,5.5e-05,0.011048,0.718182,0.040737,61.249965,2.820016,0.259806,0.011165
std,0.180865,139.42283,2326.649,97.732131,14.386457,0.187537,0.008645,1.83596,2.705012,1.53339,73.348178,30.367057,9.546624,3.906513
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,54.0,0.0,0.0,0.0
50%,0.0,54.0,0.0,55.0,0.0,0.0,0.0,0.0,1.0,0.0,54.0,0.0,0.0,0.0
75%,0.0,54.0,0.0,240.0,0.0,0.0,0.0,0.0,1.0,0.0,58.0,0.0,0.0,0.0
max,4.990934,34977.0,1289025.0,253.0,393.0,21.0,3.0,644.0,528.0,564.0,2019.0,2333.0,1568.488892,1728.053248


All 14 numeric features are nonnegative and their histograms are right-skewed, resembling an exponential distribution.

Hence, ***a log-transformation is recommended***.

#### Apply log-transformation to numeric columns

In [14]:
for feature in numeric_cols:
    flow_data[feature] = np.log1p(flow_data[feature])

# Check distribution of numeric cols now
flow_data[numeric_cols].describe()

Unnamed: 0,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,sload,dload,spkts,dpkts,smeansz,dmeansz,sjit
count,308024.0,308024.0,308024.0,308024.0,308024.0,308024.0,307708.0,308006.0,307720.0,219335.0,219335.0,219318.0,219317.0,219290.0
mean,0.005352,2.844342,0.082773,3.308658,0.058316,0.001394,3.6e-05,0.000285,0.480521,0.021732,3.970318,0.126886,0.009596,0.000565
std,0.081091,1.897745,0.607448,2.190763,0.365582,0.046352,0.005301,0.038311,0.325158,0.133488,0.719499,0.73464,0.20931,0.022915
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.007333,0.0,0.0,0.0
50%,0.0,4.007333,0.0,4.025352,0.0,0.0,0.0,0.0,0.693147,0.0,4.007333,0.0,0.0,0.0
75%,0.0,4.007333,0.0,5.484797,0.0,0.0,0.0,0.0,0.693147,0.0,4.077537,0.0,0.0,0.0
max,1.790247,10.462475,14.069397,5.537334,5.976351,3.091042,1.386294,6.46925,6.270988,6.336826,7.610853,7.755339,7.358505,7.455329


#### Encoding the categorical target variable, `service`

We use one-hot encoding as none of the categorical features: `proto` or `state` are ordinal.

In [15]:
le = LabelEncoder()
flow_data['service'] = le.fit_transform(flow_data['service'])
flow_data

Unnamed: 0,proto,state,dur,sbytes,dbytes,sttl,dttl,sloss,dloss,sload,dload,spkts,dpkts,smeansz,dmeansz,sjit,service
0,tcp,INT,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,,,,,,11
1,tcp,INT,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,,,,,,3
2,tcp,INT,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,,,,,,3
3,tcp,INT,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,,,,,,3
4,tcp,INT,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,,,,,,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
308019,tcp,RST,0.0,5.241747,5.983936,5.509388,5.517453,0.693147,0.0,0.0,0.0,1.098612,1.94591,4.553877,4.204693,0.0,3
308020,tcp,INT,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,,,,,,3
308021,tcp,INT,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,,,,,,3
308022,tcp,INT,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,,,,,,3


In [16]:
# Get the inverse transforms
values = le.inverse_transform(flow_data['service'])

# Mapping the encoded value
encoded_target_column = {}
for i in range(len(flow_data['service'])):
    encoded_target_column[flow_data['service'][i]] = values[i]

encoded_target_column = dict(sorted(encoded_target_column.items(), key=lambda x: x[0]))
print(encoded_target_column)

{0: 'bgp', 1: 'dns', 2: 'ftp', 3: 'http', 4: 'imap', 5: 'imaps', 6: 'ntp', 7: 'pop3', 8: 'smtp', 9: 'snmp', 10: 'ssh', 11: 'ssl', 12: 'submission', 13: 'telnet', 14: 'urd'}


#### Feature engineering

In [17]:
# Fill in missing values using median of the column
miss_cols = ['dloss', 'sload', 'dload', 'spkts', 'dpkts', 'smeansz', 'dmeansz', 'sjit']
for col in miss_cols:
    flow_data[col] = flow_data[col].fillna(flow_data[col].mean())

flow_data.isna().sum()

proto      0
state      0
dur        0
sbytes     0
dbytes     0
sttl       0
dttl       0
sloss      0
dloss      0
sload      0
dload      0
spkts      0
dpkts      0
smeansz    0
dmeansz    0
sjit       0
service    0
dtype: int64

In [18]:
# The nominative (unordered) categorical features
# Uses one-hot encoding
nominal_cols = ['proto', 'state']

# Define data preparation for the columns
t = [('num', MinMaxScaler(), numeric_cols),
     ('nom', OneHotEncoder(), nominal_cols)]

col_transform = ColumnTransformer(transformers=t, remainder='passthrough')

#### Split data into train and test sets

We first obtain the number of records for each service: this will help us quantify the amount of imbalance in the data.

In [19]:
X = flow_data.copy()
y = X.pop('service')

y.value_counts()

service
13    114192
3      62713
11     54732
1      30188
10     13774
14      7459
8       7063
12      6581
2       3102
7       2384
6       1824
0       1684
5       1178
9        817
4        333
Name: count, dtype: int64

The number of instances for each service are imbalanced. Hence, we perform a stratified train-test split.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    random_state = 15,
                                                    stratify = y)

### Decision Tree

We go with the decision tree model first because it performs automatic feature selection.<br>
That is, when it is building the tree, it only does so by splitting on features that cause the greatest increase in node purity, so features that a feature selection method would have eliminated aren’t used in the model.

In [21]:
%%time
dtree = DecisionTreeClassifier(criterion='gini', class_weight='balanced')

Dtree = Pipeline([
    ("preprocess", col_transform),
    ("regressor", dtree)
])

# Get fitted model
model1 = Dtree.fit(X_train, y_train)
# Get predictions
y_preds = model1.predict(X_test)

CPU times: user 994 ms, sys: 1.66 ms, total: 996 ms
Wall time: 1 s


In [22]:
# The 'weighted' option takes class imbalance into account
precision = precision_score(y_test, y_preds, average = 'weighted')
accuracy = model1.score(X_test, y_test)  # Return mean accuracy

print("Precision: "+ "{:.2%}".format(precision))
print("Accuracy: "+ "{:.2%}".format(accuracy))
model_performance.loc['Decision Tree'] = [precision, accuracy]

Precision: 83.87%
Accuracy: 58.76%


In [23]:
# The classification report gives the complete picture
print(classification_report(y_test, y_preds, target_names=list(encoded_target_column.values())))

              precision    recall  f1-score   support

         bgp       0.16      0.72      0.26       337
         dns       0.98      0.94      0.96      6038
         ftp       0.12      0.43      0.18       620
        http       0.80      0.35      0.49     12543
        imap       0.03      0.77      0.05        66
       imaps       0.27      0.43      0.33       236
         ntp       0.81      0.75      0.78       365
        pop3       0.02      0.71      0.04       477
        smtp       0.56      0.03      0.05      1413
        snmp       0.65      0.82      0.73       163
         ssh       0.72      0.37      0.49      2755
         ssl       0.86      0.54      0.66     10946
  submission       0.35      0.30      0.32      1316
      telnet       0.96      0.75      0.84     22838
         urd       0.38      0.18      0.25      1492

    accuracy                           0.59     61605
   macro avg       0.51      0.54      0.43     61605
weighted avg       0.84   

### Histogram Gradient Boosting

According to the docs, this estimator is much faster than `GradientBoostingClassifier` for big datasets ($n_{\text{samples}} \geq 10000$).

In [24]:
%%time
our_scorer = make_scorer(precision_score, average='weighted')

# max_depth=None uses an unconstrained DTree
histgbc = HistGradientBoostingClassifier(max_depth=None, class_weight='balanced',
                                         scoring=our_scorer, learning_rate=0.1,
                                         random_state=15, max_iter=200)

Hist = Pipeline([
    ("preprocess", col_transform),
    ("regressor", histgbc)
])

# Get fitted model
model2 = Hist.fit(X_train, y_train)
# Get predictions
y_preds = model2.predict(X_test)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


CPU times: user 34.8 s, sys: 142 ms, total: 35 s
Wall time: 9.56 s


In [25]:
precision = precision_score(y_test, y_preds, average = 'weighted')
accuracy = model2.score(X_test, y_test)

print("Precision: "+ "{:.2%}".format(precision))
print("Accuracy: "+ "{:.2%}".format(accuracy))
model_performance.loc['Hist. GrBoost'] = [precision, accuracy]

Precision: 83.58%
Accuracy: 58.75%


In [26]:
# The classification report gives the complete picture
print(classification_report(y_test, y_preds, target_names=list(encoded_target_column.values())))

              precision    recall  f1-score   support

         bgp       0.16      0.73      0.26       337
         dns       0.98      0.94      0.96      6038
         ftp       0.13      0.42      0.20       620
        http       0.78      0.36      0.50     12543
        imap       0.03      0.80      0.05        66
       imaps       0.34      0.42      0.37       236
         ntp       0.82      0.75      0.78       365
        pop3       0.02      0.71      0.04       477
        smtp       0.61      0.03      0.05      1413
        snmp       0.59      0.82      0.69       163
         ssh       0.69      0.37      0.48      2755
         ssl       0.86      0.53      0.66     10946
  submission       0.35      0.37      0.36      1316
      telnet       0.96      0.74      0.84     22838
         urd       0.40      0.12      0.18      1492

    accuracy                           0.59     61605
   macro avg       0.51      0.54      0.43     61605
weighted avg       0.84   

### Logistic Regression

In [27]:
%%time
logreg = LogisticRegression(solver='lbfgs', class_weight='balanced', max_iter=250)

LogReg = Pipeline([
    ("preprocess", col_transform),
    ("regressor", logreg)
])

# Get fitted model
model3 = LogReg.fit(X_train, y_train)
# Get predictions
y_preds = model3.predict(X_test)

CPU times: user 2min 11s, sys: 1min 12s, total: 3min 24s
Wall time: 55 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:
precision = precision_score(y_test, y_preds, average = 'weighted')
accuracy = model3.score(X_test, y_test)

print("Precision: {:.2%}".format(precision))
print("Accuracy: {:.2%}".format(accuracy))
model_performance.loc['Logistic Regression'] = [precision, accuracy]

Precision: 78.62%
Accuracy: 44.78%


In [29]:
# The classification report gives the complete picture
print(classification_report(y_test, y_preds, target_names=list(encoded_target_column.values())))

              precision    recall  f1-score   support

         bgp       0.07      0.74      0.13       337
         dns       0.98      0.51      0.67      6038
         ftp       0.02      0.20      0.04       620
        http       0.95      0.05      0.10     12543
        imap       0.02      0.11      0.03        66
       imaps       0.03      0.08      0.04       236
         ntp       0.08      0.37      0.13       365
        pop3       0.02      0.58      0.04       477
        smtp       0.36      0.02      0.04      1413
        snmp       0.06      0.55      0.11       163
         ssh       0.34      0.09      0.15      2755
         ssl       0.62      0.55      0.59     10946
  submission       0.26      0.40      0.31      1316
      telnet       0.96      0.70      0.81     22838
         urd       0.02      0.00      0.00      1492

    accuracy                           0.45     61605
   macro avg       0.32      0.33      0.21     61605
weighted avg       0.79   

### Conclusion

On a closing note, we provide the model performance statistics for the three models we fitted above.

In [30]:
model_performance

Unnamed: 0,Precision,Accuracy
Decision Tree,0.838697,0.587598
Hist. GrBoost,0.835772,0.587485
Logistic Regression,0.786163,0.447756


The logistic regression model gives the lowest precision and accuracy, while the performance of the other two models are quite similar.