## Hackathon Datasets Information

The dataset contains detailed records of network activities, capturing various attributes associated with network connections. Each record is labeled to indicate whether the activity is normal or a "Neptune" attack, providing a foundation for binomial classification.

A Neptune attack, also known as a SYN flood attack, is a type of denial-of-service (DoS) attack where an attacker overwhelms a target system with a high number of SYN requests, causing the system to become unresponsive to legitimate traffic. It exploits the TCP handshake process to consume resources on the target machine.

The training set contains 86,845 rows, including whether the activity (column - Attack) is normal or not. Use this to train a Machine Learning model, then predict whether the 21,712 entries in the test set have a normal activity or not (Neptune).

Note: 

In the target variable (Attack), “normal” means normal activity (no attack) i.e., attack = 0

“neptune” means Neptune attack. i.e., attack = 1


In [1]:
#Modules
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix

#Model Train Dataframe
train_data=pd.read_csv("./Train_Data.csv")
test_data=pd.read_csv("./Test_Data.csv")

In [2]:
cols=train_data.columns
print(cols)
train_data.head()

Index(['duration', 'protocoltype', 'service', 'flag', 'srcbytes', 'dstbytes',
       'land', 'wrongfragment', 'urgent', 'hot', 'numfailedlogins', 'loggedin',
       'numcompromised', 'rootshell', 'suattempted', 'numroot',
       'numfilecreations', 'numshells', 'numaccessfiles', 'numoutboundcmds',
       'ishostlogin', 'isguestlogin', 'count', 'srvcount', 'serrorrate',
       'srvserrorrate', 'rerrorrate', 'srvrerrorrate', 'samesrvrate',
       'diffsrvrate', 'srvdiffhostrate', 'dsthostcount', 'dsthostsrvcount',
       'dsthostsamesrvrate', 'dsthostdiffsrvrate', 'dsthostsamesrcportrate',
       'dsthostsrvdiffhostrate', 'dsthostserrorrate', 'dsthostsrvserrorrate',
       'dsthostrerrorrate', 'dsthostsrvrerrorrate', 'lastflag', 'attack'],
      dtype='object')


Unnamed: 0,duration,protocoltype,service,flag,srcbytes,dstbytes,land,wrongfragment,urgent,hot,...,dsthostsamesrvrate,dsthostdiffsrvrate,dsthostsamesrcportrate,dsthostsrvdiffhostrate,dsthostserrorrate,dsthostsrvserrorrate,dsthostrerrorrate,dsthostsrvrerrorrate,lastflag,attack
0,0,tcp,netbios_dgm,REJ,0,0,0,0,0,0,...,0.06,0.06,0.0,0.0,0.0,0.0,1.0,1.0,21,1
1,0,tcp,smtp,SF,1239,400,0,0,0,0,...,0.45,0.04,0.0,0.0,0.11,0.0,0.02,0.0,18,0
2,0,tcp,http,SF,222,945,0,0,0,0,...,1.0,0.0,0.02,0.03,0.0,0.0,0.0,0.0,21,0
3,0,tcp,http,SF,235,1380,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21,0
4,0,tcp,uucp_path,REJ,0,0,0,0,0,0,...,0.01,0.08,0.0,0.0,0.0,0.0,1.0,1.0,19,1


In [3]:
# EDA
print(train_data.info())
print(train_data.describe())
print(train_data.isnull().sum())

# print(train_data.flag.unique())
# print(train_data.protocoltype.unique())
# print(train_data.service.unique())

for i in cols:
    if train_data[i].dtype=='object':
        print(train_data[i].unique())

#From the Given data, We found Columns have no NULL Values, No Missing Data
#They have Numerical Data, Standardisation and correlation needed

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86845 entries, 0 to 86844
Data columns (total 43 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   duration                86845 non-null  int64  
 1   protocoltype            86845 non-null  object 
 2   service                 86845 non-null  object 
 3   flag                    86845 non-null  object 
 4   srcbytes                86845 non-null  int64  
 5   dstbytes                86845 non-null  int64  
 6   land                    86845 non-null  int64  
 7   wrongfragment           86845 non-null  int64  
 8   urgent                  86845 non-null  int64  
 9   hot                     86845 non-null  int64  
 10  numfailedlogins         86845 non-null  int64  
 11  loggedin                86845 non-null  int64  
 12  numcompromised          86845 non-null  int64  
 13  rootshell               86845 non-null  int64  
 14  suattempted             86845 non-null

In [4]:
test_data.head()

# for i in cols:
#     if test_data[i].dtype=='object':
#         print(test_data[i].unique(),train_data[i].unique())

Unnamed: 0,duration,protocoltype,service,flag,srcbytes,dstbytes,land,wrongfragment,urgent,hot,...,dsthostsrvcount,dsthostsamesrvrate,dsthostdiffsrvrate,dsthostsamesrcportrate,dsthostsrvdiffhostrate,dsthostserrorrate,dsthostsrvserrorrate,dsthostrerrorrate,dsthostsrvrerrorrate,lastflag
0,0,tcp,mtp,REJ,0,0,0,0,0,0,...,7,0.03,0.08,0.0,0.0,0.0,0.0,1.0,1.0,20
1,0,tcp,http,SF,199,1721,0,0,0,0,...,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,21
2,0,tcp,discard,S0,0,0,0,0,0,0,...,14,0.05,0.09,0.0,0.0,1.0,1.0,0.0,0.0,18
3,0,tcp,telnet,S0,0,0,0,0,0,0,...,2,0.01,0.09,0.0,0.0,1.0,1.0,0.0,0.0,18
4,0,tcp,exec,S0,0,0,0,0,0,0,...,16,0.06,0.06,0.0,0.0,1.0,1.0,0.0,0.0,20


In [5]:
#Encoding Protocoltype,Service and Flag using Label Encoder
# le = LabelEncoder()
# train_data['protocoltype'] = le.fit_transform(train_data['protocoltype'])
# train_data['service'] = le.fit_transform(train_data['service'])
# train_data['flag'] = le.fit_transform(train_data['flag'])

# test_data['protocoltype'] = le.transform(test_data['protocoltype'])
# test_data['service'] = le.transform(test_data['service'])
# test_data['flag'] = le.transform(test_data['flag'])

label_encoders = {}
# Encode categorical columns in training data
for col in ['protocoltype', 'service', 'flag']:
    le = LabelEncoder()
    train_data[col] = le.fit_transform(train_data[col])
    label_encoders[col] = le

# Transform categorical columns in test data using the fitted encoders
for col in ['protocoltype', 'service', 'flag']:
    test_data[col] = label_encoders[col].transform(test_data[col])

#Using Standard Scaler for Scaling
scaler = StandardScaler()
num_features = cols
num_features=num_features[:-1]
print(num_features)
train_data[num_features] = scaler.fit_transform(train_data[num_features])


Index(['duration', 'protocoltype', 'service', 'flag', 'srcbytes', 'dstbytes',
       'land', 'wrongfragment', 'urgent', 'hot', 'numfailedlogins', 'loggedin',
       'numcompromised', 'rootshell', 'suattempted', 'numroot',
       'numfilecreations', 'numshells', 'numaccessfiles', 'numoutboundcmds',
       'ishostlogin', 'isguestlogin', 'count', 'srvcount', 'serrorrate',
       'srvserrorrate', 'rerrorrate', 'srvrerrorrate', 'samesrvrate',
       'diffsrvrate', 'srvdiffhostrate', 'dsthostcount', 'dsthostsrvcount',
       'dsthostsamesrvrate', 'dsthostdiffsrvrate', 'dsthostsamesrcportrate',
       'dsthostsrvdiffhostrate', 'dsthostserrorrate', 'dsthostsrvserrorrate',
       'dsthostrerrorrate', 'dsthostsrvrerrorrate', 'lastflag'],
      dtype='object')


In [6]:
train_data.describe()

Unnamed: 0,duration,protocoltype,service,flag,srcbytes,dstbytes,land,wrongfragment,urgent,hot,...,dsthostsamesrvrate,dsthostdiffsrvrate,dsthostsamesrcportrate,dsthostsrvdiffhostrate,dsthostserrorrate,dsthostsrvserrorrate,dsthostrerrorrate,dsthostsrvrerrorrate,lastflag,attack
count,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0,...,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0,86845.0
mean,-4.172685e-17,1.495621e-16,6.659932000000001e-17,-1.184306e-16,3.927233e-18,3.2726939999999997e-19,-3.190876e-18,0.0,-4.909040999999999e-19,5.3999450000000004e-18,...,7.881056e-17,1.129079e-17,-1.2763510000000001e-17,-4.7617690000000007e-17,1.505439e-17,8.247188e-17,2.327703e-17,-6.741749e-17,-8.012782e-16,0.379964
std,1.000006,1.000006,1.000006,1.000006,1.000006,1.000006,1.000006,0.0,1.000006,1.000006,...,1.000006,1.000006,1.000006,1.000006,1.000006,1.000006,1.000006,1.000006,1.000006,0.48538
min,-0.1003276,-3.239013,-1.92715,-2.532093,-0.02360532,-0.04928902,-0.00678684,0.0,-0.004552693,-0.07887654,...,-1.176832,-0.4857206,-0.363793,-0.2939405,-0.6992758,-0.6880943,-0.32674,-0.3234844,-13.40952,0.0
25%,-0.1003276,-0.3007457,-0.508099,-0.8665781,-0.02360532,-0.04928902,-0.00678684,0.0,-0.004552693,-0.07887654,...,-1.065062,-0.4857206,-0.363793,-0.2939405,-0.6992758,-0.6880943,-0.32674,-0.3234844,-0.7461017,0.0
50%,-0.1003276,-0.3007457,-0.4405252,0.7989372,-0.0234769,-0.04849522,-0.00678684,0.0,-0.004552693,-0.07887654,...,0.05262858,-0.1953661,-0.363793,-0.2939405,-0.6992758,-0.6880943,-0.32674,-0.3234844,0.660945,0.0
75%,-0.1003276,-0.3007457,1.046099,0.7989372,-0.02284599,-0.03543373,-0.00678684,0.0,-0.004552693,-0.07887654,...,1.058551,0.1917731,-0.2678178,-0.1130116,1.454557,1.469389,-0.32674,-0.3234844,0.660945,1.0
max,38.91081,2.637522,2.397576,1.215316,250.056,126.7528,147.344,0.0,263.5812,41.81568,...,1.058551,9.192761,4.434968,17.79895,1.454557,1.469389,3.20975,3.229542,0.660945,1.0


In [7]:

# Splitting the data
X = train_data.drop(['attack'], axis=1)
y = train_data['attack']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

In [9]:
# Model Evaluation
y_pred = model.predict(X_val)
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     10762
           1       1.00      1.00      1.00      6607

    accuracy                           1.00     17369
   macro avg       1.00      1.00      1.00     17369
weighted avg       1.00      1.00      1.00     17369

[[10762     0]
 [    0  6607]]


In [10]:
# Hyperparameter Tuning with GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_


In [11]:
# Cross-Validation
kfold = KFold(n_splits=5)
results = cross_val_score(best_model, X, y, cv=kfold)
print("Cross-Validation Scores:", results)
print("Mean Accuracy:", results.mean())


Cross-Validation Scores: [1.         1.         0.99988485 1.         1.        ]
Mean Accuracy: 0.9999769704646209


In [12]:
# Ensemble Method
model1 = RandomForestClassifier(random_state=42)
model2 = GradientBoostingClassifier(random_state=42)
model3 = LogisticRegression(random_state=42)

ensemble_model = VotingClassifier(estimators=[
    ('rf', model1), ('gb', model2), ('lr', model3)], voting='hard')
ensemble_model.fit(X_train, y_train)


In [13]:
# We have Already Encoded Test Data, so Now We scaled it and Predicted the Output
test_data[num_features] = scaler.transform(test_data[num_features])

test_predictions = ensemble_model.predict(test_data)

In [20]:
submission_df = pd.DataFrame({
    'attack': test_predictions  # Assuming 'attack' is the column name for your predictions
})

# Save predictions to a CSV file
submission_df.to_csv('output.csv', index=False)

In [21]:
submission_format=pd.read_csv("./output.csv")
# submission_format['attack']=test_predictions
submission_format.head()

Unnamed: 0,attack
0,1
1,0
2,1
3,1
4,1
