# Network Security, Information Security, Cyber Security

### Dataset Description

KDDTrain+.ARFF The full NSL-KDD train set with binary labels in ARFF format <br>

KDDTrain+.TXT The full NSL-KDD train set including attack-type labels and difficulty level in CSV format <br>

The full NSL-KDD train set including attack-type labels and difficulty level in CSV format <br>

KDDTrain+_20Percent.ARFF A 20% subset of the KDDTrain+.arff file <br>

KDDTrain+_20Percent.TXT A 20% subset of the KDDTrain+.txt file <br>

KDDTest+.ARFF The full NSL-KDD test set with binary labels in ARFF format <br>

KDDTest+.TXT The full NSL-KDD test set including attack-type labels and difficulty level in CSV format <br>

KDDTest-21.ARFF A subset of the KDDTest+.arff file which does not include records with difficulty level of
21 out of 21 <br>

KDDTest-21.TXT A subset of the KDDTest+.txt file which does not include records with difficulty level of 21
out of 21 <br>


### Improvements to the KDD'99 data set


The NSL-KDD data set has the following advantages over the original KDD data set:
It does not include redundant records in the train set, so the classifiers will not be biased towards more frequent records.

There is no duplicate records in the proposed test sets; therefore, the performance of the learners are not biased by the methods which have better detection rates on the frequent records.

The number of selected records from each difficultylevel group is inversely proportional to the percentage of records in the original KDD data set. As a result, the classification rates of distinct machine learning methods vary in a wider range, which makes it more efficient to have an accurate evaluation of different learning techniques.

The number of records in the train and test sets are reasonable, which makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research works will be consistent and comparable.

Dataset Here : https://www.kaggle.com/datasets/hassan06/nslkdd

# Mount Google Drive with Colab

In [None]:
# mount drive with colab
from google.colab import drive
drive.mount('/content/mydrive')

# Import Libraries

In [None]:
# preprocessing Libraries
import numpy as np
import pandas as pd
# Label Encoding Library
from sklearn import preprocessing
# Feature Engineering Libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Logistic Regression Model Library
from sklearn.linear_model import LogisticRegression
# DecisionTree Model Library
from sklearn.tree import DecisionTreeClassifier
# SVM Model Library
from sklearn.svm import SVC
# KNN Model Library
from sklearn.neighbors import KNeighborsClassifier
# Random forest and Gradient Boosting Classifier Library
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# Visulization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Read Datasets

In [None]:
# Define the Columns of Dataset
columns = (['duration','protocol_type','service','flag','src_bytes','dst_bytes','land','wrong_fragment','urgent','hot'
,'num_failed_logins','logged_in','num_compromised','root_shell','su_attempted','num_root','num_file_creations'
,'num_shells','num_access_files','num_outbound_cmds','is_host_login','is_guest_login','count','srv_count','serror_rate'
,'srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate','diff_srv_rate','srv_diff_host_rate','dst_host_count','dst_host_srv_count'
,'dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate','dst_host_srv_diff_host_rate','dst_host_serror_rate'
,'dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','outcome','level'])


In [None]:
# read the test dataset
KDD_TEST=pd.read_csv('/content/mydrive/MyDrive/dataset/NSL-KDD/NSL-KDD/KDDTest+.txt' , names= columns)
print(KDD_TEST.head())

In [None]:
# read the test dataset on 21 different attackers
KDD_TEST_21=pd.read_csv('/content/mydrive/MyDrive/dataset/NSL-KDD/NSL-KDD/KDDTest-21.txt' , names= columns)
print(KDD_TEST_21.head())

In [None]:
# read the train dataset
KDD_TRAIN=pd.read_csv('/content/mydrive/MyDrive/dataset/NSL-KDD/NSL-KDD/KDDTrain+.txt' , names= columns)
print(KDD_TRAIN.head())

In [None]:
# read the train dataset ON 20 percentage
KDD_TRAIN_20=pd.read_csv('/content/mydrive/MyDrive/dataset/NSL-KDD/NSL-KDD/KDDTrain+_20Percent.txt' , names= columns)
print(KDD_TRAIN_20.head())

# Dataset Preprocessing -EDA



> Identify the Null Values From Different Datasets



In [None]:
# KDD_TEST
KDD_TEST.isnull().sum()

In [None]:
# KDD_TEST_21
KDD_TEST_21.isnull().sum()

In [None]:
# KDD_TRAIN
KDD_TRAIN.isnull().sum()

In [None]:
# KDD_TRAIN_20
KDD_TRAIN_20.isnull().sum()

**Note**: <br> *KDD_TEST,KDD_TEST_21 ,KDD_TRAIN and KDD_TRAIN_20 these datasets have not Null Values or missing Values.*



> Information of Different Datasets



In [None]:
KDD_TEST.info()

In [None]:
KDD_TEST_21.info()

In [None]:
KDD_TRAIN.info()

In [None]:
KDD_TRAIN_20.info()



> Encoding Labels



In [None]:
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
def label_encode(dataset):
  dataset['protocol_type']=le.fit_transform(dataset['protocol_type'])
  dataset['service']=le.fit_transform(dataset['service'])
  dataset['flag']=le.fit_transform(dataset['flag'])
  dataset['outcome']=le.fit_transform(dataset['outcome'])

In [None]:
label_encode(KDD_TEST)

In [None]:
label_encode(KDD_TEST_21)

In [None]:
label_encode(KDD_TRAIN)

In [None]:
label_encode(KDD_TRAIN_20)



> Descriprion of Different Datasets



In [None]:
KDD_TEST.describe()

In [None]:
KDD_TEST_21.describe()

In [None]:
KDD_TRAIN.describe()

In [None]:
KDD_TRAIN_20.describe()

# Visulization



> KDD_TEST


In [None]:
KDD_TEST.hist(color='Teal' , figsize=(22,21))



> KDD_TEST_21



In [None]:
KDD_TEST_21.hist(color = 'green' , figsize=(22,21))



> KDD_TRAIN



In [None]:
KDD_TRAIN.hist( color = 'Olive' , figsize=(20,18))



> KDD_TRAIN_20



In [None]:
KDD_TRAIN_20.hist(figsize=(20,18))

# Correlation



> Identify which Variables are effect the performance of Tariget Variable





> KDD_TEST_21



In [None]:
KDD_TEST_21_cor=KDD_TEST_21.corr()
print(KDD_TEST_21_cor)



> HeatMap



In [None]:
# figure size
plt.figure(figsize=(39,40))
mask = np.triu(np.ones_like(KDD_TEST_21_cor.corr(), dtype=bool))
# Configure a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(KDD_TEST_21_cor.corr(), annot=True, mask = mask, cmap=cmap)
plt.show()



>Drop the Unnecessary Columns



In [None]:
x_KDD_TEST_21=KDD_TEST_21.drop(columns=['outcome' , 'level' , 'num_outbound_cmds'] , axis=1)
y_KDD_TEST_21=KDD_TEST_21['outcome']

In [None]:
# shape
print(x_KDD_TEST_21.shape)
print(y_KDD_TEST_21.shape)

In [None]:
x_KDD_TEST_21.columns

In [None]:
# split the x and y into training and testing dataset
x_train_KDD_TEST_21   , x_test_KDD_TEST_21 , y_train_KDD_TEST_21, y_test_KDD_TEST_21=train_test_split(x_KDD_TEST_21,y_KDD_TEST_21 , test_size=0.23 , random_state=42)

In [None]:
# shape
print(x_train_KDD_TEST_21.shape)
print(x_test_KDD_TEST_21.shape)
print(y_train_KDD_TEST_21.shape)
print(y_test_KDD_TEST_21.shape)

# Testing Machine Learning Models

# Logistic Regression

In [None]:
# random search logistic regression model on the sonar dataset
from scipy.stats import loguniform
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Built in the logistic Regression Models
lr = LogisticRegression().fit(x_train_KDD_TEST_21,y_train_KDD_TEST_21 )
pred1_KDD_TEST_21 = lr.predict(x_test_KDD_TEST_21)

In [None]:
# Display the Accuracy , Classification matrix and Classification Report
accuracy_lr = accuracy_score(y_test_KDD_TEST_21, pred1_KDD_TEST_21)
print(f'accuracy: {accuracy_lr}')
print(f' confusion matrix:')
print(confusion_matrix(y_test_KDD_TEST_21, pred1_KDD_TEST_21))
print(f' classification report:')
print(classification_report(y_test_KDD_TEST_21, pred1_KDD_TEST_21))

# KNN

In [None]:
# Built in the KNN model
KNN_model=KNeighborsClassifier()
KNN_model.fit(x_train_KDD_TEST_21, y_train_KDD_TEST_21)
pred2_KDD_TEST_21 = KNN_model.predict(x_test_KDD_TEST_21)

In [None]:
# Display the Accuracy , confusion matrix and Classification Report
accuracy_knn = accuracy_score(y_test_KDD_TEST_21, pred2_KDD_TEST_21)
print(f'accuracy: {accuracy_knn}')
print(f' confusion matrix:')
print(confusion_matrix(y_test_KDD_TEST_21, pred2_KDD_TEST_21))
print(f' classification report:')
print(classification_report(y_test_KDD_TEST_21, pred2_KDD_TEST_21))

# Random Forest

In [None]:
# Built in the Random Forest Model
Ran_model=RandomForestClassifier()
Ran_model.fit(x_train_KDD_TEST_21, y_train_KDD_TEST_21)
pred3_KDD_TEST_21 = Ran_model.predict(x_test_KDD_TEST_21)

In [None]:
# Display the accurcay , confusion matrix and classification report
accuracy_ram = accuracy_score(y_test_KDD_TEST_21, pred3_KDD_TEST_21)
print(f'accuracy: {accuracy_ram}')
print(f' confusion matrix:')
print(confusion_matrix(y_test_KDD_TEST_21, pred3_KDD_TEST_21))
print(f' classification report:')
print(classification_report(y_test_KDD_TEST_21, pred3_KDD_TEST_21))

In [None]:
print(x_train_KDD_TEST_21.shape)
print(y_train_KDD_TEST_21.shape)

# SVM

In [None]:
# Built the SVM Model
svc_model=SVC()
svc_model.fit(x_train_KDD_TEST_21, y_train_KDD_TEST_21)
pred4_KDD_TEST_21 = svc_model.predict(x_test_KDD_TEST_21)

In [None]:
from sklearn.model_selection import GridSearchCV


In [None]:
# Display the Accuracy , confusion matrix and classification report
accuracy_svm = accuracy_score(y_test_KDD_TEST_21, pred4_KDD_TEST_21)
print(f'accuracy: {accuracy_svm}')
print(f' confusion matrix:')
print(confusion_matrix(y_test_KDD_TEST_21, pred4_KDD_TEST_21))
print(f' classification report:')
print(classification_report(y_test_KDD_TEST_21, pred4_KDD_TEST_21))

# using GridSearch

# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']}

grid = GridSearchCV(svc_model, param_grid, refit = True, verbose = 3)

# fitting the model for grid search
grid.fit(x_train_KDD_TEST_21, y_train_KDD_TEST_21)

In [None]:
# print best parameter after tuning
print(grid.best_params_)

# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)

# xgboost

In [None]:
# built the xgboost classifier
Gra_model=GradientBoostingClassifier()
Gra_model.fit(x_train_KDD_TEST_21, y_train_KDD_TEST_21)
pred5_KDD_TEST_21 = Gra_model.predict(x_test_KDD_TEST_21)

In [None]:
# identify the accuracy , confusion matrix and classification report
accuracy_grad_model= accuracy_score(y_test_KDD_TEST_21, pred5_KDD_TEST_21)
print(f'accuracy: {accuracy_grad_model}')
print(f' confusion matrix:')
print(confusion_matrix(y_test_KDD_TEST_21, pred5_KDD_TEST_21))
print(f' classification report:')
print(classification_report(y_test_KDD_TEST_21, pred5_KDD_TEST_21))

# Compare All DL and ML models

In [None]:
fig = plt.figure()
bars = ['SVM', 'Rf' ,'LR' , 'KNN' , 'XGBOOST']
height = [accuracy_svm, accuracy_ram, accuracy_lr, accuracy_knn, accuracy_grad_model]
x_pos = np.arange(len(bars))
plt.bar(x_pos, height,  color=['skyblue', 'palegreen' , 'blue' , 'Olive' , 'green'] )
plt.xticks(x_pos, bars)
# Show graph
plt.title("Comparison of models accuracy")
plt.show()

# Combined Prediction

In [None]:
# combine the predictions into a new dataset with four features
pred1 = lr.predict(x_train_KDD_TEST_21)
pred2 = KNN_model.predict(x_train_KDD_TEST_21)
pred3 = Ran_model.predict(x_train_KDD_TEST_21)
pred4 = Gra_model.predict(x_train_KDD_TEST_21)
pred5 = svc_model.predict(x_train_KDD_TEST_21)

new_X_train = np.vstack((pred1, pred2, pred3, pred4, pred5)).T
new_X_test = np.vstack((lr.predict(x_test_KDD_TEST_21), KNN_model.predict(x_test_KDD_TEST_21), Ran_model.predict(x_test_KDD_TEST_21), Gra_model.predict(x_test_KDD_TEST_21), svc_model.predict(x_test_KDD_TEST_21))).T

# train a decision tree model on the new dataset
dt_model = DecisionTreeClassifier()
dt_model.fit(new_X_train, y_train_KDD_TEST_21)

In [None]:
# make predictions using the decision tree model
dt_pred = dt_model.predict(new_X_test)
# evaluate the performance of the decision tree model
Testing_accuracy = accuracy_score(y_test_KDD_TEST_21, dt_pred)
print(f"Accuracy of the decision tree model: {Testing_accuracy}")

In [None]:
from sklearn.model_selection import learning_curve

In [None]:
train_sizes, train_scores, test_scores = learning_curve(DecisionTreeClassifier(max_depth=10, random_state=1),new_X_train, y_train_KDD_TEST_21, cv=10, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.01, 1.0, 50))
#calculated the mean and standard deviation of the train and test scores.
train_score_mean = np.mean(train_scores, axis=1)
train_score_std = np.std(train_scores, axis=1)
test_score_mean = np.mean(test_scores, axis=1)
test_score_std = np.std(test_scores, axis=1)
# plot the learning curve graph
plt.subplots(1, figsize=(8,8))
plt.plot(train_sizes, train_score_mean, '--', color="#111111",  label="Training score")
plt.plot(train_sizes, test_score_mean, color="#111111", label="Cross-validation score")

plt.fill_between(train_sizes, train_score_mean - train_score_std, train_score_mean + train_score_std, color="#DDDDDD")
plt.fill_between(train_sizes, test_score_mean - test_score_std, test_score_mean + test_score_std, color="#DDDDDD")
#title of the graph
plt.title("Learning Curve")
#lable of axis (x-axis= training set size , y-axis=accuracy score)
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
#To show
plt.show()

# KDD_TRAIN_20

### Correlation

In [None]:
KDD_TRAIN_20_corr=KDD_TRAIN_20.corr()
print(KDD_TRAIN_20_corr)

### HeatMap

In [None]:
# Use the heatmap function from the seaborn package
plt.figure(figsize=(40,55))
sns.heatmap(KDD_TRAIN_20.corr() , annot=True , cmap="YlGnBu")
plt.show()

In [None]:
KDD_TRAIN_20.columns



> Drop the unnecessary columns



In [None]:
x=KDD_TRAIN_20.drop(columns=['outcome' , 'level','is_host_login' , 'num_outbound_cmds'] , axis=1)
y=KDD_TRAIN_20['outcome']

In [None]:
# shape
print(x.shape)
print(y.shape)

In [None]:
# split the x ,y into x_train , y_train , x_test and y_test
x_train , x_test , y_train, y_test=train_test_split(x,y , test_size=0.23 , random_state=42)

In [None]:
# shape
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

# Training Machine Learning Model

# Logistic Regression

In [None]:
# built in the logistic Regression model
lr = LogisticRegression().fit(x_train, y_train)
pred1 = lr.predict(x_test)

In [None]:
# identify the accuracy , confusion matrix and classification report
accuracy_lr = accuracy_score(y_test, pred1)
print(f'accuracy: {accuracy_lr}')
print(f' confusion matrix:')
print(confusion_matrix(y_test, pred1))
print(f' classification report:')
print(classification_report(y_test, pred1))

# KNN Model

In [None]:
# Define  the KNN model
KNN_model=KNeighborsClassifier()
KNN_model.fit(x_train, y_train)
pred2 = KNN_model.predict(x_test)

In [None]:
# Identify the accuracy , confusion matrix and classification report
accuracy_ram = accuracy_score(y_test, pred2)
print(f'accuracy: {accuracy_ram}')
print(f' confusion matrix:')
print(confusion_matrix(y_test, pred2))
print(f' classification report:')
print(classification_report(y_test, pred2))

# Random Forest

In [None]:
# Built in the Random Forest model
Ran_model=RandomForestClassifier()
Ran_model.fit(x_train, y_train)
pred3 = Ran_model.predict(x_test)

In [None]:
# Identify the accurcay , confusion matrix and classification report
accuracy_ram = accuracy_score(y_test, pred3)
print(f'accuracy: {accuracy_ram}')
print(f' confusion matrix:')
print(confusion_matrix(y_test, pred3))
print(f' classification report:')
print(classification_report(y_test, pred3))

# SVM

In [None]:
# Built in the svm model
svc_model=SVC()
svc_model.fit(x_train, y_train)
pred4 = svc_model.predict(x_test)

In [None]:
# identify the accuracy , confusion matrix and classification report
accuracy_svm = accuracy_score(y_test, pred4)
print(f'accuracy: {accuracy_svm}')
print(f' confusion matrix:')
print(confusion_matrix(y_test, pred4))
print(f' classification report:')
print(classification_report(y_test, pred4))

# XGBOOST

In [None]:
# Built the Gradient Boosting Classifier model
Gra_model=GradientBoostingClassifier()
Gra_model.fit(x_train, y_train)
pred5 = Gra_model.predict(x_test)

In [None]:
# identify the accuracy , confusion matrix and classification report
accuracy_grad_model = accuracy_score(y_test, pred5)
print(f'accuracy: {accuracy_grad_model}')
print(f' confusion matrix:')
print(confusion_matrix(y_test, pred5))
print(f' classification report:')
print(classification_report(y_test, pred5))

# Compare the All ML and DL models

In [None]:
fig = plt.figure()
bars = ['SVM', 'Rf' ,'LR' , 'KNN' , 'XGBOOST']
height = [accuracy_svm, accuracy_ram, accuracy_lr, accuracy_knn, accuracy_grad_model]
x_pos = np.arange(len(bars))
plt.bar(x_pos, height,  color=['skyblue', 'palegreen' , 'blue' , 'Olive' , 'green'] )
plt.xticks(x_pos, bars)
# Show graph
plt.title("Comparison of models accuracy")
plt.show()

# Combined Prediction

In [None]:
# combine the predictions into a new dataset with four features
pred1 = lr.predict(x_train)
pred2 = KNN_model.predict(x_train)
pred3 = Ran_model.predict(x_train)
pred4 = Gra_model.predict(x_train)
pred5 = svc_model.predict(x_train)

new_X_train = np.vstack((pred1, pred2, pred3, pred4, pred5)).T
new_X_test = np.vstack((lr.predict(x_test), KNN_model.predict(x_test), Ran_model.predict(x_test), Gra_model.predict(x_test), svc_model.predict(x_test))).T

# train a decision tree model on the new dataset
dt_model = DecisionTreeClassifier()
dt_model.fit(new_X_train, y_train)


In [None]:
# make predictions using the decision tree model
dt_pred = dt_model.predict(new_X_test)

# evaluate the performance of the decision tree model
accuracy = accuracy_score(y_test, dt_pred)
print(f"Accuracy of the decision tree model: {accuracy}")

In [None]:
train_sizes, train_scores, test_scores = learning_curve(DecisionTreeClassifier(max_depth=10, random_state=1),new_X_train, y_train, cv=10, scoring='accuracy', n_jobs=-1, train_sizes=np.linspace(0.01, 1.0, 50))
#calculated the mean and standard deviation of the train and test scores.
train_score_mean = np.mean(train_scores, axis=1)
train_score_std = np.std(train_scores, axis=1)
test_score_mean = np.mean(test_scores, axis=1)
test_score_std = np.std(test_scores, axis=1)
# plot the learning curve graph
plt.subplots(1, figsize=(8,8))
plt.plot(train_sizes, train_score_mean, '--', color="#111111",  label="Training score")
plt.plot(train_sizes, test_score_mean, color="#111111", label="Cross-validation score")

plt.fill_between(train_sizes, train_score_mean - train_score_std, train_score_mean + train_score_std, color="#DDDDDD")
plt.fill_between(train_sizes, test_score_mean - test_score_std, test_score_mean + test_score_std, color="#DDDDDD")
#title of the graph
plt.title("Learning Curve")
#lable of axis (x-axis= training set size , y-axis=accuracy score)
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
#To show
plt.show()

# Conclusion

### Network Security, Information Security dataset contain training and testing dataset. By using the Machine Learning Models, I get the different accurcay
**In Training Dataset**
> Random forest Model=0.99 <br>
> Logistic Regression Model =0.86 <br>
> KNN Model=0.97 <br>
> XGBoost=0.99 <br>
> SVM=0.52 <br>

The prediction of these model insert as a input in the Decision Tree Machine Learning Model and get the Accurcay approx 99%.

**In Testing Dataset**
> Random forest Model=0.96 <br>
> Logistic Regression Model =0.27 <br>
> KNN Model=0.92 <br>
> XGBoost=0.96 <br>
> SVM=0.238 <br>

The prediction of these model insert as a input in the Decision Tree Machine Learning Model and get the Accurcay approx 96%.










