### **CCT College Dublin**
### **Lecture Name:** David McQuaid 
### **Module Title:** Machine Learning for AI 
### **Student Full Name:** Jefferson de Oliveira Lima
### **Date of Submission:** 22/04/2024
### **github:** https://github.com/JeffOlima/JeffersonO.Lima_ML_CA1.git

#### The first few rows of the DataFrame are shown by the following code, which allows you to evaluate its dimensions and structure. Additionally, it searches each column for any missing values. The frequency distribution of the category feature "Action" is shown using value_counts(), while summary statistics for numerical features are presented using describe(). When everything is considered, these functions help at the beginning of the analysis stage by offering perceptions into the accuracy, structure, and salient characteristics of the dataset, which guide the subsequent stages of model development and analysis.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load the dataset
df = pd.read_csv("./log2.csv")

# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())

# Display the number of rows and columns in the dataset
print("\nNumber of rows and columns in the dataset:")
print(df.shape)

# Check for missing values
print("\nMissing values in the dataset:")
print(df.isnull().sum())

# Summary statistics for numerical features
print("\nSummary statistics for numerical features:")
print(df.describe())

# Frequency of categorical feature "Action"
print("\nFrequency of the 'Action' column:")
print(df['Action'].value_counts())

First few rows of the dataset:
   Source Port  Destination Port  NAT Source Port  NAT Destination Port  \
0        57222                53            54587                    53   
1        56258              3389            56258                  3389   
2         6881             50321            43265                 50321   
3        50553              3389            50553                  3389   
4        50002               443            45848                   443   

   Bytes  Bytes Sent  Bytes Received  Packets  Elapsed Time (sec)  pkts_sent  \
0    177          94              83        2                  30          1   
1   4768        1600            3168       19                  17         10   
2    238         118             120        2                1199          1   
3   3327        1438            1889       15                  17          8   
4  25358        6778           18580       31                  16         13   

   pkts_received Action  
0          

#### For the reason of classifying network traffic data, the following code prepares the data. Using one-hot encoding, it first converts the categorical feature "Action" into numerical values, generating binary columns for each category. Then, in order to ensure equal scales, it scales the numerical characteristics using StandardScaler, which can enhance the efficiency of various classification methods. Following data preparation, the dataset is divided in an 80-20 ratio into training and testing sets. The output, which displays the shapes of the training and testing sets for both features and the target variable, validates that the dataset was successfully split. There are 52,425 samples with 11 features in the training set and 13,107 samples with 11 features in the testing set.In the same way, the testing set has dimensions (13,107, 4) while the training set's target variable has dimensions (52,425, 4), which reflect the one-hot encoded categories. The produced datasets can now be used for model evaluation and training.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler


df_encoded = pd.get_dummies(df, columns=['Action'])

X = df_encoded.drop(columns=['Action_allow', 'Action_deny', 'Action_drop', 'Action_reset-both'])
y = df_encoded[['Action_allow', 'Action_deny', 'Action_drop', 'Action_reset-both']]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (52425, 11)
Shape of X_test: (13107, 11)
Shape of y_train: (52425, 4)
Shape of y_test: (13107, 4)


#### The following code implements two classification algorithms, Logistic Regression and K-Nearest Neighbors (KNN), to predict the class attribute based on input features. The dataset is divided into training and testing sets, and then features are scaled uniformly using StandardScaler. After that, both models are trained using the scaled training set, and MultiOutputClassifier is used to extend Logistic Regression to handle multi-label classification. Calculating accuracy and producing classification reports that include information on precision, recall, and F1-score for every class are two aspects of the model evaluation process. Although both models attain high accuracy, KNN performs better than Logistic Regression on every metric, indicating that it is the better option for this classification assignment.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logistic_regression = LogisticRegression(max_iter=1000, random_state=42)
multi_target_logistic_regression = MultiOutputClassifier(logistic_regression)
multi_target_logistic_regression.fit(X_train_scaled, y_train)

y_pred = multi_target_logistic_regression.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Logistic Regression:", accuracy)

print("Classification Report for Logistic Regression:")
print(classification_report(y_test, y_pred, zero_division=1))

knn_classifier = KNeighborsClassifier()
knn_classifier.fit(X_train_scaled, y_train)

y_pred_knn = knn_classifier.predict(X_test_scaled)
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print("Accuracy of KNN Classifier:", accuracy_knn)

print("Classification Report for KNN Classifier:")
print(classification_report(y_test, y_pred_knn, zero_division=1))


Accuracy of Logistic Regression: 0.9654383154039826
Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       1.00      0.99      1.00      7545
           1       0.99      0.88      0.93      2994
           2       0.94      1.00      0.97      2562
           3       1.00      0.00      0.00         6

   micro avg       0.98      0.97      0.98     13107
   macro avg       0.98      0.72      0.72     13107
weighted avg       0.98      0.97      0.98     13107
 samples avg       0.99      0.97      0.97     13107

Accuracy of KNN Classifier: 0.9971770809491112
Classification Report for KNN Classifier:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7545
           1       0.99      0.99      0.99      2994
           2       1.00      1.00      1.00      2562
           3       1.00      0.00      0.00         6

   micro avg       1.00      1.00      1.00     131

#### The following code assesses the accuracy of the trained KNN classifier on the test and training datasets using the accuracy_score function from sklearn.metrics. The accuracy attained on the test set and the training set are contrasted. The output shows that the precision of both sets is exactly the same, suggesting reliable functioning. This resemblance suggests that the model has successfully learned from the training data and can generalise to new data, as evidenced by its high accuracy on the test set. The model's prediction ability is hence reliable and constant.

In [4]:
from sklearn.metrics import accuracy_score

test_accuracy = accuracy_score(y_test, y_pred_knn)
print("Accuracy on Test Set:", test_accuracy)

print("Accuracy on Training Set (KNN Classifier):", accuracy_knn)

if test_accuracy > accuracy_knn:
    print("The model performs better on the test set compared to the training set.")
elif test_accuracy < accuracy_knn:
    print("The model performs better on the training set compared to the test set.")
else:
    print("The model has similar performance on both the training and test sets.")

Accuracy on Test Set: 0.9971770809491112
Accuracy on Training Set (KNN Classifier): 0.9971770809491112
The model has similar performance on both the training and test sets.


#### The following code predicts the action for a new set of input data using the trained K-Nearest Neighbors (KNN) classifier. The input data X_new represents a single observation with specific features. The predict method of the knn_classifier model is used to predict the action based on the input data. The output [1 0 0 0] indicates that the model predicts with high confidence that the action will be "allow," as the first element has a probability of 1, while the other elements have probabilities of 0.

In [5]:
X_new = [[1000, 80, 40000, 80, 1000, 500, 500, 10, 60, 5, 5]]

predicted_action = knn_classifier.predict(X_new)

print("Predicted Action:", predicted_action)


Predicted Action: [[ True False False False]]
