 # Network Intrusion Detection with Logistic Regression

  ## Project Overview
  This notebook implements a basic logistic regression model to detect network intrusions
  using the UNSW-NB15 dataset. We'll build a binary classifier to distinguish between normal
  network traffic and attack traffic.

  ## Objectives
  - Load and explore the raw UNSW-NB15 dataset
  - Preprocess the data (handle categorical features, scaling, etc.)
  - Train a logistic regression model
  - Evaluate model performance on network intrusion detection

  ## Dataset
  - **Training set**: UNSW_NB15_training-set.csv (~175k samples)
  - **Test set**: UNSW_NB15_testing-set.csv (~82k samples)
  - **Task**: Binary classification (Normal vs Attack)
  - **Features**: 42 network flow features including protocols, packet counts, timing
  features
  - **Preprocessing needed**: Categorical encoding, feature scaling, data cleaning

## Import Libraries

In [7]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

print("Libraries imported successfully!")

Libraries imported successfully!


## Load Data

In [8]:
train_df = pd.read_csv('../data/UNSW_NB15_training-set.csv')
test_df = pd.read_csv('../data/UNSW_NB15_testing-set.csv')

print(f"Training set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print("Data loaded successfully!")

Training set shape: (175341, 45)
Test set shape: (82332, 45)
Data loaded successfully!


In [10]:
train_df.head()
train_df.dtypes

id                     int64
dur                  float64
proto                 object
service               object
state                 object
spkts                  int64
dpkts                  int64
sbytes                 int64
dbytes                 int64
rate                 float64
sttl                   int64
dttl                   int64
sload                float64
dload                float64
sloss                  int64
dloss                  int64
sinpkt               float64
dinpkt               float64
sjit                 float64
djit                 float64
swin                   int64
stcpb                  int64
dtcpb                  int64
dwin                   int64
tcprtt               float64
synack               float64
ackdat               float64
smean                  int64
dmean                  int64
trans_depth            int64
response_body_len      int64
ct_srv_src             int64
ct_state_ttl           int64
ct_dst_ltm             int64
ct_src_dport_l

### Explore target variables

In [14]:
print("Target Variable Analysis:")
print("\nBinary labels (label column):")
print(train_df['label'].value_counts())
print(f"Attack percentage: {(train_df['label'].sum() / len(train_df)) * 100:.1f}%")

print("\nAttack categories (attack_cat column):")
print(train_df['attack_cat'].value_counts())

Target Variable Analysis:

Binary labels (label column):
label
1    119341
0     56000
Name: count, dtype: int64
Attack percentage: 68.1%

Attack categories (attack_cat column):
attack_cat
Normal            56000
Generic           40000
Exploits          33393
Fuzzers           18184
DoS               12264
Reconnaissance    10491
Analysis           2000
Backdoor           1746
Shellcode          1133
Worms               130
Name: count, dtype: int64


In [19]:
# Check for missing values

print(f"Training set missing values: {train_df.isnull().sum().sum()}")
print(f"Test set missing values: {test_df.isnull().sum().sum()}")

Training set missing values: 0
Test set missing values: 0


In [23]:
numeric_cols = train_df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = train_df.select_dtypes(include=['object']).columns.tolist()

# Remove target variables
if 'label' in numeric_cols:
    numeric_cols.remove('label')
if 'attack_cat' in categorical_cols:
    categorical_cols.remove('attack_cat')

In [24]:
print(f"Numeric features ({len(numeric_cols)}): {numeric_cols}")
print(f"\nCategorical features ({len(categorical_cols)}): {categorical_cols}")


Numeric features (40): ['id', 'dur', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss', 'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin', 'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth', 'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm', 'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm', 'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm', 'ct_srv_dst', 'is_sm_ips_ports']

Categorical features (3): ['proto', 'service', 'state']
