# Robust Binary Model Training

**Goal:** Train a binary classifier (BENIGN vs ATTACK) using NFStream-extracted features.

## Data Sources:
- **BENIGN:** Monday, Tuesday, Thursday, User traffic
- **ATTACK:** Friday (13:00-16:30 = PortScan + DDoS), Wednesday (9:43-11:24 = DoS)

## Key Principle:
Train and inference use the SAME extraction method (NFStream) = features match perfectly!

In [1]:
# Setup
import pandas as pd
import numpy as np
from pathlib import Path
import time
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Check NFStream
try:
    from nfstream import NFStreamer
    import nfstream
    print(f"‚úÖ NFStream version: {nfstream.__version__}")
except ImportError:
    raise ImportError("NFStream not installed! Run: pip install nfstream")

# Paths
BASE_DIR = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
MODELS_DIR = BASE_DIR / 'models'
DATA_DIR = BASE_DIR / 'data_processed'
USER_TRAFFIC_DIR = BASE_DIR / 'user_traffic'

MODELS_DIR.mkdir(exist_ok=True)
DATA_DIR.mkdir(exist_ok=True)

print(f"Base directory: {BASE_DIR}")
print(f"User traffic: {USER_TRAFFIC_DIR}")

‚úÖ NFStream version: 6.5.4
Base directory: c:\Users\Ghulam Mohayudin\Projects\Others\usod\network-based-ai
User traffic: c:\Users\Ghulam Mohayudin\Projects\Others\usod\network-based-ai\user_traffic


## PCAP File Locations

Update these paths to match your system:

In [2]:
# CICIDS2017 PCAP Locations - UPDATE THESE!
PCAP_PATHS = {
    'Monday': Path(r'D:\pcap\Monday-WorkingHours.pcap'),
    'Tuesday': Path(r'D:\pcap\Tuesday-WorkingHours.pcap'),
    'Wednesday': Path(r'F:\Wednesday-WorkingHours.pcap'),
    'Thursday': Path(r'F:\Thursday-WorkingHours.pcap'),
    'Friday': BASE_DIR / 'pcap' / 'Friday-WorkingHours.pcap',
}

# Verify files exist
print("CICIDS2017 PCAP Files:")
for day, path in PCAP_PATHS.items():
    if path.exists():
        size_gb = path.stat().st_size / (1024**3)
        print(f"  ‚úÖ {day}: {path} ({size_gb:.2f} GB)")
    else:
        print(f"  ‚ùå {day}: {path} (NOT FOUND)")

# User traffic
print(f"\nUser Traffic Files:")
user_pcaps = list(USER_TRAFFIC_DIR.glob('*.pcap'))
for pcap in user_pcaps:
    size_mb = pcap.stat().st_size / (1024**2)
    print(f"  ‚úÖ {pcap.name} ({size_mb:.2f} MB)")
print(f"  Total: {len(user_pcaps)} files")

CICIDS2017 PCAP Files:
  ‚úÖ Monday: D:\pcap\Monday-WorkingHours.pcap (10.08 GB)
  ‚úÖ Tuesday: D:\pcap\Tuesday-WorkingHours.pcap (10.29 GB)
  ‚úÖ Wednesday: F:\Wednesday-WorkingHours.pcap (12.50 GB)
  ‚úÖ Thursday: F:\Thursday-WorkingHours.pcap (7.73 GB)
  ‚úÖ Friday: c:\Users\Ghulam Mohayudin\Projects\Others\usod\network-based-ai\pcap\Friday-WorkingHours.pcap (8.23 GB)

User Traffic Files:
  ‚úÖ gaming_browsing_01.pcap (39.86 MB)
  ‚úÖ gaming_browsing_02.pcap (85.65 MB)
  ‚úÖ gaming_browsing_03.pcap (214.45 MB)
  ‚úÖ gaming_browsing_04.pcap (71.24 MB)
  Total: 4 files


## NFStream Feature Extraction

In [4]:
# NFStream attributes to extract (46 features)
NFSTREAM_FEATURES = [
    'dst_port',
    'bidirectional_duration_ms',
    'src2dst_packets', 'dst2src_packets', 'bidirectional_packets',
    'src2dst_bytes', 'dst2src_bytes', 'bidirectional_bytes',
    'src2dst_max_ps', 'src2dst_min_ps', 'src2dst_mean_ps', 'src2dst_stddev_ps',
    'dst2src_max_ps', 'dst2src_min_ps', 'dst2src_mean_ps', 'dst2src_stddev_ps',
    'bidirectional_min_ps', 'bidirectional_max_ps', 'bidirectional_mean_ps', 'bidirectional_stddev_ps',
    'bidirectional_mean_piat_ms', 'bidirectional_stddev_piat_ms', 'bidirectional_max_piat_ms', 'bidirectional_min_piat_ms',
    'src2dst_duration_ms', 'src2dst_mean_piat_ms', 'src2dst_stddev_piat_ms', 'src2dst_max_piat_ms', 'src2dst_min_piat_ms',
    'dst2src_duration_ms', 'dst2src_mean_piat_ms', 'dst2src_stddev_piat_ms', 'dst2src_max_piat_ms', 'dst2src_min_piat_ms',
    'src2dst_psh_packets', 'src2dst_urg_packets', 'src2dst_syn_packets', 'src2dst_fin_packets', 'src2dst_rst_packets', 'src2dst_ack_packets',
    'dst2src_psh_packets', 'dst2src_urg_packets', 'dst2src_syn_packets', 'dst2src_fin_packets', 'dst2src_rst_packets', 'dst2src_ack_packets',
]

print(f"Features to extract: {len(NFSTREAM_FEATURES)}")

Features to extract: 46


In [5]:
def extract_flows(pcap_path, max_flows=250000, label=None, skip_flows=0):
    """
    Extract NFStream features from PCAP.
    
    Args:
        pcap_path: Path to PCAP file
        max_flows: Max flows to extract
        label: Label to assign ('BENIGN' or 'ATTACK')
        skip_flows: Number of initial flows to skip
    """
    print(f"\n{'='*60}")
    print(f"Extracting from: {Path(pcap_path).name}")
    print(f"Label: {label} | Max: {max_flows:,} | Skip: {skip_flows:,}")
    print(f"{'='*60}")
    
    streamer = NFStreamer(
        source=str(pcap_path),
        statistical_analysis=True,
        splt_analysis=0,
        n_dissections=0,
    )
    
    flows_list = []
    start_time = time.time()
    skipped = 0
    
    for i, flow in enumerate(streamer):
        # Skip initial flows if requested
        if skipped < skip_flows:
            skipped += 1
            continue
        
        # Extract features
        flow_dict = {}
        for attr in NFSTREAM_FEATURES:
            try:
                value = getattr(flow, attr, 0)
                flow_dict[attr] = 0 if value is None else value
            except:
                flow_dict[attr] = 0
        
        if label:
            flow_dict['label'] = label
        
        flows_list.append(flow_dict)
        
        if len(flows_list) % 50000 == 0:
            elapsed = time.time() - start_time
            rate = len(flows_list) / elapsed if elapsed > 0 else 0
            print(f"  Extracted {len(flows_list):,} flows... ({rate:.0f}/sec)")
        
        if len(flows_list) >= max_flows:
            break
    
    elapsed = time.time() - start_time
    print(f"‚úÖ Done! {len(flows_list):,} flows in {elapsed:.1f}s")
    
    return pd.DataFrame(flows_list)

## Step 1: Extract BENIGN Traffic

Sources:
- Monday (all benign)
- Tuesday (97% benign)
- Thursday (98% benign)
- User traffic (100% benign)

In [6]:
# Extract BENIGN from Monday
df_monday = extract_flows(PCAP_PATHS['Monday'], max_flows=200000, label='BENIGN')
print(f"Monday: {len(df_monday):,} flows")


Extracting from: Monday-WorkingHours.pcap
Label: BENIGN | Max: 200,000 | Skip: 0
  Extracted 50,000 flows... (245/sec)
  Extracted 100,000 flows... (465/sec)
  Extracted 150,000 flows... (657/sec)
  Extracted 200,000 flows... (834/sec)
‚úÖ Done! 200,000 flows in 239.7s
Monday: 200,000 flows


In [7]:
# Extract BENIGN from Tuesday
df_tuesday = extract_flows(PCAP_PATHS['Tuesday'], max_flows=150000, label='BENIGN')
print(f"Tuesday: {len(df_tuesday):,} flows")


Extracting from: Tuesday-WorkingHours.pcap
Label: BENIGN | Max: 150,000 | Skip: 0
  Extracted 50,000 flows... (232/sec)
  Extracted 100,000 flows... (431/sec)
  Extracted 150,000 flows... (611/sec)
‚úÖ Done! 150,000 flows in 245.6s
Tuesday: 150,000 flows


In [8]:
# Extract BENIGN from Thursday
df_thursday = extract_flows(PCAP_PATHS['Thursday'], max_flows=150000, label='BENIGN')
print(f"Thursday: {len(df_thursday):,} flows")


Extracting from: Thursday-WorkingHours.pcap
Label: BENIGN | Max: 150,000 | Skip: 0
  Extracted 50,000 flows... (1319/sec)
  Extracted 100,000 flows... (2070/sec)
  Extracted 150,000 flows... (2774/sec)
‚úÖ Done! 150,000 flows in 54.1s
Thursday: 150,000 flows


In [9]:
# Extract BENIGN from User Traffic (ALL files)
user_dfs = []
for pcap in user_pcaps:
    df = extract_flows(pcap, max_flows=500000, label='BENIGN')  # Extract all
    user_dfs.append(df)

df_user = pd.concat(user_dfs, ignore_index=True) if user_dfs else pd.DataFrame()
print(f"\nüìä Total User Traffic: {len(df_user):,} flows")


Extracting from: gaming_browsing_01.pcap
Label: BENIGN | Max: 500,000 | Skip: 0
‚úÖ Done! 959 flows in 2.2s

Extracting from: gaming_browsing_02.pcap
Label: BENIGN | Max: 500,000 | Skip: 0
‚úÖ Done! 343 flows in 3.1s

Extracting from: gaming_browsing_03.pcap
Label: BENIGN | Max: 500,000 | Skip: 0
‚úÖ Done! 1,957 flows in 4.7s

Extracting from: gaming_browsing_04.pcap
Label: BENIGN | Max: 500,000 | Skip: 0
‚úÖ Done! 563 flows in 2.7s

üìä Total User Traffic: 3,822 flows


In [10]:
# Combine all BENIGN
df_benign = pd.concat([df_monday, df_tuesday, df_thursday, df_user], ignore_index=True)
print(f"\n‚úÖ TOTAL BENIGN: {len(df_benign):,} flows")


‚úÖ TOTAL BENIGN: 503,822 flows


## Step 2: Extract ATTACK Traffic

Sources:
- **Friday (after 13:00)**: PortScan (13:00-15:28) + DDoS (15:51-16:21)
- **Wednesday (9:43-11:24)**: DoS attacks

**Strategy**: Skip early flows to get to attack time window.

In [11]:
# Friday: Skip first ~250K flows to get to afternoon attacks (PortScan + DDoS)
# The Friday PCAP starts in the morning, attacks begin around 13:00
# Skipping 250K gets us past the morning benign traffic

df_friday_attack = extract_flows(
    PCAP_PATHS['Friday'], 
    max_flows=200000, 
    label='ATTACK',
    skip_flows=250000  # Skip morning benign traffic
)
print(f"Friday ATTACK: {len(df_friday_attack):,} flows")


Extracting from: Friday-WorkingHours.pcap
Label: ATTACK | Max: 200,000 | Skip: 250,000
  Extracted 50,000 flows... (999/sec)
  Extracted 100,000 flows... (1847/sec)
  Extracted 150,000 flows... (2582/sec)
  Extracted 200,000 flows... (3136/sec)
‚úÖ Done! 200,000 flows in 63.8s
Friday ATTACK: 200,000 flows


In [12]:
# Wednesday: DoS attacks are in morning (9:43-11:24)
# This is near the start of the PCAP, so we don't skip
# But we limit extraction to ~150K flows to stay in attack window

df_wednesday_attack = extract_flows(
    PCAP_PATHS['Wednesday'], 
    max_flows=150000, 
    label='ATTACK',
    skip_flows=0  # Attacks are at start
)
print(f"Wednesday ATTACK: {len(df_wednesday_attack):,} flows")


Extracting from: Wednesday-WorkingHours.pcap
Label: ATTACK | Max: 150,000 | Skip: 0
  Extracted 50,000 flows... (796/sec)
  Extracted 100,000 flows... (1152/sec)
  Extracted 150,000 flows... (1608/sec)
‚úÖ Done! 150,000 flows in 93.3s
Wednesday ATTACK: 150,000 flows


In [13]:
# Combine all ATTACK
df_attack = pd.concat([df_friday_attack, df_wednesday_attack], ignore_index=True)
print(f"\n‚úÖ TOTAL ATTACK: {len(df_attack):,} flows")


‚úÖ TOTAL ATTACK: 350,000 flows


## Step 3: Combine & Balance Dataset

In [14]:
# Combine datasets
print("Combining datasets...")
df_combined = pd.concat([df_benign, df_attack], ignore_index=True)

print(f"\nCombined dataset: {len(df_combined):,} flows")
print(f"\nLabel distribution:")
print(df_combined['label'].value_counts())

Combining datasets...

Combined dataset: 853,822 flows

Label distribution:
label
BENIGN    503822
ATTACK    350000
Name: count, dtype: int64


In [15]:
# Balance classes (undersample majority)
benign_count = (df_combined['label'] == 'BENIGN').sum()
attack_count = (df_combined['label'] == 'ATTACK').sum()
min_count = min(benign_count, attack_count)

print(f"Balancing to {min_count:,} samples per class...")

df_benign_sample = df_combined[df_combined['label'] == 'BENIGN'].sample(n=min_count, random_state=42)
df_attack_sample = df_combined[df_combined['label'] == 'ATTACK'].sample(n=min_count, random_state=42)

df_balanced = pd.concat([df_benign_sample, df_attack_sample], ignore_index=True)
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)  # Shuffle

print(f"\n‚úÖ Balanced dataset: {len(df_balanced):,} flows")
print(df_balanced['label'].value_counts())

Balancing to 350,000 samples per class...

‚úÖ Balanced dataset: 700,000 flows
label
ATTACK    350000
BENIGN    350000
Name: count, dtype: int64


In [16]:
# Save combined dataset
combined_path = DATA_DIR / 'nfstream_robust_binary_training.csv'
df_balanced.to_csv(combined_path, index=False)
print(f"üíæ Saved to: {combined_path}")

üíæ Saved to: c:\Users\Ghulam Mohayudin\Projects\Others\usod\network-based-ai\data_processed\nfstream_robust_binary_training.csv


## Step 4: Train Random Forest Model

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import joblib

# Prepare features and labels
X = df_balanced[NFSTREAM_FEATURES].copy()
y = df_balanced['label']

# Handle inf/nan
X = X.replace([np.inf, -np.inf], np.nan)
X = X.fillna(0)

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")

Features shape: (700000, 46)
Labels shape: (700000,)


In [18]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train):,}")
print(f"Test set: {len(X_test):,}")

Training set: 560,000
Test set: 140,000


In [19]:
# Train model
print("Training Random Forest...")
print("="*60)

model = RandomForestClassifier(
    n_estimators=150,
    max_depth=30,
    min_samples_split=5,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

start = time.time()
model.fit(X_train, y_train)
print(f"\n‚úÖ Training done in {time.time()-start:.1f}s")

Training Random Forest...


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   16.4s



‚úÖ Training done in 76.8s


[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:  1.3min finished


## Step 5: Evaluate Model

In [20]:
# Predictions
y_pred = model.predict(X_test)

# Metrics
accuracy = accuracy_score(y_test, y_pred)

print("="*60)
print("MODEL PERFORMANCE")
print("="*60)
print(f"\nAccuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    0.1s
[Parallel(n_jobs=12)]: Done 150 out of 150 | elapsed:    0.9s finished


MODEL PERFORMANCE

Accuracy: 0.7710 (77.10%)

Classification Report:
              precision    recall  f1-score   support

      ATTACK       0.90      0.61      0.73     70000
      BENIGN       0.70      0.94      0.80     70000

    accuracy                           0.77    140000
   macro avg       0.80      0.77      0.76    140000
weighted avg       0.80      0.77      0.76    140000



In [21]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred, labels=['BENIGN', 'ATTACK'])
tn, fp, fn, tp = cm[0,0], cm[0,1], cm[1,0], cm[1,1]

print("Confusion Matrix:")
print(f"                 Predicted")
print(f"                 BENIGN  ATTACK")
print(f"Actual BENIGN    {tn:>6}  {fp:>6}")
print(f"       ATTACK    {fn:>6}  {tp:>6}")

# Key metrics
fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
fnr = fn / (fn + tp) if (fn + tp) > 0 else 0

print(f"\n" + "="*50)
print(f"FALSE POSITIVE RATE: {fpr:.4f} ({fpr*100:.2f}%)")
print(f"FALSE NEGATIVE RATE: {fnr:.4f} ({fnr*100:.2f}%)")
print("="*50)

if fpr < 0.05:
    print("\n‚úÖ FPR < 5% - Excellent!")
elif fpr < 0.10:
    print("\n‚ö†Ô∏è FPR 5-10% - Good")
else:
    print("\n‚ùå FPR > 10% - Needs improvement")

Confusion Matrix:
                 Predicted
                 BENIGN  ATTACK
Actual BENIGN     65531    4469
       ATTACK     27587   42413

FALSE POSITIVE RATE: 0.0638 (6.38%)
FALSE NEGATIVE RATE: 0.3941 (39.41%)

‚ö†Ô∏è FPR 5-10% - Good


## Step 6: Save Model

In [22]:
# Save model
model_name = 'nfstream_robust_binary'

model_path = MODELS_DIR / f'random_forest_{model_name}.joblib'
joblib.dump(model, model_path)
print(f"‚úÖ Model saved: {model_path}")

# Save features
features_path = MODELS_DIR / f'feature_names_{model_name}.joblib'
joblib.dump(NFSTREAM_FEATURES, features_path)
print(f"‚úÖ Features saved: {features_path}")

# Save classes
classes_path = MODELS_DIR / f'class_names_{model_name}.joblib'
joblib.dump(['BENIGN', 'ATTACK'], classes_path)
print(f"‚úÖ Classes saved: {classes_path}")

# Save info
info_path = MODELS_DIR / f'random_forest_{model_name}_info.txt'
with open(info_path, 'w') as f:
    f.write(f"Model: NFStream Robust Binary\n")
    f.write(f"Trained: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")
    f.write(f"Accuracy: {accuracy:.4f}\n")
    f.write(f"FPR: {fpr:.4f}\n")
    f.write(f"Training data: CICIDS2017 + User traffic\n")
print(f"‚úÖ Info saved: {info_path}")

‚úÖ Model saved: c:\Users\Ghulam Mohayudin\Projects\Others\usod\network-based-ai\models\random_forest_nfstream_robust_binary.joblib
‚úÖ Features saved: c:\Users\Ghulam Mohayudin\Projects\Others\usod\network-based-ai\models\feature_names_nfstream_robust_binary.joblib
‚úÖ Classes saved: c:\Users\Ghulam Mohayudin\Projects\Others\usod\network-based-ai\models\class_names_nfstream_robust_binary.joblib
‚úÖ Info saved: c:\Users\Ghulam Mohayudin\Projects\Others\usod\network-based-ai\models\random_forest_nfstream_robust_binary_info.txt


## Step 7: Quick Test on Friday PCAP

In [23]:
# Test on held-out Friday data
print("Testing on held-out Friday PCAP data...")

# Extract flows we haven't used (skip first 450K)
df_test = extract_flows(
    PCAP_PATHS['Friday'],
    max_flows=50000,
    skip_flows=450000
)

X_friday = df_test[NFSTREAM_FEATURES].replace([np.inf, -np.inf], np.nan).fillna(0)
predictions = model.predict(X_friday)

# Results
print("\n" + "="*60)
print("HELD-OUT FRIDAY TEST RESULTS")
print("="*60)
attack_count = (predictions == 'ATTACK').sum()
benign_count = (predictions == 'BENIGN').sum()
print(f"ATTACK: {attack_count:,} ({attack_count/len(predictions)*100:.1f}%)")
print(f"BENIGN: {benign_count:,} ({benign_count/len(predictions)*100:.1f}%)")

if attack_count > benign_count:
    print("\n‚úÖ Model correctly detects attacks in Friday PCAP!")
else:
    print("\n‚ö†Ô∏è Model may need tuning")

Testing on held-out Friday PCAP data...

Extracting from: Friday-WorkingHours.pcap
Label: None | Max: 50,000 | Skip: 450,000
  Extracted 50,000 flows... (740/sec)
‚úÖ Done! 50,000 flows in 67.6s


[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  26 tasks      | elapsed:    0.0s



HELD-OUT FRIDAY TEST RESULTS
ATTACK: 31,010 (62.0%)
BENIGN: 18,990 (38.0%)

‚úÖ Model correctly detects attacks in Friday PCAP!


[Parallel(n_jobs=12)]: Done 150 out of 150 | elapsed:    0.2s finished


## Done!

Model saved to `models/random_forest_nfstream_robust_binary.joblib`

To use:
```python
from src.analyzer import NetworkThreatAnalyzer
analyzer = NetworkThreatAnalyzer()
results = analyzer.analyze_pcap('your_file.pcap', model_type='nfstream')
```