# K-Nearest Neighbors Training on CICIDS2017 Dataset

This notebook trains a KNN classifier on the CICIDS2017 intrusion detection dataset.

**Key Features:**
- Feature scaling (critical for KNN performance)
- SMOTE balancing applied within CV pipeline
- Optimal k-value selection
- Hyperparameter tuning
- Comprehensive evaluation

**Note:** KNN can be slow on large datasets. Consider using a smaller sample or be patient!

## 1. Setup and Imports

In [None]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from time import time

# Add project root to path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', '..')))

from CICIDS2017.preprocessing.dataset import CICIDS2017
from scripts.models.model_utils import (
    evaluate_model,
    check_data_leakage,
    remove_low_variance_features
)

# Import model-specific modules
from scripts.models.knn.knn import create_knn_pipeline, train_knn, find_optimal_k

from scripts.logger import LoggerManager

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì Imports successful")

## 2. Initialize Logger

In [None]:
logger = LoggerManager(log_name="knn_notebook").get_logger()
logger.info("Starting KNN training notebook")

## 3. Load and Preprocess Data

In [None]:
# Load dataset
logger.info("Loading CICIDS2017 dataset...")
dataset = CICIDS2017(logger=logger)
dataset.encode().optimize_memory()
data = dataset.data

print(f"Dataset shape: {data.shape}")
data.head()

## 4. Sample Data

**Important:** KNN is computationally expensive. We'll use a smaller sample (50k) for faster training.
Increase this if you have time and resources.

In [None]:
# Sample size - adjust based on your resources
# KNN is O(n) for prediction, so smaller samples train faster
SAMPLE_SIZE = 50000  # Start small, increase if needed

logger.info(f"Sampling {SAMPLE_SIZE} rows from dataset...")
data_sample = data.sample(n=min(SAMPLE_SIZE, len(data)), random_state=0)

print(f"Sampled data shape: {data_sample.shape}")
print(f"\n‚ö†Ô∏è KNN training time scales with sample size.")
print(f"   Current sample: {SAMPLE_SIZE} rows")
print(f"   Estimated training time: ~{SAMPLE_SIZE/10000:.1f}-{SAMPLE_SIZE/5000:.1f} minutes")

## 5. Prepare Features and Labels

In [None]:
# Split features and labels
X = data_sample.drop('Attack Type', axis=1)
y = data_sample['Attack Type']

# Remove known leakage features
leakage_features = ['Attack Number']
existing_leakage = [f for f in leakage_features if f in X.columns]

if existing_leakage:
    logger.warning(f"üö® REMOVING LEAKAGE FEATURES: {existing_leakage}")
    X = X.drop(columns=existing_leakage)

# Convert to numeric
X = X.apply(pd.to_numeric, errors='coerce')

# Handle missing values
if X.isnull().sum().sum() > 0:
    n_missing = X.isnull().sum().sum()
    logger.info(f"Filling {n_missing} missing values with 0")
    X = X.fillna(0)

# Remove low variance features
X, removed_features = remove_low_variance_features(X, threshold=0.01, logger=logger)

print(f"Feature matrix shape: {X.shape}")
print(f"\nClass distribution:")
print(y.value_counts())

## 6. Data Leakage Check

In [None]:
diagnostics = check_data_leakage(X, y, logger=logger)

## 7. Train/Test Split

In [None]:
# Remove classes with fewer than 2 samples
class_counts = y.value_counts()
valid_classes = class_counts[class_counts >= 2].index
X = X[y.isin(valid_classes)]
y = y[y.isin(valid_classes)]

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=0,
    stratify=y
)

print(f"Train set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

## 8. Find Optimal k Value

Let's test different k values to find the optimal one.

In [None]:
# Create a scaled version for k-finding (KNN needs scaling!)
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Find optimal k
logger.info("Finding optimal k value...")
k_results = find_optimal_k(
    X_train_scaled, 
    y_train, 
    k_range=range(3, 21, 2),  # Test k=3,5,7,9,11,13,15,17,19
    cv=5,
    logger=logger
)

# Plot results
plt.figure(figsize=(12, 6))
plt.errorbar(k_results['k_values'], k_results['mean_scores'], 
             yerr=k_results['std_scores'], marker='o', capsize=5, linewidth=2)
plt.axvline(x=k_results['optimal_k'], color='r', linestyle='--', 
            label=f"Optimal k={k_results['optimal_k']}")
plt.xlabel('Number of Neighbors (k)', fontsize=12)
plt.ylabel('CV Accuracy', fontsize=12)
plt.title('KNN Performance vs k Value', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\n‚úì Optimal k: {k_results['optimal_k']}")
print(f"  CV Score: {k_results['optimal_score']:.4f}")

## 9. Create KNN Pipeline with SMOTE

**Important:** We include StandardScaler because KNN is distance-based and highly sensitive to feature scales.

In [None]:
# Create pipeline with optimal k
pipeline = create_knn_pipeline(
    n_neighbors=k_results['optimal_k'],
    weights='distance',  # Weight by distance (closer neighbors more important)
    metric='minkowski',
    p=2,  # Euclidean distance
    random_state=0,
    use_smote=True,
    use_scaler=True  # CRITICAL for KNN!
)

print("Pipeline created with steps:")
for name, step in pipeline.steps:
    print(f"  - {name}: {step.__class__.__name__}")

## 10. Cross-Validation

‚ö†Ô∏è **This may take several minutes depending on your sample size.**

In [None]:
# Perform cross-validation
logger.info("Performing 5-fold cross-validation...")
start_time = time()

cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, n_jobs=-1)

elapsed_time = time() - start_time

print("\n" + "="*50)
print("CROSS-VALIDATION RESULTS")
print("="*50)
print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
print(f"Time elapsed: {elapsed_time:.2f} seconds ({elapsed_time/60:.2f} minutes)")

# Plot CV scores
plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), cv_scores, marker='o', markersize=10, linewidth=2)
plt.axhline(y=cv_scores.mean(), color='r', linestyle='--', 
            label=f'Mean: {cv_scores.mean():.4f}')
plt.xlabel('Fold', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('KNN Cross-Validation Scores', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 11. Train Final Model

In [None]:
# Train on full training set
logger.info("Training final model on full training set...")
start_time = time()

pipeline.fit(X_train, y_train)

training_time = time() - start_time

print(f"‚úì Model training completed in {training_time:.2f} seconds")

## 12. Evaluate on Test Set

In [None]:
# Evaluate model
start_time = time()
results = evaluate_model(pipeline, X_test, y_test, logger=logger)
prediction_time = time() - start_time

print("\n" + "="*50)
print("TEST SET RESULTS")
print("="*50)
print(f"Test Accuracy: {results['accuracy']:.4f}")
print(f"Prediction time: {prediction_time:.2f} seconds")
print(f"Time per sample: {prediction_time/len(X_test)*1000:.2f} ms")
print(f"\nClassification Report:")
print(results['report'])

## 13. Confusion Matrix

In [None]:
# Plot confusion matrix
plt.figure(figsize=(12, 10))
disp = ConfusionMatrixDisplay(confusion_matrix=results['confusion_matrix'],
                               display_labels=pipeline.classes_)
disp.plot(cmap='Blues', xticks_rotation=45)
plt.title('KNN Confusion Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 14. Performance Summary

In [None]:
print("\n" + "="*70)
print("FINAL PERFORMANCE SUMMARY")
print("="*70)
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
print(f"Test Accuracy: {results['accuracy']:.4f}")
print(f"\nModel Configuration:")
print(f"  - Number of neighbors (k): {pipeline.named_steps['knn'].n_neighbors}")
print(f"  - Weight function: {pipeline.named_steps['knn'].weights}")
print(f"  - Distance metric: {pipeline.named_steps['knn'].metric}")
print(f"  - Training samples: {len(X_train)}")
print(f"  - Feature scaling: Enabled")
print(f"  - SMOTE: Enabled")
print(f"\nTiming:")
print(f"  - Training time: {training_time:.2f}s")
print(f"  - Prediction time: {prediction_time:.2f}s ({prediction_time/len(X_test)*1000:.2f}ms per sample)")

# Performance indicators
if cv_scores.mean() >= 0.95:
    print("\n‚úì Excellent performance achieved (CV score ‚â• 0.95)")
elif cv_scores.mean() >= 0.90:
    print("\n‚úì Good performance achieved (CV score ‚â• 0.90)")
else:
    print("\n‚ö†Ô∏è  Performance below 0.90")
    print("   Consider:")
    print("   - Trying different k values")
    print("   - Using different distance metrics")
    print("   - Increasing training data size")
    print("   - Feature selection to reduce noise")

# KNN-specific notes
print("\nüìù KNN Notes:")
print(f"   - KNN stores all {len(X_train)} training samples")
print(f"   - Memory usage: ~{X_train.memory_usage(deep=True).sum() / 1024**2:.2f} MB for training data")
print(f"   - Consider dimensionality reduction for large datasets")

logger.info("Notebook execution completed successfully!")

## 15. (Optional) Hyperparameter Tuning

‚ö†Ô∏è **Warning:** This is computationally expensive! Only run if you have time.

In [None]:
# Uncomment to run hyperparameter tuning
# This will take a LONG time!

# logger.info("Starting hyperparameter tuning (this may take a while)...")
# best_params, best_score, grid_search = tune_knn_hyperparameters(
#     X_train_scaled, y_train, cv=3, logger=logger
# )

# print("\nBest parameters found:")
# for param, value in best_params.items():
#     print(f"  {param}: {value}")
# print(f"\nBest CV score: {best_score:.4f}")

## Tips for Improving KNN Performance

1. **Increase sample size** - More data generally helps KNN
2. **Feature selection** - Remove irrelevant features to reduce noise
3. **Dimensionality reduction** - Use PCA or other techniques
4. **Try different k values** - The optimal k depends on your data
5. **Experiment with distance metrics** - Euclidean, Manhattan, Chebyshev
6. **Use weights='distance'** - Gives more weight to closer neighbors
7. **Consider approximate KNN** - For very large datasets (e.g., FAISS, Annoy)