# Selected Algorithms

For this IoT intrusion detection problem, we have selected a diverse set of classic supervised learning algorithms:

1.  **Logistic Regression**
2.  **Decision Tree**
3.  **Random Forest**
4.  **k-Nearest Neighbors (KNN)**
5.  **Support Vector Machine (SVM)**
6.  **Naïve Bayes (GaussianNB)**

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
import joblib
import os

# Load processed data
train_df = pd.read_csv('../data/processed/train_processed.csv')

# Separate features and target
target_col = 'Attack_type'
X_train = train_df.drop(target_col, axis=1)
y_train = train_df[target_col]

# Define models
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'RandomForest': RandomForestClassifier(random_state=42),
    'KNN': KNeighborsClassifier(),
    'SVM': SVC(random_state=42),
    'NaiveBayes': GaussianNB()
}

### Justification

The selection of these models is based on the following criteria:

*   **Interpretability**: **Decision Trees** and **Logistic Regression** offer a clear view of how features influence classification, which is crucial for understanding the nature of attacks.
*   **Efficiency**: **Naïve Bayes** and **KNN** are computationally lightweight (in training or prediction), making them interesting candidates for resource-constrained IoT devices.
*   **Expected Performance**: **Random Forest** and **SVM** are known for their robustness and ability to handle complex, high-dimensional data, typical characteristics of network traffic in attack scenarios.

# Model training

The training process includes:

*   **Hyperparameters**: Specific search grids are defined for each algorithm (e.g., `n_estimators` for RF, `C` for SVM, `n_neighbors` for KNN).
*   **Cross-Validation**: We use **GridSearchCV** with 3-fold cross-validation (cv=3) to ensure results are generalizable and do not depend on a specific data split.
*   **Optimization**: GridSearch exhaustively explores hyperparameter combinations to find the optimal configuration (based on 'accuracy').

In [2]:
# Define parameter grids
param_grids = {
    'LogisticRegression': {'C': [0.1, 1, 10]},
    'DecisionTree': {'max_depth': [None, 10, 20], 'criterion': ['gini', 'entropy']},
    'RandomForest': {'n_estimators': [50, 100], 'max_depth': [None, 10]},
    'KNN': {'n_neighbors': [3, 5, 7], 'weights': ['uniform', 'distance']},
    'SVM': {'C': [0.1, 1], 'kernel': ['rbf', 'linear']},
    'NaiveBayes': {'var_smoothing': [1e-9, 1e-8]}
}

best_models = {}
os.makedirs('../results/models', exist_ok=True)

for name, model in models.items():
    print(f"Training {name}...")
    # Using a smaller subset for SVM/KNN to speed up demonstration if needed, 
    # but here we use full X_train. Note: SVM can be slow on large datasets.
    grid_search = GridSearchCV(model, param_grids[name], cv=3, scoring='accuracy', n_jobs=-1, verbose=1)
    grid_search.fit(X_train, y_train)
    
    best_models[name] = grid_search.best_estimator_
    print(f"Best {name} Score: {grid_search.best_score_:.4f}")
    print(f"Best Parameters: {grid_search.best_params_}")
    
    # Save model
    joblib.dump(grid_search.best_estimator_, f'../results/models/{name}_best.pkl')
    print(f"Model saved to ../results/models/{name}_best.pkl")
    print("-" * 30)

Training LogisticRegression...
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Best LogisticRegression Score: 0.9929
Best Parameters: {'C': 10}
Model saved to ../results/models/LogisticRegression_best.pkl
------------------------------
Training DecisionTree...
Fitting 3 folds for each of 6 candidates, totalling 18 fits
Best DecisionTree Score: 0.9985
Best Parameters: {'criterion': 'entropy', 'max_depth': None}
Model saved to ../results/models/DecisionTree_best.pkl
------------------------------
Training RandomForest...
Fitting 3 folds for each of 4 candidates, totalling 12 fits
Best RandomForest Score: 0.9987
Best Parameters: {'max_depth': None, 'n_estimators': 100}
Model saved to ../results/models/RandomForest_best.pkl
------------------------------
Training KNN...
Fitting 3 folds for each of 6 candidates, totalling 18 fits
Best KNN Score: 0.9970
Best Parameters: {'n_neighbors': 3, 'weights': 'distance'}
Model saved to ../results/models/KNN_best.pkl
------------------------

### Challenges Encountered

During the modeling phase, it is common to face the following challenges:

*   **Class Imbalance**: As seen in the exploratory analysis, some attack classes are very minority. This can bias models (especially Accuracy) towards the majority class. Mitigation techniques like resampling (SMOTE) or class weights (`class_weight='balanced'`) could be used.
*   **Training Time**: Algorithms like SVM and KNN can be computationally expensive with large data volumes (90k+ rows), slowing down hyperparameter search.
*   **Overfitting**: Complex models like Random Forest or deep trees can memorize training data noise. Cross-validation and limiting depth (`max_depth`) help control this.
*   **Preprocessing**: Scale-sensitive algorithms (SVM, KNN, Logistic Regression) require data to be normalized (StandardScaler), a step we performed in the previous stage.