# Feature Engineering

This notebook focuses on extracting and engineering features from parsed Suricata rules.

We will:
1. Extract basic features (action, protocol, ports, etc.)
2. Extract option-based features (content, pcre, flow, etc.)
3. Extract metadata features (classtype, priority, message)
4. Create TF-IDF features from rule messages
5. Combine all features into a feature matrix

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from suricata_rule_clustering import parser, features

# Set display options
pd.set_option('display.max_columns', None)

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load Parsed Rules

In [None]:
# Load the parsed rules from the previous notebook
df = parser.load_parsed_rules('../data/parsed_rules.pkl')

print(f"Loaded {len(df)} rules")
print(f"Columns: {df.columns.tolist()}")

## 2. Create Feature Extractor

In [None]:
# Initialize feature extractor
extractor = features.RuleFeatureExtractor()

print("Feature extractor initialized")

## 3. Extract Basic Features

In [None]:
# Extract basic features
df_basic = extractor.extract_basic_features(df)

# Show new columns
new_cols = [col for col in df_basic.columns if col not in df.columns]
print(f"New feature columns: {new_cols}")

# Display sample
df_basic[new_cols].head()

## 4. Extract Option Features

In [None]:
# Extract option-based features
df_options = extractor.extract_option_features(df_basic)

# Show new columns
new_cols = [col for col in df_options.columns if col not in df_basic.columns]
print(f"New option feature columns: {new_cols}")

# Display sample
df_options[new_cols].head(10)

In [None]:
# Analyze option features
option_cols = [col for col in df_options.columns if col.startswith('has_') or col.startswith('num_')]

# Show statistics for numeric option features
print("Option feature statistics:")
df_options[option_cols].describe()

## 5. Extract Metadata Features

In [None]:
# Extract metadata features
df_metadata = extractor.extract_metadata_features(df_options)

# Show new columns
new_cols = [col for col in df_metadata.columns if col not in df_options.columns]
print(f"New metadata feature columns: {new_cols}")

# Display sample
if new_cols:
    df_metadata[new_cols].head()

## 6. Extract Text Features (TF-IDF)

In [None]:
# Extract TF-IDF features from rule messages
df_text, tfidf_matrix = extractor.extract_text_features(
    df_metadata,
    max_features=50,  # Start with 50 features
    ngram_range=(1, 2)  # Unigrams and bigrams
)

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"\nTop TF-IDF features:")
print(extractor.tfidf_vectorizer.get_feature_names_out()[:20])

## 7. Create Complete Feature Matrix

In [None]:
# Create feature matrix for clustering
X = extractor.create_feature_matrix(
    df,
    include_tfidf=True,
    tfidf_max_features=100
)

print(f"Final feature matrix shape: {X.shape}")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")

## 8. Analyze Feature Matrix

In [None]:
# Check for NaN or infinite values
print(f"NaN values: {np.isnan(X).sum()}")
print(f"Infinite values: {np.isinf(X).sum()}")

# Basic statistics
print(f"\nFeature matrix statistics:")
print(f"Mean: {X.mean():.4f}")
print(f"Std: {X.std():.4f}")
print(f"Min: {X.min():.4f}")
print(f"Max: {X.max():.4f}")

In [None]:
# Visualize feature variance
feature_variance = X.var(axis=0)

plt.figure(figsize=(12, 4))
plt.plot(feature_variance)
plt.title('Feature Variance Across All Features')
plt.xlabel('Feature Index')
plt.ylabel('Variance')
plt.show()

print(f"Features with zero variance: {(feature_variance == 0).sum()}")

## 9. Save Features

In [None]:
# Save the feature matrix and processed DataFrame
np.save('../data/feature_matrix.npy', X)
df_metadata.to_pickle('../data/processed_rules.pkl')

print("Saved feature matrix and processed DataFrame")

## Summary

We have successfully:
- Extracted basic features (action, protocol, ports)
- Extracted option-based features (content, pcre, flow patterns)
- Extracted metadata features (classtype, priority, message characteristics)
- Created TF-IDF features from rule messages
- Combined all features into a scaled feature matrix

## Next Steps

Proceed to **03_clustering.ipynb** to apply clustering algorithms to the feature matrix.