# PortKodiak AI Shield - Model Training

This notebook trains an **Isolation Forest** model to detect anomalous network traffic based on data collected by the PortKodiak agent.

## 1. Setup & Upload
First, upload your `traffic_export_TIMESTAMP.csv` file using the file uploader on the left.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer
import joblib

# Check files
!ls -lh *.csv

## 2. Load Data
Update the filename below to match your uploaded CSV.

In [None]:
csv_file = "traffic_export.csv" # CHANGE THIS to your actual filename if different

try:
    df = pd.read_csv(csv_file)
    print(f"Loaded {len(df)} records.")
    display(df.head())
except FileNotFoundError:
    print("File not found! Please upload your CSV.")

## 3. Preprocessing
We need to convert categorical features (like Process Name) into numbers.

- **Process Name**: We use `HashingVectorizer` to handle arbitrary process names without exploding dimensionality.
- **Ports**: Standard Scaling.
- **Direction**: One-Hot Encoding.

In [None]:
# Feature Selection
# We focus on: remote_port, process_name (hashed), direction
# Advanced features (IP geo, time of day) can be added later.

class HashingTransformer:
    def __init__(self, n_features=32):
        self.vec = HashingVectorizer(n_features=n_features, alternate_sign=False, norm=None)
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        # X is DataFrame or Series. Flatten to string list.
        if isinstance(X, pd.DataFrame):
            X = X.iloc[:, 0]
        return self.vec.transform(X.astype(str)).toarray()

preprocessor = ColumnTransformer(
    transformers=[
        ('port', StandardScaler(), ['remote_port']),
        ('proc', HashingTransformer(n_features=16), ['process_name']),
        ('path', HashingTransformer(n_features=32), ['process_path']),
    ],
    remainder='drop'
)

# Create Pipeline
clf = IsolationForest(n_estimators=100, contamination=0.01, random_state=42, n_jobs=-1)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', clf)
])

## 4. Train Model
We train on the assumption that most of your data is "Normal". Outliers will be flagged.

In [None]:
print("Training model...")
pipeline.fit(df)
print("Training Complete!")

# Score the training data to see distribution
scores = pipeline.decision_function(df)
plt.figure(figsize=(10, 6))
sns.histplot(scores, kde=True)
plt.title("Anomaly Scores Distribution")
plt.xlabel("Score (Lower = More Anomalous)")
plt.show()

## 5. Evaluate (Anomalies)
Let's look at the top anomalies found in your own training set.

In [None]:
df['score'] = scores
df['anomaly'] = pipeline.predict(df)

# Show top anomalies (score < 0)
anomalies = df[df['anomaly'] == -1].sort_values('score')
print(f"Found {len(anomalies)} anomalies in training set.")
display(anomalies[['timestamp', 'process_name', 'remote_ip', 'remote_port', 'score']].head(20))

## 6. Export Model
Download this `model.pkl` and place it in your `ml/models/` directory.

In [None]:
joblib.dump(pipeline, "portkodiak_model.pkl")
print("Model saved as portkodiak_model.pkl")

from google.colab import files
files.download('portkodiak_model.pkl')