# EIDO-Enhanced vs. Raw Text ML Model Comparison

This notebook demonstrates the core hypothesis of the SentinelAI project: **training a machine learning model on structured EIDO-JSON data yields superior performance compared to training on raw, unstructured text.**

We will perform a classification task to predict the **priority level** of an emergency incident.

1.  **Baseline Model**: Trained on raw text using TF-IDF.
2.  **EIDO-Enhanced Model**: Trained on structured features extracted from EIDO-JSONs.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import json
import warnings
import seaborn as sns
import matplotlib.pyplot as plt

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

## 1. Data Simulation

First, we create a simulated dataset. This represents raw incident text that has been processed by the EIDO Agent to produce a corresponding EIDO-JSON. Each incident also has a priority label (1=High, 2=Medium, 3=Low).

In [None]:
# This represents the step where raw text has been processed by the EIDO-Agent.
data = [
    {'raw_text': 'Man down at the library, he is clutching his chest, possible heart attack.', 'priority_label': 1},
    {'raw_text': 'Reports of a shooting near Price Center, multiple victims seen, suspect has a handgun.', 'priority_label': 1},
    {'raw_text': 'A student reported a break-in at the student dorms on floor 3, suspect fled on foot.', 'priority_label': 3},
    {'raw_text': 'There is a large car fire on the I-5 freeway near the campus exit ramp.', 'priority_label': 2},
    {'raw_text': 'Suspicious person reported loitering near the bus stop. No weapon seen.', 'priority_label': 3},
    {'raw_text': 'Two cars collided at the intersection of Gilman and Villa La Jolla. Minor injuries reported.', 'priority_label': 2},
    {'raw_text': 'Caller states his roommate is unconscious and not breathing. Medical emergency.', 'priority_label': 1},
    {'raw_text': 'A group of people are fighting outside the bar on campus. One person has a knife.', 'priority_label': 1},
    {'raw_text': 'My bike was stolen from the rack outside the engineering building sometime yesterday.', 'priority_label': 3},
    {'raw_text': 'Loud party complaint at the apartments on Nobel Drive.', 'priority_label': 3}
]
# These are the corresponding EIDO-JSONs that would be generated by the EIDO Agent.
eido_jsons = [
    '{"personComponent": [{"personIncidentRoleRegistryText": ["Victim"]}], "incidentComponent": {"incidentTypeCommonRegistryText": "Medical"}}',
    '{"personComponent": [{"personIncidentRoleRegistryText": ["Victim"]}, {"personIncidentRoleRegistryText": ["Suspect"]}], "itemComponent": [{}], "incidentComponent": {"incidentTypeCommonRegistryText": "Crime-Violent"}}',
    '{"personComponent": [{"personIncidentRoleRegistryText": ["Suspect"]}], "incidentComponent": {"incidentTypeCommonRegistryText": "Crime-Property"}}',
    '{"vehicleComponent": [{}], "incidentComponent": {"incidentTypeCommonRegistryText": "Fire"}}',
    '{"personComponent": [{"personIncidentRoleRegistryText": ["Suspect"]}], "incidentComponent": {"incidentTypeCommonRegistryText": "Suspicious-Activity"}}',
    '{"vehicleComponent": [{}, {}], "personComponent": [{"personIncidentRoleRegistryText": ["Victim"]}], "incidentComponent": {"incidentTypeCommonRegistryText": "Traffic-Collision"}}',
    '{"personComponent": [{"personIncidentRoleRegistryText": ["Victim"]}], "incidentComponent": {"incidentTypeCommonRegistryText": "Medical"}}',
    '{"personComponent": [{"personIncidentRoleRegistryText": ["Victim"]}, {"personIncidentRoleRegistryText": ["Suspect"]}], "itemComponent": [{}], "incidentComponent": {"incidentTypeCommonRegistryText": "Crime-Violent"}}',
    '{"itemComponent": [{}], "incidentComponent": {"incidentTypeCommonRegistryText": "Crime-Property"}}',
    '{"incidentComponent": {"incidentTypeCommonRegistryText": "Disturbance"}}'
]

df = pd.DataFrame(data)
df['eido_json'] = eido_jsons

df.head()

## 2. Baseline Model (Raw Text)

Here, we train a classifier using only the `raw_text` of the incident. This is the traditional approach, where the model must infer all meaning from unstructured language.

In [None]:
print("="*50)
print("--- Training Baseline Model (Raw Text Features) ---")
print("="*50)

X_base = df['raw_text']
y_base = df['priority_label']

X_train_base, X_test_base, y_train_base, y_test_base = train_test_split(X_base, y_base, test_size=0.3, random_state=42, stratify=y_base)

# Simple pipeline: TF-IDF followed by a classifier
pipeline_base = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=100)),
    ('clf', GradientBoostingClassifier(random_state=42, n_estimators=50))
])

pipeline_base.fit(X_train_base, y_train_base)

print("\nBaseline Model Performance:")
y_pred_base = pipeline_base.predict(X_test_base)
print(f"\nAccuracy: {accuracy_score(y_test_base, y_pred_base):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_base, y_pred_base, zero_division=0))

# Plotting Confusion Matrix
cm_base = confusion_matrix(y_test_base, y_pred_base)
plt.figure(figsize=(6, 4))
sns.heatmap(cm_base, annot=True, fmt='d', cmap='Blues', xticklabels=pipeline_base.classes_, yticklabels=pipeline_base.classes_)
plt.title('Baseline Model Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

## 3. EIDO-Enhanced Model (Structured Features)

Now, we engineer features from the structured `eido_json`. This provides the model with explicit, clean signals like the number of victims or the standardized incident type, rather than forcing it to infer them from text.

In [None]:
def feature_engineer_from_eido(df):
    """
    Extracts structured features from the 'eido_json' column.
    """
    features = []
    for index, row in df.iterrows():
        try:
            eido = json.loads(row['eido_json'])
        except (json.JSONDecodeError, TypeError):
            eido = {}

        person_component = eido.get('personComponent', [])
        incident_component = eido.get('incidentComponent', {})

        num_victims = sum(1 for p in person_component if isinstance(p, dict) and "Victim" in p.get('personIncidentRoleRegistryText', []))
        num_suspects = sum(1 for p in person_component if isinstance(p, dict) and "Suspect" in p.get('personIncidentRoleRegistryText', []))
        incident_type = incident_component.get('incidentTypeCommonRegistryText', 'Unknown')
        has_weapon = 1 if 'weapon' in row['raw_text'].lower() or 'knife' in row['raw_text'].lower() or 'gun' in row['raw_text'].lower() else 0

        features.append({
            'num_victims': num_victims,
            'num_suspects': num_suspects,
            'incident_type': incident_type,
            'has_weapon': has_weapon
        })
    
    return pd.concat([df.reset_index(drop=True), pd.DataFrame(features)], axis=1)

df_featured = feature_engineer_from_eido(df)
display(df_featured[['raw_text', 'priority_label', 'num_victims', 'num_suspects', 'incident_type', 'has_weapon']].head())

In [None]:
print("\n" + "="*50)
print("--- Training EIDO-Enhanced Model (Structured Features) ---")
print("="*50)

# Define feature columns and the target
feature_cols = ['num_victims', 'num_suspects', 'incident_type', 'has_weapon']
X_eido = df_featured[feature_cols]
y_eido = df_featured['priority_label']

X_train_eido, X_test_eido, y_train_eido, y_test_eido = train_test_split(X_eido, y_eido, test_size=0.3, random_state=42, stratify=y_eido)

# Preprocessing for categorical vs. numerical features
categorical_features = ['incident_type']
numerical_features = ['num_victims', 'num_suspects', 'has_weapon']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

pipeline_eido = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', GradientBoostingClassifier(random_state=42, n_estimators=50))
])

pipeline_eido.fit(X_train_eido, y_train_eido)

print("\nEIDO-Enhanced Model Performance:")
y_pred_eido = pipeline_eido.predict(X_test_eido)
print(f"\nAccuracy: {accuracy_score(y_test_eido, y_pred_eido):.4f}")
print("\nClassification Report:")
print(classification_report(y_test_eido, y_pred_eido, zero_division=0))

# Plotting Confusion Matrix
cm_eido = confusion_matrix(y_test_eido, y_pred_eido)
plt.figure(figsize=(6, 4))
sns.heatmap(cm_eido, annot=True, fmt='d', cmap='Greens', xticklabels=pipeline_eido.classes_, yticklabels=pipeline_eido.classes_)
plt.title('EIDO-Enhanced Model Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

## 4. Conclusion

Let's compare the results.

In [None]:
baseline_accuracy = accuracy_score(y_test_base, y_pred_base)
eido_accuracy = accuracy_score(y_test_eido, y_pred_eido)

print(f"Baseline Model Accuracy (Raw Text): {baseline_accuracy:.4f}")
print(f"EIDO-Enhanced Model Accuracy (Structured Features): {eido_accuracy:.4f}")
print("-"*40)

if eido_accuracy > baseline_accuracy:
    print("The EIDO-Enhanced model performed better.")
    print("This demonstrates that structuring raw text into a standardized format like EIDO-JSON provides a superior, less noisy signal for machine learning tasks.")
    print("By providing explicit features like 'num_victims' and 'incident_type', we remove ambiguity and allow the model to learn more robust patterns.")
else:
    print("The baseline model performed as well as or better than the EIDO-Enhanced model.")
    print("This could be due to the small size of the sample data or features that were not sufficiently predictive. With a larger, more complex dataset, the benefits of structured data are expected to be much more pronounced.")
