# Complete Solution: Text Analysis Without Pre-trained Vectorizer v0.8
## Project Vigil - K-Gram Analysis with Proper Model Training

## **The Real Problem & Complete Solution**

### What Went Wrong in v0.5-v0.7?

‚ùå **You have**: `classifier.pkl` (the model)
‚ùå **You DON'T have**: `vectorizer.pkl` (the text-to-numbers converter)
‚ùå **Result**: Model expects specific features, gets random ones ‚Üí Doesn't work!

### Solution in v0.8:

We'll implement **3 approaches** so you can choose the best one:

1. ‚úÖ **Retrain Model** (RECOMMENDED) - Train new model + vectorizer together
2. ‚úÖ **LIME Text Explainer** - Model-agnostic, works on raw text
3. ‚úÖ **K-Gram Statistical Analysis** - No model needed, just patterns

### Author: Project Vigil Team
### Version: 0.8 (Complete Solution)
### Date: 2025-11-16

## Part 1: Install Libraries

In [None]:
# Install required packages
!pip install -q scikit-learn pandas numpy matplotlib seaborn tqdm xgboost lime

import os
import pickle
import numpy as np
import pandas as pd
import json
import warnings
warnings.filterwarnings('ignore')

import urllib.request
import ssl

from tqdm.auto import tqdm

# Scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

# LIME for explanations
from lime.lime_text import LimeTextExplainer

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 8)

print("‚úì All libraries imported successfully")

## Part 2: Configuration

In [None]:
# URLs
GITHUB_REPO = "https://raw.githubusercontent.com/Meet2304/Project-Vigil/claude/fix-kgram-dataset-01VTpiw6P21u1bbgrvx2rVb2"
DATASET_URL = f"{GITHUB_REPO}/Dataset/MPDD.csv"

# Config
CONFIG = {
    'sample_size': 10000,      # Use 10K for good coverage
    'test_size': 0.2,          # 20% for testing
    'random_state': 42,
    'ngram_range': (1, 3),     # 1-3 word n-grams
    'max_features': 5000,      # Top 5000 features
    'min_df': 2,               # Must appear in at least 2 docs
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

## Part 3: Load Dataset

In [None]:
# Download
def download_file(url, path):
    try:
        ssl_context = ssl.create_default_context()
        ssl_context.check_hostname = False
        ssl_context.verify_mode = ssl.CERT_NONE
        print(f"Downloading {url}...")
        urllib.request.urlretrieve(url, path)
        print(f"‚úì Downloaded")
        return True
    except Exception as e:
        print(f"‚úó Error: {e}")
        return False

if not os.path.exists("MPDD.csv"):
    download_file(DATASET_URL, "MPDD.csv")

# Load
df = pd.read_csv("MPDD.csv")
print(f"\n‚úì Loaded {len(df):,} samples")

# Sample
if CONFIG['sample_size'] < len(df):
    df = df.sample(n=CONFIG['sample_size'], random_state=CONFIG['random_state'], 
                   stratify=df['isMalicious']).reset_index(drop=True)

texts = df['Prompt'].astype(str).tolist()
labels = df['isMalicious'].astype(int).values

print(f"\nüìä Dataset:")
print(f"  Total: {len(texts):,}")
print(f"  Malicious: {sum(labels):,} ({sum(labels)/len(labels)*100:.1f}%)")
print(f"  Benign: {len(labels)-sum(labels):,}")

## APPROACH 1: Retrain Model with Matching Vectorizer (RECOMMENDED)

### Why This is Best:
- ‚úÖ Model and vectorizer are guaranteed to match
- ‚úÖ You can save both for future use
- ‚úÖ Full control over features
- ‚úÖ Can use SHAP properly

### Steps:
1. Create vectorizer
2. Transform text to features
3. Train new model
4. Save BOTH model and vectorizer
5. Analyze with SHAP

In [None]:
print("="*80)
print("APPROACH 1: RETRAIN MODEL WITH MATCHING VECTORIZER")
print("="*80)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, 
    test_size=CONFIG['test_size'],
    random_state=CONFIG['random_state'],
    stratify=labels
)

print(f"\nTrain: {len(X_train):,}, Test: {len(X_test):,}")

# Create vectorizer
print("\nCreating vectorizer...")
vectorizer = TfidfVectorizer(
    ngram_range=CONFIG['ngram_range'],
    max_features=CONFIG['max_features'],
    min_df=CONFIG['min_df'],
    lowercase=True,
    strip_accents='unicode'
)

# Fit on training data
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

feature_names = vectorizer.get_feature_names_out()
print(f"‚úì Vectorizer created: {len(feature_names):,} features")

# Train model
print("\nTraining XGBoost model...")
model = xgb.XGBClassifier(
    max_depth=6,
    learning_rate=0.1,
    n_estimators=100,
    random_state=CONFIG['random_state'],
    eval_metric='logloss'
)

model.fit(X_train_vec, y_train)
print("‚úì Model trained")

# Evaluate
y_pred_train = model.predict(X_train_vec)
y_pred_test = model.predict(X_test_vec)

train_acc = accuracy_score(y_train, y_pred_train)
test_acc = accuracy_score(y_test, y_pred_test)

print(f"\nüìä Performance:")
print(f"  Training Accuracy: {train_acc:.4f} ({train_acc*100:.2f}%)")
print(f"  Test Accuracy: {test_acc:.4f} ({test_acc*100:.2f}%)")

print(f"\nüìã Classification Report (Test Set):")
print(classification_report(y_test, y_pred_test, target_names=['Benign', 'Malicious']))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_test)
print(f"Confusion Matrix:")
print(f"  TN={cm[0,0]:,}, FP={cm[0,1]:,}")
print(f"  FN={cm[1,0]:,}, TP={cm[1,1]:,}")

# Save both model and vectorizer
print(f"\nüíæ Saving model and vectorizer...")
with open('new_classifier.pkl', 'wb') as f:
    pickle.dump(model, f)
with open('new_vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)
print("‚úì Saved: new_classifier.pkl, new_vectorizer.pkl")
print("  ‚ö†Ô∏è  IMPORTANT: Always use BOTH together!")

## APPROACH 1 Analysis: Feature Importance from New Model

In [None]:
# Get feature importance from XGBoost
print("Analyzing feature importance...\n")

# XGBoost feature importance
importance_scores = model.feature_importances_

feature_imp = pd.DataFrame({
    'feature': feature_names,
    'importance': importance_scores
}).sort_values('importance', ascending=False)

print("="*80)
print("TOP 30 MOST IMPORTANT FEATURES (XGBoost Feature Importance)")
print("="*80)
print(f"\nRank | Feature | Importance")
print("-"*80)

for i, (_, row) in enumerate(feature_imp.head(30).iterrows(), 1):
    print(f"{i:3d}. | {row['feature']:<50} | {row['importance']:.6f}")

print("="*80)

# Visualize
plt.figure(figsize=(12, 8))
top_20 = feature_imp.head(20)
plt.barh(range(len(top_20)), top_20['importance'], color='steelblue')
plt.yticks(range(len(top_20)), top_20['feature'])
plt.xlabel('Feature Importance', fontsize=12, fontweight='bold')
plt.title('Top 20 Features by XGBoost Importance', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## APPROACH 2: LIME Text Explainer (Model-Agnostic)

### Why LIME?
- ‚úÖ Works directly on raw text (no vectorizer needed!)
- ‚úÖ Model-agnostic (works with ANY classifier)
- ‚úÖ Explains individual predictions
- ‚úÖ Shows which WORDS matter in each prompt

### How it works:
1. Takes a text prompt
2. Creates perturbations (removes words)
3. Sees how predictions change
4. Identifies important words

In [None]:
print("="*80)
print("APPROACH 2: LIME TEXT EXPLAINER")
print("="*80)

# Create LIME explainer
print("\nCreating LIME explainer...")

# Prediction function for LIME
def predict_proba_lime(texts_list):
    """Predict probabilities for LIME."""
    X_vec = vectorizer.transform(texts_list)
    return model.predict_proba(X_vec)

explainer = LimeTextExplainer(class_names=['Benign', 'Malicious'])
print("‚úì LIME explainer created")

# Explain some examples
print("\nExplaining sample predictions...\n")

# Find interesting examples
test_indices = np.arange(len(X_test))
mal_indices = test_indices[y_test == 1]
ben_indices = test_indices[y_test == 0]

# Correctly classified malicious
correct_mal = mal_indices[y_pred_test[mal_indices] == 1]
if len(correct_mal) > 0:
    idx = correct_mal[0]
    
    print("="*80)
    print("EXAMPLE 1: Malicious Prompt")
    print("="*80)
    print(f"\nPrompt: {X_test[idx]}")
    print(f"True: MALICIOUS, Predicted: MALICIOUS")
    
    # Get LIME explanation
    exp = explainer.explain_instance(
        X_test[idx],
        predict_proba_lime,
        num_features=10,
        num_samples=1000
    )
    
    print(f"\nüî• Words Contributing to MALICIOUS:")
    for word, weight in exp.as_list():
        if weight > 0:
            print(f"  '{word:20}' ‚Üí +{weight:.4f}")
    
    print(f"\nüîµ Words Contributing to BENIGN:")
    for word, weight in exp.as_list():
        if weight < 0:
            print(f"  '{word:20}' ‚Üí {weight:.4f}")
    
    # Visualize
    fig = exp.as_pyplot_figure()
    plt.title('LIME Explanation: Malicious Prompt', fontsize=12, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Correctly classified benign
correct_ben = ben_indices[y_pred_test[ben_indices] == 0]
if len(correct_ben) > 0:
    idx = correct_ben[0]
    
    print("\n" + "="*80)
    print("EXAMPLE 2: Benign Prompt")
    print("="*80)
    print(f"\nPrompt: {X_test[idx]}")
    print(f"True: BENIGN, Predicted: BENIGN")
    
    exp = explainer.explain_instance(
        X_test[idx],
        predict_proba_lime,
        num_features=10,
        num_samples=1000
    )
    
    print(f"\nüîµ Words Contributing to BENIGN:")
    for word, weight in exp.as_list():
        if weight < 0:
            print(f"  '{word:20}' ‚Üí {weight:.4f}")
    
    print(f"\nüî• Words Contributing to MALICIOUS:")
    for word, weight in exp.as_list():
        if weight > 0:
            print(f"  '{word:20}' ‚Üí +{weight:.4f}")
    
    fig = exp.as_pyplot_figure()
    plt.title('LIME Explanation: Benign Prompt', fontsize=12, fontweight='bold')
    plt.tight_layout()
    plt.show()

## APPROACH 3: Statistical K-Gram Analysis (No Model Needed)

### Why This Approach?
- ‚úÖ No model required!
- ‚úÖ Pure statistical analysis
- ‚úÖ Shows which phrases appear in malicious vs benign
- ‚úÖ Fast and simple

### What it does:
- Extracts k-grams from all prompts
- Calculates how often each appears in malicious vs benign
- Ranks by discriminative power

In [None]:
print("="*80)
print("APPROACH 3: STATISTICAL K-GRAM ANALYSIS (NO MODEL)")
print("="*80)

# Separate by class
mal_texts = [texts[i] for i in range(len(texts)) if labels[i] == 1]
ben_texts = [texts[i] for i in range(len(texts)) if labels[i] == 0]

print(f"\nMalicious: {len(mal_texts):,}")
print(f"Benign: {len(ben_texts):,}")

# Extract k-grams from each class
print("\nExtracting k-grams...")

vec_mal = CountVectorizer(ngram_range=(1, 3), min_df=5)
vec_ben = CountVectorizer(ngram_range=(1, 3), min_df=5)

X_mal = vec_mal.fit_transform(mal_texts)
X_ben = vec_ben.fit_transform(ben_texts)

# Calculate frequencies
mal_freq = X_mal.sum(axis=0).A1
ben_freq = X_ben.sum(axis=0).A1

# Get feature names
mal_features = vec_mal.get_feature_names_out()
ben_features = vec_ben.get_feature_names_out()

# Find common features
common_features = set(mal_features) & set(ben_features)
print(f"‚úì Found {len(common_features):,} k-grams in both classes")

# Calculate discriminative scores
analysis = []
for feature in common_features:
    mal_idx = list(mal_features).index(feature)
    ben_idx = list(ben_features).index(feature)
    
    mal_count = mal_freq[mal_idx]
    ben_count = ben_freq[ben_idx]
    
    # Normalize by class size
    mal_rate = mal_count / len(mal_texts)
    ben_rate = ben_count / len(ben_texts)
    
    # Ratio (with smoothing)
    ratio = (mal_rate + 0.001) / (ben_rate + 0.001)
    
    analysis.append({
        'feature': feature,
        'mal_count': int(mal_count),
        'ben_count': int(ben_count),
        'mal_rate': mal_rate,
        'ben_rate': ben_rate,
        'ratio': ratio
    })

analysis_df = pd.DataFrame(analysis)

# Most malicious (high ratio)
most_mal = analysis_df.sort_values('ratio', ascending=False)

print("\n" + "="*80)
print("TOP 30 MOST MALICIOUS K-GRAMS (Statistical Analysis)")
print("="*80)
print(f"\nRank | K-Gram | Mal Count | Ben Count | Ratio")
print("-"*80)

for i, (_, row) in enumerate(most_mal.head(30).iterrows(), 1):
    print(f"{i:3d}. | {row['feature']:<40} | {row['mal_count']:6d} | "
          f"{row['ben_count']:6d} | {row['ratio']:6.2f}x")

# Most benign (low ratio)
most_ben = analysis_df.sort_values('ratio', ascending=True)

print("\n" + "="*80)
print("TOP 30 MOST BENIGN K-GRAMS (Statistical Analysis)")
print("="*80)
print(f"\nRank | K-Gram | Mal Count | Ben Count | Ratio")
print("-"*80)

for i, (_, row) in enumerate(most_ben.head(30).iterrows(), 1):
    print(f"{i:3d}. | {row['feature']:<40} | {row['mal_count']:6d} | "
          f"{row['ben_count']:6d} | {row['ratio']:6.2f}x")

print("="*80)

## Summary & Recommendations

In [None]:
print("="*80)
print("SUMMARY: THREE APPROACHES COMPARED")
print("="*80)

print("\n1Ô∏è‚É£  APPROACH 1: Retrain Model + Vectorizer")
print("   ‚úÖ PROS: Perfect feature matching, can use SHAP, full control")
print("   ‚ùå CONS: Need to retrain model")
print("   üìä Results: Model trained with {:.2f}% test accuracy".format(test_acc*100))
print("   üéØ USE WHEN: You want a production-ready solution")

print("\n2Ô∏è‚É£  APPROACH 2: LIME Text Explainer")
print("   ‚úÖ PROS: Works on raw text, model-agnostic, per-sample explanations")
print("   ‚ùå CONS: Slower, approximate explanations")
print("   üìä Results: Can explain any individual prediction")
print("   üéØ USE WHEN: You want to explain specific predictions to users")

print("\n3Ô∏è‚É£  APPROACH 3: Statistical K-Gram Analysis")
print("   ‚úÖ PROS: No model needed, fast, simple, interpretable")
print("   ‚ùå CONS: Doesn't show how MODEL uses features")
print("   üìä Results: Pure statistical patterns in data")
print("   üéØ USE WHEN: You want to understand data patterns only")

print("\n" + "="*80)
print("RECOMMENDATION")
print("="*80)
print("\n‚úÖ Use APPROACH 1 (Retrain) as your main solution")
print("   - Save both new_classifier.pkl and new_vectorizer.pkl")
print("   - Use them together for all future predictions")
print("   - Now you can use SHAP properly!")
print("\n‚úÖ Use APPROACH 2 (LIME) for explaining predictions to users")
print("   - Shows which words triggered the detection")
print("   - Good for transparency and debugging")
print("\n‚úÖ Use APPROACH 3 (Statistical) for quick data exploration")
print("   - Find common malicious phrases")
print("   - Create security filter rules")

print("\n" + "="*80)

## Export Results

In [None]:
# Export feature importance
feature_imp.to_csv('approach1_feature_importance.csv', index=False)
print("‚úì Saved: approach1_feature_importance.csv")

# Export statistical analysis
most_mal.head(100).to_csv('approach3_malicious_kgrams.csv', index=False)
most_ben.head(100).to_csv('approach3_benign_kgrams.csv', index=False)
print("‚úì Saved: approach3_malicious_kgrams.csv")
print("‚úì Saved: approach3_benign_kgrams.csv")

# Summary
summary = {
    'approach1': {
        'train_accuracy': float(train_acc),
        'test_accuracy': float(test_acc),
        'features': len(feature_names),
        'files_saved': ['new_classifier.pkl', 'new_vectorizer.pkl']
    },
    'approach3': {
        'common_features': len(common_features),
        'top_malicious_features': [
            {'rank': i+1, 'feature': row['feature'], 'ratio': float(row['ratio'])}
            for i, (_, row) in enumerate(most_mal.head(10).iterrows())
        ]
    },
    'recommendation': 'Use Approach 1 (retrained model) with Approach 2 (LIME) for explanations'
}

with open('complete_analysis_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)
print("‚úì Saved: complete_analysis_summary.json")