# NLP Multiclass Fraud Detection Baseline Model

This notebook implements a comprehensive baseline model for **multiclass fraud and scam detection** using Natural Language Processing techniques. We'll build and compare multiple approaches from traditional machine learning to modern transformer-based models.

## üéØ Objectives
1. Build traditional ML baselines for **multiclass classification** (TF-IDF + Logistic Regression, SVM)
2. Implement BERT-based multiclass classification
3. Evaluate and compare model performance across **10 classes** (9 scam types + legitimate)
4. Provide a foundation for more advanced fraud detection systems

## üìä Dataset
We'll work with a comprehensive fraud dataset containing:
- **legitimate**: Normal, non-fraudulent messages  
- **phishing**: Email/message phishing attempts
- **popup_scam**: Fake popup advertisements and scams
- **sms_spam**: SMS spam messages
- **reward_scam**: Fake reward and prize scams
- **tech_support_scam**: Fake technical support scams
- **refund_scam**: Fake refund scams
- **ssn_scam**: Social Security Number scams
- **job_scam**: Fake job opportunity scams

## ‚ú® Multiclass Benefits
- **Granular Detection**: Identify specific types of fraud
- **Better Insights**: Understand fraud patterns by category
- **Targeted Defense**: Apply appropriate countermeasures per scam type

---

## 1. Import Required Libraries

Let's start by importing all the necessary libraries for our fraud detection system.

In [None]:
# Core data processing libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Text processing and NLP
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, 
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
    nltk.data.find('corpora/stopwords')
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('omw-1.4')

print("‚úÖ All libraries imported successfully!")
print(f"üìä Pandas version: {pd.__version__}")
print(f"üî¢ NumPy version: {np.__version__}")
print(f"ü§ñ Scikit-learn version:", end=" ")
import sklearn
print(sklearn.__version__)

## 2. Load and Explore Dataset

We'll create a comprehensive dataset with various types of fraud and legitimate messages. In a real project, you would load your actual dataset here.

In [None]:
def create_fraud_dataset():
    """
    Load multiclass fraud/scam data from CSV dataset
    """
    try:
        # Load the full dataset for multiclass classification
        df = pd.read_csv('final_fraud_detection_dataset.csv')
        print(f"‚úÖ Loaded dataset with {len(df)} samples")
        
        # Use detailed_category for multiclass classification
        df = df[['text', 'detailed_category']].copy()
        df.columns = ['message', 'label']
        
        print(f"\nüìä Dataset Overview:")
        print(f"Total samples: {len(df)}")
        print(f"Number of classes: {df['label'].nunique()}")
        print(f"\nüè∑Ô∏è Class distribution:")
        class_counts = df['label'].value_counts()
        for label, count in class_counts.items():
            percentage = (count / len(df)) * 100
            print(f"  {label}: {count:,} ({percentage:.1f}%)")
        
        return df
        
    except FileNotFoundError:
        print("‚ùå Dataset file 'final_fraud_detection_dataset.csv' not found!")
        print("üìù Creating sample multiclass data for demonstration...")
        return create_sample_multiclass_data()
    except Exception as e:
        print(f"‚ùå Error loading dataset: {e}")
        print("üìù Creating sample multiclass data for demonstration...")
        return create_sample_multiclass_data()

def create_sample_multiclass_data():
    """
    Create sample multiclass fraud data for demonstration
    """
    # Sample data for each class
    data = {
        'legitimate': [
            "Thank you for your purchase. Your order will ship tomorrow.",
            "Meeting scheduled for 3 PM in conference room A.",
            "Happy birthday! Hope you have a wonderful day.",
            "The weather forecast shows rain this weekend.",
            "Please review the attached quarterly report.",
            "Lunch meeting confirmed for tomorrow at noon.",
            "Your subscription renewal is due next month.",
            "Project deadline moved to next Friday.",
            "Thanks for attending today's presentation.",
            "Weekend plans include hiking and relaxation."
        ],
        'phishing': [
            "URGENT: Verify your bank account immediately to avoid suspension.",
            "Your PayPal account has been limited. Click here to restore access.",
            "Security alert: Suspicious login detected. Confirm your identity.",
            "Your Amazon account requires immediate verification.",
            "Banking security notice: Update your credentials now.",
            "IRS tax refund pending: Provide your SSN to process.",
            "Credit card company: Verify transaction or account will be closed.",
            "Your email will be deleted unless you confirm your password.",
            "Microsoft security: Your account was accessed from unknown device.",
            "Government notification: Social Security benefits suspended."
        ],
        'popup_scam': [
            "You've won $1,000,000! Click here to claim your prize!",
            "Congratulations! You're the 1,000,000th visitor!",
            "Your computer is infected! Download our antivirus now!",
            "Free iPhone! Complete this survey to claim yours!",
            "You've won a free vacation to Hawaii!",
            "Your browser is out of date! Update now for security!",
            "Free gift card worth $500! Claim now!",
            "You've been selected for a special offer!",
            "Warning: Your computer performance is critically low!",
            "Free casino chips! Play now and win big!"
        ],
        'sms_spam': [
            "Free msg: Txt STOP to cancel. Win cash prizes by texting WIN to 12345!",
            "URGENT: Your loan application approved. Call now!",
            "Free ringtones! Reply YES to get started!",
            "Your mobile won a car! Claim now!",
            "Get rich quick! Work from home opportunity!",
            "Free credit check! Text INFO to receive details!",
            "Limited time offer: Free trial, then ¬£5/week!",
            "Congratulations! You've won a shopping voucher!",
            "Cash advance available! No credit check needed!",
            "Free dating service! Meet singles in your area!"
        ],
        'reward_scam': [
            "Congratulations! You've won a $5000 gift card!",
            "You've been selected for a luxury cruise vacation!",
            "Free cash reward! Claim your $1000 now!",
            "Winner notification: You've won an iPad!",
            "Exclusive reward: Free shopping spree worth $2000!",
            "You've won a year's supply of groceries!",
            "Congratulations! Free car giveaway winner!",
            "You've been chosen for a cash prize!",
            "Special reward: Free vacation package!",
            "Winner alert: Claim your prize money now!"
        ],
        'tech_support_scam': [
            "Microsoft support: Your computer has been infected with malware.",
            "Windows security alert: Your PC is at risk!",
            "Tech support: Your computer is sending error reports.",
            "Apple support: Your device has security issues.",
            "Computer warning: Virus detected on your system!",
            "Technical alert: Your computer performance is compromised.",
            "Microsoft: Your Windows license has expired.",
            "Tech support: Call immediately to fix computer problems.",
            "Security warning: Your computer is infected!",
            "System alert: Computer protection has expired."
        ],
        'refund_scam': [
            "Tax refund notification: $2847 refund pending.",
            "IRS refund alert: Claim your tax refund now!",
            "Government refund: You're eligible for $1500 refund.",
            "Tax office: Refund of $3200 requires verification.",
            "HMRC refund: You have an unclaimed tax refund.",
            "Refund processing: Confirm details to receive money.",
            "Government payment: Refund check is ready.",
            "Tax refund urgent: Action required to process refund.",
            "Refund notification: Update bank details to receive payment.",
            "Official refund: You're entitled to a refund."
        ],
        'ssn_scam': [
            "Social Security Administration: Your SSN has been suspended.",
            "SSN alert: Suspicious activity detected on your number.",
            "Your Social Security number will be blocked immediately.",
            "SSN security notice: Verify your number to avoid suspension.",
            "Social Security fraud alert: Your number is compromised.",
            "SSN suspension notice: Call immediately to reactivate.",
            "Your Social Security benefits are suspended.",
            "SSN alert: Update your information to avoid penalties.",
            "Social Security: Your number linked to illegal activity.",
            "SSN warning: Immediate action required to avoid legal issues."
        ],
        'job_scam': [
            "Work from home opportunity! Earn $5000 per week!",
            "Easy job: Earn $200 per day stuffing envelopes!",
            "Home-based business opportunity! No experience needed!",
            "Data entry job: Earn $50 per hour working from home!",
            "Online job: Make $3000 per month in your spare time!",
            "Mystery shopper needed! Earn money while shopping!",
            "Work from home: Guaranteed income with minimal effort!",
            "Part-time job: Earn $100 per day online!",
            "Employment opportunity: High pay for simple tasks!",
            "Job offer: Make money fast with our proven system!"
        ]
    }
    
    # Create DataFrame
    messages = []
    labels = []
    
    for label, texts in data.items():
        messages.extend(texts)
        labels.extend([label] * len(texts))
    
    df = pd.DataFrame({
        'message': messages,
        'label': labels
    })
    
    # Shuffle the data
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    print(f"? Created sample dataset with {len(df)} samples")
    print(f"Number of classes: {df['label'].nunique()}")
    print(f"\nüè∑Ô∏è Class distribution:")
    for label, count in df['label'].value_counts().items():
        print(f"  {label}: {count}")
    
    return df

# Load the dataset
print("? Loading multiclass fraud detection dataset...")
df = create_fraud_dataset()

# Display sample data
print(f"\nüìã Sample data:")
print(df.head(10))

In [None]:
# Create visualizations for multiclass data exploration
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Class distribution - using bar plot for multiclass
label_counts = df['label'].value_counts()
colors = plt.cm.Set3(np.linspace(0, 1, len(label_counts)))

bars = axes[0, 0].bar(range(len(label_counts)), label_counts.values, color=colors)
axes[0, 0].set_title('Distribution of Fraud Classes', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Fraud Classes')
axes[0, 0].set_ylabel('Number of Messages')
axes[0, 0].set_xticks(range(len(label_counts)))
axes[0, 0].set_xticklabels(label_counts.index, rotation=45, ha='right')

# Add value labels on bars
for bar, count in zip(bars, label_counts.values):
    axes[0, 0].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 10,
                    f'{count:,}', ha='center', va='bottom', fontsize=10)

# 2. Message length distribution by class
df['message_length'] = df['message'].str.len()

# Create box plot for all classes
class_data = [df[df['label'] == label]['message_length'] for label in label_counts.index]
box_plot = axes[0, 1].boxplot(class_data, labels=label_counts.index, patch_artist=True)

# Color the boxes
for patch, color in zip(box_plot['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

axes[0, 1].set_title('Message Length Distribution by Class', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Fraud Classes')
axes[0, 1].set_ylabel('Message Length (characters)')
axes[0, 1].tick_params(axis='x', rotation=45)

# 3. Word count distribution
df['word_count'] = df['message'].str.split().str.len()

# Average metrics by class
stats_data = df.groupby('label').agg({
    'message_length': 'mean',
    'word_count': 'mean'
}).round(2)

# Plot average message length
x_pos = range(len(stats_data))
bars_length = axes[1, 0].bar(x_pos, stats_data['message_length'], color=colors)
axes[1, 0].set_title('Average Message Length by Class', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Fraud Classes')
axes[1, 0].set_ylabel('Average Message Length')
axes[1, 0].set_xticks(x_pos)
axes[1, 0].set_xticklabels(stats_data.index, rotation=45, ha='right')

# Add value labels
for bar, value in zip(bars_length, stats_data['message_length']):
    axes[1, 0].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 2,
                    f'{value:.1f}', ha='center', va='bottom', fontsize=9)

# 4. Average word count by class
bars_words = axes[1, 1].bar(x_pos, stats_data['word_count'], color=colors)
axes[1, 1].set_title('Average Word Count by Class', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Fraud Classes')
axes[1, 1].set_ylabel('Average Word Count')
axes[1, 1].set_xticks(x_pos)
axes[1, 1].set_xticklabels(stats_data.index, rotation=45, ha='right')

# Add value labels
for bar, value in zip(bars_words, stats_data['word_count']):
    axes[1, 1].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.5,
                    f'{value:.1f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

# Print comprehensive statistics for multiclass
print("\nüìä MULTICLASS DATASET STATISTICS")
print("="*60)
print(f"Total samples: {len(df):,}")
print(f"Number of classes: {df['label'].nunique()}")
print(f"\nüè∑Ô∏è Class distribution:")
for label, count in label_counts.items():
    percentage = (count / len(df)) * 100
    print(f"  {label}: {count:,} samples ({percentage:.1f}%)")

print(f"\nüìè Message characteristics by class:")
class_stats = df.groupby('label').agg({
    'message_length': ['mean', 'std', 'min', 'max'],
    'word_count': ['mean', 'std', 'min', 'max']
}).round(2)

for label in df['label'].unique():
    print(f"\n  {label}:")
    print(f"    Avg length: {class_stats.loc[label, ('message_length', 'mean')]:.1f} chars")
    print(f"    Avg words: {class_stats.loc[label, ('word_count', 'mean')]:.1f}")
    print(f"    Length range: {class_stats.loc[label, ('message_length', 'min')]:.0f}-{class_stats.loc[label, ('message_length', 'max')]:.0f} chars")

print(f"\nüìä Overall statistics:")
print(f"  Total characters: {df['message_length'].sum():,}")
print(f"  Average message length: {df['message_length'].mean():.1f} characters")
print(f"  Average word count: {df['word_count'].mean():.1f} words")
print(f"  Length range: {df['message_length'].min()}-{df['message_length'].max()} characters")

## 3. Data Preprocessing and Text Cleaning

Now we'll clean and preprocess the text data to prepare it for machine learning models. This includes removing noise, normalizing text, and creating features that our models can understand.

In [None]:
class TextPreprocessor:
    """
    Comprehensive text preprocessing pipeline for fraud detection
    """
    
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
        self.stop_words = set(stopwords.words('english'))
        
        # Add domain-specific words to stop words if needed
        # self.stop_words.update(['would', 'could', 'should'])
    
    def clean_text(self, text):
        """
        Clean and normalize text
        """
        # Convert to lowercase
        text = text.lower()
        
        # Remove URLs
        text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
        text = re.sub(r'www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
        
        # Remove email addresses
        text = re.sub(r'\S+@\S+', '', text)
        
        # Remove phone numbers
        text = re.sub(r'[\+]?[1-9]?[0-9]{7,15}', '', text)
        
        # Remove special characters and digits
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def tokenize_and_lemmatize(self, text):
        """
        Tokenize text and apply lemmatization
        """
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove stopwords and short words
        tokens = [token for token in tokens if token not in self.stop_words and len(token) > 2]
        
        # Lemmatize
        tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
        
        return ' '.join(tokens)
    
    def preprocess(self, text):
        """
        Complete preprocessing pipeline
        """
        # Clean text
        cleaned = self.clean_text(text)
        
        # Tokenize and lemmatize
        processed = self.tokenize_and_lemmatize(cleaned)
        
        return processed
    
    def extract_features(self, text):
        """
        Extract additional features from text
        """
        features = {}
        
        # Basic text features
        features['char_count'] = len(text)
        features['word_count'] = len(text.split())
        features['sentence_count'] = len(re.findall(r'[.!?]+', text))
        features['avg_word_length'] = np.mean([len(word) for word in text.split()]) if text.split() else 0
        
        # Uppercase features
        features['upper_case_count'] = sum(1 for c in text if c.isupper())
        features['upper_case_ratio'] = features['upper_case_count'] / len(text) if len(text) > 0 else 0
        
        # Punctuation features
        features['exclamation_count'] = text.count('!')
        features['question_count'] = text.count('?')
        features['dollar_count'] = text.count('$')
        
        # Fraud-specific features
        fraud_indicators = ['urgent', 'click', 'verify', 'winner', 'prize', 'money', 'free', 'offer']
        features['fraud_words'] = sum(1 for word in fraud_indicators if word in text.lower())
        
        return features

# Initialize preprocessor
preprocessor = TextPreprocessor()

# Apply preprocessing to the dataset
print("üîÑ Preprocessing text data...")
df['cleaned_message'] = df['message'].apply(preprocessor.preprocess)

# Extract additional features
print("üîç Extracting additional features...")
feature_data = df['message'].apply(preprocessor.extract_features)
feature_df = pd.DataFrame(list(feature_data))

# Combine with main dataframe
df = pd.concat([df, feature_df], axis=1)

print("‚úÖ Preprocessing complete!")

# Show preprocessing examples
print("\nüìù PREPROCESSING EXAMPLES")
print("="*60)

sample_indices = [0, 15, 30]  # Show examples from different categories
for i, idx in enumerate(sample_indices):
    print(f"\nExample {i+1} - Label: {df.iloc[idx]['label'].upper()}")
    print(f"Original: {df.iloc[idx]['message'][:80]}...")
    print(f"Cleaned:  {df.iloc[idx]['cleaned_message'][:80]}...")
    print("-" * 60)

# Show feature statistics
print("\nüìä EXTRACTED FEATURES STATISTICS")
print("="*50)
feature_cols = ['char_count', 'word_count', 'upper_case_ratio', 'fraud_words']
print(df.groupby('label')[feature_cols].mean().round(2))

## 4. Feature Engineering with TF-IDF

We'll convert the cleaned text into numerical features that machine learning algorithms can understand using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization.

In [None]:
# Configure TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,          # Limit vocabulary to top 5000 words
    ngram_range=(1, 2),         # Use unigrams and bigrams
    stop_words='english',       # Remove English stop words
    min_df=2,                   # Ignore terms that appear in less than 2 documents
    max_df=0.95,               # Ignore terms that appear in more than 95% of documents
    lowercase=True,             # Convert to lowercase
    sublinear_tf=True          # Apply sublinear tf scaling
)

print("üîÑ Creating TF-IDF features...")

# Fit and transform the cleaned text
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_message'])

print(f"‚úÖ TF-IDF vectorization complete!")
print(f"üìä Feature matrix shape: {X_tfidf.shape}")
print(f"üìö Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")

# Get feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert to DataFrame for easier handling
X_tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=feature_names)

print(f"\nüîç Sample TF-IDF features:")
print(f"First 10 features: {list(feature_names[:10])}")

# Analyze most important features for each class
print("\nüìä TOP TF-IDF FEATURES BY CLASS")
print("="*50)

# Calculate mean TF-IDF scores for each class
fraud_mask = df['label'] == 'fraud'
normal_mask = df['label'] == 'normal'

fraud_tfidf_mean = X_tfidf_df[fraud_mask].mean()
normal_tfidf_mean = X_tfidf_df[normal_mask].mean()

# Get top features for fraud class
top_fraud_features = fraud_tfidf_mean.nlargest(10)
print("üö® Top 10 Fraud Features:")
for feature, score in top_fraud_features.items():
    print(f"  {feature}: {score:.4f}")

# Get top features for normal class
top_normal_features = normal_tfidf_mean.nlargest(10)
print("\n‚úÖ Top 10 Normal Features:")
for feature, score in top_normal_features.items():
    print(f"  {feature}: {score:.4f}")

# Visualize top features
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Top fraud features
top_fraud_features.plot(kind='barh', ax=ax1, color='red', alpha=0.7)
ax1.set_title('Top 10 Features in Fraud Messages', fontsize=14, fontweight='bold')
ax1.set_xlabel('Average TF-IDF Score')

# Top normal features
top_normal_features.plot(kind='barh', ax=ax2, color='green', alpha=0.7)
ax2.set_title('Top 10 Features in Normal Messages', fontsize=14, fontweight='bold')
ax2.set_xlabel('Average TF-IDF Score')

plt.tight_layout()
plt.show()

# Create word clouds for visual representation
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Fraud word cloud
fraud_text = ' '.join(df[df['label'] == 'fraud']['cleaned_message'])
fraud_wordcloud = WordCloud(width=400, height=300, background_color='white').generate(fraud_text)
ax1.imshow(fraud_wordcloud, interpolation='bilinear')
ax1.axis('off')
ax1.set_title('Fraud Messages Word Cloud', fontsize=14, fontweight='bold')

# Normal word cloud
normal_text = ' '.join(df[df['label'] == 'normal']['cleaned_message'])
normal_wordcloud = WordCloud(width=400, height=300, background_color='white').generate(normal_text)
ax2.imshow(normal_wordcloud, interpolation='bilinear')
ax2.axis('off')
ax2.set_title('Normal Messages Word Cloud', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Combine TF-IDF features with additional engineered features
additional_features = ['char_count', 'word_count', 'upper_case_ratio', 'fraud_words', 
                      'exclamation_count', 'dollar_count']

# Normalize additional features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_additional = scaler.fit_transform(df[additional_features])

# Combine features
X_combined = np.hstack([X_tfidf.toarray(), X_additional])
print(f"\nüîó Combined feature matrix shape: {X_combined.shape}")
print(f"   TF-IDF features: {X_tfidf.shape[1]}")
print(f"   Additional features: {len(additional_features)}")
print(f"   Total features: {X_combined.shape[1]}")

## 5. Train-Test Split

Now we'll split our data into training and testing sets, ensuring proper stratification to maintain class balance.

In [None]:
# Prepare target variable for MULTICLASS classification
from sklearn.preprocessing import LabelEncoder

print("üîÑ Preparing multiclass labels...")

# Use LabelEncoder for multiclass labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['label'])

print(f"‚úÖ Label encoding complete!")
print(f"Classes: {label_encoder.classes_}")
print(f"Number of classes: {len(label_encoder.classes_)}")

print("üîÑ Splitting dataset...")

# Split the data with stratification for multiclass
X_train, X_test, y_train, y_test = train_test_split(
    X_combined, y, 
    test_size=0.3,          # 70% train, 30% test
    random_state=42,        # For reproducibility
    stratify=y             # Maintain class balance across all classes
)

print("‚úÖ Dataset split complete!")

# Print split information
print(f"\nüìä MULTICLASS DATASET SPLIT SUMMARY")
print("="*50)
print(f"Total samples: {len(df)}")
print(f"Training samples: {len(X_train)} ({len(X_train)/len(df)*100:.1f}%)")
print(f"Testing samples: {len(X_test)} ({len(X_test)/len(df)*100:.1f}%)")

print(f"\nüè∑Ô∏è MULTICLASS LABEL DISTRIBUTION")
print("-"*40)
print("Training set:")
train_labels = pd.Series(y_train).map({i: label for i, label in enumerate(label_encoder.classes_)})
train_counts = train_labels.value_counts()
for label, count in train_counts.items():
    percentage = (count / len(y_train)) * 100
    print(f"  {label}: {count} samples ({percentage:.1f}%)")

print("\nTesting set:")
test_labels = pd.Series(y_test).map({i: label for i, label in enumerate(label_encoder.classes_)})
test_counts = test_labels.value_counts()
for label, count in test_counts.items():
    percentage = (count / len(y_test)) * 100
    print(f"  {label}: {count} samples ({percentage:.1f}%)")

## 6. Build Baseline Models

We'll implement and compare multiple baseline models to establish a strong foundation for fraud detection.

In [None]:
# Initialize MULTICLASS baseline models
models = {
    'Logistic Regression': LogisticRegression(
        random_state=42, 
        max_iter=2000,
        multi_class='ovr'  # One-vs-Rest for multiclass
    ),
    'SVM': SVC(
        random_state=42, 
        probability=True,
        decision_function_shape='ovr'  # One-vs-Rest for multiclass
    ),
    'Naive Bayes': MultinomialNB(),  # Naturally handles multiclass
    'Random Forest': RandomForestClassifier(
        n_estimators=100, 
        random_state=42,
        class_weight='balanced'  # Handle class imbalance
    )
}

# Store results
results = {}

print("ü§ñ Training multiclass baseline models...")
print("="*60)

# Train and evaluate each model for multiclass
for name, model in models.items():
    print(f"\nüîÑ Training {name} for multiclass classification...")
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)
    
    # Calculate multiclass metrics
    accuracy = accuracy_score(y_test, y_pred)
    
    # For multiclass, use macro and weighted averages
    precision_macro = precision_score(y_test, y_pred, average='macro', zero_division=0)
    recall_macro = recall_score(y_test, y_pred, average='macro', zero_division=0)
    f1_macro = f1_score(y_test, y_pred, average='macro', zero_division=0)
    
    precision_weighted = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    recall_weighted = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1_weighted = f1_score(y_test, y_pred, average='weighted', zero_division=0)
    
    # Store results
    results[name] = {
        'model': model,
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba,
        'accuracy': accuracy,
        'precision_macro': precision_macro,
        'recall_macro': recall_macro,
        'f1_macro': f1_macro,
        'precision_weighted': precision_weighted,
        'recall_weighted': recall_weighted,
        'f1_weighted': f1_weighted
    }
    
    print(f"‚úÖ {name} complete!")
    print(f"   Accuracy: {accuracy:.3f}")
    print(f"   F1-Score (Macro): {f1_macro:.3f}")
    print(f"   F1-Score (Weighted): {f1_weighted:.3f}")

print(f"\nüéØ MULTICLASS MODEL PERFORMANCE SUMMARY")
print("="*80)
print(f"{'Model':<18} {'Accuracy':<9} {'F1-Macro':<9} {'F1-Weighted':<11} {'Prec-Macro':<10} {'Rec-Macro':<10}")
print("-" * 80)

for name, metrics in results.items():
    print(f"{name:<18} {metrics['accuracy']:<9.3f} {metrics['f1_macro']:<9.3f} "
          f"{metrics['f1_weighted']:<11.3f} {metrics['precision_macro']:<10.3f} {metrics['recall_macro']:<10.3f}")

# Find best model based on weighted F1-score (good for imbalanced multiclass)
best_model_name = max(results.keys(), key=lambda x: results[x]['f1_weighted'])
best_f1 = results[best_model_name]['f1_weighted']

print(f"\nüèÜ Best Model: {best_model_name} (F1-Weighted: {best_f1:.3f})")

# Show detailed classification report for best model
print(f"\nüìä DETAILED CLASSIFICATION REPORT - {best_model_name}")
print("="*60)
best_model = results[best_model_name]['model']
best_y_pred = results[best_model_name]['y_pred']

report = classification_report(
    y_test, 
    best_y_pred, 
    target_names=label_encoder.classes_,
    zero_division=0
)
print(report)

# Show confusion matrix for best model
print(f"\nüîç CONFUSION MATRIX - {best_model_name}")
print("="*50)
cm = confusion_matrix(y_test, best_y_pred)
print("Classes:", label_encoder.classes_)
print("Confusion Matrix:")
print(cm)

## 7. Model Training and Evaluation

Let's perform cross-validation and detailed analysis of our models to ensure robust performance estimates.

In [None]:
# Perform cross-validation for more robust evaluation
print("üîÑ Performing cross-validation...")
print("="*50)

cv_results = {}
cv_folds = 5

for name, model in models.items():
    print(f"\nüìä Cross-validating {name}...")
    
    # Perform cross-validation on different metrics
    cv_accuracy = cross_val_score(model, X_combined, y, cv=cv_folds, scoring='accuracy')
    cv_precision = cross_val_score(model, X_combined, y, cv=cv_folds, scoring='precision')
    cv_recall = cross_val_score(model, X_combined, y, cv=cv_folds, scoring='recall')
    cv_f1 = cross_val_score(model, X_combined, y, cv=cv_folds, scoring='f1')
    cv_auc = cross_val_score(model, X_combined, y, cv=cv_folds, scoring='roc_auc')
    
    cv_results[name] = {
        'accuracy': cv_accuracy,
        'precision': cv_precision,
        'recall': cv_recall,
        'f1': cv_f1,
        'auc': cv_auc
    }
    
    print(f"   Accuracy:  {cv_accuracy.mean():.3f} (¬±{cv_accuracy.std()*2:.3f})")
    print(f"   Precision: {cv_precision.mean():.3f} (¬±{cv_precision.std()*2:.3f})")
    print(f"   Recall:    {cv_recall.mean():.3f} (¬±{cv_recall.std()*2:.3f})")
    print(f"   F1-Score:  {cv_f1.mean():.3f} (¬±{cv_f1.std()*2:.3f})")
    print(f"   AUC:       {cv_auc.mean():.3f} (¬±{cv_auc.std()*2:.3f})")

# Visualize cross-validation results
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
metrics = ['accuracy', 'precision', 'recall', 'f1', 'auc']

for idx, metric in enumerate(metrics):
    row = idx // 3
    col = idx % 3
    
    # Prepare data for box plot
    data_to_plot = [cv_results[name][metric] for name in models.keys()]
    model_names = list(models.keys())
    
    # Create box plot
    box_plot = axes[row, col].boxplot(data_to_plot, labels=model_names, patch_artist=True)
    
    # Color the boxes
    colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
    for patch, color in zip(box_plot['boxes'], colors):
        patch.set_facecolor(color)
    
    axes[row, col].set_title(f'{metric.upper()} - Cross Validation', fontsize=12, fontweight='bold')
    axes[row, col].set_ylabel(metric.title())
    axes[row, col].grid(True, alpha=0.3)
    axes[row, col].tick_params(axis='x', rotation=45)

# Remove the empty subplot
axes[1, 2].remove()

plt.tight_layout()
plt.show()

# Statistical significance testing
print(f"\nüìà CROSS-VALIDATION SUMMARY")
print("="*70)
print(f"{'Model':<20} {'Metric':<12} {'Mean':<8} {'Std':<8} {'Min':<8} {'Max':<8}")
print("-" * 70)

for name in models.keys():
    for metric in ['f1', 'precision', 'recall', 'accuracy', 'auc']:
        scores = cv_results[name][metric]
        print(f"{name:<20} {metric:<12} {scores.mean():<8.3f} {scores.std():<8.3f} "
              f"{scores.min():<8.3f} {scores.max():<8.3f}")
    print("-" * 70)

# Hyperparameter tuning for best models
print(f"\nüîß HYPERPARAMETER TUNING")
print("="*40)

# Tune Logistic Regression
print("üîÑ Tuning Logistic Regression...")
lr_params = {
    'C': [0.1, 1.0, 10.0, 100.0],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

lr_grid = GridSearchCV(
    LogisticRegression(random_state=42, max_iter=1000),
    lr_params,
    cv=3,
    scoring='f1',
    n_jobs=-1
)

lr_grid.fit(X_train, y_train)
print(f"‚úÖ Best LR params: {lr_grid.best_params_}")
print(f"‚úÖ Best LR score: {lr_grid.best_score_:.3f}")

# Tune SVM
print(f"\nüîÑ Tuning SVM...")
svm_params = {
    'C': [0.1, 1.0, 10.0],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

svm_grid = GridSearchCV(
    SVC(random_state=42, probability=True),
    svm_params,
    cv=3,
    scoring='f1',
    n_jobs=-1
)

svm_grid.fit(X_train, y_train)
print(f"‚úÖ Best SVM params: {svm_grid.best_params_}")
print(f"‚úÖ Best SVM score: {svm_grid.best_score_:.3f}")

# Store tuned models
tuned_models = {
    'Tuned Logistic Regression': lr_grid.best_estimator_,
    'Tuned SVM': svm_grid.best_estimator_
}

# Evaluate tuned models
print(f"\nüéØ TUNED MODEL PERFORMANCE")
print("="*50)

for name, model in tuned_models.items():
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    print(f"\n{name}:")
    print(f"  Accuracy:  {accuracy:.3f}")
    print(f"  Precision: {precision:.3f}")
    print(f"  Recall:    {recall:.3f}")
    print(f"  F1-Score:  {f1:.3f}")
    print(f"  AUC:       {auc:.3f}")
    
    # Update results with tuned models
    results[name] = {
        'model': model,
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'auc_score': auc
    }

## 8. Performance Metrics and Confusion Matrix

Let's dive deep into the performance analysis with detailed confusion matrices and error analysis.

In [None]:
# Detailed confusion matrix analysis
def plot_confusion_matrices(results, y_test):
    """Plot confusion matrices for all models"""
    
    n_models = len(results)
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.flatten()
    
    for idx, (name, result) in enumerate(results.items()):
        if idx >= len(axes):
            break
            
        cm = confusion_matrix(y_test, result['y_pred'])
        
        # Plot confusion matrix
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                   xticklabels=['Normal', 'Fraud'], 
                   yticklabels=['Normal', 'Fraud'],
                   ax=axes[idx])
        
        axes[idx].set_title(f'{name}\nAccuracy: {result["accuracy"]:.3f}', 
                           fontsize=12, fontweight='bold')
        axes[idx].set_xlabel('Predicted Label')
        axes[idx].set_ylabel('True Label')
        
        # Add percentage annotations
        total = cm.sum()
        for i in range(2):
            for j in range(2):
                percent = cm[i, j] / total * 100
                axes[idx].text(j + 0.5, i + 0.7, f'({percent:.1f}%)', 
                             ha='center', va='center', fontsize=10, color='red')
    
    # Remove unused subplots
    for idx in range(len(results), len(axes)):
        fig.delaxes(axes[idx])
    
    plt.tight_layout()
    plt.show()

# Plot confusion matrices
plot_confusion_matrices(results, y_test)

# Detailed classification reports
print("üìä DETAILED CLASSIFICATION REPORTS")
print("="*60)

for name, result in results.items():
    print(f"\nü§ñ {name}")
    print("-" * 40)
    print(classification_report(y_test, result['y_pred'], 
                              target_names=['Normal', 'Fraud'],
                              digits=3))

# Error analysis
print(f"\nüîç ERROR ANALYSIS")
print("="*50)

# Get the best performing model for detailed analysis
best_model_name = max(results.keys(), key=lambda x: results[x]['f1_score'])
best_result = results[best_model_name]

print(f"Analyzing errors for: {best_model_name}")
print(f"Best F1-Score: {best_result['f1_score']:.3f}")

# Get test data for error analysis
X_test_indices = y_test.index if hasattr(y_test, 'index') else range(len(y_test))

# Find misclassified samples
y_pred_best = best_result['y_pred']
misclassified_mask = y_test != y_pred_best
misclassified_indices = [i for i, mask in enumerate(misclassified_mask) if mask]

print(f"\nTotal misclassified samples: {sum(misclassified_mask)}")

# Analyze false positives and false negatives
false_positives = [(i, y_test.iloc[i] if hasattr(y_test, 'iloc') else y_test[i], y_pred_best[i]) 
                   for i in misclassified_indices 
                   if (y_test.iloc[i] if hasattr(y_test, 'iloc') else y_test[i]) == 0 and y_pred_best[i] == 1]

false_negatives = [(i, y_test.iloc[i] if hasattr(y_test, 'iloc') else y_test[i], y_pred_best[i]) 
                   for i in misclassified_indices 
                   if (y_test.iloc[i] if hasattr(y_test, 'iloc') else y_test[i]) == 1 and y_pred_best[i] == 0]

print(f"False Positives (Normal predicted as Fraud): {len(false_positives)}")
print(f"False Negatives (Fraud predicted as Normal): {len(false_negatives)}")

# Show examples of misclassified samples
if len(false_positives) > 0:
    print(f"\n‚ùå FALSE POSITIVE EXAMPLES:")
    print("-" * 30)
    # Get original test data indices
    df_test = df.iloc[y_test.index] if hasattr(y_test, 'index') else df.iloc[-len(y_test):]
    
    for i, (idx, true_label, pred_label) in enumerate(false_positives[:3]):
        original_idx = df_test.index[idx] if hasattr(df_test, 'index') else idx
        message = df.loc[original_idx, 'message'] if original_idx in df.index else "Message not found"
        print(f"{i+1}. {message[:100]}...")
        print(f"   True: Normal, Predicted: Fraud")
        print()

if len(false_negatives) > 0:
    print(f"\n‚ùå FALSE NEGATIVE EXAMPLES:")
    print("-" * 30)
    df_test = df.iloc[y_test.index] if hasattr(y_test, 'index') else df.iloc[-len(y_test):]
    
    for i, (idx, true_label, pred_label) in enumerate(false_negatives[:3]):
        original_idx = df_test.index[idx] if hasattr(df_test, 'index') else idx
        message = df.loc[original_idx, 'message'] if original_idx in df.index else "Message not found"
        print(f"{i+1}. {message[:100]}...")
        print(f"   True: Fraud, Predicted: Normal")
        print()

# Cost analysis for fraud detection
print(f"\nüí∞ COST ANALYSIS")
print("="*30)

# Assuming costs: FN (missed fraud) = $1000, FP (false alarm) = $10
cost_fn = 1000  # Cost of missing a fraud
cost_fp = 10    # Cost of false alarm

for name, result in results.items():
    cm = confusion_matrix(y_test, result['y_pred'])
    tn, fp, fn, tp = cm.ravel()
    
    total_cost = fn * cost_fn + fp * cost_fp
    print(f"{name}:")
    print(f"  False Negatives: {fn} x ${cost_fn} = ${fn * cost_fn}")
    print(f"  False Positives: {fp} x ${cost_fp} = ${fp * cost_fp}")
    print(f"  Total Cost: ${total_cost}")
    print(f"  Cost per sample: ${total_cost / len(y_test):.2f}")
    print()

# Performance at different thresholds
print(f"\nüìà THRESHOLD ANALYSIS")
print("="*30)

# Analyze best model at different thresholds
best_model = best_result['model']
y_proba = best_result['y_pred_proba']

thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
print(f"Model: {best_model_name}")
print(f"{'Threshold':<10} {'Precision':<10} {'Recall':<10} {'F1':<10} {'Cost':<10}")
print("-" * 50)

for threshold in thresholds:
    y_pred_thresh = (y_proba >= threshold).astype(int)
    
    precision_thresh = precision_score(y_test, y_pred_thresh)
    recall_thresh = recall_score(y_test, y_pred_thresh)
    f1_thresh = f1_score(y_test, y_pred_thresh)
    
    # Calculate cost
    cm_thresh = confusion_matrix(y_test, y_pred_thresh)
    tn, fp, fn, tp = cm_thresh.ravel()
    cost_thresh = fn * cost_fn + fp * cost_fp
    
    print(f"{threshold:<10.1f} {precision_thresh:<10.3f} {recall_thresh:<10.3f} "
          f"{f1_thresh:<10.3f} ${cost_thresh:<9.0f}")

# Plot precision-recall curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

from sklearn.metrics import precision_recall_curve

precision_curve, recall_curve, thresholds_pr = precision_recall_curve(y_test, y_proba)

ax1.plot(recall_curve, precision_curve, color='blue', linewidth=2)
ax1.set_xlabel('Recall')
ax1.set_ylabel('Precision')
ax1.set_title(f'Precision-Recall Curve\n{best_model_name}', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Plot threshold vs metrics
ax2.plot(thresholds, [precision_score(y_test, (y_proba >= t).astype(int)) for t in thresholds], 
         label='Precision', marker='o')
ax2.plot(thresholds, [recall_score(y_test, (y_proba >= t).astype(int)) for t in thresholds], 
         label='Recall', marker='s')
ax2.plot(thresholds, [f1_score(y_test, (y_proba >= t).astype(int)) for t in thresholds], 
         label='F1-Score', marker='^')

ax2.set_xlabel('Threshold')
ax2.set_ylabel('Score')
ax2.set_title('Metrics vs Threshold', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Model Prediction on New Text Samples

Now let's test our trained models on new, unseen text samples to see how they perform in practice.

In [None]:
class FraudDetectionPredictor:
    """
    Wrapper class for making predictions on new text samples
    """
    
    def __init__(self, model, vectorizer, preprocessor, scaler, feature_cols):
        self.model = model
        self.vectorizer = vectorizer
        self.preprocessor = preprocessor
        self.scaler = scaler
        self.feature_cols = feature_cols
        
    def predict_message(self, message):
        """
        Predict if a message is fraud or normal
        """
        # Preprocess the message
        cleaned_message = self.preprocessor.preprocess(message)
        
        # Extract TF-IDF features
        tfidf_features = self.vectorizer.transform([cleaned_message])
        
        # Extract additional features
        additional_features_dict = self.preprocessor.extract_features(message)
        additional_features = np.array([[additional_features_dict[col] for col in self.feature_cols]])
        additional_features_scaled = self.scaler.transform(additional_features)
        
        # Combine features
        combined_features = np.hstack([tfidf_features.toarray(), additional_features_scaled])
        
        # Make prediction
        prediction = self.model.predict(combined_features)[0]
        probability = self.model.predict_proba(combined_features)[0]
        
        return {
            'message': message,
            'prediction': 'Fraud' if prediction == 1 else 'Normal',
            'confidence': max(probability),
            'fraud_probability': probability[1],
            'normal_probability': probability[0]
        }

# Create predictor with the best model
best_model = results[best_model_name]['model']
predictor = FraudDetectionPredictor(
    model=best_model,
    vectorizer=tfidf_vectorizer,
    preprocessor=preprocessor,
    scaler=scaler,
    feature_cols=additional_features
)

# Test on new sample messages
test_messages = [
    # Clear fraud examples
    "URGENT: Your account is suspended! Click here immediately to restore access or lose your money forever!",
    "Congratulations! You've won $1,000,000 in our lottery! Send $500 processing fee to claim your prize now!",
    "FINAL NOTICE: Pay $2000 immediately or we will take legal action against you today!",
    "Your bank account has been compromised. Verify your details at suspicious-bank-link.com right now!",
    
    # Clear normal examples  
    "Hey, thanks for helping me with the project yesterday. The presentation went really well!",
    "Reminder: Your doctor's appointment is scheduled for tomorrow at 2 PM",
    "The team meeting has been moved to Friday at 10 AM in conference room B",
    "Happy birthday! Hope you have a wonderful celebration with your family",
    
    # Ambiguous/edge cases
    "Limited time offer: 50% off all items. Sale ends soon!",
    "Your order has been delayed due to shipping issues. Additional fees may apply",
    "Security alert: We detected unusual activity on your account. Please review",
    "Investment opportunity: High returns guaranteed with our new fund"
]

print("üîÆ TESTING ON NEW MESSAGES")
print("="*80)

# Predict for all test messages
predictions = []
for i, message in enumerate(test_messages, 1):
    result = predictor.predict_message(message)
    predictions.append(result)
    
    # Determine emoji and color based on prediction
    emoji = "üö®" if result['prediction'] == 'Fraud' else "‚úÖ"
    confidence_color = "HIGH" if result['confidence'] > 0.8 else "MEDIUM" if result['confidence'] > 0.6 else "LOW"
    
    print(f"\n{emoji} Test Message {i}:")
    print(f"Message: {message}")
    print(f"Prediction: {result['prediction']} (Confidence: {confidence_color} - {result['confidence']:.3f})")
    print(f"Fraud Probability: {result['fraud_probability']:.3f}")
    print("-" * 80)

# Create visualization of predictions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

# 1. Prediction distribution
pred_counts = pd.Series([p['prediction'] for p in predictions]).value_counts()
colors = ['green' if label == 'Normal' else 'red' for label in pred_counts.index]
ax1.pie(pred_counts.values, labels=pred_counts.index, autopct='%1.1f%%', colors=colors, alpha=0.7)
ax1.set_title('Prediction Distribution on Test Messages', fontsize=14, fontweight='bold')

# 2. Confidence distribution
confidences = [p['confidence'] for p in predictions]
fraud_confidences = [p['confidence'] for p in predictions if p['prediction'] == 'Fraud']
normal_confidences = [p['confidence'] for p in predictions if p['prediction'] == 'Normal']

ax2.hist(fraud_confidences, alpha=0.7, label='Fraud Predictions', color='red', bins=5)
ax2.hist(normal_confidences, alpha=0.7, label='Normal Predictions', color='green', bins=5)
ax2.set_xlabel('Confidence Score')
ax2.set_ylabel('Number of Predictions')
ax2.set_title('Confidence Distribution by Prediction', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Interactive prediction function
def interactive_prediction():
    """
    Interactive function for testing custom messages
    """
    print("\nüéØ INTERACTIVE FRAUD DETECTION")
    print("="*50)
    print("Enter a message to test (or 'quit' to exit):")
    
    while True:
        user_message = input("\nMessage: ").strip()
        
        if user_message.lower() in ['quit', 'exit', 'q']:
            print("üëã Thanks for testing!")
            break
            
        if not user_message:
            print("‚ö†Ô∏è Please enter a message.")
            continue
            
        try:
            result = predictor.predict_message(user_message)
            
            emoji = "üö®" if result['prediction'] == 'Fraud' else "‚úÖ"
            print(f"\n{emoji} Prediction: {result['prediction']}")
            print(f"   Confidence: {result['confidence']:.3f}")
            print(f"   Fraud Probability: {result['fraud_probability']:.3f}")
            
            if result['prediction'] == 'Fraud':
                print("   ‚ö†Ô∏è This message appears to be fraudulent!")
            else:
                print("   ‚úÖ This message appears to be legitimate.")
                
        except Exception as e:
            print(f"‚ùå Error processing message: {e}")

# Uncomment the line below to run interactive prediction
# interactive_prediction()

# Save the best model for future use
print(f"\nüíæ MODEL PERSISTENCE")
print("="*30)

import joblib

# Save the complete pipeline
model_pipeline = {
    'model': best_model,
    'vectorizer': tfidf_vectorizer,
    'preprocessor': preprocessor,
    'scaler': scaler,
    'feature_columns': additional_features,
    'model_name': best_model_name,
    'performance_metrics': {
        'accuracy': results[best_model_name]['accuracy'],
        'precision': results[best_model_name]['precision'],
        'recall': results[best_model_name]['recall'],
        'f1_score': results[best_model_name]['f1_score'],
        'auc_score': results[best_model_name]['auc_score']
    }
}

# Save to file
# joblib.dump(model_pipeline, 'fraud_detection_pipeline.pkl')
print("Model pipeline ready for saving with joblib.dump()")

print(f"\nTo load the model later:")
print("model_pipeline = joblib.load('fraud_detection_pipeline.pkl')")
print("predictor = FraudDetectionPredictor(**model_pipeline)")

# Summary of best practices
print(f"\nüìã IMPLEMENTATION SUMMARY")
print("="*50)
print(f"üèÜ Best Model: {best_model_name}")
print(f"üìä Performance: F1-Score = {results[best_model_name]['f1_score']:.3f}")
print(f"üîß Key Features:")
print(f"   - TF-IDF vectorization with {X_tfidf.shape[1]} features")
print(f"   - Additional engineered features: {len(additional_features)}")
print(f"   - Cross-validation for robust evaluation")
print(f"   - Hyperparameter tuning")
print(f"   - Cost-sensitive analysis")
print(f"\n‚ú® Next Steps:")
print(f"   1. Deploy as web service (Flask/FastAPI)")
print(f"   2. Implement real-time monitoring")
print(f"   3. Add more sophisticated features")
print(f"   4. Experiment with BERT/transformer models")
print(f"   5. Collect and retrain on real-world data")