<div style="padding: 2.5rem; border-radius: 1.5rem; background: linear-gradient(135deg, #1e293b, #0f172a); color: #f8fafc; border: 1px solid rgba(255,255,255,0.1);">
    <h1 style="color: #38bdf8; font-size: 3rem; font-weight: 800; margin: 0;">AI Spam Sentry: Engine Development</h1>
    <p style="color: #94a3b8; font-size: 1.2rem; margin-top: 0.5rem;">High-Performance Neural-Linguistic Pipeline for Secure Messaging</p>
    <div style="margin-top: 1.5rem; font-size: 0.9rem; opacity: 0.8;">
        <span>Developed by: ADVERK Intelligence Team</span> | 
        <span>Framework: Scikit-Learn + NLTK</span>
    </div>
</div>

---

## 1. Environment Orchestration
We initialize our environment with specialized libraries for vectorized text processing and statistical modeling.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import string
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
nltk.download('stopwords')

print("âœ… Core Libraries Operational")

## 2. Data Ingestion & Transformation
Retrieving the SMS Spam Collection dataset and restructuring for supervised learning.

In [None]:
# Load dataset from high-availability repository
url = "https://raw.githubusercontent.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/master/spam.csv"
df = pd.read_csv(url, encoding='latin-1')

# Schema refining
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
df.columns = ['label', 'message']

print(f"Dataset Statistics: {df.shape[0]} samples processed.")
df.head()

## 3. Linguistic Preprocessing Pipeline
A custom-built function to normalize text data, removing noise while preserving semantics.

In [None]:
stop_words = set(stopwords.words('english'))

def extract_linguistic_features(text):
    # Deep normalization
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    
    # Token purification
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

df['processed_message'] = df['message'].apply(extract_linguistic_features)
print("âœ… Pipeline Phase 1: Normalization Complete")

## 4. Vectorization: TF-IDF Embedding
Converting textual information into high-dimensional numerical space using Term Frequency-Inverse Document Frequency.

In [None]:
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
X_train, X_test, y_train, y_test = train_test_split(
    df['processed_message'], 
    df['label_num'], 
    test_size=0.3, 
    random_state=42, 
    stratify=df['label_num']
)

tfidf = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1,2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"Vector Space Dimension: {X_train_tfidf.shape[1]} features")

## 5. Model Synthesis: Logistic Regression
Training a discriminative classifier optimized for high-dimensional text classification.

In [None]:
classifier = LogisticRegression(C=1.0, solver='liblinear')
classifier.fit(X_train_tfidf, y_train)

y_pred = classifier.predict(X_test_tfidf)
print("âœ… Model Optimization Successful")

## 6. Comprehensive Performance Audit

In [None]:
print("CORE ANALYTICS REPORT")
print("=====================")
print(f"Precision Excellence: {precision_score(y_test, y_pred):.4f}")
print(f"F1 Harmonization:     {f1_score(y_test, y_pred):.4f}")

plt.figure(figsize=(10, 8))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='inferno')
plt.title('Error Matrix Distribution', fontsize=15)
plt.show()

## 7. Artifact Serialization
Saving the trained architecture for real-time inference in the Sentry App.

In [None]:
joblib.dump(classifier, 'spam_model.pkl')
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')
print("ðŸ“¦ Artifacts cached in 2_Mini_Spam_Email_Detection/")