# **TechNova - Spam Detection (AI Model Training)**

## **COS30049 - Computing Technology Innovation Project**

---

### **Project Overview**
This notebook presents a comprehensive Machine Learning pipeline for **Spam Detection** - classifying text messages/emails as either **Spam** or **Ham** (not spam).

The pipeline includes:
1. **Data Loading & Exploration** - Understanding the dataset distribution and characteristics
2. **Data Preprocessing** - Text cleaning, normalization, and feature engineering
3. **Feature Extraction** - TF-IDF Vectorization
4. **Model Training** - Naive Bayes, SVM, Logistic Regression, Random Forest
5. **Model Evaluation** - Accuracy, Precision, Recall, F1-Score, Confusion Matrix, ROC Curve
6. **Cross-Validation** - 5-Fold CV for robust performance estimation
7. **Hyperparameter Tuning** - GridSearchCV for optimal model configuration
8. **Model Saving** - Exporting the best model and vectorizer with joblib
9. **Prediction Interface** - Function to classify new text inputs


---
## **Step 1: Setup & Import Libraries**

We use the following libraries:
- **pandas** - data manipulation and analysis
- **numpy** - numerical operations
- **matplotlib & seaborn** - data visualization
- **scikit-learn** - ML models, preprocessing, evaluation
- **nltk** - natural language processing (stopwords, tokenization)
- **wordcloud** - word cloud visualization
- **joblib** - model serialization


In [None]:
# Install required packages
# pip install pandas numpy matplotlib seaborn scikit-learn nltk wordcloud joblib

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
import string
import warnings
warnings.filterwarnings("ignore")

# NLP
import nltk
nltk.download("stopwords", quiet=True)
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Scikit-learn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc
)
from sklearn.calibration import CalibratedClassifierCV

# Visualization
from wordcloud import WordCloud

# Model saving
import joblib

# Set plot style
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["font.size"] = 12

print("All libraries imported successfully!")


### **1.1 Upload & Load Dataset**

Upload your CSV dataset file. The dataset should contain text messages/emails with their corresponding labels (spam/ham).


In [None]:
# For Google Colab: Upload files
from google.colab import files

print("Please upload your dataset file(s) (.csv):")
uploaded = files.upload()

filenames = list(uploaded.keys())
print(f"\nUploaded files: {filenames}")


In [None]:
# Load and merge data
dfs = []
print("--- Loading Data ---")

for file in filenames:
    try:
        # Try reading with 'utf-8' first, then 'latin1' if it fails
        try:
            df = pd.read_csv(file)
        except UnicodeDecodeError:
            print(f"  'utf-8' decoding failed for {file}. Trying 'latin1'...")
            df = pd.read_csv(file, encoding='latin1')

        dfs.append(df)
        print(f"Loaded {file}: {df.shape}")
    except Exception as e:
        print(f"Error loading {file}: {e}")

if dfs:
    df = pd.concat(dfs, ignore_index=True)
    print(f"\nTotal shape after merging: {df.shape}")
else:
    raise ValueError("No data loaded!")


---
## **Step 2: Exploratory Data Analysis (EDA)**

Before building any model, we need to understand our dataset thoroughly.


### **2.1 Basic Data Information**


In [None]:
# Display first few rows
print("=" * 60)
print("FIRST 5 ROWS")
print("=" * 60)
df.head()


In [None]:
# Dataset info
print("=" * 60)
print("DATASET INFO")
print("=" * 60)
print(f"Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nData Types:")
print(df.dtypes)
print(f"\nMissing Values:")
print(df.isnull().sum())
print(f"\nDuplicate Rows: {df.duplicated().sum()}")


In [None]:
# Statistical summary
print("=" * 60)
print("STATISTICAL SUMMARY")
print("=" * 60)
df.describe(include="all")


### **2.2 Target Distribution**


In [None]:
# Identify the label column
# Adjust this if your dataset uses a different column name
label_col = "v1"
text_col = "v2"

print(f"Label column: \"{label_col}\"")
print(f"Text column: \"{text_col}\"")
print(f"\nClass distribution:")
print(df[label_col].value_counts())
print(f"\nClass proportions:")
print(df[label_col].value_counts(normalize=True).round(4) * 100)

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
colors = ["#2ecc71", "#e74c3c"]
df[label_col].value_counts().plot(kind="bar", ax=axes[0], color=colors, edgecolor="black")
axes[0].set_title("Class Distribution (Bar Chart)", fontsize=14, fontweight="bold")
axes[0].set_xlabel("Class")
axes[0].set_ylabel("Count")
axes[0].tick_params(axis="x", rotation=0)

# Pie chart
df[label_col].value_counts().plot(kind="pie", ax=axes[1], autopct="%1.1f%%",
                                   colors=colors, startangle=90,
                                   textprops={"fontsize": 12})
axes[1].set_title("Class Distribution (Pie Chart)", fontsize=14, fontweight="bold")
axes[1].set_ylabel("")

plt.tight_layout()
plt.show()


### **2.3 Text Length Analysis**


In [None]:
# Add text length features
df["text_length"] = df[text_col].apply(len)
df["word_count"] = df[text_col].apply(lambda x: len(str(x).split()))
df["special_char_count"] = df[text_col].apply(lambda x: sum(1 for c in str(x) if c in string.punctuation))

print("Text Length Statistics by Class:")
print(df.groupby(label_col)[["text_length", "word_count", "special_char_count"]].describe().round(2))


In [None]:
# Distribution of text lengths by class
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

features = ["text_length", "word_count", "special_char_count"]
titles = ["Text Length Distribution", "Word Count Distribution", "Special Character Count"]

for i, (feat, title) in enumerate(zip(features, titles)):
    for label in df[label_col].unique():
        subset = df[df[label_col] == label]
        axes[i].hist(subset[feat], bins=50, alpha=0.6, label=label, edgecolor="black")
    axes[i].set_title(title, fontsize=13, fontweight="bold")
    axes[i].set_xlabel(feat)
    axes[i].set_ylabel("Frequency")
    axes[i].legend()

plt.tight_layout()
plt.show()


### **2.4 Word Cloud Visualization**


In [None]:
# Word Cloud for Spam vs Ham
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for i, label in enumerate(df[label_col].unique()):
    text = " ".join(df[df[label_col] == label][text_col].astype(str).tolist())
    wc = WordCloud(width=800, height=400, background_color="white",
                   max_words=200, colormap="viridis").generate(text)
    axes[i].imshow(wc, interpolation="bilinear")
    axes[i].set_title(f"Word Cloud \u2014 {label.upper()}", fontsize=14, fontweight="bold")
    axes[i].axis("off")

plt.tight_layout()
plt.show()


---
## **Step 3: Data Preprocessing**

Preprocessing is crucial for NLP tasks. We will:
1. Remove duplicates and handle missing values
2. Clean text: lowercase, remove punctuation, remove stopwords
3. Encode labels (spam \u2192 1, ham \u2192 0)
4. Split data into train/test sets


### **3.1 Data Cleaning**


In [None]:
# Remove duplicates and missing values
print(f"Shape before cleaning: {df.shape}")
print(f"Duplicates: {df.duplicated().sum()}")
print(f"Missing values: {df.isnull().sum().sum()}")

df.drop_duplicates(inplace=True)
df.dropna(subset=[text_col, label_col], inplace=True)

print(f"\nShape after cleaning: {df.shape}")


### **3.2 Text Preprocessing**


In [None]:
# Text cleaning function
stop_words = set(stopwords.words("english"))

def clean_text(text):
    """
    Clean and preprocess text:
    1. Convert to lowercase
    2. Remove URLs
    3. Remove email addresses
    4. Remove numbers
    5. Remove punctuation
    6. Remove stopwords
    7. Remove extra whitespace
    """
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)  # Remove URLs
    text = re.sub(r"\S+@\S+", "", text)                   # Remove emails
    text = re.sub(r"\d+", "", text)                        # Remove numbers
    text = text.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    tokens = word_tokenize(text)
    tokens = [w for w in tokens if w not in stop_words and len(w) > 1]
    return " ".join(tokens)

# Apply cleaning
print("Cleaning text data...")
df["cleaned_text"] = df[text_col].apply(clean_text)

# Show before vs after
print("\n--- Before vs After Cleaning ---")
for i in range(3):
    print(f"\nOriginal:  {df[text_col].iloc[i][:100]}...")
    print(f"Cleaned:   {df['cleaned_text'].iloc[i][:100]}...")
print("\nText cleaning complete!")


### **3.3 Label Encoding**


In [None]:
# Encode labels: ham \u2192 0, spam \u2192 1
le = LabelEncoder()
df["label"] = le.fit_transform(df[label_col])

print("Label Encoding Mapping:")
for cls, encoded in zip(le.classes_, le.transform(le.classes_)):
    print(f"  {cls} \u2192 {encoded}")

print(f"\nEncoded distribution:")
print(df["label"].value_counts())


### **3.4 Feature Extraction \u2014 TF-IDF Vectorization**

TF-IDF (Term Frequency \u2014 Inverse Document Frequency) converts text into numerical feature vectors.
- **TF**: How often a word appears in a document
- **IDF**: Penalizes words that appear in many documents (common words)


In [None]:
# TF-IDF Vectorization
print("--- TF-IDF Vectorization ---")
tfidf = TfidfVectorizer(max_features=5000, stop_words="english",
                        ngram_range=(1, 2),   # unigrams + bigrams
                        min_df=2, max_df=0.95)

X = tfidf.fit_transform(df["cleaned_text"])
y = df["label"]

print(f"TF-IDF Matrix shape: {X.shape}")
print(f"Number of features (vocabulary size): {len(tfidf.get_feature_names_out())}")
print(f"\nSample feature names: {list(tfidf.get_feature_names_out()[:20])}")


### **3.5 Train/Test Split**


In [None]:
# Split data: 80% train, 20% test (stratified to maintain class proportions)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set:  {X_test.shape[0]} samples")
print(f"\nTraining class distribution:")
print(y_train.value_counts())
print(f"\nTesting class distribution:")
print(y_test.value_counts())


---
## **Step 4: Model Training & Evaluation**

We will train and compare **4 different classification algorithms**:
1. **Multinomial Naive Bayes** \u2014 Fast, works well with text data
2. **Support Vector Machine (SVM)** \u2014 Effective in high-dimensional spaces
3. **Logistic Regression** \u2014 Simple yet powerful linear classifier
4. **Random Forest** \u2014 Ensemble method, robust to overfitting


### **4.1 Define Models**


In [None]:
# Define models
models = {
    "Naive Bayes": MultinomialNB(),
    "SVM (LinearSVC)": CalibratedClassifierCV(LinearSVC(random_state=42, max_iter=10000)),
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
}

print(f"Models to train: {list(models.keys())}")


### **4.2 Train & Evaluate All Models**


In [None]:
# Train and evaluate each model
results = {}

print("=" * 70)
print("MODEL TRAINING & EVALUATION")
print("=" * 70)

for name, model in models.items():
    print(f"\n{'\u2500' * 50}")
    print(f"Training: {name}")
    print(f"{'\u2500' * 50}")

    # Train
    model.fit(X_train, y_train)

    # Predict
    y_pred = model.predict(X_test)

    # Calculate metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average="binary")
    rec = recall_score(y_test, y_pred, average="binary")
    f1 = f1_score(y_test, y_pred, average="binary")

    results[name] = {
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1-Score": f1,
        "model": model,
        "y_pred": y_pred
    }

    print(f"  Accuracy:  {acc:.4f}")
    print(f"  Precision: {prec:.4f}")
    print(f"  Recall:    {rec:.4f}")
    print(f"  F1-Score:  {f1:.4f}")
    print(f"\n  Classification Report:")
    print(classification_report(y_test, y_pred, target_names=le.classes_))

print("\nAll models trained successfully!")


---
## **Step 5: Model Comparison & Visualization**


### **5.1 Confusion Matrices**


In [None]:
# Plot confusion matrices for all models
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for idx, (name, res) in enumerate(results.items()):
    ax = axes[idx // 2, idx % 2]
    cm = confusion_matrix(y_test, res["y_pred"])
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax,
                xticklabels=le.classes_, yticklabels=le.classes_,
                annot_kws={"size": 14})
    ax.set_title(f"Confusion Matrix \u2014 {name}", fontsize=13, fontweight="bold")
    ax.set_ylabel("Actual")
    ax.set_xlabel("Predicted")

plt.tight_layout()
plt.show()


### **5.2 Performance Comparison**


In [None]:
# Create comparison dataframe
comparison_df = pd.DataFrame({
    name: {k: v for k, v in res.items() if k not in ["model", "y_pred"]}
    for name, res in results.items()
}).T

print("\nModel Performance Comparison:")
print(comparison_df.round(4).to_string())

# Bar chart comparison
fig, ax = plt.subplots(figsize=(12, 6))
comparison_df.plot(kind="bar", ax=ax, colormap="Set2", edgecolor="black", width=0.8)
ax.set_title("Model Performance Comparison", fontsize=15, fontweight="bold")
ax.set_ylabel("Score")
ax.set_xlabel("Model")
ax.set_ylim(0.7, 1.02)
ax.legend(loc="lower right", fontsize=10)
ax.tick_params(axis="x", rotation=15)
plt.tight_layout()
plt.show()


### **5.3 ROC Curve**


In [None]:
# ROC Curves
fig, ax = plt.subplots(figsize=(10, 7))
colors = ["#e74c3c", "#3498db", "#2ecc71", "#f39c12"]

for idx, (name, res) in enumerate(results.items()):
    model = res["model"]
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
    elif hasattr(model, "decision_function"):
        y_prob = model.decision_function(X_test)
    else:
        continue

    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, color=colors[idx], lw=2,
            label=f"{name} (AUC = {roc_auc:.4f})")

ax.plot([0, 1], [0, 1], "k--", lw=1, label="Random Classifier")
ax.set_title("ROC Curve Comparison", fontsize=15, fontweight="bold")
ax.set_xlabel("False Positive Rate", fontsize=12)
ax.set_ylabel("True Positive Rate", fontsize=12)
ax.legend(loc="lower right", fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


---
## **Step 6: Cross-Validation (5-Fold)**

Cross-validation provides a more robust estimate of model performance by training and testing on different subsets of data.


In [None]:
# 5-Fold Stratified Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = {}

print("=" * 60)
print("5-FOLD CROSS-VALIDATION RESULTS")
print("=" * 60)

# Re-define models (fresh instances)
cv_models = {
    "Naive Bayes": MultinomialNB(),
    "SVM (LinearSVC)": CalibratedClassifierCV(LinearSVC(random_state=42, max_iter=10000)),
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
}

for name, model in cv_models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
    cv_results[name] = {
        "Mean F1": scores.mean(),
        "Std F1": scores.std(),
        "Scores": scores
    }
    print(f"\n{name}:")
    print(f"  F1 Scores per fold: {[f'{s:.4f}' for s in scores]}")
    print(f"  Mean F1: {scores.mean():.4f} (+/- {scores.std():.4f})")

In [None]:
# Visualize cross-validation results
fig, ax = plt.subplots(figsize=(10, 6))

names = list(cv_results.keys())
means = [cv_results[n]["Mean F1"] for n in names]
stds = [cv_results[n]["Std F1"] for n in names]

bars = ax.bar(names, means, yerr=stds, capsize=5,
              color=["#3498db", "#e74c3c", "#2ecc71", "#f39c12"],
              edgecolor="black")
ax.set_title("5-Fold Cross-Validation \u2014 Mean F1-Score", fontsize=14, fontweight="bold")
ax.set_ylabel("F1-Score")
ax.set_ylim(0.7, 1.02)
ax.tick_params(axis="x", rotation=10)

for bar, mean in zip(bars, means):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
            f"{mean:.4f}", ha="center", va="bottom", fontweight="bold")

plt.tight_layout()
plt.show()


---
## **Step 7: Hyperparameter Tuning (GridSearchCV)**

We use **GridSearchCV** to find the optimal hyperparameters for our best-performing model.


In [None]:
# Determine best model from CV results
best_model_name = max(cv_results, key=lambda n: cv_results[n]["Mean F1"])
print(f"Best model from CV: {best_model_name}")
print(f"Mean F1: {cv_results[best_model_name]['Mean F1']:.4f}")


In [None]:
# Define hyperparameter grids for each model
param_grids = {
    "Naive Bayes": {
        "alpha": [0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
    },
    "SVM (LinearSVC)": {
        "estimator__C": [0.01, 0.1, 1.0, 10.0],
        "estimator__max_iter": [5000, 10000]
    },
    "Logistic Regression": {
        "C": [0.01, 0.1, 1.0, 10.0, 100.0],
        "solver": ["lbfgs", "liblinear"]
    },
    "Random Forest": {
        "n_estimators": [50, 100, 200],
        "max_depth": [None, 10, 20, 30],
        "min_samples_split": [2, 5]
    }
}

# Fresh instance of the best model
best_model_instances = {
    "Naive Bayes": MultinomialNB(),
    "SVM (LinearSVC)": CalibratedClassifierCV(LinearSVC(random_state=42)),
    "Logistic Regression": LogisticRegression(random_state=42, max_iter=1000),
    "Random Forest": RandomForestClassifier(random_state=42, n_jobs=-1)
}

print(f"\n--- GridSearchCV for {best_model_name} ---")
grid_search = GridSearchCV(
    estimator=best_model_instances[best_model_name],
    param_grid=param_grids[best_model_name],
    cv=5,
    scoring="f1",
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV F1-Score: {grid_search.best_score_:.4f}")


In [None]:
# Evaluate tuned model on test set
tuned_model = grid_search.best_estimator_
y_pred_tuned = tuned_model.predict(X_test)

print("=" * 60)
print(f"TUNED {best_model_name.upper()} \u2014 TEST SET RESULTS")
print("=" * 60)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_tuned):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_tuned):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_tuned):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_tuned):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_tuned, target_names=le.classes_))

# Compare before vs after tuning
orig_f1 = results[best_model_name]["F1-Score"]
tuned_f1 = f1_score(y_test, y_pred_tuned)
print(f"Before tuning F1: {orig_f1:.4f}")
print(f"After tuning F1:  {tuned_f1:.4f}")
print(f"Improvement:      {(tuned_f1 - orig_f1):.4f}")


---
## **Step 8: Save Best Model**

We save the best model and the TF-IDF vectorizer using **joblib** so they can be loaded later for the web application.


In [None]:
# Save model and vectorizer
os.makedirs("saved_models", exist_ok=True)

model_path = "saved_models/spam_model.pkl"
vectorizer_path = "saved_models/tfidf_vectorizer.pkl"
encoder_path = "saved_models/label_encoder.pkl"

joblib.dump(tuned_model, model_path)
joblib.dump(tfidf, vectorizer_path)
joblib.dump(le, encoder_path)

print(f"Model saved to: {model_path}")
print(f"Vectorizer saved to: {vectorizer_path}")
print(f"Label encoder saved to: {encoder_path}")

# Verify file sizes
for path in [model_path, vectorizer_path, encoder_path]:
    size_kb = os.path.getsize(path) / 1024
    print(f"   {path}: {size_kb:.1f} KB")


---
## **Step 9: Prediction Interface**

Create a reusable function to predict whether a given text is spam or ham.


In [None]:
# Load saved model (simulating deployment)
loaded_model = joblib.load(model_path)
loaded_vectorizer = joblib.load(vectorizer_path)
loaded_encoder = joblib.load(encoder_path)

def predict_spam(text, model=loaded_model, vectorizer=loaded_vectorizer, encoder=loaded_encoder):
    """
    Predict whether a text message is Spam or Ham.

    Args:
        text (str): Input text message
        model: Trained ML model
        vectorizer: Fitted TF-IDF vectorizer
        encoder: Fitted label encoder

    Returns:
        dict: Prediction result with label and confidence
    """
    # Clean the text
    cleaned = clean_text(text)

    # Vectorize
    features = vectorizer.transform([cleaned])

    # Predict
    prediction = model.predict(features)[0]
    label = encoder.inverse_transform([prediction])[0]

    # Confidence score
    if hasattr(model, "predict_proba"):
        proba = model.predict_proba(features)[0]
        confidence = max(proba)
    else:
        confidence = None

    return {
        "text": text,
        "prediction": label,
        "confidence": f"{confidence:.2%}" if confidence else "N/A"
    }

print("Prediction function ready!")


In [None]:
# Test predictions with sample texts
sample_texts = [
    "Congratulations! You have won a $1000 Walmart gift card. Click here to claim now!",
    "Hey, are you free for lunch tomorrow? Let me know.",
    "URGENT: Your account has been compromised. Verify your identity immediately at this link.",
    "Hi Mom, I will be home for dinner tonight. See you at 7!",
    "FREE entry in 2 weekly competitions. Text WIN to 80808. Conditions apply.",
    "Can you send me the notes from today's lecture? Thanks!",
    "You have been selected for a secret shopper position. Earn $500/day working from home!",
    "Meeting rescheduled to 3 PM. Please confirm your attendance.",
]

print("=" * 70)
print("SPAM DETECTION \u2014 SAMPLE PREDICTIONS")
print("=" * 70)

for text in sample_texts:
    result = predict_spam(text)
    icon = "SPAM" if result["prediction"] == "spam" else "HAM"
    display_text = text[:80] + ("..." if len(text) > 80 else "")
    print(f"\n[{icon}] (Confidence: {result['confidence']})")
    print(f"   \"{display_text}\"")


---
## **Step 10: Summary & Conclusions**

### **Key Findings:**

1. **Dataset**: We analyzed a dataset of text messages labeled as spam or ham.
2. **Preprocessing**: Text was cleaned using NLTK (lowercase, remove stopwords, punctuation, URLs, numbers).
3. **Feature Extraction**: TF-IDF vectorization with unigrams and bigrams (max 5,000 features).
4. **Models Trained**: Naive Bayes, SVM, Logistic Regression, Random Forest.
5. **Evaluation**: All models achieved high performance; metrics include Accuracy, Precision, Recall, F1-Score.
6. **Cross-Validation**: 5-fold CV confirmed robust generalization.
7. **Hyperparameter Tuning**: GridSearchCV optimized the best model further.
8. **Model Saved**: Best model and vectorizer exported for deployment in the web application.

### **Next Steps (Assignment 3 \u2014 Web Application):**
- Integrate the saved model into a **FastAPI** backend
- Build a **React/HTML** frontend for user interaction
- Add data visualization features to help users understand results
- Deploy the application for user testing

---
*TechNova Team COS30049 Computing Technology Innovation Project*
