## Instructions 

### It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set (https://archive.ics.uci.edu/dataset/94/spambase)

### For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source, such as your own spam folder).

### For more adventurous students, you are welcome (encouraged!) to come up with a different set of documents (including scraped web pages) that have already been classified (e.g., tagged), then analyze these documents to predict how new documents should be classified.

In [1]:
#pip install ucimlrepo

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 
  
# metadata 
print(spambase.metadata) 
  
# variable information 
print(spambase.variables) 

{'uci_id': 94, 'name': 'Spambase', 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase', 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv', 'abstract': 'Classifying Email as Spam or Non-Spam', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 4601, 'num_features': 57, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1999, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C53G6X', 'creators': ['Mark Hopkins', 'Erik Reeber', 'George Forman', 'Jaap Suermondt'], 'intro_paper': None, 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collecti

### X contains 57 numeric features (word frequencies, character frequencies, capitalization stats, etc.)

### y  contains the binary target variable (1 = spam, 0 = not spam)

### There are 4601 email samples, and their stated typical performance is around ~7% misclassification error. The dataset is small and already numeric, so no text preprocessing is needed here, and I can begin model training and evaluating.

In [2]:
# Train Test Split 
from sklearn.model_selection import train_test_split

# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [3]:
# Classifier training 
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from tqdm.notebook import tqdm  # progress bar
import pandas as pd

# Flatten y to avoid DataConversionWarning
y_train_flat = y_train.values.ravel()
y_test_flat = y_test.values.ravel()

# Scale features for SVM and Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define models
models = {
    "Naive Bayes": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Max Entropy (Logistic Regression)": LogisticRegression(max_iter=5000, random_state=42),
    "Support Vector Machine": SVC(kernel='linear', random_state=42)
}

# Store results
results = []

# Add progress bar
for name, model in tqdm(models.items(), desc="Training Models"):
    # Use scaled features for models sensitive to feature magnitude
    if name in ["Max Entropy (Logistic Regression)", "Support Vector Machine"]:
        model.fit(X_train_scaled, y_train_flat)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train_flat)
        y_pred = model.predict(X_test)
    
    acc = accuracy_score(y_test_flat, y_pred)
    report = classification_report(y_test_flat, y_pred, output_dict=True)
    
    results.append({
        "Model": name,
        "Accuracy": acc,
        "Precision (Spam=1)": report['1']['precision'],
        "Recall (Spam=1)": report['1']['recall'],
        "F1-score (Spam=1)": report['1']['f1-score']
    })

# Display results
results_df = pd.DataFrame(results).sort_values(by="Accuracy", ascending=False).reset_index(drop=True)
print("Model Comparison Results:\n")
display(results_df)

Training Models:   0%|          | 0/4 [00:00<?, ?it/s]

Model Comparison Results:



Unnamed: 0,Model,Accuracy,Precision (Spam=1),Recall (Spam=1),F1-score (Spam=1)
0,Support Vector Machine,0.926167,0.935135,0.887179,0.910526
1,Max Entropy (Logistic Regression),0.919653,0.931694,0.874359,0.902116
2,Decision Tree,0.918567,0.92,0.884615,0.901961
3,Naive Bayes,0.820847,0.719298,0.946154,0.817276


In [4]:
# Cross-validation to confirm accuracy score stability

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
import numpy as np

# Scale features for SVM and Logistic Regression
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define stratified k-fold
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Define models
models = {
    "Linear SVM": LinearSVC(max_iter=10000, random_state=42, dual=False),
    "Max Entropy (Logistic Regression)": LogisticRegression(max_iter=5000, random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Naive Bayes": MultinomialNB()
}

# Store results
cv_results = []

for name, model in models.items():
    print(f"Running 5-Fold CV for {name} ...")
    
    # Use scaled data for Linear SVM and Logistic Regression
    if name in ["Linear SVM", "Max Entropy (Logistic Regression)"]:
        X_use = X_scaled
    else:
        X_use = X
    
    accuracy = cross_val_score(model, X_use, np.ravel(y), cv=kfold, scoring='accuracy')
    precision = cross_val_score(model, X_use, np.ravel(y), cv=kfold, scoring=make_scorer(precision_score))
    recall = cross_val_score(model, X_use, np.ravel(y), cv=kfold, scoring=make_scorer(recall_score))
    f1 = cross_val_score(model, X_use, np.ravel(y), cv=kfold, scoring=make_scorer(f1_score))
    
    cv_results.append({
        "Model": name,
        "Accuracy Mean": np.mean(accuracy),
        "Accuracy Std": np.std(accuracy),
        "Precision Mean": np.mean(precision),
        "Recall Mean": np.mean(recall),
        "F1 Mean": np.mean(f1)
    })

# Convert to DataFrame and display
cv_df = pd.DataFrame(cv_results)
print("\nCross-Validation Results (5-Fold):")
display(cv_df.sort_values(by="Accuracy Mean", ascending=False))


Running 5-Fold CV for Linear SVM ...
Running 5-Fold CV for Max Entropy (Logistic Regression) ...
Running 5-Fold CV for Decision Tree ...
Running 5-Fold CV for Naive Bayes ...

Cross-Validation Results (5-Fold):


Unnamed: 0,Model,Accuracy Mean,Accuracy Std,Precision Mean,Recall Mean,F1 Mean
1,Max Entropy (Logistic Regression),0.925019,0.006116,0.920916,0.886355,0.903028
0,Linear SVM,0.924801,0.005574,0.921365,0.88525,0.902665
2,Decision Tree,0.906327,0.006282,0.876786,0.887469,0.881878
3,Naive Bayes,0.791565,0.01304,0.743051,0.721469,0.731883


### Overall, Linear Support Vector Machines and Logistic Regression perform nearly identically, with a high 92.5% accuracy and low standard deviations, hence the models are stable. Precision checks the proportion of all the emails predicted as spam, what fraction actually were spam (True Positive). Recall checks the proportion of all actual spam emails, what fraction was identified correctly. These metrics are 92.1% and 88.6% respectively, so the Max Entropy and SVM models rarely label good emails as spam and catch most actual spam emails. F1 is the balanced harmonic mean of these two metrics, and is 90% for both. Now the model is trained, cross-validated, and ready to be used with new data. I will use Kaggle's SMS Spam collection dataset, to see how a model trained on email data performs with text-message data.

In [5]:
import os
import shutil

# This requires Kaggle username and key: http://bit.ly/kaggle-creds
# Define the .kaggle folder
kaggle_folder = os.path.join(os.path.expanduser("~"), ".kaggle")
os.makedirs(kaggle_folder, exist_ok=True)

# Copy kaggle.json to new location
shutil.copy(r"C:\Users\Ron\OneDrive\Desktop\kaggle.json", os.path.join(kaggle_folder, "kaggle.json"))

# Fetch data using credentials in kaggle.json
import opendatasets as od
od.download("https://www.kaggle.com/uciml/sms-spam-collection-dataset")

# Load Data
kaggle_df = pd.read_csv(r"sms-spam-collection-dataset\spam.csv", encoding='latin-1')  
# Keep only relevant columns and rename
kaggle_df = kaggle_df[['v1','v2']].rename(columns={'v1':'Category','v2':'Message'})
print(kaggle_df.head())

Skipping, found downloaded files in ".\sms-spam-collection-dataset" (use force=True to force download)
  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...


### The SMS data has been loaded from Kaggle, and the features in my original Spambase dataset are different from raw SMS text, so I need to vectorize the SMS messages into the same feature format as the training data. Spambase is numeric (word/char frequencies), so thereâ€™s no direct 1-to-1 mapping for SMS text. The original SVM trained on spambase can't be used directly on SMS text, as this new data doesn't have the original 57 numeric features. 

In [6]:
# Compute Spambase features from Kaggle SMS Messages

import pandas as pd
import numpy as np
import re

# List of words from Spambase features
spambase_words = [
    "make", "address", "all", "3d", "our", "over", "remove", "internet", 
    "order", "mail", "receive", "will", "people", "report", "addresses", 
    "free", "business", "email", "you", "credit", "your", "font", "000",
    "money", "hp", "hpl", "george", "650", "lab", "labs", "telnet", "857",
    "data", "415", "85", "technology", "1999", "parts", "pm", "direct",
    "cs", "meeting", "original", "project", "re", "edu", "table", "conference"
]

# List of characters for char_freq features
spambase_chars = [';', '(', '[', '!', '$', '#']

def compute_spambase_features(message):
    message_lower = message.lower()
    total_words = len(re.findall(r'\b\w+\b', message_lower)) or 1
    total_chars = len(message) or 1
    
    features = []

    # Word frequencies
    for w in spambase_words:
        freq = 100 * message_lower.split().count(w) / total_words
        features.append(freq)
    
    # Char frequencies
    for c in spambase_chars:
        freq = 100 * message.count(c) / total_chars
        features.append(freq)
    
    # Capital run lengths
    caps = re.findall(r'[A-Z]+', message)
    if caps:
        lengths = [len(seq) for seq in caps]
        # capital_run_length_average
        features.append(np.mean(lengths))  
        # capital_run_length_longest
        features.append(np.max(lengths)) 
        # capital_run_length_total
        features.append(np.sum(lengths))         
    else:
        features.extend([0, 0, 0])
    
    return features


In [11]:
# Fit the SVM on the full Spambase training data
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler

# Fit scaler on Spambase features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # X from Spambase

# Fit LinearSVC on the full dataset
svm_model = LinearSVC(max_iter=10000, random_state=42, dual=False)
svm_model.fit(X_scaled, np.ravel(y))  # y from Spambase

# Compute features for all SMS messages
X_sms_features = np.array([compute_spambase_features(msg) for msg in kaggle_df['Message']])

# Check shape of Spambase and Kaggle SMS features
print("Shape of SMS features:", X_sms_features.shape)
print("Shape expected by scaler:", X_scaled.shape)

Shape of SMS features: (5572, 57)
Shape expected by scaler: (4601, 57)


### With the features prepped, I can get predictions and accuracy metrics. 

In [15]:
# Scale SMS features with the same scaler
X_sms_scaled = scaler.transform(X_sms_features)
# Predictions
predictions = svm_model.predict(X_sms_scaled)

kaggle_df['Predicted'] = predictions
kaggle_df['Predicted_Label'] = kaggle_df['Predicted'].map({0: 'ham', 1: 'spam'})
print(kaggle_df[['Category', 'Message', 'Predicted_Label']].head(10))

# Get metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# True labels
y_true = (kaggle_df['Category'] == 'spam').astype(int)

# Predicted labels
y_pred = kaggle_df['Predicted']

# Compute metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Display results
print("_________________________________________________________")
print("SMS Spam Prediction Metrics (using Spambase-trained SVM):")
print(f"Accuracy : {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall   : {recall:.4f}")
print(f"F1-score : {f1:.4f}")

  Category                                            Message Predicted_Label
0      ham  Go until jurong point, crazy.. Available only ...             ham
1      ham                      Ok lar... Joking wif u oni...             ham
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...            spam
3      ham  U dun say so early hor... U c already then say...             ham
4      ham  Nah I don't think he goes to usf, he lives aro...             ham
5     spam  FreeMsg Hey there darling it's been 3 week's n...             ham
6      ham  Even my brother is not like to speak with me. ...             ham
7      ham  As per your request 'Melle Melle (Oru Minnamin...            spam
8     spam  WINNER!! As a valued network customer you have...             ham
9     spam  Had your mobile 11 months or more? U R entitle...            spam
_________________________________________________________
SMS Spam Prediction Metrics (using Spambase-trained SVM):
Accuracy : 0.8110
Precisio



### Looking at the results, I see that the SVM which was trained on Spambase's Email data doesn't perform as well on SMS text data. This is because the dataset is imbalanced, with more instances of non-spam than spam, and the feature distributions such as word usage, character length, and general linguistic structure of informal text messages throws off the model. 