<a href="https://colab.research.google.com/github/PhuocOng/Spam-Classifier-Email/blob/main/Text_Classification_A_Spam_Classifier_for_emails.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
!pip install datasets



In [10]:
!pip install transformers



In [11]:
from datasets import load_dataset
dataset = load_dataset("sms_spam")

In [80]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = 2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
import torch

In [32]:
def preprocess(text):
  inputs = tokenizer(text, padding = True, truncation = True, return_tensors = "pt")
  print(f"After processing the {text}, the inputs we have is:")
  print(inputs)
  return inputs

In [63]:
def predict_spam(text):
  inputs = preprocess(text)
  with torch.no_grad():
    outputs = model(**inputs)

  logits = outputs.logits
  predicted_class = torch.argmax(logits, dim = 1).item()
  return predicted_class

def predict_spam_confidence(text):
  inputs = preprocess(text)
  with torch.no_grad():
    outputs = model(**inputs)
  logits = outputs.logits
  probabilities = torch.softmax(logits, dim = 1)
  predicted_class = torch.argmax(logits, dim = 1).item()
  confidence = probabilities[0, predicted_class].item()
  return confidence

In [96]:
def classify_message(text):
    class_map = {0: "Not Spam", 1: "Spam"}
    inputs = preprocess(text)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    print(f"logis: {logits}")
    probabilities = torch.softmax(logits, dim=1)  # Convert logits to probabilities
    predicted_class = torch.argmax(logits, dim=1).item()
    print(f"probabilities: {probabilities}")
    print(f"predicted_class: {predicted_class}")
    print(f"probabilities[0][0]", probabilities[0][0])
    print(f"probabilities[0][1]", probabilities[0][1])
    print(f"probabilities[0, 0]", probabilities[0, 0])
    print(f"probabilities[0, 1]", probabilities[0, 1].item())
    confidence = probabilities[0][predicted_class].item()  # Get confidence score

    return class_map[predicted_class], confidence

In [98]:
example_messages = [
    "Congratulations! You've won a $1000 Walmart gift card. Click here to claim now.",
    "Hey, are we still meeting for coffee later?",
    "URGENT! Your account has been compromised. Reset your password immediately.",
    "Can you send me the project report by tomorrow?",
    "eiufnqbuivbeuiv",
    "Hi Peter! We have a job give you $500000 in 1 hour!!",
    "Hey, do you want to win the fantasy football league this year? Let's strategize!",
    "Hey Mom, my college is offering free lunch this Friday. Want to join?"
    "You have WON a brand new car!!! Click this link NOW to claim your prize."
    ]

for msg in example_messages:
    label, confidence = classify_message(msg)
    print(f"📩 Message: {msg}")
    print(f"🔍 Prediction: {label} (Confidence: {confidence:.4f})\n")

After processing the Congratulations! You've won a $1000 Walmart gift card. Click here to claim now., the inputs we have is:
{'input_ids': tensor([[  101, 23156,   999,  2017,  1005,  2310,  2180,  1037,  1002,  6694,
         24547, 22345,  5592,  4003,  1012, 11562,  2182,  2000,  4366,  2085,
          1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
logis: tensor([[ 0.0069, -0.0142]])
probabilities: tensor([[0.5053, 0.4947]])
predicted_class: 0
probabilities[0][0] tensor(0.5053)
probabilities[0][1] tensor(0.4947)
probabilities[0, 0] tensor(0.5053)
probabilities[0, 1] 0.49470338225364685
📩 Message: Congratulations! You've won a $1000 Walmart gift card. Click here to claim now.
🔍 Prediction: Not Spam (Confidence: 0.5053)

After processing the Hey, are we still meeting for coffee later?, the inputs we have is:
{'input_ids': tensor([[ 101, 4931, 1010, 2024, 2057, 2145, 3116, 2005, 4157, 2101, 1029,  102]]), 'attention_mask'