# **Exercise for Unit 4.1 - NLP Text Classification**

**Name:** Kirk Henrich Gamo & Myrrhea Belle B. Junsay <br>
**Date:** February 12, 2026 <br>
**Year and Section:** BSCS3A -AI <br>

This notebook demonstrates the manual implementation of Naïve Bayes classifier for spam and ham classification using bag of words, priors, and likelihoods.

### Dataset
| doc | class |
|-----|-------|
| Free money now!!! | SPAM |
| Hi mom, how are you? | HAM |
| Lowest price for your meds | SPAM |
| Are we still on for dinner? | HAM |
| Win a free iPhone today | SPAM |
| Let's catch up tomorrow at the office | HAM |
| Meeting at 3 PM tomorrow | HAM |
| Get 50% off, limited time! | SPAM |
| Team meeting in the office | HAM |
| Click here for prizes! | SPAM |
| Can you send the report? | HAM |

In [1]:
import re
from collections import defaultdict, Counter
import math

# Dataset
documents = [
    ("Free money now!!!", "SPAM"),
    ("Hi mom, how are you?", "HAM"),
    ("Lowest price for your meds", "SPAM"),
    ("Are we still on for dinner?", "HAM"),
    ("Win a free iPhone today", "SPAM"),
    ("Let's catch up tomorrow at the office", "HAM"),
    ("Meeting at 3 PM tomorrow", "HAM"),
    ("Get 50% off, limited time!", "SPAM"),
    ("Team meeting in the office", "HAM"),
    ("Click here for prizes!", "SPAM"),
    ("Can you send the report?", "HAM")
]

## a. Generate a Bag of Words (for word frequency)

In [2]:
def preprocess_text(text):
    """
    Preprocess text: convert to lowercase and extract words
    """
    # Convert to lowercase and remove punctuation
    text = text.lower()
    # Extract words (alphanumeric sequences)
    words = re.findall(r'\w+', text)
    return words

def generate_bag_of_words(documents):
    """
    Generate a bag of words with word frequency for each class
    Returns:
        - vocabulary: set of all unique words
        - word_freq_by_class: dict with class -> word frequencies
    """
    vocabulary = set()
    word_freq_by_class = defaultdict(lambda: defaultdict(int))
    
    for doc, label in documents:
        words = preprocess_text(doc)
        for word in words:
            vocabulary.add(word)
            word_freq_by_class[label][word] += 1
    
    return vocabulary, word_freq_by_class

# Generate Bag of Words
vocabulary, word_freq_by_class = generate_bag_of_words(documents)

print("=" * 60)
print("PART A: BAG OF WORDS")
print("=" * 60)
print(f"\nVocabulary size: {len(vocabulary)}")
print(f"\nVocabulary: {sorted(vocabulary)}")

print(f"\n--- Word Frequency for SPAM ---")
print(dict(word_freq_by_class['SPAM']))

print(f"\n--- Word Frequency for HAM ---")
print(dict(word_freq_by_class['HAM']))

PART A: BAG OF WORDS

Vocabulary size: 45

Vocabulary: ['3', '50', 'a', 'are', 'at', 'can', 'catch', 'click', 'dinner', 'for', 'free', 'get', 'here', 'hi', 'how', 'in', 'iphone', 'let', 'limited', 'lowest', 'meds', 'meeting', 'mom', 'money', 'now', 'off', 'office', 'on', 'pm', 'price', 'prizes', 'report', 's', 'send', 'still', 'team', 'the', 'time', 'today', 'tomorrow', 'up', 'we', 'win', 'you', 'your']

--- Word Frequency for SPAM ---
{'free': 2, 'money': 1, 'now': 1, 'lowest': 1, 'price': 1, 'for': 2, 'your': 1, 'meds': 1, 'win': 1, 'a': 1, 'iphone': 1, 'today': 1, 'get': 1, '50': 1, 'off': 1, 'limited': 1, 'time': 1, 'click': 1, 'here': 1, 'prizes': 1}

--- Word Frequency for HAM ---
{'hi': 1, 'mom': 1, 'how': 1, 'are': 2, 'you': 2, 'we': 1, 'still': 1, 'on': 1, 'for': 1, 'dinner': 1, 'let': 1, 's': 1, 'catch': 1, 'up': 1, 'tomorrow': 2, 'at': 2, 'the': 3, 'office': 2, 'meeting': 2, '3': 1, 'pm': 1, 'team': 1, 'in': 1, 'can': 1, 'send': 1, 'report': 1}


## b. Calculate the Prior for the classes HAM and SPAM

In [3]:
def calculate_priors(documents):
    """
    Calculate prior probability P(class) for each class
    Prior = (count of documents in class) / (total documents)
    """
    class_counts = defaultdict(int)
    total_docs = len(documents)
    
    for doc, label in documents:
        class_counts[label] += 1
    
    priors = {}
    for class_label in class_counts:
        priors[class_label] = class_counts[class_label] / total_docs
    
    return priors, class_counts

priors, class_counts = calculate_priors(documents)

print("\n" + "=" * 60)
print("PART B: CLASS PRIORS")
print("=" * 60)
print(f"\nTotal documents: {len(documents)}")
print(f"\nClass counts:")
for class_label, count in class_counts.items():
    print(f"  {class_label}: {count}")

print(f"\nPrior Probabilities:")
for class_label, prior in priors.items():
    print(f"  P({class_label}) = {class_counts[class_label]}/{len(documents)} = {prior:.4f}")


PART B: CLASS PRIORS

Total documents: 11

Class counts:
  SPAM: 5
  HAM: 6

Prior Probabilities:
  P(SPAM) = 5/11 = 0.4545
  P(HAM) = 6/11 = 0.5455


## c. Calculate the Likelihood of the tokens in the vocabulary with respect to the class

In [4]:
def calculate_likelihoods(vocabulary, word_freq_by_class, classes):
    """
    Calculate likelihood P(word|class) for each word in vocabulary and each class
    Using Laplace smoothing: P(word|class) = (count(word in class) + 1) / (total words in class + vocabulary size)
    """
    likelihoods = {}
    vocab_size = len(vocabulary)
    
    for class_label in classes:
        likelihoods[class_label] = {}
        # Total word count in this class
        total_words = sum(word_freq_by_class[class_label].values())
        
        for word in vocabulary:
            # Laplace smoothing: add 1 to count and to denominator
            count = word_freq_by_class[class_label].get(word, 0)
            likelihood = (count + 1) / (total_words + vocab_size)
            likelihoods[class_label][word] = likelihood
    
    return likelihoods

classes = ['HAM', 'SPAM']
likelihoods = calculate_likelihoods(vocabulary, word_freq_by_class, classes)

print("\n" + "=" * 60)
print("PART C: LIKELIHOODS P(word|class)")
print("=" * 60)

print(f"\nVocabulary size: {len(vocabulary)}")
print(f"\n--- Likelihood for SPAM (sample words) ---")
spam_words = sorted(word_freq_by_class['SPAM'].items(), key=lambda x: x[1], reverse=True)[:10]
for word, count in spam_words:
    print(f"  P({word}|SPAM) = {likelihoods['SPAM'][word]:.6f}")

print(f"\n--- Likelihood for HAM (sample words) ---")
ham_words = sorted(word_freq_by_class['HAM'].items(), key=lambda x: x[1], reverse=True)[:10]
for word, count in ham_words:
    print(f"  P({word}|HAM) = {likelihoods['HAM'][word]:.6f}")


PART C: LIKELIHOODS P(word|class)

Vocabulary size: 45

--- Likelihood for SPAM (sample words) ---
  P(free|SPAM) = 0.044776
  P(for|SPAM) = 0.044776
  P(money|SPAM) = 0.029851
  P(now|SPAM) = 0.029851
  P(lowest|SPAM) = 0.029851
  P(price|SPAM) = 0.029851
  P(your|SPAM) = 0.029851
  P(meds|SPAM) = 0.029851
  P(win|SPAM) = 0.029851
  P(a|SPAM) = 0.029851

--- Likelihood for HAM (sample words) ---
  P(the|HAM) = 0.050633
  P(are|HAM) = 0.037975
  P(you|HAM) = 0.037975
  P(tomorrow|HAM) = 0.037975
  P(at|HAM) = 0.037975
  P(office|HAM) = 0.037975
  P(meeting|HAM) = 0.037975
  P(hi|HAM) = 0.025316
  P(mom|HAM) = 0.025316
  P(how|HAM) = 0.025316


## d. Determine the class of test sentences

In [5]:
def predict_class(text, priors, likelihoods, vocabulary, classes):
    """
    Predict the class of a given text using Naïve Bayes
    P(class|document) ∝ P(class) * ∏P(word|class) for all words in document
    """
    words = preprocess_text(text)
    
    scores = {}
    for class_label in classes:
        # Start with the prior probability (using log for numerical stability)
        score = math.log(priors[class_label])
        
        # Multiply likelihoods of all words in the document
        for word in words:
            if word in vocabulary:
                score += math.log(likelihoods[class_label][word])
            else:
                # For unknown words, use smoothed likelihood
                score += math.log(1 / (sum(word_freq_by_class[class_label].values()) + len(vocabulary)))
        
        scores[class_label] = score
    
    # Find the class with the highest score
    predicted_class = max(scores, key=scores.get)
    return predicted_class, scores

print("\n" + "=" * 60)
print("PART D: CLASSIFICATION OF TEST SENTENCES")
print("=" * 60)

# Test sentences
test_sentences = [
    "Limited offer, click here!",
    "Meeting at 2 PM with the manager."
]

for i, test_sentence in enumerate(test_sentences, 1):
    predicted_class, scores = predict_class(test_sentence, priors, likelihoods, vocabulary, classes)
    print(f"\n{i}. Test Sentence: \"{test_sentence}\"")
    print(f"   Predicted Class: {predicted_class}")
    print(f"   Scores: HAM={scores['HAM']:.4f}, SPAM={scores['SPAM']:.4f}")


PART D: CLASSIFICATION OF TEST SENTENCES

1. Test Sentence: "Limited offer, click here!"
   Predicted Class: SPAM
   Scores: HAM=-18.0839, SPAM=-15.5278

2. Test Sentence: "Meeting at 2 PM with the manager."
   Predicted Class: HAM
   Scores: HAM=-26.9156, SPAM=-30.2213


## Summary of Results

In [None]:
print("\n" + "=" * 60)
print("MANUAL NAÏVE BAYES - SUMMARY")
print("=" * 60)

print(f"\n✓ Bag of Words Generated: Vocabulary size = {len(vocabulary)}")
print(f"\n✓ Class Priors Calculated:")
for class_label in classes:
    print(f"  - P({class_label}) = {priors[class_label]:.4f}")

print(f"\n✓ Likelihoods Calculated: P(word|class) for all {len(vocabulary)} words")

print(f"\n✓ Test Sentences Classified:")
for i, test_sentence in enumerate(test_sentences, 1):
    predicted_class, _ = predict_class(test_sentence, priors, likelihoods, vocabulary, classes)
    print(f"  {i}. \"{test_sentence}\" → {predicted_class}")


MANUAL NAÏVE BAYES - SUMMARY

✓ Bag of Words Generated: Vocabulary size = 45

✓ Class Priors Calculated:
  - P(HAM) = 0.5455
  - P(SPAM) = 0.4545

✓ Likelihoods Calculated: P(word|class) for all 45 words

✓ Test Sentences Classified:
  1. "Limited offer, click here!" → SPAM
  2. "Meeting at 2 PM with the manager." → HAM


## Part 2: Using Scikit-Learn

Use the scikit-learn package to train and test a Multinomial Naïve Bayes classifier.

In [7]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Convert the existing documents dataset to a DataFrame
docs = [doc for doc, label in documents]
labels = [label for doc, label in documents]

df = pd.DataFrame({
    'doc': docs,
    'class': labels
})

display(df.head())

Unnamed: 0,doc,class
0,Free money now!!!,SPAM
1,"Hi mom, how are you?",HAM
2,Lowest price for your meds,SPAM
3,Are we still on for dinner?,HAM
4,Win a free iPhone today,SPAM


In [8]:
model = make_pipeline(CountVectorizer(), MultinomialNB())

model.fit(df['doc'], df['class'])

print("Multinomial Naive Bayes model trained successfully using a pipeline.")

Multinomial Naive Bayes model trained successfully using a pipeline.


### a. Determine the class of the following test sentences:
1. Limited offer, click here!
2. Meeting at 2 PM with the manager.

In [9]:
test_sentences = [
    'Limited offer, click here!',
    'Meeting at 2 PM with the manager.'
]

print("\n--- Predictions using Scikit-Learn Multinomial Naïve Bayes ---\n")

for sentence in test_sentences:
    prediction = model.predict([sentence])
    print(f"'{sentence}' is classified as: {prediction[0]}")

print("\n--- Prediction process complete ---")


--- Predictions using Scikit-Learn Multinomial Naïve Bayes ---

'Limited offer, click here!' is classified as: SPAM
'Meeting at 2 PM with the manager.' is classified as: HAM

--- Prediction process complete ---
