# Naive Bayes Message Classification

This notebook demonstrates a simple **Naive Bayes classifier** using word probabilities to classify short messages as either **spam** or **ham (not spam)**.


In [1]:
import pandas as pd
from collections import Counter

## Step 1: Define the dataset

We start with a small sample dataset of labeled messages.


Unnamed: 0,message,label
0,free money now,spam
1,call now for free prize,spam
2,hello how are you,ham
3,meet me at noon,ham
4,win free ticket now,spam


## Step 2: Tokenize Messages

We split each message into lowercase words (tokens) for word frequency analysis.


Unnamed: 0,message,label,tokens
0,free money now,spam,"[free, money, now]"
1,call now for free prize,spam,"[call, now, for, free, prize]"
2,hello how are you,ham,"[hello, how, are, you]"
3,meet me at noon,ham,"[meet, me, at, noon]"
4,win free ticket now,spam,"[win, free, ticket, now]"


## Step 3: Count Word Frequencies

We count the frequency of each word in spam and ham messages separately.


(Counter({'free': 3,
          'now': 3,
          'money': 1,
          'call': 1,
          'for': 1,
          'prize': 1,
          'win': 1,
          'ticket': 1}),
 Counter({'hello': 1,
          'how': 1,
          'are': 1,
          'you': 1,
          'meet': 1,
          'me': 1,
          'at': 1,
          'noon': 1}))

## Step 4: Compute Prior Probabilities

We calculate the prior probabilities of spam and ham.


(0.6, 0.4)

## Step 5: Define Vocabulary and Word Probability Function

We use Laplace smoothing to handle unseen words in the test data.


## Step 6: Classify New Message

Now we classify a new message using the Naive Bayes formula.


## Step 7: Normalize and Compare Probabilities

We normalize the probabilities to make a decision based on the highest likelihood.


High probability of spam: 0.97
Lower likelihood of ham: 0.03
