###  Naive Bayes (Probabilistic)

**Concept:**  
Based on **Bayes' Theorem**:

$$
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
$$

**Naive:**  
Assumes all features are **independent** (e.g., the word "Free" is unrelated to "Money").  
- Even though this is rarely true in reality, it works surprisingly well for text classification.

**MultinomialNB:**  
Used for **discrete counts**, such as **word frequencies** in spam email filters.

## 1Ô∏è‚É£ Bayes Theorem

Bayes Theorem:

P(A | B) = ( P(B | A) * P(A) ) / P(B)

Where:

- P(A | B) ‚Üí Posterior Probability (The probability of A happening after we see B.)
- P(B | A) ‚Üí Likelihood  (The probability of seeing the evidence B if A is true.)
- P(A) ‚Üí Prior (How likely A is before seeing any evidence.)
- P(B) ‚Üí Evidence (The probability of seeing the evidence regardless of the class.)

## 2Ô∏è‚É£ Bayes Theorem for Classification

In ML:

$$
P(Class \mid Features) = \frac{P(Features \mid Class) \cdot P(Class)}{P(Features)}
$$


Since `P(Features)` is same for all classes,
we compare only:

`P(Features | Class) * P(Class)`

Choose the class with highest probability.

## Calculate Prior Probabilities

Suppose we have:

- 400 Spam emails
- 600 Ham emails
- Total = 1000 emails

P(Spam) = 400 / 1000 = 0.4
P(Ham) = 600 / 1000 = 0.6

P(Spam‚à£Free)

means:
After seeing the word "free", what is the probability the message is spam?

## Calculate Likelihoods

Suppose from training data:

Word Counts:

| Word  | Spam Count | Ham Count |
|--------|------------|------------|
| win    | 50         | 5          |
| free   | 60         | 10         |
| money  | 45         | 8          |

Total words in Spam = 2000  
Total words in Ham = 3000  

Now calculate:

P(win | Spam) = 50 / 2000  
P(win | Ham) = 5 / 3000

Repeat for all words.

## Naive Independence Assumption

Naive Bayes assumes:

All words are independent.

So:

P(Words | Spam) = P(win | Spam) * P(free | Spam) * P(money | Spam)

Then multiply by prior:

Score(Spam) = P(Words | Spam) * P(Spam)

## Final Decision Rule

We calculate two scores:

- **Score(Spam)**
- **Score(Ham)**

### Step 1: Compare the scores

If:

Score(Spam) > Score(Ham)

‚û°Ô∏è **Predict: Spam**

Else if:

Score(Ham) > Score(Spam)

‚û°Ô∏è **Predict: Ham**

---

## Mathematical Form

Score(Spam) = P(Words | Spam) √ó P(Spam)

Score(Ham) = P(Words | Ham) √ó P(Ham)

---

## Final Decision Rule (Compact Form)

Predict the class with the **higher probability score**.

$$
\hat{y} = \arg\max_{class \in \{Spam, Ham\}} P(Words | class) \times P(class)
$$

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
import pandas as pd

# 1. Load Data (SMS Spam Collection)
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df_sms = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

In [6]:
df_sms

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will √º b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [9]:
df_sms.duplicated().sum()

403

In [11]:
df_sms.drop_duplicates(inplace=True)

In [12]:
df_sms

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will √º b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [15]:


# Convert labels to numbers (ham=0, spam=1)
df_sms['label_num'] = df_sms.label.map({'ham':0, 'spam':1})

X = df_sms['message']
y = df_sms['label_num']

In [16]:
df_sms

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,1
5568,ham,Will √º b going to esplanade fr home?,0
5569,ham,"Pity, * was in mood for that. So...any other s...",0
5570,ham,The guy did some bitching but I acted like i'd...,0


In [17]:
df_sms.head(-1)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
...,...,...,...
5566,spam,REMINDER FROM O2: To get 2.50 pounds free call...,1
5567,spam,This is the 2nd time we have tried 2 contact u...,1
5568,ham,Will √º b going to esplanade fr home?,0
5569,ham,"Pity, * was in mood for that. So...any other s...",0


In [19]:
df_sms.shape

(5169, 3)

In [20]:
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Text Preprocessing (Bag of Words)
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train) # Learn vocab and count
X_test_dtm = vect.transform(X_test)       #just count

# 3. Train Naive Bayes
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)


In [21]:
# 4. Evaluate
y_pred = nb.predict(X_test_dtm)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Test a custom message
custom_msg = ["Congratulations! You won a free lottery ticket. Call now."]
custom_vec = vect.transform(custom_msg)
print(f"Prediction for '{custom_msg[0]}': {'Spam' if nb.predict(custom_vec)[0]==1 else 'Ham'}")

Confusion Matrix:
 [[886   8]
 [ 11 129]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99       894
           1       0.94      0.92      0.93       140

    accuracy                           0.98      1034
   macro avg       0.96      0.96      0.96      1034
weighted avg       0.98      0.98      0.98      1034

Prediction for 'Congratulations! You won a free lottery ticket. Call now.': Spam


In [23]:
test_msgs = [
    "Hey, are we meeting tomorrow?",
    "Free entry in 2 a weekly competition! Call now!",
    "Can you send me the report?",
    "Congratulations! You won a prize."
]

test_vec = vect.transform(test_msgs)
predictions = nb.predict(test_vec)

for msg, pred in zip(test_msgs, predictions):
    print(f"{msg} --> {'Spam' if pred==1 else 'Ham'}")

Hey, are we meeting tomorrow? --> Ham
Free entry in 2 a weekly competition! Call now! --> Spam
Can you send me the report? --> Ham
Congratulations! You won a prize. --> Spam


# üìß Rules That Commonly Make an Email Spam

Spam filters use different rules and machine learning techniques to classify emails as Spam or Ham (Not Spam).

---

## 1Ô∏è‚É£ Suspicious Keywords

Common spam words:
- "Congratulations"
- "You won"
- "Free"
- "Lottery"
- "Urgent"
- "Act now"
- "Call immediately"

Emails with many promotional or urgent words are often flagged.

---

## 2Ô∏è‚É£ Excessive Capitalization & Symbols

- ALL CAPS TEXT
- Too many exclamation marks !!!!!!
- Repeated symbols -'$'$$$$-

---

## 3Ô∏è‚É£ Suspicious Links

- Unknown domains
- Shortened URLs
- Too many links
- Mismatch between link text and actual URL

---

## 4Ô∏è‚É£ Poor Grammar and Spelling

Example:
> "You are winner claim prize now urgent reply"

---

## 5Ô∏è‚É£ Asking for Sensitive Information

Spam emails often request:
- Passwords
- Bank details
- Credit card numbers
- OTP codes

---

## 6Ô∏è‚É£ Suspicious Attachments

Common risky file types:
- .exe
- .zip
- .scr
- .docm

---

## 7Ô∏è‚É£ Machine Learning Detection (Naive Bayes)

Spam filters calculate:

P(Spam | Message)

If the probability is high ‚Üí classified as Spam.

In [24]:
df=pd.read_csv("IMDB-Dataset.csv")
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [25]:
df["review"]=df["review"].str.replace("<br /><br />","")

In [26]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [28]:
df["sentiment_label"]=df.sentiment.map({'negative':0,"positive":1})
df

Unnamed: 0,review,sentiment,sentiment_label
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. The filming tec...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1
...,...,...,...
49995,I thought this movie did a down right good job...,positive,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,0
49997,I am a Catholic taught in parochial elementary...,negative,0
49998,I'm going to have to disagree with the previou...,negative,0


In [29]:



X = df['review']
y = df['sentiment_label']

#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Text Preprocessing (Bag of Words)
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train) # Learn vocab and count
X_test_dtm = vect.transform(X_test)       #just count

# 3. Train Naive Bayes
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

In [40]:
# 4. Evaluate
y_pred = nb.predict(X_test_dtm)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Test a custom message
custom_msg = ["Wow!"]
custom_vec = vect.transform(custom_msg)
print(f"Prediction for '{custom_msg[0]}': {'Positive' if nb.predict(custom_vec)[0]==1 else 'Negative'}")

Confusion Matrix:
 [[4362  599]
 [ 909 4130]]

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.88      0.85      4961
           1       0.87      0.82      0.85      5039

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000

Prediction for 'Wow!': Negative
