Naive Bayes Classifier — a powerful yet simple algorithm widely used in spam detection, text classification, and medical diagnosis.

✅ NAIVE BAYES – Step-by-Step
🔹 Why it matters:
- Based on Bayes’ Theorem
- Assumes features are independent (naive!)
- Fast and effective for high-dimensional data (like text)
- Performs surprisingly well in real-world applications

In [11]:
# ✅ STEP 1: Load Dataset (we'll use spam or Iris for simplicity)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

iris = load_iris()
X = iris.data
y = iris.target
print(X, end="\n")
print(y, end="\n")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

In [12]:
# ✅ STEP 2: Train Naive Bayes
model = GaussianNB()
model.fit(X_train, y_train)

0,1,2
,priors,
,var_smoothing,1e-09


In [13]:
# ✅ STEP 3: Predict + Accuracy
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



🔍 What's Going On?
GaussianNB assumes data is normally distributed (good for continuous features like height, weight, etc.)

Calculates:

P(class∣features)= P(features∣class)⋅P(class) / P(features)​
 
and picks the class with maximum probability

Multinomial Naive Bayes, perfect for text classification (like spam detection, news categorization, sentiment analysis).

✅ MULTINOMIAL NAIVE BAYES – For Text Data
🔹 Why it matters:
- Used when features are discrete counts, like word frequency in documents.
- Fast, efficient, and surprisingly accurate for many NLP tasks.

In [14]:
# ✅ STEP 1: Sample Dataset (SMS Spam)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

# Load data
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Encode labels
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
print(df)


      label                                            message
0         0  Go until jurong point, crazy.. Available only ...
1         0                      Ok lar... Joking wif u oni...
2         1  Free entry in 2 a wkly comp to win FA Cup fina...
3         0  U dun say so early hor... U c already then say...
4         0  Nah I don't think he goes to usf, he lives aro...
...     ...                                                ...
5567      1  This is the 2nd time we have tried 2 contact u...
5568      0               Will ü b going to esplanade fr home?
5569      0  Pity, * was in mood for that. So...any other s...
5570      0  The guy did some bitching but I acted like i'd...
5571      0                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [15]:
#✅ STEP 2: Text Preprocessing (Bag of Words)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])
y = df['label']
print(X, end="\n")
print(y, end="\n")


<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 74169 stored elements and shape (5572, 8713)>
  Coords	Values
  (0, 3571)	1
  (0, 8084)	1
  (0, 4374)	1
  (0, 5958)	1
  (0, 2338)	1
  (0, 1316)	1
  (0, 5571)	1
  (0, 4114)	1
  (0, 1767)	1
  (0, 3655)	1
  (0, 8548)	1
  (0, 4501)	1
  (0, 1765)	1
  (0, 2061)	1
  (0, 7694)	1
  (0, 3615)	1
  (0, 1082)	1
  (0, 8324)	1
  (1, 5538)	1
  (1, 4537)	1
  (1, 4342)	1
  (1, 8450)	1
  (1, 5567)	1
  (2, 4114)	1
  (2, 3373)	1
  :	:
  (5570, 4245)	1
  (5570, 8371)	1
  (5570, 1097)	1
  (5570, 4642)	1
  (5570, 7089)	1
  (5570, 3323)	1
  (5570, 7674)	1
  (5570, 1451)	1
  (5570, 5367)	1
  (5570, 2606)	1
  (5570, 8120)	1
  (5570, 1794)	1
  (5570, 7099)	1
  (5570, 2905)	1
  (5570, 3489)	1
  (5570, 1802)	1
  (5570, 3709)	1
  (5570, 4188)	1
  (5570, 914)	1
  (5570, 1561)	1
  (5571, 7806)	1
  (5571, 5276)	1
  (5571, 4253)	2
  (5571, 7938)	1
  (5571, 6548)	1
0       0
1       0
2       1
3       0
4       0
       ..
5567    1
5568    0
5569    0
5570    

In [16]:
# ✅ STEP 3: Train + Predict
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


In [17]:
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.9856502242152466


🔍 What's Happening?

Each word is a feature.
CountVectorizer converts text into numeric vectors.
Naive Bayes uses the frequency of each word to predict spam vs ham.

Mini Project: YouTube Comment Spam Classifier using Multinomial Naive Bayes.
This is a realistic workflow used in spam moderation systems.

✅ Project Goal:
Detect whether a YouTube comment is spam or not using word frequencies.

In [18]:
# ✅ STEP 1: Simulate a Real Dataset
import pandas as pd

data = {
    "Comment": [
        "Subscribe to my channel and win an iPhone",
        "Nice explanation, thanks a lot!",
        "Click here to get free followers",
        "Loved the tutorial, very helpful",
        "Buy cheap views now!",
        "Thanks, it worked perfectly",
        "Free Bitcoin giveaway!!! Click now",
        "Great content, keep going",
        "Visit this site for free stuff",
        "Awesome bro, keep uploading!"
    ],
    "Label": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]  # 1 = Spam, 0 = Not Spam
}

df = pd.DataFrame(data)


In [19]:
# ✅ STEP 2: Vectorize Text
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df["Comment"])
y = df["Label"]

In [20]:
# ✅ STEP 3: Train/Test Split + Model
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)


0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [21]:
# ✅ STEP 4: Predict + Accuracy
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 1.0


In [22]:
# ✅ STEP 5: Try a New Comment
comment = ["Subscribe to this channel for rewards"]
new_vector = vectorizer.transform(comment)
print("Spam" if model.predict(new_vector)[0] else "Not Spam")


Spam


🧠 Concept Reinforced:

Text → Features using CountVectorizer
Naive Bayes models word count probabilities
Simple but effective model for classification

🧠 Naive Bayes vs Multinomial Naive Bayes
Concept	    
What is it?	
# Naive Bayes (General) - A family of classifiers based on Bayes' theorem with independence assumptions
Use case	    Text, binary features, categorical, Gaussian features
Examples	    GaussianNB, BernoulliNB, MultinomialNB
Formula basis	Bayes’ Theorem + feature distribution assumption (Gaussian, Bernoulli, Multinomial)

# Multinomial Naive Baye -A specific type of Naive Bayes used when features are counts (like word frequency)
Use case        Mostly text classification where data = word counts or tf-idf
Examples        This is one of them
Formula basis   Uses Multinomial distribution to model word count vectors

🎯 Summary:
"Naive Bayes" is the umbrella.
"Multinomial Naive Bayes" is used for text classification where features are word frequencies or counts.