# Naive Bayes Classification Types

| Type        | Use When Features Are... | Example                  |
| ----------- | ------------------------ | ------------------------ |
| Gaussian    | Numbers (real values)    | Age, weight, scores      |
| Multinomial | Counts                   | Word frequencies         |
| Bernoulli   | Yes/No flags             | Word exists or not       |
| Categorical | Named groups             | Car type, region, gender |


# TF-IDF

TF-IDF stands for:



> Term Frequency – Inverse Document Frequency



It’s a statistical method to measure:

> How important a word is in a document, relative to a collection of documents.

1. Term Frequency (TF)

How often a word appears in the document?
$$ TF(t,d)=\frac{Number \;of \; times \;t \; a \; word \; appears \; in \; d}{Total \; number \; of \; words \; in \; d}$$

2. Inverse Document Frequency (IDF)

How rare is a word across all documents?
$$ IDF(t) =\log \frac{N}{1+df(t)} $$
where,

* N = total number of documents
* df(t)= number of documents containing word t

combining
$$ TF-IDF(t,d)=TF(t,d)×IDF(t) $$


***Where TF-IDF is Used***

* Text classification (e.g., spam detection)

* Clustering (grouping documents)

* Search engines (relevance ranking)

* Keyword extraction



#  Convert to Pandas DataFrame

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np

newsgroups = fetch_20newsgroups(subset='all')

df = pd.DataFrame({
   'text':newsgroups.data,
   'label_id':newsgroups.target,
   'label_name':[newsgroups.target_names[i] for i in newsgroups.target]
})

print(df.head(100))

print(newsgroups.data[0])
print(newsgroups.target[0])
print(newsgroups.target_names[0])

                                                 text  label_id  \
0   From: Mamatha Devineni Ratnam <mr47+@andrew.cm...        10   
1   From: mblawson@midway.ecn.uoknor.edu (Matthew ...         3   
2   From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...        17   
3   From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...         3   
4   From: Alexander Samuel McDiarmid <am2o+@andrew...         4   
..                                                ...       ...   
95  From: jcmorris@mbunix.mitre.org (Morris)\nSubj...         3   
96  From: shiva@leland.Stanford.EDU (Matt Jacobson...         2   
97  From: tmenner@sei.cmu.edu (Thomas Menner)\nSub...        10   
98  From: scatt@apg.andersen.com (Scott Cattanach)...        16   
99  From: sxs@extol.Convergent.Com (S. Sridhar)\nS...         5   

                  label_name  
0           rec.sport.hockey  
1   comp.sys.ibm.pc.hardware  
2      talk.politics.mideast  
3   comp.sys.ibm.pc.hardware  
4      comp.sys.mac.hardware  
..       

#  View First 5 Posts with Labels

In [None]:
for i in range(5):
    print(f"\n--- Document {i} ---")
    print(f"Label: {newsgroups.target_names[newsgroups.target[i]]}")
    print(newsgroups.data[i][:500])  # First 500 characters only



--- Document 0 ---
Label: rec.sport.hockey
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killin

--- Document 1 ---
Label: comp.sys.ibm.pc.hardware
From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)
Subject: Which high-performance VLB video card?
Summary: Seek recommendations for VLB video card
Nntp-Posting-Host: midway.ecn.uoknor.edu
Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA
Keywords: orchid, stealth, vlb
Lines: 21

  My brother is in the market for a high-performance video card that supports
VESA 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

# Narrowing Down the Dataset

| Subset Name | Meaning                              |
| ----------- | ------------------------------------ |
| `'train'`   | 60% of the data → for model training |
| `'test'`    | 40% of the data → for evaluation     |
| `'all'`     | Entire dataset (train + test)        |


Always set a random_state if you're using shuffle, especially in research or training setups.

In [None]:
categories = ['sci.space', 'comp.graphics', 'rec.sport.baseball']
training_data = fetch_20newsgroups(subset='train', categories=categories,shuffle=True, random_state=50)

In [None]:
#Print only the first 10 lines of the first document in the training set, to avoid printing the entire long text.

print("\n".join(training_data.data[1].split("\n")[:30]))
print("Target is:",training_data.target_names[training_data.target[1]])

From: mscrap@halcyon.com (Marta Lyall)
Subject: Re: Video in/out
Organization: Northwest Nexus Inc. (206) 455-3505
Lines: 29

Organization: "A World of Information at your Fingertips"
Keywords: 

In article <628@toontown.columbiasc.ncr.com> craig@toontown.ColumbiaSC.NCR.COM (Craig S. Williamson) writes:
>
>I'm getting ready to buy a multimedia workstation and would like a little
>advice.  I need a graphics card that will do video in and out under windows.
>I was originally thinking of a Targa+ but that doesn't work under Windows.
>What cards should I be looking into?
>
>Thanks,
>Craig
>
>-- 
>                                             "To forgive is divine, to be
>-Craig Williamson                              an airhead is human."
> Craig.Williamson@ColumbiaSC.NCR.COM                -Balki Bartokomas
> craig@toontown.ColumbiaSC.NCR.COM (home)                  Perfect Strangers


Craig,

You should still consider the Targa+. I run windows 3.1 on it all the
time at work and it works f

# Counting the word occurances

# What is a Document-Term Matrix?

A Document-Term Matrix (DTM) is a table (or 2D array) that shows:

How often each word (term) appears in each document.

| Term ↓ / Doc → | Doc 1 | Doc 2 | Doc 3 |
| -------------- | ----- | ----- | ----- |
| space          | 1     | 0     | 2     |
| NASA           | 0     | 1     | 0     |
| goal           | 0     | 0     | 3     |
| graphics       | 1     | 1     | 0     |


* Rows = words (terms)

* Columns = documents

* Values = word counts or TF-IDF scores

This matrix becomes your input to clustering algorithms like KMeans, LDA, etc.

#Types of DTM

| Type          | Built With          | Values Contain                   |
| ------------- | ------------------- | -------------------------------- |
| Count Matrix  | `CountVectorizer()` | Raw word counts                  |
| TF-IDF Matrix | `TfidfVectorizer()` | Weighted importance of each word |



In [None]:
count_vector = CountVectorizer()
x_train_counts = count_vector.fit_transform(training_data.data)
print(count_vector.vocabulary_, x_train_counts)

	with 252780 stored elements and shape (1774, 28834)>
  Coords	Values
  (0, 12388)	7
  (0, 16333)	2
  (0, 8874)	4
  (0, 26907)	3
  (0, 10628)	3
  (0, 15458)	1
  (0, 25105)	1
  (0, 24364)	20
  (0, 11693)	2
  (0, 625)	3
  (0, 693)	3
  (0, 14058)	3
  (0, 26205)	48
  (0, 5725)	6
  (0, 4604)	16
  (0, 5186)	30
  (0, 15763)	1
  (0, 12363)	1
  (0, 5102)	1
  (0, 21507)	1
  (0, 5057)	1
  (0, 5188)	1
  (0, 11457)	1
  (0, 17359)	8
  (0, 1075)	1
  :	:
  (1773, 8657)	1
  (1773, 25926)	1
  (1773, 4938)	1
  (1773, 16209)	2
  (1773, 23665)	2
  (1773, 5512)	1
  (1773, 7518)	2
  (1773, 28791)	6
  (1773, 28403)	1
  (1773, 21319)	1
  (1773, 4794)	1
  (1773, 23768)	1
  (1773, 27310)	2
  (1773, 20238)	5
  (1773, 28245)	2
  (1773, 4090)	1
  (1773, 4236)	1
  (1773, 4237)	1
  (1773, 10275)	1
  (1773, 11038)	1
  (1773, 11036)	1
  (1773, 11037)	1
  (1773, 18619)	1
  (1773, 18620)	1
  (1773, 16043)	1


# Explanation

| Term             | Meaning                                            |
| ---------------- | -------------------------------------------------- |
| `x_train_counts` | Sparse matrix: documents as rows, words as columns |
| `(0, 12388) 7`   | Word with index 12388 appears 7 times in doc 0     |
| `vocabulary_`    | Word → index mapping used to build the matrix      |
| Sparse Format    | Saves memory by storing only non-zero values       |


In [None]:
tfid_transformer = TfidfTransformer()
x_train_tfidf = tfid_transformer.fit_transform(x_train_counts)

print(x_train_tfidf)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 252780 stored elements and shape (1774, 28834)>
  Coords	Values
  (0, 60)	0.0220399995597289
  (0, 97)	0.011634908554226994
  (0, 154)	0.009630185518216098
  (0, 308)	0.01226338322218642
  (0, 408)	0.00820444228507001
  (0, 625)	0.02382684838861426
  (0, 626)	0.028114237365476445
  (0, 693)	0.022075616415870938
  (0, 694)	0.011410958017113795
  (0, 1059)	0.01364233830514972
  (0, 1064)	0.01134095269012306
  (0, 1075)	0.00726471133796042
  (0, 1284)	0.05184861325631628
  (0, 1829)	0.007957811823156326
  (0, 1876)	0.014296813183136001
  (0, 2090)	0.010441618738317752
  (0, 2225)	0.010847132326657445
  (0, 2267)	0.010792310550677236
  (0, 2403)	0.009083865986031765
  (0, 2502)	0.011207090928265738
  (0, 2554)	0.012369215960726878
  (0, 2696)	0.011410958017113795
  (0, 3029)	0.025439815546172726
  (0, 3043)	0.015634024092787864
  (0, 3341)	0.02117070067222264
  :	:
  (1773, 21319)	0.0937351317979693
  (1773, 21866)	0.05887068146

You are converting word counts into TF-IDF scores.

Some words (like "the", "and", "is") appear in almost every document, so plain counting makes them seem important — but they’re not.

TF-IDF downweights common words, and upweights rare but meaningful words

In [None]:
model = MultinomialNB().fit(x_train_tfidf, training_data.target)

| Method            | What It Does                                                      | When to Use               |
| ----------------- | ----------------------------------------------------------------- | ------------------------- |
| `fit()`           | Learns from the data (builds vocabulary, computes mean/var, etc.) | On **training** data only |
| `transform()`     | Applies what was learned to new data                              | On **both train & test**  |
| `fit_transform()` | Does both steps in one line                                       | On **training** data only |


In [None]:
new_sentences = [
    "NASA is planning a new mission to Mars in 2027.",
    "The 3D rendering in this graphics software is mind-blowing.",
    "The Yankees won the baseball game in extra innings.",
    "Astrophysicists discovered a new black hole near the Milky Way.",
    "Photoshop is the most powerful image editing tool I’ve used.",
    "The pitcher threw a perfect game in last night’s match.",
    "SpaceX successfully launched another satellite.",
    "OpenGL is essential for real-time rendering in game engines.",
    "The outfielder made an incredible diving catch!",

    "Astronauts train for months before heading to the International Space Station."
]


x_new_counts = count_vector.transform(new_sentences)
x_new_tfidf = tfid_transformer.transform(x_new_counts)

predicted = model.predict(x_new_tfidf)

for doc, category in zip(new_sentences,predicted):
  print('%r----------------->%s'%(doc,training_data.target_names[category]))

'NASA is planning a new mission to Mars in 2027.'----------------->sci.space
'The 3D rendering in this graphics software is mind-blowing.'----------------->comp.graphics
'The Yankees won the baseball game in extra innings.'----------------->rec.sport.baseball
'Astrophysicists discovered a new black hole near the Milky Way.'----------------->sci.space
'Photoshop is the most powerful image editing tool I’ve used.'----------------->comp.graphics
'The pitcher threw a perfect game in last night’s match.'----------------->rec.sport.baseball
'SpaceX successfully launched another satellite.'----------------->sci.space
'OpenGL is essential for real-time rendering in game engines.'----------------->rec.sport.baseball
'The outfielder made an incredible diving catch!'----------------->rec.sport.baseball
'Astronauts train for months before heading to the International Space Station.'----------------->sci.space


# A Pipeline is a way to chain multiple steps in your ML workflow so they run sequentially and consistently.

In [None]:
from sklearn.pipeline import Pipeline

model_pipeline = Pipeline([
    ('Counter',CountVectorizer()),
    ('vectorizer',TfidfTransformer()),
    ('model',MultinomialNB())
])

#training the pipeline

model_pipeline.fit(training_data.data,training_data.target)

predicted = model_pipeline.predict(new_sentences)
print(predicted)

for doc, category in zip(new_sentences,predicted):
  print(f"{doc!r}----->{training_data.target_names[category]}")

[2 0 1 2 0 1 2 1 1 2]
'NASA is planning a new mission to Mars in 2027.'----->sci.space
'The 3D rendering in this graphics software is mind-blowing.'----->comp.graphics
'The Yankees won the baseball game in extra innings.'----->rec.sport.baseball
'Astrophysicists discovered a new black hole near the Milky Way.'----->sci.space
'Photoshop is the most powerful image editing tool I’ve used.'----->comp.graphics
'The pitcher threw a perfect game in last night’s match.'----->rec.sport.baseball
'SpaceX successfully launched another satellite.'----->sci.space
'OpenGL is essential for real-time rendering in game engines.'----->rec.sport.baseball
'The outfielder made an incredible diving catch!'----->rec.sport.baseball
'Astronauts train for months before heading to the International Space Station.'----->sci.space
