<a href="https://colab.research.google.com/github/Arnavvv16/AI_Notes/blob/main/KTS_25_26_NaiveBayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multinomial Naive Bayes

Dataset Used: The 20 Newsgroups dataset is a popular text dataset containing around 20,000 newsgroup posts (i.e., forum messages), organized into 20 topic categories like sports.basketball, sports.football, sci.space and so on. Each post is labelled with its category.

Here, we fetch this dataset via sklearn.datasets, documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

1. Import Modules

In [None]:
from sklearn.datasets import fetch_20newsgroups #The dataset
from sklearn.feature_extraction.text import CountVectorizer #Keeps track of the count of words
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

2. Load the data. We load only space and hoeckey data for this example.

In [None]:
categories = ['sci.space', 'rec.sport.hockey']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

texts = newsgroups.data
labels = newsgroups.target
label_names = newsgroups.target_names  # ['rec.sport.hockey', 'sci.space']

3. CountVectorizer: It converts a collection of text documents into a matrix of token counts.
* It first finds all unique words and creates a vocabulary.
* Then for each sentence, it forms a vector with the word count of each word from the vocab that occurs in that sentence.
* Concatenates all vectors and outputs a matrix for the train data, consisting of the word counts which gets used by MultinomialNB

In [None]:
#Vectorize text
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)

#Train/test split
X_train, X_test, y_train, y_test, texts_train, texts_test = train_test_split(
    X, labels, texts, test_size=0.2, random_state=42
)

4. Train and Predict!

In [None]:
#Train Multinomial Naive Bayes
model = MultinomialNB()
model.fit(X_train, y_train)

#Predict on test data
y_pred = model.predict(X_test)

#Show predictions for a few examples
print("🔍 Example Predictions:\n")
for i in range(10):
    print(f"Text #{i + 1}:")
    print(texts_test[i][:300].replace("\n", " ") + "...")
    print(f"Actual label:    {label_names[y_test[i]]}")
    print(f"Predicted label: {label_names[y_pred[i]]}")
    print("-" * 80)

# 7. Overall metrics
print("\n📊 Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\n📈 Classification Report:")
print(classification_report(y_test, y_pred, target_names=label_names))


🔍 Example Predictions:

Text #1:
 Major league baseball has told the Blue Jays and the Expos not to sign Oscar Linares (I think that is his name) ...Canada does not have the restrictions against Cubans that the US has and other major league teams have told the Canadian teams that they would be very unhappy if the Expos or the Blue ...
Actual label:    rec.sport.hockey
Predicted label: rec.sport.hockey
--------------------------------------------------------------------------------
Text #2:
                           ^^          Funny you should mention it...this is exactly the case I was going to make.   I will grant that a star like Mario will draw fans, even if the team sucks.  But this is short term only; I still do not think the attendance increase  will last, unless the team is a...
Actual label:    rec.sport.hockey
Predicted label: rec.sport.hockey
--------------------------------------------------------------------------------
Text #3:
  Kerry-- I'm guessing a little at this, be

 # Gaussian Naive Bayes

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Iris dataset contains measurments from the flowers Setosa, Versicolour, and Virginica
# The measurements are sepal length, sepal width, petal length and petal width
# 0 - setosa
# 1 - versicolor
# 2 - virginica

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train Gaussian Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

#To predict on new data
new_data = [[5.1, 3.5, 1.4, 0.2], [6.7, 3.1, 4.7, 1.5]]
predictions = model.predict(new_data)
print("\nPredictions:", predictions)

Accuracy: 0.9778

Predictions: [0 1]
