# Task 1: Theory Questions

Ques.1. What is the core assumption of Naive Bayes?

Ans.1. Naive Bayes is based on the fundamental assumption that all features are independent of one another when the class label is known. In other words, knowing the value of one feature doesn’t give any information about the others, which makes calculating probabilities much simpler.

Ques.2. Differentiate between GaussianNB, MultinomialNB, and BernoulliNB?

Ans.2. 
GaussianNB is ideal for continuous features and assumes that the data follows a Gaussian (normal) distribution.

MultinomialNB is typically used for data represented as counts or frequencies, like term counts in documents.

BernoulliNB is suitable for binary feature data, where each feature indicates the presence or absence (1 or 0) of a particular attribute—commonly used in tasks like spam detection.

Ques.3. Why is Naive Bayes considered suitable for high-dimensional data?

Ans.3.
Naive Bayes performs well with high-dimensional datasets because it treats each feature independently, which simplifies the learning process. This assumption allows the model to compute probabilities for each feature separately, making it fast and effective, especially in applications like text classification where the number of features (e.g., words) can be extremely large.

# Task 2: Spam Detection using MultinomialNB

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

data = pd.read_csv("email.csv", sep='\t', header=None, names=["category", "text"])
data.dropna(subset=['text'], inplace=True)
data['category'] = data['category'].replace({'ham': 0, 'spam': 1})

text_train, text_test, label_train, label_test = train_test_split(
    data['text'], data['category'], test_size=0.2, random_state=42
)

vectorizer = CountVectorizer()
train_vectors = vectorizer.fit_transform(text_train)
test_vectors = vectorizer.transform(text_test)

model = MultinomialNB()
model.fit(train_vectors, label_train)
predictions = model.predict(test_vectors)

accuracy = accuracy_score(label_test, predictions)
precision = precision_score(label_test, predictions)
recall = recall_score(label_test, predictions)
conf_matrix = confusion_matrix(label_test, predictions)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print("Confusion Matrix:")
print(conf_matrix)


Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
Confusion Matrix:
[[132   0]
 [  0  16]]


  data['category'] = data['category'].replace({'ham': 0, 'spam': 1})


# Task 3: GaussianNB with Iris or Wine Dataset

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

dataset = load_iris()
features, labels = dataset.data, dataset.target

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_predictions = nb_model.predict(X_test)

print("Evaluation of Gaussian Naive Bayes:")
print("Accuracy:", accuracy_score(y_test, nb_predictions))
print(classification_report(y_test, nb_predictions, target_names=dataset.target_names))

log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)
lr_predictions = log_reg.predict(X_test)

tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
tree_predictions = tree_model.predict(X_test)

print("Accuracy Comparison Across Models:")
print(f"Naive Bayes Accuracy: {accuracy_score(y_test, nb_predictions):.4f}")
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, lr_predictions):.4f}")
print(f"Decision Tree Accuracy: {accuracy_score(y_test, tree_predictions):.4f}")


Evaluation of Gaussian Naive Bayes:
Accuracy: 1.0
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Accuracy Comparison Across Models:
Naive Bayes Accuracy: 1.0000
Logistic Regression Accuracy: 1.0000
Decision Tree Accuracy: 1.0000
