<a href="https://colab.research.google.com/github/Muskan2326/DataScience-ML/blob/main/Naive_Bayes_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing neccessary libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB, MultinomialNB, CategoricalNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

Loading Dataset

In [None]:
df=pd.read_csv('/content/SMSSpamCollection.csv', sep='\t', header=None, names=['label', 'message'])

In [None]:
X = df['message']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Extracting text features using bag-of-words vectorization

In [None]:
vectorizer = CountVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

print("Shape of X_train_vectors:", X_train_vectors.shape)
print("Shape of X_test_vectors:", X_test_vectors.shape)

Shape of X_train_vectors: (3900, 7263)
Shape of X_test_vectors: (1672, 7263)


Encoding Labels

In [None]:
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print("Original labels:", label_encoder.classes_)
print("Encoded labels (train):", np.unique(y_train_encoded))
print("Encoded labels (test):", np.unique(y_test_encoded))

Original labels: ['ham' 'spam']
Encoded labels (train): [0 1]
Encoded labels (test): [0 1]


Training Models on Training Dataset

In [None]:
# Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_train_vectors, y_train_encoded)
print("Multinomial Naive Bayes trained.")

# Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X_train_vectors, y_train_encoded)
print("Bernoulli Naive Bayes trained.")

# Gaussian Naive Bayes (requires dense input)
gnb = GaussianNB()
gnb.fit(X_train_vectors.toarray(), y_train_encoded)
print("Gaussian Naive Bayes trained.")

# Categorical Naive Bayes (requires dense and categorical input, binarize counts)
cgnb = CategoricalNB()
cgnb.fit((X_train_vectors.toarray() > 0).astype(int), y_train_encoded)
print("Categorical Naive Bayes trained.")

Multinomial Naive Bayes trained.
Bernoulli Naive Bayes trained.
Gaussian Naive Bayes trained.
Categorical Naive Bayes trained.


Testing for 30% DATA

In [None]:
models = {
    "Multinomial Naive Bayes": mnb,
    "Bernoulli Naive Bayes": bnb,
    "Gaussian Naive Bayes": gnb,
    "Categorical Naive Bayes": cgnb
}

for name, model in models.items():
    print(f"\n--- {name} ---")

    # GaussianNB and CategoricalNB need dense input for prediction
    # CategoricalNB also needs binarized input
    if name == "Gaussian Naive Bayes":
        y_pred = model.predict(X_test_vectors.toarray())
    elif name == "Categorical Naive Bayes":
        y_pred = model.predict((X_test_vectors.toarray() > 0).astype(int))
    else:
        y_pred = model.predict(X_test_vectors)

    accuracy = accuracy_score(y_test_encoded, y_pred)

    print(f"Accuracy: {accuracy:.4f}")


--- Multinomial Naive Bayes ---
Accuracy: 0.9904

--- Bernoulli Naive Bayes ---
Accuracy: 0.9791

--- Gaussian Naive Bayes ---
Accuracy: 0.9103

--- Categorical Naive Bayes ---
Accuracy: 0.9791


Therefore, The Model with the most accuracy is:
MULTINOMIAL NAIVE BAYES MODEL