**Naive Bayes Text Classifier**

 **Goal**: Build a spam detector using the Multinomial Naive Bayes algorithm on a text dataset.

**installing the dataset**

In [1]:
import kagglehub
uciml_sms_spam_collection_dataset_path = kagglehub.dataset_download('uciml/sms-spam-collection-dataset')

print('Data source import complete.')

Data source import complete.


**Install Required Libraries**

In [8]:
!pip install pandas scikit-learn matplotlib seaborn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score



**Load the SMS Spam Dataset**

In [17]:
df=pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv",encoding='latin1')[['v1', 'v2']]

# rename columns
df.columns = ['label', 'message']

# Preview the data
print(df.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


**Preprocess the Data**

In [18]:
# Convert labels to binary: spam = 1, ham = 0
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
# Check class balance
print(df['label'].value_counts())

label
0    4825
1     747
Name: count, dtype: int64


**Split Data into Training and Test Sets**

In [20]:
from sklearn.model_selection import train_test_split

X = df['message']  # Text data
y = df['label']    # Spam or ham

# Split into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


**Convert Text to Vectors (TF-IDF)**

In [23]:
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
# Fit on training data and transform both train and test
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

**Train the Naive Bayes Classifier**

In [28]:
# Create and train the model
model = MultinomialNB()
model.fit(X_train_vec, y_train)
# Predict on test data
y_pred = model.predict(X_test_vec)

**Evaluate the Model**

In [26]:
# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
# Confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
# Classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9668161434977578
Confusion Matrix:
 [[965   0]
 [ 37 113]]
Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.75      0.86       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115



**Summary**

In this project, we built a machine learning model to classify SMS messages as either Spam or Ham (Not Spam) using the Multinomial Naive Bayes algorithm. The raw message text was first converted into numerical features using TF-IDF Vectorization, which measures the importance of words while ignoring common stopwords.

After training the model on 80% of the data, we evaluated it on the remaining 20%. **The results showed:**
* Accuracy: 96.7%
* Precision (Spam): 100% (no false positives)
* Recall (Spam): 75% (some spam was missed)
* F1 Score (Spam): 86%

**Key Insights:**

The model accurately detects ham messages with 100% recall.
It missed some spam messages, which lowered the spam recall to 75%.

This trade-off (no false spam predictions, but some missed spam) can be tuned by adjusting the decision threshold or trying other models like SVM or ensemble methods.