## Using Multinomial Naive Bayes and TD-IDF Vectorizer

In [None]:
import pandas as pd

path="../src/hamvsspam.csv"

df = pd.read_csv("../src/hamvsspam.csv", encoding='latin1')

# Keep only the first two columns (label and message)
df = df.iloc[:, :2]

# Rename the columns
df.columns = ['label', 'message']

print(df.head())



  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [None]:

X=df['message']
y=df['label']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.25,random_state=42)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

'''
This is a text preprocessing tool from sklearn.feature_extraction.text.

- It turns raw text into numbers so machine learning models can understand it.

- It uses TF-IDF (Term Frequency–Inverse Document Frequency) to assign importance to words.

- Output is a sparse matrix where each row is a message and each column is a word feature.
'''

from sklearn.naive_bayes import MultinomialNB

'''
This is a machine learning model from sklearn.naive_bayes that's ideal for text classification, especially when features are word counts or TF-IDF scores.

- Based on Bayes’ Theorem.

- Assumes word features are conditionally independent.

- Works well for spam detection, sentiment analysis, etc.
'''

from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

# Create a pipeline: TF-IDF → Naive Bayes
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Train the model
model.fit(X_train, y_train)

# Predict on train set
y_pred_train = model.predict(X_train)

# Predict on test set
y_pred_test = model.predict(X_test)



In [None]:
print(classification_report(y_train, y_pred_train))


print(classification_report(y_test, y_pred_test))


              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      4987
        spam       1.00      0.86      0.92       767

    accuracy                           0.98      5754
   macro avg       0.99      0.93      0.96      5754
weighted avg       0.98      0.98      0.98      5754

              precision    recall  f1-score   support

         ham       0.97      1.00      0.98      1645
        spam       1.00      0.79      0.88       274

    accuracy                           0.97      1919
   macro avg       0.98      0.89      0.93      1919
weighted avg       0.97      0.97      0.97      1919

