#Spam Mail Classification using Bag of Words (BoW)

##Import Libraries

Importing necessary libraries such as pandas, numpy, and sklearn for data manipulation and machine learning.

In [21]:
import pandas as pd
import numpy as np

##Load Dataset

Load the spam mail dataset from a CSV file into a pandas DataFrame.

In [22]:
df= pd.read_csv('spam.csv')

##Display First Five Rows

Display the first five rows of the dataset to understand its structure.

In [23]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Check for Missing Values

Check for any missing values in the dataset.

In [24]:
df.isnull().sum()

Category    0
Message     0
dtype: int64

##Count of Each Category

Display the count of each category (ham or spam) in the dataset.

In [25]:
df.Category.value_counts()

Category
ham     4825
spam     747
Name: count, dtype: int64

##Encode Labels

Encode the 'Category' column labels (ham and spam) into numerical values using LabelEncoder.

In [26]:
from sklearn.preprocessing import LabelEncoder
encoder= LabelEncoder()
df['Category']= encoder.fit_transform(df['Category'])

##Display First Five Rows of Encoded Data

Display the first five rows of the DataFrame with the encoded 'Category' column.

In [27]:
df.head()

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


##Split Data

Split the dataset into training and testing sets.

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.Message, df.Category, test_size=0.2)

In [29]:
X_train.shape

(4457,)

##Initialize CountVectorizer

Initialize the CountVectorizer to convert text data into a bag-of-words model.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
v= CountVectorizer()


##Fit and Transform Training Data

Fit the CountVectorizer to the training data and transform the training messages into vectors.

In [31]:
X_train_cv= v.fit_transform(X_train.values)
X_train_cv.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [32]:
v.get_feature_names_out().shape

(7712,)

In [33]:
X_train_cv.shape

(4457, 7712)

##Initialize Multinomial Naive Bayes

Initialize the Multinomial Naive Bayes classifier.

In [34]:
from sklearn.naive_bayes import MultinomialNB
model= MultinomialNB()


##Train the Model

Train the Multinomial Naive Bayes model on the training data.

In [35]:
model.fit(X_train_cv, y_train)

##Transform Testing Data

Transform the testing messages into vectors using the fitted CountVectorizer.

In [36]:
X_test_cv= v.transform(X_test)

##Predict and Classification Report

Predict the labels for the test data and print the classification report to evaluate the model's performance.

In [37]:
from sklearn.metrics import classification_report
y_pred= model.predict(X_test_cv)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       955
           1       0.99      0.89      0.93       160

    accuracy                           0.98      1115
   macro avg       0.98      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115



##Confusion Matrix

Import the confusion_matrix function and display the confusion matrix to visualize the model's performance.

In [38]:
#to make a confussion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
#SHOW THIS Confussion matix
heatmap= pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
print(heatmap)


Predicted    0    1
Actual             
0          953    2
1           18  142
