<a href="https://colab.research.google.com/github/Adityavenkatramani/Heart_disease_Prediction/blob/main/Spam_Classification_Naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We Use Multinomial Naive Bayes for the spam classification


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer


Load the dataset

In [3]:
spam = pd.read_csv('/content/spam.csv')

Data analysis and Data description

In [4]:
spam.head(5)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
spam.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [6]:
spam.shape


(5572, 2)

In [7]:
spam.isnull().sum()

Category    0
Message     0
dtype: int64

No null values so data cleaning or imputation is not required .


We Groupby the category for easier classification

In [8]:
spam.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


In [46]:
spam['spam'] = spam['Category'].apply(lambda x:1 if x=='spam' else 0)

In [47]:
spam.head()

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


We Created a target column stating if it is a spam message or not .

This is done for training and testing

In [49]:
x_train,x_test,y_train,y_test = train_test_split(spam['Message'],spam['spam'],test_size=0.3,random_state=2)

In [58]:
print(x_train)

1915    New TEXTBUDDY Chat 2 horny guys in ur area 4 j...
1056                             I'm at work. Please call
3717              Networking technical support associate.
5375    I cant pick the phone right now. Pls send a me...
945     I sent my scores to sophas and i had to do sec...
                              ...                        
3335    That's fine, have him give me a call if he kno...
1099    NO GIFTS!! You trying to get me to throw mysel...
2514    U have won a nokia 6230 plus a free digital ca...
3606                      Jordan got voted out last nite!
2575    Your next amazing xxx PICSFREE1 video will be ...
Name: Message, Length: 3900, dtype: object


In [51]:
print(y_train)

1915    1
1056    0
3717    0
5375    0
945     0
       ..
3335    0
1099    0
2514    1
3606    0
2575    1
Name: spam, Length: 3900, dtype: int64


In [52]:
x_train.describe()

count                       3900
unique                      3679
top       Sorry, I'll call later
freq                          25
Name: Message, dtype: object

In [53]:
cv = CountVectorizer(lowercase=True,stop_words='english')


In [54]:
x_train_transformed = cv.fit_transform(x_train)

In [55]:
x_test_transformed = cv.transform(x_test)

In [56]:
x_train_transformed.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [60]:
x_train.value_counts()
y_train.value_counts()

0    3380
1     520
Name: spam, dtype: int64

In [62]:
cv.get_feature_names_out()

array(['00', '000', '000pes', ..., 'zouk', 'ú1', '〨ud'], dtype=object)

Model Training

In [63]:
model = MultinomialNB()
model.fit(x_train_transformed,y_train)

In [65]:
y_test_predicted_labels=model.predict(x_test_transformed)

In [66]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [68]:
print('Accuracy Score after testing :',accuracy_score(y_test,y_test_predicted_labels)*100,'%')

Accuracy Score after testing : 97.78708133971293 %


We Use a confusion matrix to check accuracy on the prediction

In [70]:
results = confusion_matrix(y_test,y_test_predicted_labels)
print(results)

[[1434   11]
 [  26  201]]


The TP , FP , FN , TN values are represented and we see there are less number of FP and FN which means the model is accurate

Now we generate a classification report using sklearn.metrics --> classification_report

In [72]:
report = classification_report(y_test,y_test_predicted_labels)
print(report)

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1445
           1       0.95      0.89      0.92       227

    accuracy                           0.98      1672
   macro avg       0.97      0.94      0.95      1672
weighted avg       0.98      0.98      0.98      1672



Now we predict if a message is spam or ham using a given input

In [73]:
new_msg = "You have won a ticket to australia , congrats!"
new_msg_transformed = cv.transform([new_msg])
new_msg_transformed

<1x6942 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [74]:
new_msg_transformed.toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [76]:
ans = model.predict(new_msg_transformed)
if ans==[1]:
  print("The message is a spam message")
else:
  print("The message is a ham message")

The message is a spam message
