### Activity 5  Naive Bayes for Spam Detection
#### Members:
 1. __TUGADO, JUDE PHILIPPE M.__		
 2. __ALAMO,  ED CHRISTIAN A.__		
 3. __BONITA, KIRBY H.__		
 4. __RODRIGUEZ, AARON LANCE D.__		

#### Import Necessary Libraries

In [32]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

#### Read data

In [86]:
train_df = pd.read_csv("TrainingData.csv", encoding='latin1' )
test_df = pd.read_csv("TestData.csv", encoding='latin1')

train_df

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
3895,spam,tells u 2 call 09066358152 to claim å£5000 pri...
3896,ham,No. Thank you. You've been wonderful
3897,ham,Otherwise had part time job na-tuition..
3898,ham,ÌÏ mean it's confirmed... I tot they juz say o...


#### Inspect test data

In [88]:
test_df

Unnamed: 0,message
0,That depends. How would you like to be treated...
1,"Right on brah, see you later"
2,Waiting in e car 4 my mum lor. U leh? Reach ho...
3,Your 2004 account for 07XXXXXXXXX shows 786 un...
4,Do you want a new video handset? 750 anytime a...
...,...
1667,This is the 2nd time we have tried 2 contact u...
1668,Will Ì_ b going to esplanade fr home?
1669,"Pity, * was in mood for that. So...any other s..."
1670,The guy did some bitching but I acted like i'd...


#### Encode the values of spam/ham to 1/0 and save it unto a separate column

In [48]:
train_df['spam'] = train_df['label'].apply(lambda x: 1 if x == 'spam' else 0)
train_df

Unnamed: 0,label,message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
...,...,...,...
3895,spam,tells u 2 call 09066358152 to claim å£5000 pri...,1
3896,ham,No. Thank you. You've been wonderful,0
3897,ham,Otherwise had part time job na-tuition..,0
3898,ham,ÌÏ mean it's confirmed... I tot they juz say o...,0


#### Split Train Dataset

In [50]:
x_train, x_test, y_train, y_test = train_test_split(train_df.message, train_df.spam, test_size = 0.25)

#### Use Count Vectorizer to convert emails into Numerical Features

In [52]:
cv = CountVectorizer()
x_train_count = cv.fit_transform(x_train.values)

#### Inspect numerical features

In [56]:
x_train_count.toarray()


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

#### Train the model with MultinomialNB() using training data split

In [58]:
model = MultinomialNB()
model.fit(x_train_count, y_train)

#### Determine Accuracy using test data split

In [61]:
x_test_count = cv.transform(x_test)
model.score(x_test_count, y_test)

0.9907692307692307

#### Test Model with New Email Messages

In [124]:
emails = ["Win a free prize now!","Meeting tomorrow at 10 AM", "I kinda like you", "Buy 1 Get 1 Free! Call NOW!"]
emails_transformed = cv.transform(emails)

predictions = model.predict(emails_transformed)
for email, prediction in zip(emails,predictions):
    print(f"Email: {email} | Prediction: {'Spam' if prediction == 1 else 'Ham'}")

Email: Win a free prize now! | Prediction: Spam
Email: Meeting tomorrow at 10 AM | Prediction: Ham
Email: I kinda like you | Prediction: Ham
Email: Buy 1 Get 1 Free! Call NOW! | Prediction: Spam


#### Test Model with Original Test Data 

In [90]:
test_data_transformed = cv.transform(test_df['message'])

predictions = model.predict(test_data_transformed)

test_df['prediction'] = ['Spam' if pred == 1 else 'Ham' for pred in predictions]
prediction_counts = test_df['prediction'].value_counts()

print(prediction_counts)


prediction
Ham     1454
Spam     218
Name: count, dtype: int64


Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,3381,3195,"Sorry, I'll call later",23
spam,519,473,"Loan for any purpose å£500 - å£75,000. Homeown...",3


#### Test Model with accuracy and classification report using true labels and guess labels

In [129]:
from sklearn.metrics import accuracy_score, classification_report

train_messages = cv.transform(train_df['message'])

predictions = model.predict(train_messages)

train_df['guess'] = ['spam' if pred == 1 else 'ham' for pred in predictions]
train_df

accuracy = accuracy_score(train_df['label'],train_df['guess'])
print(f"Accuracy: {accuracy}")

report = classification_report(train_df['label'],train_df['guess'])
print(f"Classification Report: \n", report)



Accuracy: 0.9930769230769231
Classification Report: 
               precision    recall  f1-score   support

         ham       0.99      1.00      1.00      3381
        spam       0.99      0.96      0.97       519

    accuracy                           0.99      3900
   macro avg       0.99      0.98      0.98      3900
weighted avg       0.99      0.99      0.99      3900

