Develop a spam filtering system on based of bloom filtering.

First we need to install the database from :- https://www.kaggle.com/datasets/ozlerhakan/spam-or-not-spam-dataset

And another dataset from :- https://archive.ics.uci.edu/dataset/94/spambase 

In [1]:
import spacy
import numpy as np
import pandas as pd
import warnings 
warnings.filterwarnings('ignore')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
df = pd.read_csv('spam_or_not_spam.csv')
df.head(5)

Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0


In [3]:
# Load the pre-trained English language model with medium-sized word vectors
nlp = spacy.load("en_core_web_md")

In [4]:
def email_to_vector(email):
    # Process the email with spaCy
    doc = nlp(email)

    # Extract word vectors and calculate the average vector
    word_vectors = [token.vector for token in doc if token.has_vector]
    
    if word_vectors:
        avg_vector = np.mean(word_vectors, axis=0)
        return avg_vector
    else:
        # If no word vectors are found, return a zero vector
        return np.zeros(nlp.vocab.vectors.shape[1])

In [5]:
def process_emails(df):
    vectors = []
    labels = []

    for index, row in df.iterrows():
        email = row['email']
        label = row['label']
        email = str(email)
        # Convert email to vector
        vector = email_to_vector(email)

        # Append vector and label to lists
        vectors.append(vector)
        labels.append(label)

    return np.array(vectors), np.array(labels)

In [6]:
# Process emails and labels
vectors, labels = process_emails(df)

# Print vectors and labels
print("Vectors:\n", vectors[:5])
print("Labels:\n", labels[:5])

Vectors:
 [[-0.34246749 -0.31992501 -1.38089931 ... -1.76729345 -1.80014443
   0.03079836]
 [-1.0956682   1.04971981 -2.71677542 ... -1.53641939 -3.20208693
   1.19321811]
 [-1.69103181  0.94615138 -1.79006374 ... -1.25015318 -2.36384511
   1.25686538]
 [-0.8480776   0.58890224 -1.95315635 ... -1.97730875 -1.83998895
   0.63771141]
 [-0.87737048  1.10694838 -2.03774714 ... -1.11400509 -3.75652146
   1.08556521]]
Labels:
 [0 0 0 0 0]


In [7]:
from pybloom_live import ScalableBloomFilter
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(vectors, labels, test_size=0.2, random_state=42)

# Train a Bloom filter on the training set
bloom_filter = ScalableBloomFilter(mode=ScalableBloomFilter.SMALL_SET_GROWTH)
for vector, label in zip(X_train, y_train):
    if label == 1:  # Spam
        bloom_filter.add(vector.tobytes())

# Test the Bloom filter on the testing set
predictions = [1 if bloom_filter.__contains__(vector.tobytes()) else 0 for vector in X_test]

# Evaluate the accuracy of the Bloom filter
accuracy = accuracy_score(y_test, predictions)
print("Bloom Filter Accuracy:", accuracy)

Bloom Filter Accuracy: 0.8966666666666666


In [8]:
from sklearn.metrics import confusion_matrix

# Evaluate the confusion matrix
cm = confusion_matrix(y_test, predictions)

# Convert confusion matrix to a Pandas DataFrame
confusion_matrix_df = pd.DataFrame(cm, index=['Actual Positive', 'Actual Negative'], columns=['Predicted Positive', 'Predicted Negative'])

# Extract values from the confusion matrix
TP = confusion_matrix_df.loc['Actual Positive', 'Predicted Positive']
TN = confusion_matrix_df.loc['Actual Negative', 'Predicted Negative']
FP = confusion_matrix_df.loc['Actual Negative', 'Predicted Positive']
FN = confusion_matrix_df.loc['Actual Positive', 'Predicted Negative']

# Calculate true positive rate (sensitivity)
tpr = TP / (TP + FN)
fpr = FP / (TP + FN)

print("True Positive rate (TPR): ", tpr)
print("False Positive Rate (FPR):", fpr)

True Positive rate (TPR):  1.0
False Positive Rate (FPR): 0.12277227722772277
