# BERT-Based Spam Filtering Project by Mateo Velarde & Mohammad Fanous

In this project, we explore the application of the state-of-the-art natural language processing model, Bidirectional Encoder Representations from Transformers (BERT), for the purpose of spam detection.

## Milestone 1: 
Download and preprocess the data that will be used in your project work. The result of the code should be input (X) and output (Y) matrices, which can be later used for training deep neural networks.

In [10]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
import numpy as np

# Assuming the dataset is downloaded and available as 'spam.csv'
# Read the dataset
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['v1', 'v2']] # v1: label, v2: message
data.columns = ['label', 'message']

# Preprocessing
# Remove stop words or other preprocessing steps if needed
# For simplicity, this example doesn't include extensive preprocessing

# Label encoding (spam: 1, not spam: 0)
data['label'] = data['label'].map({'spam': 1, 'ham': 0})

# Tokenization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the messages and create the input matrix (X)
tokenized = data['message'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

# Pad the tokenized data to have the same length
max_len = max([len(i) for i in tokenized])
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized])

# Prepare input (X) and output (Y) matrices
X = np.array(padded)
Y = data['label'].values

# Split the dataset into training (60%), validation (20%), and test sets (20%)
X_train, X_temp, Y_train, Y_temp = train_test_split(X, Y, test_size=0.4, random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_temp, Y_temp, test_size=0.5, random_state=42)

# The X_train, X_val, X_test, Y_train, Y_val, Y_test are now ready to be used for training and evaluation

print('X_train shape:', X_train.shape)
print('X_val shape:', X_val.shape)
print('X_test shape:', X_test.shape)
print('Y_train shape:', Y_train.shape)
print('Y_val shape:', Y_val.shape)
print('Y_test shape:', Y_test.shape)

print(X, Y)


X_train shape: (3343, 238)
X_val shape: (1114, 238)
X_test shape: (1115, 238)
Y_train shape: (3343,)
Y_val shape: (1114,)
Y_test shape: (1115,)
[[  101  2175  2127 ...     0     0     0]
 [  101  7929  2474 ...     0     0     0]
 [  101  2489  4443 ...     0     0     0]
 ...
 [  101 12063  1010 ...     0     0     0]
 [  101  1996  3124 ...     0     0     0]
 [  101 20996 10258 ...     0     0     0]] [0 0 1 ... 0 0 0]
