<a href="https://colab.research.google.com/github/HaywhyCoder/spam_classifier/blob/main/Spam_Not_spam_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Import Libraries

In [None]:
import tensorflow as tf
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
from datasets import load_dataset

### Load the Dataset

In [None]:
dataset = load_dataset('Deysi/spam-detection-dataset')

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8175
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2725
    })
})

### Load Pre-Trained Tokenizer and Model

In [None]:
model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

### Tokenize the Dataset

In [None]:
def build_tokens(data):
  """
  A function to tokenize a dataset, truncate and pad the sequence to max length

  Return: Tensorflow tensors
  """
  return tokenizer(data['text'], padding='max_length', truncation=True, return_tensors='tf')

encoded_dataset = dataset.map(build_tokens, batched=True)

#### Prepare the Dataset

In [None]:
encoded_dataset = encoded_dataset.remove_columns(['text'])
encoded_dataset = encoded_dataset.rename_column("label", "labels")

train_dataset = encoded_dataset['train'].shuffle(seed=42).select(range(3000))
test_dataset = encoded_dataset['test'].shuffle(seed=42).select(range(1000))

In [None]:
def prepare_dataset(dataset):
  """
  a function to convert input dataset into a TensorFlow dataset

  Return: Batched Tensorflow dataset
  """
  dataset = dataset.map(lambda x: {'labels': 1 if x['labels'] == 'spam' else 0})
  return tf.data.Dataset.from_tensor_slices(({key: dataset[key] for key in dataset.features if key != "labels"}, dataset['labels'])).batch(8)

train_tf_dataset = prepare_dataset(train_dataset)
test_tf_dataset = prepare_dataset(test_dataset)

#### Compile the Model

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

#### Train the Model














In [None]:
model.fit(train_tf_dataset, validation_data=test_tf_dataset, epochs=1)



<tf_keras.src.callbacks.History at 0x7a93000d7050>

#### Testing the Model

In [None]:
import numpy as np

def spam_filter(email):
  token = build_tokens({'text': [email]})
  pred = model.predict(token)
  pred = np.argmax(pred.logits, axis=1)

  if pred == 1:
    print("Spam")
  else:
    print("Not spam")

spam_filter("""
WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
""")

Spam
