<a href="https://colab.research.google.com/github/Kimi-Gingercat/IAT360-FinalProj-SpamDetection/blob/main/SMS_Spam_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup Python Libraries (pip)

In [5]:
# install some Python packages with pip

%pip install numpy torch datasets transformers evaluate --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [7]:
# let's check the version we are using

%pip freeze | grep -E '^numpy|^torch|^datasets|^transformers|^evaluate'

datasets==3.1.0
evaluate==0.4.3
numpy==1.26.4
torch==2.5.1
torchvision==0.20.1
transformers==4.47.0
Note: you may need to restart the kernel to use updated packages.


# Create IMDB Dataset for Fine-tuning BERT

## Let's load the IMDB Dataset

In [11]:
from datasets import load_dataset
import pandas as pd

# Paths to the Parquet files
testSet = "Documents/IAT360/FinalProj/data/test-00000-of-00001-fa9b3e8ade89a333.parquet"
trainSet = "Documents/IAT360/FinalProj/data/train-00000-of-00001-daf190ce720b3dbb.parquet"

# Load each file into separate DataFrames
test_df = pd.read_parquet(testSet)
train_df = pd.read_parquet(trainSet)

In [12]:
dataset_summary = {
    "train": {
        "features": list(train_df.columns),
        "num_rows": len(train_df),
    },
    "test": {
        "features": list(test_df.columns),
        "num_rows": len(test_df),
    },
}

# Print the structure
print("Dataset Summary:")
for key, value in dataset_summary.items():
    print(f"{key}:")
    print(f"  Features: {value['features']}")
    print(f"  Num Rows: {value['num_rows']}")

Dataset Summary:
train:
  Features: ['text', 'label']
  Num Rows: 8175
test:
  Features: ['text', 'label']
  Num Rows: 2725


The raw dataset splits its data 25/75. It has 10900 data entries in total

In [13]:
print(train_df.head())
print(test_df.head())

                                                text     label
0  hey I am looking for Xray baggage datasets can...  not_spam
1  "Get rich quick! Make millions in just days wi...      spam
2  URGENT MESSAGE: YOU WON'T BELIEVE WHAT WE HAVE...      spam
3  [Google AI Blog: Contributing Data to Deepfake...  not_spam
4  Trying to see if anyone already has timestamps...  not_spam
                                                text     label
0   Deezer.com 10,406,168 Artist DB\n\nWe have sc...  not_spam
1  🚨 ATTENTION ALL USERS! 🚨\n\n🆘 Are you looking ...      spam
2  I'm working on a stats project to test some of...  not_spam
3  [[Sorry, I cannot generate inappropriate or sp...      spam
4  L@@k at these Unbelievable diet pills that can...      spam


In [14]:
# New entries to add to the training set
new_train_entries = [
    {"text": "Congratulations! You've won a free gift card. Claim now.", "label": "spam"},
    {"text": "Does anyone have experience with Python's Pandas library?", "label": "not_spam"},
]

# New entries to add to the test set
new_test_entries = [
    {"text": "Breaking news: Stock market crashes due to unforeseen events.", "label": "not_spam"},
    {"text": "Click here to claim your lottery winnings!", "label": "spam"},
]

# Convert new entries to DataFrame
new_train_df = pd.DataFrame(new_train_entries)
new_test_df = pd.DataFrame(new_test_entries)

# Append new entries to the existing DataFrames
merged_train_df = pd.concat([train_df, new_train_df], ignore_index=True)
merged_test_df = pd.concat([test_df, new_test_df], ignore_index=True)

# Save the merged DataFrames as new files
merged_train_path = "Documents/IAT360/FinalProj/data/merged_train.parquet"
merged_test_path = "Documents/IAT360/FinalProj/data/merged_test.parquet"

merged_train_df.to_parquet(merged_train_path, index=False)
merged_test_df.to_parquet(merged_test_path, index=False)


In [15]:
# Load the new files and inspect
merged_train_df = pd.read_parquet(merged_train_path)
merged_test_df = pd.read_parquet(merged_test_path)

print(merged_train_df.tail())  # Check last few rows of the new training set
print(merged_test_df.tail())   # Check last few rows of the new testing set


                                                   text     label
8172  Hi\n\nI am working on a project and need penal...  not_spam
8173  Do you want to BLOW UP your social media follo...      spam
8174  WAZZUP MY FELLOW NETIZENS! Time to get your sc...      spam
8175  Congratulations! You've won a free gift card. ...      spam
8176  Does anyone have experience with Python's Pand...  not_spam
                                                   text     label
2722  Would love if anyone knew of any really good d...  not_spam
2723     Fields = Hashrate, VRAM, TDP, MSRP, Profit/day  not_spam
2724  Feelin’ like you’re not getting enough attenti...      spam
2725  Breaking news: Stock market crashes due to unf...  not_spam
2726         Click here to claim your lottery winnings!      spam


In [16]:
dataset_summary = {
    "train": {
        "features": list(merged_train_df.columns),
        "num_rows": len(merged_train_df),
    },
    "test": {
        "features": list(merged_test_df.columns),
        "num_rows": len(merged_test_df),
    },
}

# Print the structure
print("Dataset Summary:")
for key, value in dataset_summary.items():
    print(f"{key}:")
    print(f"  Features: {value['features']}")
    print(f"  Num Rows: {value['num_rows']}")

Dataset Summary:
train:
  Features: ['text', 'label']
  Num Rows: 8177
test:
  Features: ['text', 'label']
  Num Rows: 2727


In [26]:
# Combine the current train and test datasets
combined_df = pd.concat([merged_train_df, merged_test_df])

## Let's create the train, validation, test sets

Split dataset into train 70%, validation 15%, test 15%

In [28]:
from datasets import Dataset, DatasetDict
# Step 1: Split into train, validation, and test (70%, 15%, 15%)
train_df, temp_df = train_test_split(combined_df, test_size=0.3, random_state=42, shuffle=True)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, shuffle=True)

# Step 2: Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

# Combine into a DatasetDict
dataset = DatasetDict({
    "train": train_dataset,
    "val": val_dataset,
    "test": test_dataset,
})

dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 7632
    })
    val: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 1636
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 1636
    })
})

In [29]:
dataset = dataset.remove_columns(["__index_level_0__"])
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 7632
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 1636
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1636
    })
})

After merging, we have 10,904 in dataset

## We start by tokenizing our dataset with the BERT's Fast Tokenizer

In [17]:
# let's import the pretrained faster tokenizer from huggingface
# source: (https://huggingface.co/distilbert-base-uncased)

from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=True)
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [30]:
# tokenize the text in batches with truncation and padding based on BERT requirements

def tokenization(example):
    return tokenizer(example['text'], truncation=True, padding=True)

tokenized_dataset = dataset.map(tokenization, batched=True, remove_columns=['text'])
tokenized_dataset

Map: 100%|████████████████████████| 7632/7632 [00:00<00:00, 13645.56 examples/s]
Map: 100%|████████████████████████| 1636/1636 [00:00<00:00, 14252.98 examples/s]
Map: 100%|████████████████████████| 1636/1636 [00:00<00:00, 14452.94 examples/s]


DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 7632
    })
    val: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1636
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1636
    })
})

# Setup Training Metrics (Accuracy, F1)

In [31]:
import evaluate
import numpy as np

# we setup the training to evaluate the accuracy and f1 scores

accuracy_metric = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels)
    return {**accuracy, **f1}

Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 10.1MB/s]
Downloading builder script: 100%|██████████| 6.77k/6.77k [00:00<00:00, 18.3MB/s]


In [35]:
pip install tf-keras

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting tf-keras
  Downloading tf_keras-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Collecting keras>=3.5.0 (from tensorflow<2.19,>=2.18->tf-keras)
  Downloading keras-3.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading tf_keras-2.18.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading keras-3.7.0-py3-none-any.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: keras, tf-keras
  Attempting uninstall: keras
    Found existing installation: keras 2.11.0
    Uninstalling keras-2.11.0:
      Successfully uninstalled keras-2.11.0
Successfully installed keras-3.7.0 tf-keras-2.18.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49

In [41]:
pip install "transformers[torch]"


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting accelerate>=0.26.0 (from transformers[torch])
  Downloading accelerate-1.1.1-py3-none-any.whl.metadata (19 kB)
Downloading accelerate-1.1.1-py3-none-any.whl (333 kB)
Installing collected packages: accelerate
Successfully installed accelerate-1.1.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [46]:
%pip install --upgrade pip
%pip install "accelerate>=0.26.0"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting pip
  Downloading pip-24.3.1-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-24.3.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.2
    Uninstalling pip-24.2:
      Successfully uninstalled pip-24.2
Successfully installed pip-24.3.1
Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [47]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Setup Training Configurations

In [48]:
import os
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# get bert model with a sequence classification head for sentiment analysis
# source: (https://huggingface.co/distilbert-base-uncased)
checkpoint = 'distilbert-base-uncased'
num_labels = 2
id2label = {0:'not_spam',1:'spam'}
label2id = {'not_spam':0,'spam':1}
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels, id2label=id2label, label2id=label2id)

# setup custom training arguments
# 1. store training checkpoints to 'results' output directory
# 2. fine-tune for just 1 epoch
# 3,4. use 16 as a batch size to speed things up
# 5. evaluate validation set every 500 steps (this is the default steps)
# 6. load the best model based on the lowest validation loss at the end of training
training_args = TrainingArguments(
    seed=42,
    output_dir = './results',
    num_train_epochs = 1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy='steps',
    load_best_model_at_end=True,
)

# setup trainer with custom metrics (accuracy, f1)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['val'],
    compute_metrics=compute_metrics,
)

# disable wandb logging (a v4 huggingface artifact)
os.environ['WANDB_DISABLED']= "true"

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.26.0`: Please run `pip install transformers[torch]` or `pip install 'accelerate>={ACCELERATE_MIN_VERSION}'`

# Evaluate UnFine-Tuned BERT on Test Set for a Baseline Metric


In [None]:
# let's first evaluate unfine-tuned model with test set

trainer.evaluate(tokenized_dataset['test'])

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


{'eval_loss': 0.6972731947898865,
 'eval_model_preparation_time': 0.0028,
 'eval_accuracy': 0.4988,
 'eval_f1': 0.6642550911039657,
 'eval_runtime': 389.6548,
 'eval_samples_per_second': 64.159,
 'eval_steps_per_second': 4.011}

Without fine-tuning BERT, our model currently has around **52% Accuracy (eval_accuracy)** and **19% F1 (eval_f1)**, which is pretty bad due to the test dataset having around 50% positive and 50% negative reviews. 😕


Let's make it better with transfer learning! 🦾

# Fine-Tune BERT with IMDb Dataset

In [None]:
# let's fine-tune BERT with the IMDb dataset

trainer.train()

Step,Training Loss,Validation Loss,Model Preparation Time,Accuracy,F1
500,0.3424,0.285967,0.0028,0.9016,0.900081
1000,0.2415,0.229587,0.0028,0.9172,0.918343


TrainOutput(global_step=1250, training_loss=0.27448213500976565, metrics={'train_runtime': 1203.8007, 'train_samples_per_second': 16.614, 'train_steps_per_second': 1.038, 'total_flos': 2649347973120000.0, 'train_loss': 0.27448213500976565, 'epoch': 1.0})

In [None]:
# let's see how well it did in the test set

trainer.evaluate(tokenized_dataset['test'])

{'eval_loss': 0.21517841517925262,
 'eval_model_preparation_time': 0.0028,
 'eval_accuracy': 0.92296,
 'eval_f1': 0.9246596776717259,
 'eval_runtime': 393.8639,
 'eval_samples_per_second': 63.474,
 'eval_steps_per_second': 3.968,
 'epoch': 1.0}

**WOAH!** We got a **92% Accuracy (eval_accuracy)** and **92% F1 (eval_f1)** with just **1 epoch**! 🤯

# Try out some examples!

In [None]:
from transformers import pipeline
import torch

# get current device with pytorch
device = torch.cuda.current_device()

# create pipeline for sentiment classifier with custom model and tokenizer
sentiment_classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=tokenizer, device=device)

In [None]:
# let's see how our model classifies a good review
# this is from 'justinvitelli' (https://www.imdb.com/review/rw8972952)

review = """
First off this movie is for kids and fans of Nintendo and the Mario franchise.
I still think an adult who isnt a fan could still enjoy it but this movie is so
full of fan service that it will have you smiling the whole time.
The voice acting I was skeptical but they all work and work well too.
Jack Black is the star here. I love how they kept the story simple like all of the games.
Truly felt like a video game on screen.
This movie felt like a beautifully animated amusement park ride.
The audio in the movie was amazing too.
The sounds and the score with reimagined iconic music was perfect.
Some of the songs in the movie felt unnecessary but they worked.
I think they should've bumped the run time to 105-120 min.
90 min felt too short as it goes by quick.
I havent had this much wholesome fun at the movies in a long time.
If youre a fan you HAVE to see it.
"""
sentiment_classifier(review)

[{'label': 'POSITIVE', 'score': 0.9938808679580688}]

That is **99% POSITIVE**! *justinvitelli* loves the movie!

In [None]:
# let's see how our model classifies a bad review
# this is from 'industriousbug16' (https://www.imdb.com/review/rw8998214)

review = """
Flat, visual noise.
Fundamentally incurious. Potentially injurious.
The mystique generated by the characters in the games is here raked over and presented
haphazardly by hacks.
A hobbled attempt to explain a long and random evolution of characters who were never meant
to be narratised fails.
Doing it well is near impossible when you insist on EVERY LITTLE BIT OF LORE,
from the last forty years being shoehorned into 90 minutes.
Makes little sense, shamelessly leans on member berries to stimulate older viewers but offers
nothing else.
I feel sad for the animators who did a sterling job, but to no end as this movie has no soul.
"""
sentiment_classifier(review)

[{'label': 'NEGATIVE', 'score': 0.9951890707015991}]

That is **99% NEGATIVE**! *industriousbug16* must hate the movie very badly.

# Resources

### If you would like to use this model without running the entire notebook, try the model at my [HuggingFace](https://huggingface.co/wesleyacheng/movie-review-sentiment-classifier-with-bert).

### If you woud like to get this in GitHub, here's my [repo](https://github.com/wesleyacheng/movie-review-sentiment-classifier-with-bert).