# Spam/Jailbreak Classification

---

## Dependencies

### Modules

In [None]:
%pip install fastai

In [None]:
%pip install torch

In [None]:
%pip install transformers

In [None]:
%pip install datasets

In [None]:
%pip install tokenizers

In [None]:
%pip install scikit-learn

In [None]:
%pip install matplotlib

In [None]:
%pip install spacy

### Imports

In [76]:
import pandas as pd
from sklearn.model_selection import train_test_split
from fastai.text.all import *
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


---

## Data

190K+ Spam | Ham Email Dataset for Classification: https://www.kaggle.com/datasets/meruvulikith/190k-spam-ham-email-dataset-for-classification

Emails for spam or ham classification (Trec 2007): https://www.kaggle.com/datasets/bayes2003/emails-for-spam-or-ham-classification-trec-2007?select=email_text.csv

### Filtering

In [17]:
df_a = pd.read_csv("data/email_text.csv")
df_b = pd.read_csv("data/spam_Emails_data.csv")

In [28]:
df_a['label'] = df_a['label'].map({1: 'spam', 0: 'ham'})
df_b['label'] = df_b['label'].str.strip().str.lower()

In [33]:
df_b['label'].unique()

array(['spam', 'ham'], dtype=object)

In [34]:
merged = pd.concat([df_a[['label', 'text']], df_b[['label', 'text']]], ignore_index=True)
merged

Unnamed: 0,label,text
0,spam,do you feel the pressure to perform and not ri...
1,ham,hi i've just updated from the gulus and i chec...
2,spam,mega authenticv i a g r a discount pricec i a ...
3,spam,hey billy it was really fun going out the othe...
4,spam,system of the home it will have the capabiliti...
...,...,...
247515,ham,on escapenumber escapenumber escapenumber rob ...
247516,spam,we have everything you need escapelong cialesc...
247517,ham,hi quick question say i have a date variable i...
247518,spam,thank you for your loan request which we recie...


In [36]:
merged['label'].unique()

array(['spam', 'ham'], dtype=object)

In [41]:
merged['text'].unique()

array(['do you feel the pressure to perform and not rising to the occasion try v ia gr a your anxiety will be a thing of the past and you will be back to your old self ',
       "hi i've just updated from the gulus and i check on other mirrors it seems there is a little typo in debian readme file example http gulus usherbrooke ca debian readme ftp ftp fr debian org debian readme testing or lenny access this release through dists testing the current tested development snapshot is named etch packages which have been tested in unstable and passed automated tests propogate to this release etch should be replace by lenny like in the readme html yan morin consultant en logiciel libre yan morin savoirfairelinux com escapenumber escapenumber escapenumber to unsubscribe email to debian mirrors request lists debian org with a subject of unsubscribe trouble contact listmaster lists debian org",
       'mega authenticv i a g r a discount pricec i a l i s discount pricedo not miss it click here htt

In [42]:
merged = merged.drop_duplicates().reset_index(drop=True)

In [59]:
merged.describe()

Unnamed: 0,label,text
count,193852,193850
unique,2,193848
top,ham,hi
freq,102160,2


In [62]:
merged = merged.dropna(subset=['text']).reset_index(drop=True)
merged = merged.drop_duplicates(subset=['text']).reset_index(drop=True)

In [63]:
merged.describe()

Unnamed: 0,label,text
count,193848,193848
unique,2,193848
top,ham,do you feel the pressure to perform and not ri...
freq,102158,1


In [64]:
merged.to_csv("data/merged_spam_ham.csv", index=False)

### Splitting

In [74]:
train, temp = train_test_split(
    merged,
    train_size=0.8,
    stratify=merged['label'],  
    shuffle=True,
)

val, test = train_test_split(
    temp,
    test_size=0.5,
    stratify=temp['label'],
    shuffle=True
)

label_map = {'ham':0, 'spam':1}

train['label'] = train['label'].map(label_map)
val['label'] = val['label'].map(label_map)
test['label']  = test['label'].map(label_map)
train = train.reset_index(drop=True)
val = val.reset_index(drop=True)
test = test.reset_index(drop=True)

In [75]:
train.to_csv("filtered_data/train/spam_ham_train.csv", index=False)
val.to_csv("filtered_data/validation/spam_ham_val.csv", index=False)
test.to_csv("filtered_data/test/spam_ham_test.csv", index=False)

---

## Classifier