### This application is designed to classify emails between spam and ham. 

To do this, we proceed as follows: 

1. install prerequisites 
2. load and prepare CSV data - load CSV file and split into training and test datasets 
3. tokenise the message - tokenise texts with a pre-trained model 
4. create Pytorch datasets - create dataset objects for PyTorch 
5. load model and configure training - select a model for text classification (e.g. BERT) and define training parameters 
6. Train the model with the training data
6. evaluate model - evaluate model on test data 
7. save and load model 
8. make prediction - use the model to classify new emails

### 1. Install prerequisites (note: it's a good practice to use anaconda and to install all packages in an environment)

<span style="color:red"><b>CLI command:</b> pip install torch transformers datasets scikit-learn pandas</span><br>
<span style="color:red"><b>If Applicable:</b> pip install ipywidgets</span>

- torch: for training the modell
- transformers: for the use of pre-trained models from HuggingFace
- datasets: a helpful tool from HuggingFace for data processing and loading data sets 
- scikit-learn: for splitting the data and for evaluation metrics such as accuracy and F1-Score
- pandas: for loading and editing CSV-file

### 2. Load and prepare CSV

Load and prepare your CSV file with the messages and categories. We will convert the data into a format that works well with HuggingFace.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split # function that splits data into training and test sets

# load CSV in a pandas dataframe (= like a type of table or two-dimensional array that stores data in form of rows and columns)
df = pd.read_csv('s_spam.csv')

# display first lines from dataframe for verification
print(df.head())

# maps the category column into numerical values. Replaces category names named as ham (no spam) and spam (spam) with 0 and 1 to make them usable for machine learning
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})  # 'ham' -> 0, 'spam' -> 1

# split into training and test set (80% training, 20% test)
# Parameter random_state=42 ensures that the division is reproducible (i.e. the same division of the data is generated each time the code is run)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# show number of lines in the training and test set to ensure that the splitting was carried out correctly
print(f"Training dataset size: {len(train_df)}")
print(f"Test dataset size: {len(test_df)}")

  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
Training dataset size: 4457
Test dataset size: 1115


### 3. Tokenisation of messages

For the HuggingFace model, the texts must be converted into tokens. We use a pre-trained model such as BERT or DistilBERT, which is already available in HuggingFace.

In [None]:
# Tokenizer specially designed to prepare texts for input into BERT models. Converts text into tokens which can be processed by BERT model
from transformers import BertTokenizer

# Load Tokeniser (we use ‘bert-base-uncased’, a pre-trained model from BERT)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function for tokenising messages
def tokenize_function(examples):
    return tokenizer(examples['Message'], padding="max_length", truncation=True)

# Apply tokenisation to the training and test data set
train_encodings = tokenizer(list(train_df['Message']), truncation=True, padding=True, max_length=512)
test_encodings = tokenizer(list(test_df['Message']), truncation=True, padding=True, max_length=512)


### 4. Creating PyTorch datasets

HuggingFace-Transformers uses dataset objects that are well compatible with PyTorch. We can create a dataset from the tokenised data.

In [None]:
# Used to work with tensors (multidimensional arrays) and the creation of data sets and models
import torch

# Class for customised implementation of torch.utils.data.Dataset Object which is a basis for working with PyTorch data 
class EmailDataset(torch.utils.data.Dataset):

    # Saves the tokenised texts (encodings) and labels (spam/ham)
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    # Returns the example (text and label) at the specified position (idx) for use in a mini-batch
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    
    # Returns the number of elements in the dataset, which is necessary for the mini-batch creation
    def __len__(self):
        return len(self.labels)

# Create dataset
train_dataset = EmailDataset(train_encodings, list(train_df['Category']))
test_dataset = EmailDataset(test_encodings, list(test_df['Category']))

### 5. Load model and configure training

Now we can use a model like BertForSequenceClassification from HuggingFace, which is suitable for text classification.

In [4]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Load model for binary classification (2 classes: Spam and Ham)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define training parameters
training_args = TrainingArguments(
    output_dir='./results',          # Directory for model output
    num_train_epochs=3,              # Number of training periods
    per_device_train_batch_size=8,   # Batch size for training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    warmup_steps=500,                # Number of steps for warm-up
    weight_decay=0.01,               # L2 Regularisation
    logging_dir='./logs',            # Directory for logs
    logging_steps=10,
)

# Create a trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Start training
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1674 [00:00<?, ?it/s]

{'loss': 0.7299, 'grad_norm': 10.679104804992676, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.02}
{'loss': 0.6885, 'grad_norm': 9.798615455627441, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.04}
{'loss': 0.6448, 'grad_norm': 9.575231552124023, 'learning_rate': 3e-06, 'epoch': 0.05}
{'loss': 0.5767, 'grad_norm': 8.170449256896973, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.07}
{'loss': 0.5026, 'grad_norm': 2.919818878173828, 'learning_rate': 5e-06, 'epoch': 0.09}
{'loss': 0.3693, 'grad_norm': 5.2771711349487305, 'learning_rate': 6e-06, 'epoch': 0.11}
{'loss': 0.3051, 'grad_norm': 4.223723411560059, 'learning_rate': 7.000000000000001e-06, 'epoch': 0.13}
{'loss': 0.1915, 'grad_norm': 2.7134225368499756, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.14}
{'loss': 0.1833, 'grad_norm': 6.674505710601807, 'learning_rate': 9e-06, 'epoch': 0.16}
{'loss': 0.0952, 'grad_norm': 3.6974830627441406, 'learning_rate': 1e-05, 'epoch': 0.18}
{'loss': 0.0788, 'grad_norm': 1.25

TrainOutput(global_step=1674, training_loss=0.05949286560165417, metrics={'train_runtime': 1321.2296, 'train_samples_per_second': 10.12, 'train_steps_per_second': 1.267, 'total_flos': 1635347236816440.0, 'train_loss': 0.05949286560165417, 'epoch': 3.0})

### 6. Evaluate the model

After training, we can evaluate the model to check its performance.


In [None]:
# Evaluate test data
results = trainer.evaluate()

print("Evaluation Results:", results)

  0%|          | 0/70 [00:00<?, ?it/s]

Evaluation Results: {'eval_loss': 0.06995370239019394, 'eval_runtime': 17.525, 'eval_samples_per_second': 63.624, 'eval_steps_per_second': 3.994, 'epoch': 3.0}


### 7. Save and load model

After training, we can save the model, load and use it again.


In [6]:
# Save model
model.save_pretrained('./spam_classifier_model')
tokenizer.save_pretrained('./spam_classifier_model')

# Load model
model = BertForSequenceClassification.from_pretrained('./spam_classifier_model')
tokenizer = BertTokenizer.from_pretrained('./spam_classifier_model')

### 8. Make predictions

Finally, we can use the trained model to make predictions. Here is an example of how we could do this for a new email.

In [None]:
def predict(message):
    # return_tensors='pt' --> specifies that tokenization should take place in PyTorch tensors (pt). Tokenized data for input into the model is returned as PyTorch tensors (and not as NumPy arrays or other formats)
    inputs = tokenizer(message, return_tensors='pt', truncation=True, padding=True, max_length=512)
    
    # model is called in order to make a prediction. The tokenized inputs are passed as input for the model
    outputs = model(**inputs)

    # logits represent the improbability or the measure of how certain the model is in relation to each class
    logits = outputs.logits

    # function returns index of the largest logit, which corresponds to the class that the model considers most likely
    prediction = torch.argmax(logits, dim=-1).item()
    return 'spam' if prediction == 1 else 'ham'

# Example 1
message1 = "Congratulations! You've won a free vacation!"
print("Message 1 is: " + predict(message1))

# Example 2
message2 = "Please call me back, in love, Tracy."
print("Message 2 is: " + predict(message2))

# Example 3
message3 = "Hey Honey, we've won a free trip to Las Vegas! Isn't this awesome? Call me back soon, so we can discuss our planings for the trip."
print("Message 3 is: " + predict(message3))

# Example 4
message4 = "Hey Honey, I miss you, last night was soooo unbelievable. If you need real love, call 123456789. Now! I'am waiting for u."
print("Message 4 is: " + predict(message4))

# Example 5
message5 = "Hi Mom. Here is my new whatsapp number. Please call me back today. Here is my number: 33445678789. I'am your child Clara."
print("Message 5 is: " + predict(message5))

# Example 6
message6 = "Yor delifery is larte. Your DHL pakage wil come later. Please follo the link here: https://www.link.com for mor details."
print("Message 6 is: " + predict(message6))

# Example 7
message7 = "Please call 911, I am in danger. Jutta."
print("Message 7 is: " + predict(message7))

# Example 8
message8 = "YOU WON A FREE TRIP TO LAS VEGAS!"
print("Message 8 is: " + predict(message8))

# Example 9
message9 = "https://www.hotshit.com!"
print("Message 9 is: " + predict(message9))

# Example 10
message10 = "Hi, my name is Max and I am 10 years old. I want to be a trainee in your company."
print("Message 10 is: " + predict(message10))

Message 1 is: spam
Message 2 is: ham
Message 3 is: ham
Message 4 is: spam
Message 5 is: spam
Message 6 is: spam
Message 7 is: ham
Message 8 is: spam
Message 9 is: spam
Message 10 is: ham
