


![Spam_Email.jpg](https://www.nelsonslaw.co.uk/wp-content/uploads/2024/03/iStock-902015458-970x546.jpg)
Image credits: https://www.nelsonslaw.co.uk/wp-content/uploads/2024/03/iStock-902015458-970x546.jpg

## We recommend using the Colab GPU.

You can train neural networks much faster with a dedicated graphics processing unit (GPU). Fortunately, Google Colab offers free access to GPU instances. We recommend using them for quicker training.

Before starting your instance, go to <code>Runtime</code>→<code>Change runtime type</code> in the top menu. Select the available GPU for <code>Hardware accelerator</code> and click <code>Save</code>. The following bash command will show information about the GPU allocated to you.


In [None]:
!nvidia-smi
#Check allocated GPU

# Introduction to Spam Email Classification

Have you ever wondered how your email client identifies and filters spam emails? Why do some emails end up in the 'junk' folder?

In this assignment, we will explore spam email classification using Convolutional Neural Networks (CNNs).

These algorithms are needed because the internet is flooded with spam emails that can be annoying, deceptive, or even harmful. Effective spam classification is important for improving user experience and protecting against threats. Traditional methods, like rule-based filters and Bayesian classifiers, often fail due to the changing tactics of spam. This is where machine learning, especially CNNs, becomes important. By learning patterns from a large dataset of labeled emails, CNNs can classify emails accurately.


# **Section 1** Preprocessing the emails for training (3 tasks to complete)

## Data loading

First, we load the email dataset into the Colab runtime. Upload the JSON file ```full_spam_dataset.json``` into the ```sample_data``` folder in the runtime. You can access the runtime folder by clicking the small folder icon on the left panel. Keep in mind that the uploaded file will be lost when your runtime disconnects. Make sure to upload it again after starting a new runtime.

In [2]:
DATA_PATH = 'sample_data/full_spam_dataset.json'

## Step 0: Clean the emails

The first step is to clean the email text by removing punctuation and converting all characters to lowercase. These steps are already completed for you below. We strongly recommend that you **do not modify** this cell, as changes could affect the outputs of your code in later parts of the assignment.

In [None]:
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
#-# STEP 0:  Clean email.  Remove punctuations, caps, HTML entities #-#
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#

import sys
import subprocess
import warnings
import os
from collections import Counter
from html import unescape
import re
import numpy as np

def clean_email(text):
    text = text.lower()  # Convert to lowercase
    text = unescape(text)  # Decode HTML entities like &amp; or &lt;
    text = re.sub(r'<[^>]+>', ' ', text)  # Remove HTML tags
    text = re.sub(r'\\[a-z]', ' ', text)  # Remove any escape sequences like \n, \t
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-z ]', '', text)  # Remove special characters and numbers
    text = re.sub(' +', ' ', text)  # Replace multiple spaces with single space
    return text.strip()  # Strip leading/trailing spaces

if __name__ == '__main__':
    warnings.filterwarnings("ignore", category=UserWarning) # Mute HuggingFace warnings temporarily
    original_stderr = sys.stderr
    sys.stderr = open('/dev/null', 'w') # Mute Linux dependency stderr temporarily
    try:
        subprocess.run(['pip', 'install', 'datasets'], check=True)
    finally: # Restore stderr and warnings
        sys.stderr.close()
        sys.stderr = original_stderr
        warnings.filterwarnings("default", category=UserWarning)

    from datasets import load_dataset
    print("Loading dataset...")
    ds = load_dataset('json', data_files=DATA_PATH)

    SUBSET_SIZE = 1000  # Define the size of loaded subset
    if 'train' in ds:
        ds = ds['train'].shuffle(seed=42).select(range(SUBSET_SIZE))

        print("Cleaning emails...")
        labels_of_emails = []
        cleaned_emails = []
        for email in ds:
          cleaned_emails.append(clean_email(email['text']))
          labels_of_emails.append(email['label'])

        print("Sample cleaned email:", cleaned_emails[:30])
        print("Sample label:", labels_of_emails[:30])  # 0 means ham (not spam) and 1 means spam
        print("Labels_of_emails length:", len(labels_of_emails))
    else:
        print("Failed to load the dataset.")

    os.makedirs("models", exist_ok=True) # create a new folder "models" to save your model


## Step 1: Create the vocabulary and mapping

Next, use the cleaned text to create the following **four** variables. It's important to note that to manage the corpus size and filter out uncommon or infrequent words, our vocabulary (vocab) should include only words that appear **at least 5 times** in `cleaned_emails` (a list of strings where each entry is a cleaned email text).

You will complete this function:
<center>

| function name  | parameter name      | parameter datatype | return type                                                              |
|----------------|---------------------|--------------------|--------------------------------------------------------------------------|
| `create_vocabulary` | `cleaned_emails` | `list[str]`         | `list[str], int, dict[str, int], dict[int, str]`        |

</center>

Which returns the following variables:

<center>

| Variable Name   | Data Type          | Description                                                                                                  |
|-----------------|--------------------|--------------------------------------------------------------------------------------------------------------|
| `vocab`         | `list[str]`         | A sorted list of unique words that appear **5 times or more** in the cleaned text.                              |
| `vocab_size`    | `int`               | The total number of **unique** words in the vocabulary (`vocab`).                                               |
| `word_to_index` | `dict[str, int]`    | A dictionary mapping each unique word in `vocab` to a unique integer index, based on alphabetical order.       |
| `index_to_word` | `dict[int, str]`    | A dictionary mapping each index back to the corresponding word in `vocab`.                                     |

</center>

These variables are used to create a vocabulary model that maps words to indices and allows us to index words back to their original form. This process filters the vocabulary to include only frequently used words, optimizing memory usage and computation efficiency in later steps.



---


**Example**:

For the cleaned text (assuming each of these words appears at least 5 times in all emails):
> our special offer expires tomorrow dont miss out on this opportunity

The generated variables would be:

| Variable Name   | Content                                                                                                                   |
|-----------------|---------------------------------------------------------------------------------------------------------------------------|
| `vocab`         | `['dont', 'expires', 'miss', 'offer', 'on', 'opportunity', 'our', 'out', 'special', 'this', 'tomorrow']`                  |
| `vocab_size`    | `11`                                                                                                                      |
| `word_to_index` | `{'dont': 0, 'expires': 1, 'miss': 2, 'offer': 3, 'on': 4, 'opportunity': 5, 'our': 6, 'out': 7, 'special': 8, 'this': 9, 'tomorrow': 10}` |
| `index_to_word` | `{0: 'dont', 1: 'expires', 2: 'miss', 3: 'offer', 4: 'on', 5: 'opportunity', 6: 'our', 7: 'out', 8: 'special', 9: 'this', 10: 'tomorrow'}` |

**Note**:

Be aware that an incorrect implementation of this function may lead to inconsistent or inaccurate results in later tasks, which could affect your scores, even if your implementations for those tasks are correct. Please check the provided `sample_output_TODO1.txt` to verify the correctness of your implementation.

In [None]:
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
#-# STEP 1:  Compile corpus, create vocab, vocab_size, word_to_index, index_to_word #-#
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#

def create_vocabulary(cleaned_emails):
    """
    Processes a list of cleaned email texts to generate a vocabulary and associated mappings.

    Args:
        cleaned_emails (list of str): A list where each entry is a cleaned email text.

    Returns:
        vocab (list of str): A **sorted** list of unique words appearing 5 or more times.
        vocab_size (int): The number of unique words in the vocabulary.
        word_to_index (dict of str: int): A dictionary mapping each word in `vocab` to a unique integer index.
        index_to_word (dict of int: str): A dictionary mapping each integer index back to the corresponding word in `vocab`.
    """

    word_to_index = {}
    index_to_word = {}

    ###############################################################################
    # TODO 1: your code starts here
    # hint: you may consider using:
    #   .join():  concatenates a list of strings into a single string.
    #   .split(): splits a string into a list. it splits by whitespace in default.
    #   collections.Counter():  counts the frequency of elements in a list and returns a dictionary-like object.
    #       e.g. you may use something like word_counts = Counter(words)
    #                                   and word_counts[a_word_in_words] >= 5
    #   .sort():  sorts a list in place.
    #   set():    can remove duplicates from a list.
    #   len():    returns the length of a list.
    #   enumerate(): gives both the index and value when iterating over a list.
    #
    # Please make sure your ordered_words, vocab_size, word_to_index and index_to_word only considers words that appear >= 5 times,
    # Or it might make the following steps impossible to run because of huge memory consumption.


    # TODO 1: your code ends here
    ###############################################################################

    # do not modify this
    return vocab, vocab_size, word_to_index, index_to_word

# Create the vocabulary
if __name__ == '__main__':
    vocab, vocab_size, word_to_index, index_to_word = create_vocabulary(cleaned_emails)


    # Sample outputs:
    print(f"Number of Unique Words that appear >=5 times (size of the vocabulary): {vocab_size}\n")

    print(f"Case 1")
    print(f"vocab samples: {vocab[:20]}")
    print(f"word_to_index samples: {list(word_to_index.items())[:20]}")
    print(f"index_to_word samples: {list(index_to_word.items())[:20]}\n")

    print(f"Case 2")
    print(f"vocab samples: {vocab[1000:1020]}")
    print(f"word_to_index samples: {list(word_to_index.items())[1000:1020]}")
    print(f"index_to_word samples: {list(index_to_word.items())[1000:1020]}\n")

    print(f"Case 3")
    print(f"vocab samples: {vocab[4000:4020]}")
    print(f"word_to_index samples: {list(word_to_index.items())[4000:4020]}")
    print(f"index_to_word samples: {list(index_to_word.items())[4000:4020]}\n")


## Step 2: Encode the emails

After completing the preparation work in the previous step, you need to create `encode_emails` to encode a list of cleaned email texts into sequences of integers. Each word in an email will be represented by a unique integer index from the `word_to_index` dictionary.

The vocabulary includes words that appear at least 5 times in the corpus, ensuring that only frequent and meaningful words are included, while infrequent or possibly incorrect words are encoded as 0.

To efficiently check if a word is in the vocabulary, consider converting the `vocab` list into a **`set`**. A `set` provides faster lookup than a `list` because it uses a hash table for implementation.

You will complete the following function:

<center>

| function name | parameter name            | parameter datatype | return type                      |
|--------------------------|----------------------------------------|-------------------|-------------------------------|
| `encode_emails` | `cleaned_emails`<br> `vocab`<br> `word_to_index` | `list[str]`<br> `list[str]`<br> `dict[str, int]` | `list[list[int]]` |

</center>

Which returns the following variable:

<center>

| variable name             | datatype          | description                                                                          |
|---------------------------|-------------------|--------------------------------------------------------------------------------------|
| `encoded_emails`      | `list[list[int]]` | A list of lists, where each inner list is a sequence of integers representing a cleaned email, with each word mapped to an integer index from `word_to_index`. Words not in `vocab` are assigned a value of `0`.     |

</center>




---


**Example:**

Let's assume you have the following inputs:

| Variable Name   | Content                                                                                                                   |
|-----------------|---------------------------------------------------------------------------------------------------------------------------|
| `cleaned_emails`| `["hello world", "foo email"]`                                                                                           |
| `vocab`         | `["hello", "world", "email"]`                                                                                             |
| `word_to_index` | `{"hello": 1, "world": 2, "email": 3}`                                                                                    |


The function `encode_emails(cleaned_emails, vocab, word_to_index)` will return: `[[1, 2], [0, 3]]`

In this example:
- "hello" is mapped to 1,
- "world" is mapped to 2,
- "foo" is not in the `vocab`, so it is mapped to 0,
- "email" is mapped to 3.

**Note**:

Be aware that an incorrect implementation of this function may lead to inconsistent or inaccurate results in later tasks, which could affect your scores, even if your implementations for those tasks are correct. Please check the provided `sample_output_TODO2.txt` to verify the correctness of your implementation.


In [None]:
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
#-# STEP 2: Encode the emails #-#
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#

def encode_emails(cleaned_emails, vocab, word_to_index):
    """
    Encodes a list of cleaned email texts into sequences of integers.

    Args:
        cleaned_emails (list of str): A list where each entry is a cleaned email text.
        vocab (list of str): List of words that appear at least 5 times in the corpus.
        word_to_index (dict of str: int): A dictionary mapping each word to a unique integer index.

    Returns:
        encoded_emails (list of [list of int]): A list of lists, where each inner list is a sequence of integers representing a cleaned email.
    """
    encoded_emails = []

    ###############################################################################
    # TODO 2: your code starts here
    # hint: you may consider using:
    #   .split(): splits a string into a list. it splits by whitespace in default.
    #   .append(): adds an element to the end of a list.


    # TODO 2: your code ends here
    ###############################################################################

    return encoded_emails


if __name__ == '__main__': # Encode the emails
    encoded_emails = encode_emails(cleaned_emails, vocab, word_to_index)


    # Sample outputs:
    print(f"Case 1")
    print(f"Sample email content: \n{cleaned_emails[0]}")
    print(f"Encoded email: \n{encoded_emails[0]}\n")

    print(f"Case 2")
    print(f"Sample email content: \n{cleaned_emails[341]}")
    print(f"Encoded email: \n{encoded_emails[341]}\n")

    print(f"Case 3")
    print(f"Sample email content: \n{cleaned_emails[923]}")
    print(f"Encoded email: \n{encoded_emails[923]}\n")

## Step 3: Create paddings

Since the CNN requires uniform input length, you will implement a function that pads or truncates sequences of integers representing encoded email texts to a maximum length of **`ENCODE_LENGTH` = 300**. You will create a function that takes a list of encoded email sequences (`encoded_emails`) and outputs a 2D numpy array, where each row is a padded or truncated sequence of integers.

Padding is needed for sequences of varying lengths, while truncation is used for sequences that exceed a certain length. Padding will be performed with zeros added to the right of the sequences, and truncation will remove extra integers from the end. This ensures all sequences have the same fixed length, which is essential for further processing in machine learning models.

DO NOT MODIFY **ENCODE_LENGTH = 300**, as changing it will result in 0 points for the tasks.

You will complete the following function:

<center>

| function name          | parameter name        | parameter datatype    | return type               |
|------------------------|-----------------------|-----------------------|---------------------------|
| `pad_encoded_emails`   | `encoded_emails`<br> `max_len` | `list[list[int]]`<br> `int` | `np.ndarray` |

</center>

Which returns the following variable:

<center>

| variable name          | datatype              | description                                                                     |
|------------------------|-----------------------|---------------------------------------------------------------------------------|
| `padded_emails`        | `np.ndarray`          | A 2D numpy array of shape `(num_emails, max_len)` where each row is a padded or truncated sequence of integers. |

</center>


---

**Example**:

For example, the `padded_emails` from the input of 3 example variables may look like this:


| variable name          | datatype              | example                                                                         |
|------------------------|-----------------------|---------------------------------------------------------------------------------|
| `encoded_emails`       | `list[list[int]]`     | `[[1, 2, 3], [4, 5, 6, 7, 8], [9]]`                                             |
| `max_len`              | `int`                 | `4`                                                                             |
| `padded_emails`        | `np.ndarray`          | `[[1, 2, 3, 0], [4, 5, 6, 7], [9, 0, 0, 0]]`                                    |


**Note**:

Be aware that an incorrect implementation of this function may lead to inconsistent or inaccurate results in later tasks, which could affect your scores, even if your implementations for those tasks are correct. Please check the provided `sample_output_TODO3.txt` to verify the correctness of your implementation.

In [None]:
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
#-# STEP 3: Create paddings (and truncations) #-#
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#

def pad_encoded_emails(encoded_emails, max_len = 300):
    """
    Pads or truncates encoded email sequences to a specified length using NumPy.

    Args:
        encoded_emails (list of list of int): A list of lists, where each inner list is a sequence of integers representing a cleaned email.
        max_len (int): The maximum length of the sequences after padding/truncation.

    Returns:
        padded_emails (numpy array of shape (num_emails, max_len)): A 2D numpy array containing the padded/truncated email sequences.
    """
    # Initialize an array of zeros with shape (number of emails, max_len)
    padded_emails = np.zeros((len(encoded_emails), max_len), dtype=int)

    ###############################################################################
    # TODO 3: your code starts here
    # hint: you may consider using:
    #   enumerate(): gives both the index and value when iterating over a list.

    # TODO 3: your code ends here
    ###############################################################################

    return padded_emails

if __name__ == '__main__': # Pad/Truncate the encoded emails
    ENCODE_LENGTH = 300 # DO NOT MODIFY THIS. the ENCODE_LENGTH variable should be 300.
    padded_emails = pad_encoded_emails(encoded_emails, max_len=ENCODE_LENGTH)

    # Sample outputs:
    print(f"Shape of padded emails: {padded_emails.shape}\n")

    print(f"Case 1")
    print(f"Sample padded email: {padded_emails[0]}\n")

    print(f"Case 2")
    print(f"Sample padded email: {padded_emails[821]}\n")

    print(f"Case 3")
    print(f"Sample padded email: {padded_emails[345]}\n")





# **Section 2** Compiling and training your spam email classification model (2 tasks to complete)

## Step 4: Build the CNN model

After preprocessing your data, you will build a Convolutional Neural Network (CNN) for text classification. The model will process input text sequences that have been converted into numerical representations through previous encoding and padding.

The model will have two inputs:
*  `vocab_size`: The size of the vocabulary in your dataset, which sets the input dimension for the embedding layer.
*  `input_length`: The length of the input sequences, which is equal to `ENCODE_LENGTH` from step 3.


## What is an Embedding?

In language processing, an **embedding** transforms words from discrete indices into dense vectors that capture semantic relationships. This is important because raw integer representations lack meaning—words like 'urgent' and 'important' may have different indices but should be treated similarly.

An **embedding** layer addresses this by learning a semantic representation of words during training, converting each word index into a dense, low-dimensional vector (embedding). These vectors encode semantic information, so words with similar meanings will have similar vector representations. This is especially important in text classification tasks, as understanding word relationships helps the model make better predictions.

The embedding layer in the your model will convert these integer sequences into meaningful dense vectors. This prepares the data for further processing by the convolutional layers, which will extract relevant features for classification and enable the model to learn patterns and relationships between words.

Refer to the documentation below for the key layers you may want to use when implementing the model:
*   [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding): Converts integer sequences to dense vectors of fixed size.
<!-- *   [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM): Captures sequential dependencies in the data. -->
*   [Conv1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D): 1D convolutional layer to detect local patterns.
<!-- *   [BatchNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization): Normalizes the output of a previous layer. -->
*   [MaxPooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MaxPooling1D): Downsamples the input representation.
*   [Flatten](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten): Flattens the input to prepare for dense layers.
*   [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense): Fully connected layer to perform classification.
*   [Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout): Regularization layer to prevent overfitting.



**Hint**:

1.   Your model should begin with an [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer that completes the tokenization process along with data preprocessing. Set the `output_dim` parameter to 300.

<!-- 2.   A [`LSTM`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) layer is expected after the embedding layer to capture the sequential information in the email. -->

2.   In this assignment, a convolutional 'block' should consist of a ```Conv1D``` layer followed by a ```MaxPooling1D``` layer. You can stack multiple blocks to deepen the model (hence the term 'deep learning'). Typically, as you progress through the blocks, the input size decreases (due to the pooling layer in the previous block) while the number of filters in the convolutional layer increases.

3.   Since this is a binary classification task, the final layer of your model should have the appropriate dimensions and use a ```sigmoid``` activation function.


You will complete the following function:
<center>

| function name  | parameter name            | parameter datatype | return type               |
|----------------|---------------------------|---------------------|---------------------------|
| `create_model` | `vocab_size`<br> `input_length` | `int`<br> `int` | `Sequential`              |

</center>

Which returns the following variable:

<center>

| variable name             | datatype          | description                                                                 |
|---------------------------|-------------------|-----------------------------------------------------------------------------|
| `model`                   | `Sequential`      | A compiled CNN model ready for training on text data.                       |

</center>

**Optional Task**:

For sequential data like text, [Recurrent Neural Networks (RNN)](https://keras.io/api/layers/recurrent_layers) are more suitable. While we do not cover the details of RNNs, you are encouraged to study them on your own and implement them for this task.

You **must** ensure your final model has fewer than 10,000,000 parameters (1e7 / 10 million); otherwise, you will receive 0 points for all model-related tasks. For your reference, a model with approximately 2 million parameters can achieve over 90% on the test set.

In [None]:
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
#-# STEP 4:  Build the model  #-#
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
SEED = 2211
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dense, Dropout, Input
from tensorflow.keras.utils import set_random_seed
set_random_seed(SEED)


def build_model(vocab_size, input_length):
    """
    build the CNN model for binary classification of spam email.

    Args:
        vocab_size (int): The size of the vocabulary.
        input_length (int): The length of input sequences.

    Returns:
        Sequential: The compiled CNN model.
    """

    model = Sequential()

    model.add(Input((input_length,),batch_size=32))

    ###############################################################################
    # TODO 4: your code starts here

    # TODO 4: your code ends here
    ###############################################################################

    return model

if __name__ == "__main__":
    # Split the data into training and test sets. Do not modify the "random_state" parameter, as doing so will give entirely different split.
    X_train, X_test, y_train, y_test = train_test_split(padded_emails, labels_of_emails, test_size=0.3, random_state=SEED)
    X_train = np.array(X_train)
    print(np.sum(X_train[0]))
    print(np.sum(X_test[0]))
    X_test = np.array(X_test)
    y_train = np.array(y_train)
    y_test = np.array(y_test)
    print(f"Training data shape: {X_train.shape}")
    print(f"Test data shape: {X_test.shape}")
    print(f"Training labels shape: {len(y_train)}")
    print(f"Test labels shape: {len(y_test)}")

    # create the model
    model = build_model(vocab_size, X_train.shape[1] ) # (voacb_size, 300)
    print(model.summary())

## Step 5: Compile and Train the model

It's time to test your model! Now you need to **compile** and **train** the CNN model in `build_model(vocab_size, input_length)`.

** Overview of Steps: **

1. **Model Compilation**:
   - In this step, you will compile the CNN model using the [model.compile( )](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile) method.
     - **Optimizer**: This controls the learning process. You will use the `Adam` optimizer.
     - **Loss Function**: This measures how well the model's predictions match the true labels. For binary classification, `binary_crossentropy` is appropriate. (Use `categorical_crossentropy` for multi-class classification.)
     - **Metrics**: These evaluate the model's performance during training. You will use the `accuracy` metric to track the proportion of correctly classified instances.

2. **Model Training**:
   - After compiling the model, you will train it using the [model.fit( )](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method.
      - This involves providing the model with training data and labels (`X_train` and `y_train`) for several epochs. An epoch represents one complete pass through the training dataset.
      - **Set `epochs` = 25** and feel free to adjust the `batch_size`. A good starting point is to set `batch_size = 32`.
      - You will also use `validation_data` (`X_test` and `y_test`) to monitor the model's performance after each epoch, which helps prevent overfitting.
      - Be sure to **include `callbacks**=[model_checkpoint] in your code.

3. **Callbacks**:
   - Callbacks are tools provided by Keras to customize the training process. The `model_checkpoint` callback is provided for you, and you need to include it in your `model.fit()`.
     - **ModelCheckpoint**: This saves the best version of the model during training based on its performance on the validation set.


In [None]:
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
#-# STEP 5:  Compile and Train the model  #-#
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#

import os
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint


def compile_and_train_model(model, X_train, y_train, X_test, y_test,
                            weights_save_path='model_pa2_cnn.keras',
                            epochs=25, batch_size=32):
    """
    Compiles and trains the given model.

    Parameters:
    - model: The Keras model to compile and train.
    - X_train: Training data features.
    - y_train: Training data labels.
    - X_test: Validation data features.
    - y_test: Validation data labels.
    - weights_save_path: Path to save the best model weights (default is 'model_pa2_cnn.keras').
    - epochs: Number of training epochs (default is 25).
    - batch_size: Batch size for training (default is 32).

    Returns:
    - history: Training history object.
    - model: The trained Keras model.
    """

    # Set up ModelCheckpoint callback
    model_checkpoint = ModelCheckpoint(filepath=weights_save_path,
                                       save_best_only=True,
                                       monitor='val_loss',
                                       verbose=1)

    ###############################################################################
    # TODO 5: your code starts here
    # hint: you may consider using:
    #   model.compile(optimizer(you can set the learning rates here), loss, metrics)
    #   history = model.fit(X_train, y_train, epochs, batch_size, validation_data, callbacks=[model_checkpoint])



    # TODO 5: your code ends here
    ###############################################################################

    # Evaluate the model
    loss, accuracy = model.evaluate(X_test, y_test)
    print(f"Test Accuracy: {accuracy * 100:.2f}%")

    return history, model

if __name__ == '__main__':
    # Call the function to compile, train, evaluate, and save the model
    history, trained_model = compile_and_train_model(model, X_train, y_train, X_test, y_test)

# **Section 3** Visualising the outcome (no task to complete)

Now that the model is trained, use the following code to evaluate and visualize your results!

In [None]:
import matplotlib.pyplot as plt
import tensorflow.keras as keras

def visualize_hist(history):
    plt.figure(figsize=(8, 4))
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'], label='Training Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()

    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()

    plt.tight_layout()
    plt.show()

if __name__ == "__main__":
    model = keras.models.load_model('model_pa2_cnn.keras')
    val_loss, val_accuracy = model.evaluate(X_test, y_test, verbose=1)
    print(f'Validation accuracy: {val_accuracy}')
    visualize_hist(history)

# **Appendix** (no tasks to complete)

## Evolution of Spam Detection
Spam detection has evolved significantly over the years, resembling an ongoing chess match between clever spammers and vigilant defenders of our inboxes. In this section, we will discuss some earlier spam detection methods that existed before the rise of current deep learning-based approaches.

### Early Days: Rule-Based Systems
Initially, simple rule-based systems served as our first line of defense. These systems used predefined rules to filter out spam. For instance, an email containing the word 'free' could be flagged as spam. Although easy to implement, these systems were soon outsmarted by clever spammers who constantly changed their tactics. It became a never-ending game of 'whack-a-mole.'

In [None]:
def is_spam(email):
    # Define spam keywords
    spam_keywords = ['free', 'win', 'prize', 'click here']

    # Iterate through the email to find if it contains the spam keywords
    for word in spam_keywords:
        if word in email.lower():
            return True
    return False

if __name__ == '__main__':
    email = "Congratulations! Click here to win a free prize!"
    print(is_spam(email))  # Output: True

###   Bayesian Classifiers: A Smarter Approach

Next, Bayesian classifiers emerged, introducing a more adaptive strategy. BUsing probabilistic models, these classifiers analyzed word frequencies to assess the likelihood of an email being spam. As statistician John Tukey noted:

> "The greatest value of a picture is when it forces us to notice what we never expected to see."

Bayesian classifiers enabled us to identify patterns and probabilities in text, providing a more sophisticated method for combating spam.

In [None]:
from collections import Counter
import math

def train_bayes(emails, labels):
    # Initialize counters for spam and ham words
    spam_words = Counter()
    ham_words = Counter()
    spam_count = 0
    ham_count = 0

    # Count words in spam and ham emails
    for email, label in zip(emails, labels):
        words = email.lower().split()
        if label == 'spam':
            spam_words.update(words)
            spam_count += 1
        else:
            ham_words.update(words)
            ham_count += 1

    return spam_words, ham_words, spam_count, ham_count

def predict_bayes(email, spam_words, ham_words, spam_count, ham_count):
    words = email.lower().split()

    # Calculate prior probabilities
    p_spam = spam_count / (spam_count + ham_count)
    p_ham = ham_count / (spam_count + ham_count)

    # Initialize related quantities
    total_spam_words = sum(spam_words.values())
    total_ham_words = sum(ham_words.values())
    vocab = set(spam_words.keys()).union(set(ham_words.keys()))
    vocab_size = len(vocab)
    spam_likelihood = 0
    ham_likelihood = 0

    # Calculate likelihoods with Laplace smoothing
    for word in words:
        spam_likelihood += math.log((spam_words[word] + 1) / (total_spam_words + vocab_size))
        ham_likelihood += math.log((ham_words[word] + 1) / (total_ham_words + vocab_size))

    # Combine likelihoods with prior probabilities
    p_spam_given_email = spam_likelihood + math.log(p_spam)
    p_ham_given_email = ham_likelihood + math.log(p_ham)

    # Predict based on higher probability
    return 'spam' if p_spam_given_email > p_ham_given_email else 'ham'

if __name__ == '__main__':
    # Training data
    emails = ["Win a free prize now", "Meeting at 10am", "Click here to claim your prize"]
    labels = ["spam", "ham", "spam"]

    # Train the model
    spam_words, ham_words, spam_count, ham_count = train_bayes(emails, labels)

    # Test the model
    email = "Free prize for you"
    print(predict_bayes(email, spam_words, ham_words, spam_count, ham_count))  # Output: spam

###   Classical Machine Learning: Enter SVMs and Random Forests
The advent of machine learning marked a significant leap forward in spam detection. Algorithms such as Support Vector Machines (SVMs) and Random Forests gained popularity for their ability to handle large datasets and achieve high accuracy. However, they necessitated extensive feature engineering—a meticulous process of identifying and creating relevant features from raw data.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC

if __name__ == '__main__':
    emails = ["Win a free prize now", "Meeting at 10am", "Click here to claim your prize"]
    labels = [1, 0, 1]  # 1 for spam, 0 for ham

    # use Support Vector Classifier (a kind of SVM) to classify spam emails
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(emails)
    model = SVC()
    model.fit(X, labels)

    email = "Free prize for you"
    email_vec = vectorizer.transform([email])
    print(model.predict(email_vec)[0])  # Output: 1 (spam)