# **Homework 20: Natural Language Processing I**
---

### **Description**
In this week's homework, you will apply what you learned in this week's lab to classify emails as either `"spam"` or `"not spam"`.


<br>

### **Structure**
**Part 1**: Detecting Spam Emails with a DNN

**Part 2**: Detecting Spam Emails with a CNN




<br>

### **Cheat Sheets**
[Natural Language Processing I](https://docs.google.com/document/d/1ZaLtMF7aQsG05myetJpoTJlr-sAIURP_a9sQr66pfqw/edit?usp=drive_link)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**

In [None]:
import numpy as np
import pandas as pd

import tensorflow as tf
from keras.models import Sequential
from keras.layers import *
from keras.optimizers import Adam
from keras.utils import to_categorical

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

---
## **Part 1: Detecting Spam Emails with a DNN**
---


**Run the code provided below to import the dataset.**

In [None]:
df = pd.read_excel('https://docs.google.com/spreadsheets/d/e/2PACX-1vQaAZH50du-EP6meYf_LjHztynjYFZ2mg1miSvjgz8nLNh_lnbSdgARSQC10UdhhQ/pub?output=xlsx')

inputs = df[["Column2"]]
output = df["Column1"]

x_train, x_test, y_train, y_test = train_test_split(inputs, output, test_size=0.2, random_state=42)

#### **Exercise #1.1: Create the `TextVectorization` layer**


Let's create a `TextVectorization` layer to vectorize this data.

Specifically,
1. Initialize the layer with the specified parameters.

2. Adapt the layer to the training data.

##### **1. Initialize the layer with the specified parameters.**

* `max_tokens = 5000`
* `output_mode = 'int'`
* `output_sequence_length = 50`

In [None]:
vectorize_layer = TextVectorization(
    # WRITE YOUR CODE HERE
  )

vectorize_layer.adapt(x_train)

###### **Solution**


In [None]:
vectorize_layer = TextVectorization(
    max_tokens = 5000,
    output_mode = 'int',
    output_sequence_length = 50
  )

vectorize_layer.adapt(x_train)

#### **Exercise #1.2: Look at the vocabulary**


Print the first 50 words of the vocabulary.

In [None]:
# WRITE YOUR CODE HERE

###### **Solution**

In [None]:
vectorize_layer.get_vocabulary()[0:50]

['',
 '[UNK]',
 'the',
 'and',
 'for',
 'you',
 'this',
 'that',
 'enron',
 'will',
 'with',
 'ect',
 'have',
 'from',
 'are',
 'your',
 'hou',
 'our',
 'not',
 'would',
 'has',
 'all',
 'ees',
 'please',
 'can',
 'any',
 'com',
 'they',
 'but',
 'out',
 'get',
 'was',
 'more',
 'power',
 'been',
 'which',
 'also',
 'energy',
 'some',
 'kitchen',
 'one',
 'these',
 'time',
 'there',
 'over',
 'need',
 'subject',
 'what',
 'know',
 'their']

#### **Exercise #1.3: Add the input and text vectorization layers to the model**




In [None]:
model = Sequential()

model.add(Input(# COMPLETE THIS LINE
model.add(# COMPLETE THIS LINE

###### **Solution**


In [None]:
model = Sequential()

model.add(Input(shape=(1,), dtype=tf.string))
model.add(vectorize_layer)

#### **Exercise #1.4: Look at the vectorization of an example**


Add your own sentence below to see how it would be vectorized with our newly adapted layer.

<br>

**NOTE:** `TextVectorizer` will ignore any punctuation and consider upper and lower case the same. There are extra parameters that can set to adjust this.

In [None]:
vector_0 = model.predict([# COMPLETE THIS LINE

print(vector_0)
print(vector_0.shape)

###### **Solution**


In [None]:
vector_0 = model.predict(['hello world my name is Adam'])

print(vector_0)
print(vector_0.shape)

[[1176  379    1  229    1    1    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0]]
(1, 50)


#### ***STOP!* Answer the following question under Problem #8: When using the text vectorization layer in Keras, what do the numbers in the vectorization represent?**

#### **Exercise #1.5: Add hidden layers and an output layer**


Add two dense layers with 512 neurons and ReLU activation.

Then, because this is a binary classification task, create the output layer with a single neuron (for spam/not spam) and the sigmoid activation function.

In [None]:
model.add(Dense(# COMPLETE THIS LINE
model.add(Dense(# COMPLETE THIS LINE
model.add(Dense(# COMPLETE THIS LINE

###### **Solution**


In [None]:
model.add(Dense(1000, activation = 'relu'))
model.add(Dense(1000, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))

Let's take a look at our DNN.

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, 50)                0         
 Vectorization)                                                  
                                                                 
 dense (Dense)               (None, 1000)              51000     
                                                                 
 dense_1 (Dense)             (None, 1000)              1001000   
                                                                 
 dense_2 (Dense)             (None, 1)                 1001      
                                                                 
Total params: 1053001 (4.02 MB)
Trainable params: 1053001 (4.02 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


#### **Exercise #1.6: Compile and fit the model**


Compile and fit the model with the following parameters:
* Adam learning rate of 0.001
* `binary_crossentropy` for the loss function
* Accuracy as the metric
* For the fit, use `epochs=5` and `batch_size=128`

In [None]:
opt = Adam(learning_rate = # COMPLETE THIS LINE
model.compile(optimizer = opt, loss = # COMPLETE THIS LINE

model.fit(# COMPLETE THIS LINE

###### **Solution**


In [None]:
opt = Adam(learning_rate = 0.001)
model.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])

model.fit(x_train, y_train, epochs = 5, batch_size = 128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7ec1b399f910>

#### ***STOP!* Answer the following question under Problem #9: Why are we using the binary crossentropy loss function instead of the categorical crossentropy function for this task?**

#### **Exercise #1.7: Evaluate the model**


Now, evaluate the model for both the training and test sets.

<br>

**NOTE:** As a baseline, randomly guessing 1 out of 4 possible classes would achieve a roughly 0.25 accuracy.

In [None]:
# Evaluate the training set
model.evaluate(# COMPLETE THIS LINE

# Evaluate the test set
model.evaluate(# COMPLETE THIS LINE

###### **Solution**


In [None]:
model.evaluate(x_train, y_train)
model.evaluate(x_test, y_test)



[23.089759826660156, 0.5699999928474426]

---
## **Part 2: Detecting Spam Emails with a CNN**
---


#### **Exercise #2.1: Initialize the model with an input and vectorizer layer**


*Hint: This is the same as last time.*

In [None]:
cnn_model = # COMPLETE THIS LINE

cnn_model.add(Input(# COMPLETE THIS LINE
cnn_model.add(# COMPLETE THIS LINE

###### **Solution**


In [None]:
cnn_model = Sequential()

cnn_model.add(Input(shape=(1,), dtype=tf.string))
cnn_model.add(vectorize_layer)

#### **Exercise #2.2: Finish building the CNN**


Build your CNN with the following layers:
* a convolutional layer with 128 filters, a kernel size of 5, and ReLU activation
* a max pooling layer with a pool size of 2
* a convolutional layer with 256 filters, a kernel size of 5, and ReLU activation
* a max pooling layer with a pool size of 2
* a flatten layer
* a dense layer with 512 neurons
* the output layer

In [None]:
# The convolution layer requires us to cast the inputs to a different data type
# and reshape the input as well. We have done this for you.
cnn_model.add(Lambda(lambda x: tf.cast(x, 'float32')))
cnn_model.add(Reshape((50, 1)))

# Start building your CNN below.
# WRITE YOUR CODE HERE

###### **Solution**

In [None]:
# The convolution layer requires us to cast the inputs to a different data type
# and reshape the input as well. We have done this for you.
cnn_model.add(Lambda(lambda x: tf.cast(x, 'float32')))
cnn_model.add(Reshape((50, 1)))

# Start building your CNN below.
cnn_model.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
cnn_model.add(MaxPooling1D(pool_size=2))
cnn_model.add(Conv1D(filters=256, kernel_size=5, activation='relu'))
cnn_model.add(MaxPooling1D(pool_size=2))
cnn_model.add(Flatten())
cnn_model.add(Dense(512, activation = 'relu'))
cnn_model.add(Dense(1, activation = 'sigmoid'))

#### ***STOP!* Answer the following question under Problem #10: Why did we use the Conv1D layer instead of the Conv2D layer?**

Let's take a look at the completed model.

In [None]:
cnn_model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, 50)                0         
 Vectorization)                                                  
                                                                 
 lambda (Lambda)             (None, 50)                0         
                                                                 
 reshape (Reshape)           (None, 50, 1)             0         
                                                                 
 conv1d (Conv1D)             (None, 46, 128)           768       
                                                                 
 max_pooling1d (MaxPooling1  (None, 23, 128)           0         
 D)                                                              
                                                                 
 conv1d_1 (Conv1D)           (None, 19, 256)          

#### **Exercise #2.3: Compile and fit the model**


Compile and fit the model with the same parameters as Part 1.

In [None]:
opt = Adam(# COMPLETE THIS LINE)
cnn_model.compile(# COMPLETE THIS LINE

cnn_model.fit(# COMPLETE THIS LINE

###### **Solution**


In [None]:
opt = Adam(learning_rate = 0.001)
cnn_model.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])

cnn_model.fit(x_train, y_train, epochs = 5, batch_size = 128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7ec1a82e75b0>

#### **Exercise #2.4: Evaluate the model**


Now, evaluate the model for both the training and test sets.

In [None]:
cnn_model.evaluate(# COMPLETE THIS LINE
cnn_model.evaluate(# COMPLETE THIS LINE

###### **Solution**

In [None]:
cnn_model.evaluate(x_train, y_train)
cnn_model.evaluate(x_test, y_test)



[0.7203307151794434, 0.5950000286102295]

---
#End of notebook

© 2024 The Coding School, All rights reserved