## Neural Networks for Data Science Applications (a.a. 2023-2024)
### Lab session 3: Text classification with 1D CNNs (and transfer learning)

**Contents**:
1. Tokenizing and embedding text sentences.
2. Training a 1D CNN.
3. Transfer learning from a pre-trained sentence embedder.
4. Visualizing the embeddings.

In [None]:
%pip install tiktoken --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires openai, which is not installed.[0m[31m
[0m

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import tiktoken

### Step 1 - Data preprocessing (tokenization)

In [None]:
# The dataset concerns a sentence classification task: https://www.tensorflow.org/datasets/catalog/trec
data = tfds.load('trec')

Downloading and preparing dataset 350.79 KiB (download: 350.79 KiB, generated: 636.90 KiB, total: 987.69 KiB) to /root/tensorflow_datasets/trec/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/5452 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/trec/1.0.0.incompleteKF0MWI/trec-train.tfrecord*...:   0%|          | 0/54…

Generating test examples...:   0%|          | 0/500 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/trec/1.0.0.incompleteKF0MWI/trec-test.tfrecord*...:   0%|          | 0/500…

Dataset trec downloaded and prepared to /root/tensorflow_datasets/trec/1.0.0. Subsequent calls will reuse this data.


In [None]:
train_data = data['train']
test_data = data['test']

In [None]:
el = next(iter(train_data))

In [None]:
# For the purpose of this lab, we will use the coarse label (7 possible classes).
el

{'label-coarse': <tf.Tensor: shape=(), dtype=int64, numpy=3>,
 'label-fine': <tf.Tensor: shape=(), dtype=int64, numpy=4>,
 'text': <tf.Tensor: shape=(), dtype=string, numpy=b'Who was Camp David named for ?'>}

In [None]:
# tiktoken (https://github.com/openai/tiktoken) is an open-source implementation
# of the OpenAI tokenizer. We use here the small tokenizer (approx. 50k possible tokens).
enc = tiktoken.get_encoding('r50k_base')

In [None]:
enc.n_vocab

50257

In [None]:
# Tokenize a sentence into a sequence of token IDs.
enc.encode('What is the weather today?')

[2061, 318, 262, 6193, 1909, 30]

In [None]:
def preprocess(el):
  # Transform the dataset into an (x,y) format.
  return el['text'], el['label-coarse']

In [None]:
# To convert our data into strings, we first extract the NumPy representation (byte strings),
# before decoding to the correct text format.
el['text'].numpy().decode('utf8')

'Who was Camp David named for ?'

In [None]:
# py_function is needed to convert our function to something compatible with
# tf.data (since tiktoken is not TensorFlow).
@tf.py_function(name='tokenize', Tout=(tf.int32, tf.int64))
def tokenize(text, label):
  text = text.numpy().decode('utf8')
  tokens = enc.encode(text)
  return tf.convert_to_tensor(tokens), label

In [None]:
for x, y in train_data.map(preprocess).shuffle(1000).map(tokenize):
  print(x)
  print(y)
  break

tf.Tensor([ 2437   318   262  1573  4600 10662   328   506   705 16293  5633], shape=(11,), dtype=int32)
tf.Tensor(0, shape=(), dtype=int64)


In [None]:
# We only need to run the tokenization step once, so we can cache the result on disk
# to avoid unnecessary recomputations.
train_data_p = train_data.map(preprocess).map(tokenize).cache()
test_data_p = test_data.map(preprocess).map(tokenize).cache()

In [None]:
for xb, yb in train_data_p.shuffle(1000).padded_batch(4, padded_shapes=([None], [])):
  # Build a mini-batch, by zero-padding to the largest sentence in the mini-batch
  # (try to run multiple times and see how the output changes).
  print(xb)
  print(yb)
  break

tf.Tensor(
[[ 2061   318 12963 39148   705    82  1336  1438  5633     0     0     0
      0     0]
 [ 2061  1499  2497   262  8159   286   262  7740 34070  5633     0     0
      0     0]
 [ 8241  6928 22108 20604   355   262  2583   286   383  4544  2253  7873
    415  5633]
 [ 2061   318   262  3139   286 36421  5633     0     0     0     0     0
      0     0]], shape=(4, 14), dtype=int32)
tf.Tensor([3 5 3 5], shape=(4,), dtype=int64)


### Step 2 - Playing with layers

In [None]:
from tensorflow.keras import layers

#### 2a: Embedding

In [None]:
emb = layers.Embedding(50_000, 6)

In [None]:
# Each token is converted to a 6-dimensional vector.
emb(xb).shape

TensorShape([4, 14, 6])

In [None]:
for p in emb.trainable_variables:
  # The parameters is a (tokens x embedding_dimension) matrix.
  print(p.shape)

(50000, 6)


#### 2b: Dropout

In [None]:
# The input is the probability of masking an element.
drop = layers.Dropout(0.75)

In [None]:
x = tf.random.normal((3, 2))
print(x)

tf.Tensor(
[[ 0.0985296  -0.63593316]
 [ 0.15297675  0.028723  ]
 [-0.47992012  0.6675854 ]], shape=(3, 2), dtype=float32)


In [None]:
# By default, the layer works in "evaluation mode", i.e., it does nothing.
drop(x)

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[ 0.0985296 , -0.63593316],
       [ 0.15297675,  0.028723  ],
       [-0.47992012,  0.6675854 ]], dtype=float32)>

In [None]:
# Note that unmasked values are multiplied by 1/(1-p).
drop(x, training=True)

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[ 0.       ,  0.       ],
       [ 0.611907 ,  0.       ],
       [-1.9196805,  2.6703415]], dtype=float32)>

#### 2c: Batch normalization

In [None]:
bn = layers.BatchNormalization()

In [None]:
# Construct the parameters of the layers without calling it.
bn.build((None, 2))

In [None]:
# The parameters are the scale and shift factors applied to the normalized inputs (alpha and beta in the slides).
for p in bn.trainable_variables:
  print(p)

<tf.Variable 'gamma:0' shape=(2,) dtype=float32, numpy=array([1., 1.], dtype=float32)>
<tf.Variable 'beta:0' shape=(2,) dtype=float32, numpy=array([0., 0.], dtype=float32)>


In [None]:
# The non-trainable parameters are the running mean and the running variance of the model.
for p in bn.non_trainable_variables:
  print(p)

<tf.Variable 'moving_mean:0' shape=(2,) dtype=float32, numpy=array([0., 0.], dtype=float32)>
<tf.Variable 'moving_variance:0' shape=(2,) dtype=float32, numpy=array([1., 1.], dtype=float32)>


In [None]:
tf.reduce_mean(x, 0)

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-0.07613792,  0.02012507], dtype=float32)>

In [None]:
bn(x, training=True)

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[ 0.09848037, -0.6356154 ],
       [ 0.15290031,  0.02870865],
       [-0.47968033,  0.6672518 ]], dtype=float32)>

In [None]:
# Try with training=True and training=False, and check how the mean and the non-trainable
# variables are being modified.
tf.reduce_mean(bn(x, training=True), 0)

<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-0.07609988,  0.02011502], dtype=float32)>

#### 2d: Regularizers

In [None]:
from tensorflow.keras import regularizers

In [None]:
# Compute the regularization value on a single parameter.
regularizers.L2(0.001)(bn.trainable_variables[0])
# loss + regularizers.L2 ...

<tf.Tensor: shape=(), dtype=float32, numpy=0.002>

In [None]:
# Apply the regularizer (only valid for fit()).
layers.BatchNormalization(beta_regularizer=regularizers.L2(0.001))

<keras.src.layers.normalization.batch_normalization.BatchNormalization at 0x7de4ff5f3220>

### Step 3 - Building the model

In [None]:
class TextClassifier(tf.keras.Model):
    def __init__(self):
        super().__init__()
        # This is a simple model composed of 3 convolutional blocks (Conv1d - BatchNorm - MaxPool),
        # followed by global average pooling and one or more fully-connected layers.
        self.emb = layers.Embedding(50_000, 4)
        self.conv1 = layers.Conv1D(32, 5, padding='same')
        self.conv2 = layers.Conv1D(64, 5, padding='same')
        self.conv3 = layers.Conv1D(128, 5, padding='same')
        self.bn1 = layers.BatchNormalization()
        self.bn2 = layers.BatchNormalization()
        self.bn3 = layers.BatchNormalization()
        self.max_pool = layers.MaxPool1D(2)
        self.global_pool = layers.GlobalAvgPool1D()
        self.drop = layers.Dropout(0.3)
        self.dense = layers.Dense(7)

    def call(self, x, training=False):
        # x has shape (None, max_seq_len)
        x = self.emb(x)                         # (None, max_seq_len, 4)
        x = tf.nn.relu(self.bn1(self.conv1(x),
                                training=training)) # (None, max_seq_len, 32)
        x = self.max_pool(x)                    # (None, max_seq_len/2, 32)
        x = tf.nn.relu(self.bn2(self.conv2(x),
                                training=training))
        x = self.max_pool(x)
        x = tf.nn.relu(self.bn3(self.conv3(x),
                                training=training)) # (None, max_seq_len/4, 128)
        x = self.global_pool(x)                 # (None, 128)
        x = self.drop(x, training=training)
        return self.dense(x)                    # (None, 7)

In [None]:
cnn = TextClassifier()

In [None]:
cnn(xb).shape

TensorShape([4, 7])

In [None]:
cnn.summary()

Model: "text_classifier"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     multiple                  200000    
                                                                 
 conv1d (Conv1D)             multiple                  672       
                                                                 
 conv1d_1 (Conv1D)           multiple                  10304     
                                                                 
 conv1d_2 (Conv1D)           multiple                  41088     
                                                                 
 batch_normalization_2 (Bat  multiple                  128       
 chNormalization)                                                
                                                                 
 batch_normalization_3 (Bat  multiple                  256       
 chNormalization)                                  

### Step 4 - Training the model

In [None]:
from tensorflow.keras import losses, metrics, optimizers, callbacks

In [None]:
loss = losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [metrics.SparseCategoricalAccuracy()]
optimizer = optimizers.Adam()

In [None]:
cnn.compile(loss=loss, metrics=metrics, optimizer=optimizer)

In [None]:
cnn.evaluate(test_data_p.padded_batch(32, padded_shapes=([None], [])))



[1.9462552070617676, 0.13199999928474426]

In [None]:
# Callbacks add additional functionalities to the training procedure.
# Check out the full list here: https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/Callback
es = callbacks.EarlyStopping(monitor='val_sparse_categorical_accuracy',
                             patience=5,
                             restore_best_weights=True)

In [None]:
cnn.fit(train_data_p.shuffle(1000).padded_batch(32, padded_shapes=([None], [])),
        validation_data=test_data_p.padded_batch(32, padded_shapes=([None], [])),
        epochs=10_000,
        callbacks=[es])

Epoch 1/10000
Epoch 2/10000
Epoch 3/10000
Epoch 4/10000
Epoch 5/10000
Epoch 6/10000
Epoch 7/10000
Epoch 8/10000
Epoch 9/10000
Epoch 10/10000
Epoch 11/10000


<keras.src.callbacks.History at 0x7de4dc6010f0>

### Step 5: Transfer learning

In [None]:
%pip install tensorflow_text --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m56.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m475.2/475.2 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m110.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m442.0/442.0 kB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m94.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import tensorflow_text
import tensorflow_hub as tfhub

In [None]:
# Download a pre-trained sentence embedding network from the hub:
# https://www.kaggle.com/models/google/nnlm/frameworks/TensorFlow2/variations/en-dim128/versions/1
embedder = tfhub.KerasLayer(
    "https://www.kaggle.com/models/google/nnlm/frameworks/TensorFlow2/variations/en-dim128/versions/1")

In [None]:
text = tf.constant(['What is the weather?'])

In [None]:
embedder(text)

<tf.Tensor: shape=(1, 128), dtype=float32, numpy=
array([[ 0.2625199 ,  0.0655278 , -0.03559397,  0.05916893, -0.06252968,
        -0.11400567, -0.01118497, -0.09423669, -0.0154704 ,  0.01580296,
         0.07717817, -0.12524146, -0.04622058, -0.05460491,  0.05132222,
         0.04918185,  0.1659416 ,  0.05348625, -0.1531514 ,  0.38159302,
        -0.03752544, -0.03765873,  0.15985799,  0.03962875, -0.0772755 ,
         0.14130831, -0.1802341 ,  0.03392623, -0.05066182,  0.12486257,
         0.08414268, -0.07169367, -0.0294523 , -0.09322049, -0.07138393,
        -0.00666364,  0.05540091,  0.08182998, -0.1606449 , -0.05262072,
        -0.02823647, -0.11852064, -0.02871278, -0.08204059,  0.05881826,
        -0.12198396, -0.05434574, -0.14797848, -0.06854214,  0.05576418,
        -0.00258158,  0.06265221,  0.13194905,  0.02321389, -0.02113196,
         0.03364733, -0.05678733, -0.01316416,  0.08929291, -0.04095168,
        -0.1090198 , -0.06177203, -0.13504025, -0.01373049, -0.03007997,
 

In [None]:
# The only trainable component is now the additional fully-connected layer (< 1k parameters).
classifier_v2 = tf.keras.Sequential([
    embedder,
    layers.Dense(7)
])

In [None]:
data = tfds.load('trec')

In [None]:
train_data = data['train'].map(preprocess).shuffle(1000).batch(64)
test_data = data['test'].map(preprocess).batch(64)

In [None]:
for xb, yb in train_data:
  print(xb.shape)
  print(classifier_v2(xb).shape)
  break

(64,)
(64, 7)


In [None]:
loss = losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy()]
optimizer = optimizers.Adam()

In [None]:
classifier_v2.compile(loss=loss, metrics=metrics, optimizer=optimizer)

In [None]:
classifier_v2.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer_5 (KerasLayer)  (None, 128)               124642688 
                                                                 
 dense_3 (Dense)             (None, 7)                 903       
                                                                 
Total params: 124643591 (475.48 MB)
Trainable params: 903 (3.53 KB)
Non-trainable params: 124642688 (475.47 MB)
_________________________________________________________________


In [None]:
classifier_v2.fit(train_data, validation_data=test_data, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7de40c10c940>

### Step 6: Visualizing the embeddings

Check out the [TensorFlow projector](https://projector.tensorflow.org/).

In [None]:
test_data = data['test'].map(preprocess)

In [None]:
# This is a metadata file, with each sentence in a row.
with open('metadata.tsv', 'w') as f:
  for sentence, y in test_data:
    f.write(sentence.numpy().decode('utf-8') + '\n')

In [None]:
for i in map(str, embedder(sentence[None])[0].numpy()):
  print(i)
  break

0.3548469


In [None]:
# This are the embeddings: one row per sentence, one column per embedding dimension.
with open('embeddings.tsv', 'w') as f:
  for sentence, y in test_data:
    o = map(str, embedder(sentence[None])[0].numpy())
    f.write('\t'.join(o) + '\n')

Download the two files and load them in the projector to visualize the embeddings.