### Transformer in Text Classification

* Implementazione di alcuni layer custom in Keras per
poter costruire una rete neurale artificiale basata
sull'architettura **Transformer**.
* Salvataggio dell'addestramento e recupero pesi di un
modello precedentemente addestrato.


In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

2024-12-16 11:37:08.667705: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1734345429.241746    3092 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1734345429.441703    3092 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-16 11:37:10.654569: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
df = pd.read_csv("dataset/train.csv", encoding='ISO-8859-1')


In [3]:
df

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
...,...,...,...,...,...,...
1599994,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599999 entries, 0 to 1599998
Data columns (total 6 columns):
 #   Column                                                                                                               Non-Null Count    Dtype 
---  ------                                                                                                               --------------    ----- 
 0   0                                                                                                                    1599999 non-null  int64 
 1   1467810369                                                                                                           1599999 non-null  int64 
 2   Mon Apr 06 22:19:45 PDT 2009                                                                                         1599999 non-null  object
 3   NO_QUERY                                                                                                             1599999 non-null  object
 4   _

In [5]:
df = df.iloc[:, [0, -1]]

In [6]:
df

Unnamed: 0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,is upset that he can't update his Facebook by ...
1,0,@Kenichan I dived many times for the ball. Man...
2,0,my whole body feels itchy and like its on fire
3,0,"@nationwideclass no, it's not behaving at all...."
4,0,@Kwesidei not the whole crew
...,...,...
1599994,4,Just woke up. Having no school is the best fee...
1599995,4,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,Happy 38th Birthday to my boo of alll time!!! ...


In [7]:
y = df.iloc[:, 0]  
x = df.iloc[:, 1]  

# Stampa i risultati

print(y)
print(x)

0          0
1          0
2          0
3          0
4          0
          ..
1599994    4
1599995    4
1599996    4
1599997    4
1599998    4
Name: 0, Length: 1599999, dtype: int64
0          is upset that he can't update his Facebook by ...
1          @Kenichan I dived many times for the ball. Man...
2            my whole body feels itchy and like its on fire 
3          @nationwideclass no, it's not behaving at all....
4                              @Kwesidei not the whole crew 
                                 ...                        
1599994    Just woke up. Having no school is the best fee...
1599995    TheWDB.com - Very cool to hear old Walt interv...
1599996    Are you ready for your MoJo Makeover? Ask me f...
1599997    Happy 38th Birthday to my boo of alll time!!! ...
1599998    happy #charitytuesday @theNSPCC @SparksCharity...
Name: @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D, Length: 1599999, dtype: 

In [8]:
print(x.shape)  # Dimensioni del set di feature
print(y.shape)

(1599999,)
(1599999,)


In [9]:
# Implementazione di un blocco Transformer
# tramite estensione della classe Layer di Keras

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.5):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-8)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-8)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

In [10]:
# Implementazione del blocco Embedding
# per l'utilizzo di vettori posizionali
# insieme ai vettori di token di parole
# tramite estensione della classe Layer di Keras

class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

In [11]:
# Creazione dataset

vocab_size = 5000000  # si considera un vocabolario di 20000 parole per la
                    # costruzione dello spazio vettoriale su cui modellare
                    # i vettori rappresentanti le parole nelle frasi di input
maxlen = 200  # vengono considerate le prime 200 parole di ogni recensione
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state = 42)
print(len(x_train), "sequenze di addestramento")
print(len(x_test), "sequenze di validazione")




1119999 sequenze di addestramento
480000 sequenze di validazione


In [12]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

texts = x_train

# Step 1: Creare un tokenizer e adattarlo ai tuoi dati
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

# Step 2: Convertire le frasi in sequenze di numeri
x_train = tokenizer.texts_to_sequences(texts)

# Step 3: Applicare padding
maxlen = 200  # Imposta la lunghezza massima desiderata
x_train = pad_sequences(x_train, maxlen=maxlen)

print(x_train)

[[   0    0    0 ...   68    4  541]
 [   0    0    0 ... 7548    2   14]
 [   0    0    0 ...  720   15  560]
 ...
 [   0    0    0 ...   95  164  918]
 [   0    0    0 ... 1075   14   17]
 [   0    0    0 ...   51   52 6678]]


In [13]:
# Esempio di dati
texts2 = x_test

# Step 1: Creare un tokenizer e adattarlo ai tuoi dati
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

# Step 2: Convertire le frasi in sequenze di numeri
x_test = tokenizer.texts_to_sequences(texts2)

# Step 3: Applicare padding
maxlen = 200  # Imposta la lunghezza massima desiderata
x_test = pad_sequences(x_test, maxlen=maxlen)

print(x_test)

[[    0     0     0 ...   298   123   426]
 [    0     0     0 ...   239   100    97]
 [    0     0     0 ...  4805    35  3175]
 ...
 [    0     0     0 ...   490     6  5751]
 [    0     0     0 ...  1166   622   118]
 [    0     0     0 ... 13624  2549  2148]]


In [14]:

# aggiunta di padding per rendere tutte le frasi di
# lunghezza uguale (200 parole)
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

In [15]:
import tensorflow as tf



In [16]:
# Implementazione di un modello di classificazione
# usando il layer custom basato sull'architettura Transformer
# creato precedentemente.
# Il layer Transformer genera un vettore per ogni
# istante temporale della sequenza di input.
# Tramite GlobalAveragePooling effettuiamo una media
# su tutti gli istanti temporali e sfruttiamo una
# rete FF per classificare gli input proposti

embed_dim = 32  # dimensioni del vettore di input
num_heads = 8  # numero di meccanismi multi-head attention
ff_dim = 16  # n° di celle dei layer FF

inputs = layers.Input(shape=(maxlen,))
x = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)(inputs)
x = TransformerBlock(embed_dim, num_heads, ff_dim)(x, training=True)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(ff_dim, activation="relu")(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(5, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

I0000 00:00:1734345489.702229    3092 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5529 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4060 Ti, pci bus id: 0000:01:00.0, compute capability: 8.9


In [17]:
# verifica architettura del modello
model.summary()

In [18]:
# diagramma dell'architettura del modello
keras.utils.plot_model(model)

You must install pydot (`pip install pydot`) for `plot_model` to work.


In [19]:
# compilazione del modello
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])

In [20]:
# addestramento
history = model.fit(
    x_train,
    y_train,
    batch_size=32,
    epochs=5,
    validation_data=(x_test, y_test)
)

Epoch 1/5


2024-12-16 11:38:11.365867: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 895999200 exceeds 10% of free system memory.
I0000 00:00:1734345494.651415    3177 service.cc:148] XLA service 0x7fd63800adb0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1734345494.651563    3177 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 4060 Ti, Compute Capability 8.9
2024-12-16 11:38:14.782695: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1734345495.333473    3177 cuda_dnn.cc:529] Loaded cuDNN version 90300












I0000 00:00:1734345506.302279    3177 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m34999/35000[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 31ms/step - accuracy: 0.5379 - loss: 0.6786












[1m35000/35000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1147s[0m 32ms/step - accuracy: 0.5379 - loss: 0.6786 - val_accuracy: 0.7999 - val_loss: 0.4553
Epoch 2/5
[1m35000/35000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1133s[0m 32ms/step - accuracy: 0.8109 - loss: 0.4291 - val_accuracy: 0.8092 - val_loss: 0.4461
Epoch 3/5
[1m35000/35000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1124s[0m 32ms/step - accuracy: 0.8492 - loss: 0.3678 - val_accuracy: 0.8020 - val_loss: 0.4792
Epoch 4/5
[1m35000/35000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1111s[0m 32ms/step - accuracy: 0.8857 - loss: 0.2968 - val_accuracy: 0.7957 - val_loss: 0.4964
Epoch 5/5
[1m35000/35000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1104s[0m 32ms/step - accuracy: 0.8958 - loss: 0.2712 - val_accuracy: 0.7983 - val_loss: 0.5402


In [21]:
# valutazione performance del modello
model.evaluate(x_test, y_test)

[1m15000/15000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 2ms/step - accuracy: 0.7980 - loss: 0.5443


[0.5402283668518066, 0.7983208298683167]

In [24]:
# salvataggio del modello
model.save('my_model1.keras')


In [25]:
# resettare il kernel e riavviarlo
# eseguire tutte le celle fino alla
# cella di addestramento del modello (esclusa)
# quindi continuare da qui:

# caricamento pesi modello da file
model.load_weights('my_model1.keras')

model.evaluate(x_test, y_test)

[1m15000/15000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 2ms/step - accuracy: 0.7980 - loss: 0.5443


[0.5402283668518066, 0.7983208298683167]