# Making a Closed Domain Chatbot Via Text Classification

Return to the [index](https://github.com/Nkluge-correa/Aira-EXPERT).

**In this notebook, we train the `Bi-LSTM` and `decoder-only transformer` version of Ai.ra.**

## Data & Preprocessing

**This code is preparing data for a language model. Specifically, it is creating a training and testing dataset for a text classification task.**

**First, the code opens a file named "_data/tags_pt.txt_" using the `with` statement and reads its content. It then processes each line of the file, extracting the text and label of each entry. The text is added to a list called `X` and the label to another list called `Y`.**

**Next, the code defines some parameters for the language model, such as the size of the vocabulary, the size of the embedding layer, and the maximum sequence length. It then imports the TensorFlow library and creates a `TextVectorization` layer. This layer converts text into a sequence of integers and pads/truncates the sequences to ensure that they have the same length.**

**The code then adapts the `TextVectorization` layer to the `X` data to learn the vocabulary and transform the data. The vocabulary is saved to a file named "_aira/vocabulary_pt_.txt".**

**Finally, the code splits the data into training and testing sets using `train_test_split` from scikit-learn. It encodes the text data using the `TextVectorization` layer and one-hot encodes the labels using `to_categorical` from Keras. The resulting training and testing data are stored in variables named `x_train`, `x_test`, `y_train`, and `y_test`.**

In [8]:
import tensorflow as tf

language = "pt" # "bilingual", "pt", or "en"

with open(f'data/tags_{language}.txt', encoding='utf-8') as fp:
    X = [[' '.join(line.strip().split(' ')[:-1])] for line in fp]
    fp.close()

with open(f'data/tags_{language}.txt', encoding='utf-8') as fp:
    Y = [int(line.strip().split(' ')[-1]) for line in fp]
    fp.close()

vocab_size = 2000 #4000 for bilingual or 2000 for english or portuguese
embed_size = 256
sequence_length = 10

# Create a vectorization layer and adapt it to the text
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    ngrams=3,
    output_sequence_length=sequence_length)

vectorize_layer.adapt(X)
vocabulary = vectorize_layer.get_vocabulary()

# Save the vocabulary for later inspection
with open(f'aira/vocabulary_{language}.txt', 'w') as fp:
    for word in vocabulary:
        fp.write("%s\n" % word)
    fp.close()

encoded_X = vectorize_layer(X)

encoded_X
one_hot_encoded_Y = tf.keras.utils.to_categorical(Y)[:,1:]

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(encoded_X.numpy(), 
                                                    one_hot_encoded_Y, 
                                                    test_size=0.1, 
                                                    random_state=42)


## `Bi-LSTM`

**This code is defining and training a `Bidirectional Long Short-Term Memory` neural network for text classification.**

> **A `BiLSTM` is a type of recurrent neural network (`RNN`) that is commonly used for processing sequential data, such as text or speech. It extends the capabilities of a regular `LSTM` by processing the input sequence in both forward and backward directions, allowing the network to capture dependencies that exist in both directions.**

**The model is defined using the Keras functional API. The input layer is defined using `tf.keras.Input`, and the text data is passed through an Embedding layer to create dense word embeddings. The Embedding layer is followed by two Bidirectional LSTM layers with 128 units each. The output of the second LSTM layer is passed through a Dense layer with a softmax activation function, which outputs a probability distribution over the classes.**

**The model is compiled using the Adam optimizer, categorical cross-entropy loss function, and categorical accuracy as the evaluation metric.**

**The code then defines a list of callbacks to be used during training, including early stopping, model checkpointing, and reducing the learning rate on a plateau.**


In [9]:
import tensorflow as tf

inputs = tf.keras.Input(shape=(x_train.shape[1],), dtype="int32")

embedded_inputs = tf.keras.layers.Embedding(
    input_dim=vocab_size, 
    output_dim=embed_size, 
    mask_zero=True)(inputs)

x = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(128, return_sequences=True))(embedded_inputs)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128))(x)

outputs = tf.keras.layers.Dense(142, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()

callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='val_categorical_accuracy', 
                                    patience=10), 
    tf.keras.callbacks.ModelCheckpoint(filepath=f'aira/Aira_BiLSTM_{language}.keras', 
                                        monitor='categorical_accuracy', 
                                        save_best_only=True,),  
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_categorical_accuracy', 
                                        factor=0.1, 
                                        patience=10), 
]

model.fit(x_train,
          y_train,
          validation_split = 0.2,
          epochs=100,
          batch_size=8,
          verbose=1,
          callbacks=callbacks)

model = tf.keras.models.load_model(f'aira/Aira_BiLSTM_{language}.keras')
test_loss_score, test_acc_score = model.evaluate(x_test, y_test)

print(f'Final Loss: {round(test_loss_score, 1)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')

Version:  2.10.1
Eager mode:  True
GPU is available
Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 10)]              0         
                                                                 
 embedding_5 (Embedding)     (None, 10, 256)           512000    
                                                                 
 bidirectional_6 (Bidirectio  (None, 10, 256)          394240    
 nal)                                                            
                                                                 
 bidirectional_7 (Bidirectio  (None, 256)              394240    
 nal)                                                            
                                                                 
 dense_7 (Dense)             (None, 142)               36494     
                                                                 
Total p

## `Ensemble Bi-LSTM`

**This model is an ensemble of two `BiLSTM` layers, each taking a different input. The two inputs are processed in parallel, and their outputs are concatenated and fed into a final dense layer with a softmax activation function.**

> **Ensemble model is a machine learning technique that combines several individual models to improve the performance of the overall system.**

**The code begins by importing the required libraries and setting some hyperparameters such as the vocabulary size, sequence length, and embedding size. Then, it defines two input layers `inputs_1` and `inputs_2`, each taking sequences of integers with a variable length. These layers are then passed through separate embedding layers `embedded_inputs_1` and `embedded_inputs_2`, respectively, which convert the input integers into dense vectors of fixed size.**

**Next, each embedding layer output is fed into a bidirectional LSTM layer with 128 units, which processes the input sequences in both forward and backward directions to capture both past and future context. The output of the second LSTM layer is then concatenated with the output of the first LSTM layer along the last axis using `tf.keras.layers.concatenate` to produce a single tensor.**

**Finally, the concatenated tensor is passed through a dense layer with a softmax activation function that produces class probabilities.**

In [10]:
import tensorflow as tf

inputs_1= tf.keras.Input(shape=(x_train.shape[1],), dtype="int32",  name='input_1')

embedded_inputs_1 = tf.keras.layers.Embedding(
    input_dim=vocab_size, 
    output_dim=embed_size, 
    mask_zero=True,
    name="embedded_inputs_1")(inputs_1)

x_1 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(128, return_sequences=True))(embedded_inputs_1)
x_1 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128))(x_1)

inputs_2= tf.keras.Input(shape=(x_train.shape[1],), dtype="int32",  name='input_2')

embedded_inputs_2 = tf.keras.layers.Embedding(
    input_dim=vocab_size, 
    output_dim=embed_size, 
    mask_zero=True,
    name="embedded_inputs_2")(inputs_2)

x_2 = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(128, return_sequences=True))(embedded_inputs_2)
x_2 = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128))(x_2)

concatenated = tf.keras.layers.concatenate([x_1, x_2], axis=-1)
outputs = tf.keras.layers.Dense(142, activation="softmax")(concatenated)

model = tf.keras.Model([inputs_1, inputs_2], outputs)

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy'])

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")
model.summary()


callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='val_categorical_accuracy', 
                                    patience=10), 
    tf.keras.callbacks.ModelCheckpoint(filepath=f'aira/Aira_BiLSTM_ENSEMB_{language}.keras', 
                                        monitor='categorical_accuracy', 
                                        save_best_only=True,),  
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_categorical_accuracy', 
                                        factor=0.1, 
                                        patience=10), 
]

model.fit([x_train, x_train],
          y_train,
          validation_split = 0.2,
          epochs=100,
          batch_size=8,
          verbose=1,
          callbacks=callbacks)

model = tf.keras.models.load_model(f'aira/Aira_BiLSTM_ENSEMB_{language}.keras')
test_loss_score, test_acc_score = model.evaluate([x_test,x_test], y_test)

print(f'Final Loss: {round(test_loss_score, 1)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')

Version:  2.10.1
Eager mode:  True
GPU is available
Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 10)]         0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 10)]         0           []                               
                                                                                                  
 embedded_inputs_1 (Embedding)  (None, 10, 256)      512000      ['input_1[0][0]']                
                                                                                                  
 embedded_inputs_2 (Embedding)  (None, 10, 256)      512000      ['input_2[0][0]']                
                                        

## `Dedocer-Transformer`

**The code is implementing a `Transformer` model to perform text classification.**

> **`Decoder-only transformers` are a type of transformer architecture that only consists of the decoder module, while omitting the encoder module.**

**The first section imports the `PositionalEmbedding` and `TransformerEncoder` classes from the `tblocks` file.**

**The input is passed through a `PositionalEmbedding` layer that applies embedding to the input tokens and adds positional information to the embeddings. The output of the `PositionalEmbedding` layer is passed through a `TransformerEncoder` layer that performs multi-head self-attention on the input sequence and then applies dense projections on the concatenated outputs of the attention heads.**

**The output of the `TransformerEncoder` layer is passed through a `GlobalMaxPooling1D` layer that performs global max pooling along the temporal axis of the tensor. Then, the output is passed through a `Dropout` layer with a rate of `0.5` to reduce overfitting. Finally, a `Dense` layer with `softmax` activation is used to produce the output probabilities for each of the classes.**

In [11]:
from tblocks import PositionalEmbedding, TransformerEncoder
import tensorflow as tf

num_heads = 6
dense_dim = 512

inputs = tf.keras.Input(shape=(x_train.shape[1],), dtype="int64")

x = PositionalEmbedding(sequence_length, vocab_size, embed_size)(inputs)
x = TransformerEncoder(embed_size, dense_dim, num_heads)(x)
x = tf.keras.layers.GlobalMaxPooling1D()(x) 
x = tf.keras.layers.Dropout(0.5)(x)

outputs = tf.keras.layers.Dense(142, activation="softmax")(x)

model = tf.keras.Model(inputs, outputs)
model.compile(optimizer="adam", 
    loss="categorical_crossentropy", 
    metrics=["categorical_accuracy"])
model.summary()

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='val_categorical_accuracy', 
                                    patience=10), 
    tf.keras.callbacks.ModelCheckpoint(filepath=f'aira/Aira_transformer_{language}.keras', 
                                        monitor='categorical_accuracy', 
                                        save_best_only=True,),  
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_categorical_accuracy', 
                                        factor=0.1, 
                                        patience=10), 
]

model.fit(x_train,
          y_train,
          validation_split = 0.2,
          epochs=100,
          batch_size=8,
          verbose=1,
          callbacks=callbacks)

model = tf.keras.models.load_model(f"aira/Aira_transformer_{language}.keras", 
                                custom_objects={"TransformerEncoder": TransformerEncoder, 
                                                 "PositionalEmbedding": PositionalEmbedding})

test_loss_score, test_acc_score = model.evaluate(x_test, y_test)

print(f'Final Loss: {round(test_loss_score, 1)}.')
print(f'Final Performance: {round(test_acc_score * 100, 2)} %.')

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 10)]              0         
                                                                 
 positional_embedding_1 (Pos  (None, 10, 256)          514560    
 itionalEmbedding)                                               
                                                                 
 transformer_encoder_1 (Tran  (None, 10, 256)          1841664   
 sformerEncoder)                                                 
                                                                 
 global_max_pooling1d_1 (Glo  (None, 256)              0         
 balMaxPooling1D)                                                
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                           

## Testing the Models

**Below you can load one of the trained models for inspection.**

In [12]:
import tensorflow as tf

model = tf.keras.models.load_model(f"aira/Aira_BiLSTM_ENSEMB_{language}.keras")

# to load the transformer, pass the:
#
# 'custom_objects={"TransformerEncoder": TransformerEncoder, 
#                   "PositionalEmbedding": PositionalEmbedding}' 
# 
# argument.

with open(f'aira/vocabulary_{language}.txt', encoding='utf-8') as fp:
    vocabulary = [line[:-1] for line in fp]
    fp.close()

with open('data/answers_en.txt', encoding='utf-8') as fp:
    answers = [line.strip() for line in fp]
    fp.close()


vocab_size = 4000 #4000 for bilingual or 2000 for english or portuguese
sequence_length = 10

text_vectorization = tf.keras.layers.TextVectorization(max_tokens=vocab_size, 
                                        output_mode="int", 
                                        ngrams=3,
                                        vocabulary=vocabulary,
                                        output_sequence_length=sequence_length)

**Here are some strings/questions to test the trained models.**

In [16]:
import string
import numpy as np
from IPython.display import Markdown 

#text = '''what is Interpretability?'''
#text = '''What is the problem of alignment?'''
#text = '''O que é Interpretabilidade?'''
text = '''O que é o problema de Alinhamento?'''
#text = '''What is Machine Learning?'''
#text = '''O que é Ética das Virtudes?'''
#text = '''What is your name?'''
#text = '''Qual é o seu nome?'''
#text = '''O que é SGD?'''
#text = '''What is Stochastic Gradient Descent?'''

print(f'Questions: {text}\n')

encoded_sentence = text_vectorization(text.lower()\
                                      .translate(str.maketrans('', '', string.punctuation)))
print(f'Encoded Sentence: {encoded_sentence}\n')

INPUT = tf.keras.backend.expand_dims(encoded_sentence, axis=0)

preds = model.predict([INPUT,INPUT],verbose=0)[0]
output = answers[np.argmax(preds)]

display(Markdown(f'Answers: \n\n{output} \n\n[Confidence: {max(preds) * 100: .2f} %]'))

Questions: O que é o problema de Alinhamento?

Encoded Sentence: [  2   3 652   2  25   8  35   4 730   1]



Answers: 

`Outer-alignment`, in the context of Machine Learning, is the **extent to which the specified objective function is aligned with the intended goal of its designers**. This is an intuitive notion, in part because human intentions themselves are not well understood. This is what is usually discussed as the "_[value alignment](https://intelligence.org/files/ValueLearningProblem.pdf)_" problem. 

[Confidence:  95.72 %]

---

Return to the [index](https://github.com/Nkluge-correa/Aira-EXPERT).